Differentiating dopamine

Original citation

Kim, Malik, Mikhael, Bech, Tsutsui-Kimura, Sun, Zhang, Li, Watabe-Uchida, Gershman, and Uchida. "A Unified Framework for Dopamine Signals Across Timescales", Cell, 2020.

Read on publisher site

In its original form, the RPE hypothesis predicts that dopamine output should be essentially flat in well-trained animals navigating a predictable environment. As a result, observations of value-like ramping dopamine output in animals approaching a known reward represent one of the most significant challenges to this hypothesis.

However, subsequent theoretical work has clarified that even if the environment is perfectly predictable, various forms of subjective uncertainty can prevent the animal's value estimates $\hat{v}$ from converging to the corresponding true values $v$ , leading to persistent TD error-driven dopamine ramps. Experimental evidence points to spatial uncertainty being an important contributor to this phenomenon, but even a mechanism as simple as forgetfulness can produce a similar effect.

To build intuition for how uncertainty can produce dopamine ramps, adjust the forgetting rate slider in the widget below. Notice how more rapid forgetting pulls the learned values towards zero. Importantly, since this forgetting compounds over distance, the learned values drop off more quickly than the corresponding true values. This excessively steep drop leads to a string of positive TD errors resembling a ramp.

What now?

If value and TD error can have the same ramp-like appearance in animals approaching reward, how can we differentiate between them? In their 2020 paper "A Unified Framework for Dopamine Across Timescales", Kim et al. argue that the answer lies in the fact that under mild discounting, TD error is more closely related to the rate of change of value than value itself.

The relationship between TD errors and changes in value is clearest in Kenji Doya's continuous time version of the TD error, where the temporal derivatve of value appears directly: $\delta(t) = r(t) + \frac{d \hat{v}}{dt} - \frac{1}{\tau} \hat{v}(t).$ However, it is also possible to rewrite the conventional discrete time TD error in a similar way: $\delta_t = r_{t+1} + \gamma (\hat{v}(s_{t+1}) - \hat{v}(s_t)) - (1 - \gamma) \hat{v}(s_t).$ In both cases, mild discounting ( $\gamma \approx 1$ or large $\tau$ ) causes the value term to almost vanish, causing the derivative-like term to dominate in the absence of reward.

Beam me up!

Kim et al. exploit the derivative-like property of TD error by designing an experiment in which value and its time derivative can be separately controlled. In their experiment, mice are trained to expect a reward at the end of a virtual-reality track. On most trials, value and dopamine ramp up smoothly as the animals approach the reward. However, on a subset of trials, the animals are virtually teleported forward, which creates a staircase-like jump in value but a spike in TD error. Since dopamine signals more closely resemble spikes than staircases, the authors conclude that dopamine encodes TD error rather than value.

To see the differing effects of teleportation on value and TD error, click Play then Teleport in the widget below.