Dopamine

Temporal difference learning

Best known as one of the basic building blocks of modern AI, temporal difference (TD) learning was first introduced by Richard Sutton and Andrew Barto as a model of animal learning.

In a 1981 paper published in Psychological Review, they argued that existing learning rules could not easily account for one of the most basic aspects of animal conditioning: order matters. Only stimuli that precede a reward are useful for predicting it. As a result, animals can more easily associate a stimulus with a future reward than they can a reward with a concurrent or future stimulus. For example, this means that it is much easier to train a puppy to recognize the phrase "Good dog!" as praise by giving praise and then offering a treat than by doing the reverse.

To see how a modern version of TD learning associates stimuli ⬤ with a reward 🥩, drag the reward size slider in the widget below. Then use the dropdown to see how it compares with other learning rules.

−2.0−1.00.0Time to reward →−1.00.01.0Value ↑🥩

Δv^(st)=α(rt+1+γv^(st+1)v^(st))\Delta \hat{v}(s_t) = \alpha( r_{t+1} + \gamma \hat{v}(s_{t+1}) - \hat{v}(s_t) )

How long is a moment of time?

Much of the early success of TD learning in explaining features of animal conditioning can be traced back to its close relationship with the earlier model of Rescorla and Wagner. Both models centre around the use of some type of reward prediction error (RPE) δtreward prediction error=observed rewardv^(st)expected reward\underbrace{\delta_t}_\text{reward prediction error} = \text{observed reward} - \underbrace{\hat{v}(s_t)}_\text{expected reward} to incrementally update a reward prediction Δv^(st)δt,\Delta \hat{v}(s_t) \propto \delta_t, with the main difference being in the definition of a time step: the duration of the interval between time tt and time t+1t+1.

For Rescorla and Wagner, a time step is an entire stimulus-reward pairing, potentially lasting several seconds. This leads to a very simple definition of the reward prediction v^(st)\hat{v}(s_t) as the amount of reward an animal associates with one presentation of the stimulus, and of the RPE δt\delta_t as the difference between the amount of reward given rtr_t and the animal's current prediction δt=rtv^(st).\delta_t = r_t - \hat{v}(s_t). The evolution of learning under this rule is equally simple: over successive stimulus-reward pairings, the prediction associated with each stimulus converges exponentially towards the average paired reward. The cost of this simplicity is that the Rescorla–Model is unable to explain how reward predictions evolve before the start of a stimulus, after the end of a reward, or even during a short delay between stimulus and reward.

0.01.02.0Reward ↑ 0102030405060708090100Number of stimulus-reward pairings →0.00.51.01.52.0Value ↑

In the TD model, a time step is an essentially arbitrary time interval, usually much shorter than a single stimulus-reward pairing. (In neuroscience and psychology applications, it is conventional to set the time step equal to the duration of a reward.) This motivates a more nuanced definition of the reward prediction, called value, as the total reward over a weighted time interval in the near future.

0.00.20.40.60.81.0Weight ↑ 051015202530354045505560Time until reward (s) →−2.0−1.00.01.02.0Reward ↑🥩💩🥩 0.00.20.40.60.81.0Disc. factor →Value ↑

Equivalent timescale: τ=24.7\tau = 24.7 s

As shown in the widget at the top of this page, the TD learning rule that produces these nuanced reward predictions is only slightly more complex than the Rescorla–Wagner rule. Instead of the "observed" reward consisting of the reward collected in the current time step rtr_t, TD learning uses an estimate of the cumulative future reward bootstrapped from the animal's current reward predictions E[i=0γirt+i+2st+1]average cumulative future rewardrt+1+γv^(st+1)bootstrapped estimate. \underbrace{\mathbb{E}\left[\sum_{i=0}^\infty \gamma^i r_{t+i+2} \mid s_{t+1}\right]}_\text{average cumulative future reward} \approx \underbrace{r_{t+1} + \gamma \hat{v}(s_{t+1})}_\text{bootstrapped estimate}. The inconsistency between the current value v^(st)\hat{v}(s_t) and the bootstrapped estimate is referred to as the temporal difference error δt=rt+1+γv^(st+1)reward prediction at time t+1v^(st)reward prediction at time t, \delta_t = \underbrace{r_{t+1} + \gamma \hat{v}(s_{t+1})}_\text{reward prediction at time t+1} - \underbrace{\hat{v}(s_t)}_\text{reward prediction at time t}, so called because it revolves around measuring the change in the reward prediction across a time interval.

The decision to treat a time step as an interval potentially shorter than an entire stimulus-reward pairing allows TD learning to offer detailed predictions about how reward predictions might evolve. However, the cost of this added detail is that even simple applications of the model reveal learning dynamics that do not follow the simple exponential paths observed in the Rescorla–Wagner model, particularly when the time step is small. To see this, click Play in the widget below and watch how the value in the reward area evolves.

TD error ↑ Value ↑0.00.51.01.52.0Time (s) →0.00.20.40.60.81.01.21.41.61.82.0

Evolution

The first version of the temporal difference learning rule suffered from three main problems. First, Sutton and Barto did not explicitly differentiate between value v^\hat{v} and reward rr in their learning rule, and instead had to assume that the values of certain states had a special property of being fixed and equal to rr. Second, their rule predicted that in environments where animals could collect rewards on an ongoing basis, the values v^\hat{v} would continue increasing the longer the animal spent in the environment instead of stabilizing.

These first two problems would be solved within a few years, culminating in the introduction of the temporal difference learning rule in the form we know it today: v^(st)v^(st)+αδtδt=rt+1+γv^(st+1)v^(st), \begin{aligned} \hat{v}(s_t) &\leftarrow \hat{v}(s_t) + \alpha \delta_t \\ \delta_t &= r_{t+1} + \gamma \hat{v}(s_{t+1}) - \hat{v}(s_t), \end{aligned} where δt\delta_t is called the temporal difference error, and 0<γ<10 < \gamma < 1 is called the temporal discount factor. Rather than pulling the value of one state towards the value of the next, as in the original temporal difference learning rule, this updated rule pulls the value of each state towards the immediate reward rt+1r_{t+1} plus a scaled down version of the value of the next state γv^(st+1)\gamma \hat{v}(s_{t+1}). By explicitly separating value and reward, the new rule obviated the need for special fixed values for certain states. By pulling the value of each state towards slightly less than the value of the next (in the absence of an immediate reward), the new rule ensured that values would not run away to infinity even when animals could collect rewards on an ongoing basis.

The third problem, mentioned in the concluding remarks of Sutton and Barto's paper, was more serious: to what extent such a "primitive" learning rule could actually produce anything remotely useful was completely unknown. Decades later, after this learning rule had been used to train AI systems to beat the best human players at challenging games like Go, the question remains unsettled.

TD(0)

Monte-Carlo

Steps to reach goal

020406080100120140160180200Episode →501002003004005001k2k3k4k5kSteps ↑