Temporal difference learning

Original citation

Sutton and Barto. "Toward a Modern Theory of Adaptive Networks: Expectation and Prediction", Psychological Review, 1981.

Read on publisher site Open access

Best known as one of the basic building blocks of modern AI, temporal difference (TD) learning was first introduced by Richard Sutton and Andrew Barto as a model of animal learning.

In a 1981 paper published in Psychological Review, they argued that existing learning rules could not easily account for one of the most basic aspects of animal conditioning: order matters. Only stimuli that precede a reward are useful for predicting it. As a result, animals can more easily associate a stimulus with a future reward than they can a reward with a concurrent or future stimulus. For example, this means that it is much easier to train a puppy to recognize the phrase "Good dog!" as praise by giving praise and then offering a treat than by doing the reverse.

To see how a modern version of TD learning associates stimuli ⬤ with a reward 🥩, drag the reward size slider in the widget below. Then use the dropdown to see how it compares with other learning rules.

How long is a moment of time?

Much of the early success of TD learning in explaining features of animal conditioning can be traced back to its close relationship with the earlier model of Rescorla and Wagner. Both models centre around the use of some type of reward prediction error (RPE) $\underbrace{\delta_t}_\text{reward prediction error} = \text{observed reward} - \underbrace{\hat{v}(s_t)}_\text{expected reward}$ to incrementally update a reward prediction $\Delta \hat{v}(s_t) \propto \delta_t,$ with the main difference being in the definition of a time step: the duration of the interval between time $t$ and time $t+1$ .

For Rescorla and Wagner, a time step is an entire stimulus-reward pairing, potentially lasting several seconds. This leads to a very simple definition of the reward prediction $\hat{v}(s_t)$ as the amount of reward an animal associates with one presentation of the stimulus, and of the RPE $\delta_t$ as the difference between the amount of reward given $r_t$ and the animal's current prediction $\delta_t = r_t - \hat{v}(s_t).$ The evolution of learning under this rule is equally simple: over successive stimulus-reward pairings, the prediction associated with each stimulus converges exponentially towards the average paired reward. The cost of this simplicity is that the Rescorla–Model is unable to explain how reward predictions evolve before the start of a stimulus, after the end of a reward, or even during a short delay between stimulus and reward.

In the TD model, a time step is an essentially arbitrary time interval, usually much shorter than a single stimulus-reward pairing. (In neuroscience and psychology applications, it is conventional to set the time step equal to the duration of a reward.) This motivates a more nuanced definition of the reward prediction, called value, as the total reward over a weighted time interval in the near future.

As shown in the widget at the top of this page, the TD learning rule that produces these nuanced reward predictions is only slightly more complex than the Rescorla–Wagner rule. Instead of the "observed" reward consisting of the reward collected in the current time step $r_t$ , TD learning uses an estimate of the cumulative future reward bootstrapped from the animal's current reward predictions $\underbrace{\mathbb{E}\left[\sum_{i=0}^\infty \gamma^i r_{t+i+2} \mid s_{t+1}\right]}_\text{average cumulative future reward} \approx \underbrace{r_{t+1} + \gamma \hat{v}(s_{t+1})}_\text{bootstrapped estimate}.$ The inconsistency between the current value $\hat{v}(s_t)$ and the bootstrapped estimate is referred to as the temporal difference error $\delta_t = \underbrace{r_{t+1} + \gamma \hat{v}(s_{t+1})}_\text{reward prediction at time t+1} - \underbrace{\hat{v}(s_t)}_\text{reward prediction at time t},$ so called because it revolves around measuring the change in the reward prediction across a time interval.

The decision to treat a time step as an interval potentially shorter than an entire stimulus-reward pairing allows TD learning to offer detailed predictions about how reward predictions might evolve. However, the cost of this added detail is that even simple applications of the model reveal learning dynamics that do not follow the simple exponential paths observed in the Rescorla–Wagner model, particularly when the time step is small. To see this, click Play in the widget below and watch how the values in the stimulus and reward areas evolve.

Evolution

The first version of the temporal difference learning rule suffered from three main problems. First, Sutton and Barto did not explicitly differentiate between value $\hat{v}$ and reward $r$ in their learning rule, and instead had to assume that the values of certain states had a special property of being fixed and equal to $r$ . Second, their rule predicted that in environments where animals could collect rewards on an ongoing basis, the values $\hat{v}$ would continue increasing the longer the animal spent in the environment instead of stabilizing.

These first two problems would be solved within a few years, culminating in the introduction of the temporal difference learning rule in the form we know it today: $\begin{aligned} \hat{v}(s_t) &\leftarrow \hat{v}(s_t) + \alpha \delta_t \\ \delta_t &= r_{t+1} + \gamma \hat{v}(s_{t+1}) - \hat{v}(s_t), \end{aligned}$ where $\delta_t$ is called the temporal difference error, and $0 < \gamma < 1$ is called the temporal discount factor. Rather than pulling the value of one state towards the value of the next, as in the original temporal difference learning rule, this updated rule pulls the value of each state towards the immediate reward $r_{t+1}$ plus a scaled down version of the value of the next state $\gamma \hat{v}(s_{t+1})$ . By explicitly separating value and reward, the new rule obviated the need for special fixed values for certain states. By pulling the value of each state towards slightly less than the value of the next (in the absence of an immediate reward), the new rule ensured that values would not run away to infinity even when animals could collect rewards on an ongoing basis.

The third problem, mentioned in the concluding remarks of Sutton and Barto's paper, was more serious: to what extent such a "primitive" learning rule could actually produce anything remotely useful was completely unknown. Decades later, after this learning rule had been used to train AI systems to beat the best human players at challenging games like Go, the question remains unsettled.