Emerson Harkin

Asymmetric reward learning

July 28, 2024

The Rescorla–Wagner model

Since the days of Pavlov and Thorndike, neuroscientists and psychologists have been seeking to understand how animals learn to associate sensory stimuli with future rewards or punishments.

The Rescorla–Wagner model of associative learning provides a particularly simple and elegant description of this process. According to this model, the value attached to a stimulus vv is updated incrementally towards the size of the associated reward (or punishment) rr

vv+α(rv)update,v \leftarrow v + \underbrace{\alpha (r - v)}_\text{update},

where α\alpha is called the learning rate (0α10 \leq \alpha \leq 1). If the reward is random, repeatedly applying this rule causes value to converge to the expected reward E[r]\mathbb{E}[r] (equivalent to the mean), which can be seen by setting the expected update to zero.

The connection between the Rescorla–Wagner model and the average reward can be seen another way if we explicitly consider how value and reward change over time. If we rewrite the model as

vt=αrt1+(1α)vt1,v_t = \alpha r_{t-1} + (1 - \alpha) v_{t-1},

we can expand the resulting recursion and show that value is an exponentially-weighted moving average of past rewards:

vt=αi=0(1α)irti1.v_t = \alpha \sum_{i=0}^\infty (1 - \alpha)^i r_{t-i-1}.

We can make it even clearer that this is an exponentially-weighted moving average by rewriting 1α1 - \alpha as e1τe^\frac{-1}{\tau} (where τ=1/ln(1α)\tau = -1 / \ln(1-\alpha)), such that the equation above becomes

vt=(1e1τ)i=0eiτrti1.v_t = \left(1 - e^\frac{-1}{\tau}\right) \sum_{i=0}^\infty e^\frac{-i}{\tau} r_{t-i-1}.

I call τ\tau the learning timescale, since it represents the amount of time it takes to (mostly) learn the correct value vtv_t.

To build an intuition for how the learning rate α\alpha and learning timescale τ\tau relate to eachother and how they affect the behaviour of the Rescorla–Wagner model, try adjusting them in the widget below.


Asymmetric learning rates

A common modification of the Rescorla–Wagner model is to use different learning rates α+\alpha_+ and α\alpha_- depending on whether the reward rr is above or below the current value vv

vt={vt1+α+(rt1vt1)if rv,vt1+α(rt1vt1)if r<v.v_t = \begin{cases} v_{t-1} + \alpha_+ (r_{t-1} - v_{t-1}) & \text{if $r \geq v$,} \\ v_{t-1} + \alpha_- (r_{t-1} - v_{t-1}) & \text{if $r < v$.} \end{cases}

This type of modified model is often referred to in the literature as a reward learning model with asymmetric learning rates.

Changing the Rescorla–Wagner rule in this way has two effects:

  1. If the reward suddenly changes, then how quickly value will adjust depends on whether the reward has gone up or down.
  2. If the reward is inherently random, then changing the positive and negative learning rates can shift the value to be above or below the expected reward.

The first effect is easy enough to visualize and understand, especially for non-random rewards like in the widget below.



The second effect is less intuitive (at least to me) because it relies on the reward rtr_t randomly fluctuating around the value vtv_t, and it can be hard to mentally simulate the overall effect of these fluctuations. To make things easier, here I’ll focus on the average effect of a particularly simple type of random reward that is common in neuroscience: binary rewards.

If we assume that at each point in time the reward is either r=1r=1 with probability pp or r=0r=0 with probability 1p1-p, we can write an average-case version of the Rescorla–Wagner model with asymmetric learning rates as follows

E[vt]=E[vt1]+α+(1E[vt1])p+α(0E[vt1])(1p)update.\mathbb{E}[v_t] = \mathbb{E}[v_{t-1}] + \underbrace{\alpha_+ (1 - \mathbb{E}[v_{t-1}]) p + \alpha_- (0 - \mathbb{E}[v_{t-1}]) (1 - p)}_\text{update}.

(To derive this equation, take the expected value of vtvt1v_t - v_{t-1}, then use linearity of expectation and the law of total expectation to simplify the result.) The equation above says that at each time step, we have a pp chance of getting a reward of 11 and incrementing the value towards one using the positive learning rate α+\alpha_+, and a 1p1 - p chance of getting a reward of 00 and incrementing in that direction using α\alpha_-, so we can find the average behaviour by taking the average of these two options.

To build an intuition for the effect of asymmetric learning rates in the context of binary rewards, try adjusting the parameters in the widget below.



Asymmetric learning rates in the dopamine system

The fact that asymmetric learning rates cause value to converge to something above or below the mean reward is not necessarily a bad thing. Learning the best- and worst-case scenarios in addition to the average case covered by the vanilla Rescorla–Wagner model can be very helpful for risk-based decision-making, and asymmetric learning rates are a simple way to accomplish this. In fact, an influential paper from Dabney et al. makes the case that dopamine neurons broadcast asymmetric Rescorla–Wagner updates with different positive and negative learning rates, allowing a downstream system to learn the whole shape of the reward distribution rather than just the mean.