Asymmetric reward learning
The Rescorla–Wagner model
Since the days of Pavlov and Thorndike, neuroscientists and psychologists have been seeking to understand how animals learn to associate sensory stimuli with future rewards or punishments.
The Rescorla–Wagner model of associative learning provides a particularly simple and elegant description of this process. According to this model, the value attached to a stimulus is updated incrementally towards the size of the associated reward (or punishment)
where is called the learning rate (). If the reward is random, repeatedly applying this rule causes value to converge to the expected reward (equivalent to the mean), which can be seen by setting the expected update to zero.
The connection between the Rescorla–Wagner model and the average reward can be seen another way if we explicitly consider how value and reward change over time. If we rewrite the model as
we can expand the resulting recursion and show that value is an exponentially-weighted moving average of past rewards:
We can make it even clearer that this is an exponentially-weighted moving average by rewriting as (where ), such that the equation above becomes
I call the learning timescale, since it represents the amount of time it takes to (mostly) learn the correct value .
To build an intuition for how the learning rate and learning timescale relate to eachother and how they affect the behaviour of the Rescorla–Wagner model, try adjusting them in the widget below.
Asymmetric learning rates
A common modification of the Rescorla–Wagner model is to use different learning rates and depending on whether the reward is above or below the current value
This type of modified model is often referred to in the literature as a reward learning model with asymmetric learning rates.
Changing the Rescorla–Wagner rule in this way has two effects:
- If the reward suddenly changes, then how quickly value will adjust depends on whether the reward has gone up or down.
- If the reward is inherently random, then changing the positive and negative learning rates can shift the value to be above or below the expected reward.
The first effect is easy enough to visualize and understand, especially for non-random rewards like in the widget below.
The second effect is less intuitive (at least to me) because it relies on the reward randomly fluctuating around the value , and it can be hard to mentally simulate the overall effect of these fluctuations. To make things easier, here I’ll focus on the average effect of a particularly simple type of random reward that is common in neuroscience: binary rewards.
If we assume that at each point in time the reward is either with probability or with probability , we can write an average-case version of the Rescorla–Wagner model with asymmetric learning rates as follows
(To derive this equation, take the expected value of , then use linearity of expectation and the law of total expectation to simplify the result.) The equation above says that at each time step, we have a chance of getting a reward of and incrementing the value towards one using the positive learning rate , and a chance of getting a reward of and incrementing in that direction using , so we can find the average behaviour by taking the average of these two options.
To build an intuition for the effect of asymmetric learning rates in the context of binary rewards, try adjusting the parameters in the widget below.
Asymmetric learning rates in the dopamine system
The fact that asymmetric learning rates cause value to converge to something above or below the mean reward is not necessarily a bad thing. Learning the best- and worst-case scenarios in addition to the average case covered by the vanilla Rescorla–Wagner model can be very helpful for risk-based decision-making, and asymmetric learning rates are a simple way to accomplish this. In fact, an influential paper from Dabney et al. makes the case that dopamine neurons broadcast asymmetric Rescorla–Wagner updates with different positive and negative learning rates, allowing a downstream system to learn the whole shape of the reward distribution rather than just the mean.