Vigor and the opportunity cost of time
Have you ever downed cup after cup of coffee to meet a deadline? Or jumped for joy when your favourite team scored a winning goal? If you have, then the idea that dopamine is not just involved in learning, but is also a potent stimulant, might feel obvious to you. However, from the perspective of the RPE hypothesis, this dual role of dopamine in learning and behaviour is anything but.
To explain this duality, Niv et al. (2005) proposed that slow changes in background dopamine concentrations in the striatum might serve a separate computational role from the fast spikes and dips that had previously been connected to learning. Specifically, they argued that the slow, so-called tonic part of the dopamine signal reflects the prevailing reward rate in the environment, which dictates how quickly animals should complete tasks that lie between themselves and reward. In short, the higher the background reward rate, the higher the opportunity cost of being slow, and therefore the faster animals should act.
The widget below shows this principle in action. The black circle β on the left represents an animal that must travel a short distance and press a button to collect a reward. If the animal moves more quickly, it can collect rewards more frequently, pushing up its average reward rate. However, speed comes at a cost. As a result, the net reward rate peaks at a moderate travel time indicated by the blue circle β.
Learning vigor
The normative model of Niv et al. prescribes how quickly animals should act given a particular set of reward opportunities and effort costs, but how could animals actually learn to act accordingly? One way is through a reward rate-maximizing variant of TD learning known as R learning. To see how a variant of R learning behaves in the button pushing task shown above, adjust the sliders in the widget below.
To see how a variant of R learning behaves in the button pushing task shown above, adjust the sliders in the widget below. Notice how the travel times towards the button and towards the cheese converge to the same optimal value, as does the estimated reward rate.
What happened to RPE?
While the main focus of Niv et al. is on the normative role of a dopaminergic reward rate signal rather than its origins, they offer some speculation as to how it might arise.
One particularly simple possibility is that the tonic reward rate component of the signal is the result of accumulating undiscounted phasic TD errors over time. To see the relationship between TD error and reward rate, notice that the continuous time undiscounted TD error is the reward plus the change in value. Accordingly, the average TD error over a time window of duration is the average reward plus the average change in value over the corresponding period Provided that the value does not run off towards infinity, in the long run the average change in value is zero, leaving the average TD error equal to the reward rate.