Dopamine

Glossary

Important concepts

ConceptDescriptionExample
RewardSomething we assume animals want.Food.
Reward predictionSubjective expectation of reward associated with a stimulus. See value.Upon hearing the ticking of a metronome, Pavlov's dogs expected to receive food.
Reward prediction errorDifference between observed and expected reward. See temporal difference error.If Pavlov started his metronome and skipped feeding his dogs, this would produce a negative reward prediction error.
Temporal discountingThe phenomenon of treating a fixed reward as though its worth decreases the longer an animal has to wait to obtain it. See temporal discount factor.A child might choose one marshmallow now over two in half an hour due to temporal discounting.
Episodic taskA type of reinforcement learning problem structured around "start task", "complete task", "reset" cycles.A trial-structured experiment where each trial is considered an episode and inter-trial intervals are ignored.
Continuing taskAn ongoing reinforcement learning problem with no explicit reset.Unstructured behaviour, trial structured experiments that are modeled as an ongoing sequence of trials and inter-trial intervals.

Formal definitions

TermDescriptionDiscrete time notationContinuous time notation
RewardSomething we assume animals want to maximize.rtr_tr(t)r(t)
Temporal discount parameterParameter decribing how quickly the worth of a reward decreases as a function of delay.γ:0γ<1\gamma : 0 \leq \gamma < 1τ=1/lnγ\tau = -1/\ln\gamma
ReturnCumulative future reward, which we assume animals seek to maximize.gt=i=0γirt+i+1g_t = \sum_{i=0}^\infty \gamma^i r_{t+i+1}g(t)=0ezτr(t+z)dzg(t) = \int_0^\infty e^\frac{-z}{\tau} r(t+z) \, dz
True valueStatistical expectation of the return following a particular state. Informally, true average future reward for a given set of circumstances.v(st)=E[gt]v(s_t) = \mathbb{E}[g_t]v(s(t))=E[g(t)]v(s(t)) = \mathbb{E}[g(t)]
ValueEstimate of the return following a particular state.v^(st)\hat{v}(s_t)v^(s(t))\hat{v}(s(t))
StateAll aspects of the animal's context that could be used to define its location in a task. For example, position on a linear track, or time since a stimulus was presented.sts_ts(t)s(t)
Bellman equationRelationship between the true value of one state and the true value of the next implied by the defintion of true value as the expected return.v(st)=E[rt+1+γv(st+1)]v(s_t) = \mathbb{E}[r_{t+1} + \gamma v(s_{t+1})]1τv(s(t))=E[r(t)+ddtv(s(t))]\frac{1}{\tau} v(s(t)) = \mathbb{E}\left[r(t) + \frac{d}{dt} v(s(t))\right]
Temporal difference (TD) errorEstimate of the error in the current values with respect to the Bellman equation.δt=rt+1+γv^(st+1)v^(st)\delta_t = r_{t+1} + \gamma \hat{v}(s_{t+1}) - \hat{v}(s_t)δ(t)=r(t)1τv^(s(t))+ddtv^(s(t))\delta(t) = r(t) - \frac{1}{\tau} \hat{v}(s(t)) + \frac{d}{dt} \hat{v}(s(t))
Undiscounted TD errorSpecial case of TD error without temporal discounting. This version is found in Sutton and Barto's original 1981 paper and in many dopamine models. Not suitable for continual learning.δt=rt+1+v^(st+1)v^(st)\delta_t = r_{t+1} + \hat{v}(s_{t+1}) - \hat{v}(s_t)δ(t)=r(t)+ddtv^(s(t))\delta(t) = r(t) + \frac{d}{dt} \hat{v}(s(t))
Average reward TD errorType of undiscounted TD error suitable for continual learning.δt=rt+1ρ+v^(st+1)v^(st)\delta_t = r_{t+1} - \rho + \hat{v}(s_{t+1}) - \hat{v}(s_t)
Reward rateAverage reward per time step in a continuing task.ρ\rho
Learning rateHow quickly value is updated in response to TD errors.α\alphaα\alpha

Note about units: In discrete time, all quantities are in the same units as the reward rtr_t, except for γ\gamma and α\alpha which are unitless. For example, if the reward is an amount of food measured in grams, then the value v^(st)\hat{v}(s_t) and TD error δt\delta_t are both also in grams. However, in continuous time, r(t)r(t) and δ(t)\delta(t) are both rates in units of reward per time interval (e.g., grams of food per second) while the value v^(s(t))\hat{v}(s(t)) and related quantities are in units of reward (e.g., grams of food), the discount timescale τ\tau is in units of time, and the learning rate α\alpha is unitless as in the discrete case.