Glossary

Important concepts

Concept	Description	Example
Reward	Something we assume animals want.	Food.
Reward prediction	Subjective expectation of reward associated with a stimulus. See value.	Upon hearing the ticking of a metronome, Pavlov's dogs expected to receive food.
Reward prediction error	Difference between observed and expected reward. See temporal difference error.	If Pavlov started his metronome and skipped feeding his dogs, this would produce a negative reward prediction error.
Temporal discounting	The phenomenon of treating a fixed reward as though its worth decreases the longer an animal has to wait to obtain it. See temporal discount factor.	A child might choose one marshmallow now over two in half an hour due to temporal discounting.
Episodic task	A type of reinforcement learning problem structured around "start task", "complete task", "reset" cycles.	A trial-structured experiment where each trial is considered an episode and inter-trial intervals are ignored.
Continuing task	An ongoing reinforcement learning problem with no explicit reset.	Unstructured behaviour, trial structured experiments that are modeled as an ongoing sequence of trials and inter-trial intervals.

Formal definitions

Term	Description	Discrete time notation	Continuous time notation
Reward	Something we assume animals want to maximize.	$r_t$	$r(t)$
Temporal discount parameter	Parameter decribing how quickly the worth of a reward decreases as a function of delay.	$\gamma : 0 \leq \gamma < 1$	$\tau = -1/\ln\gamma$
Return	Cumulative future reward, which we assume animals seek to maximize.	$g_t = \sum_{i=0}^\infty \gamma^i r_{t+i+1}$	$g(t) = \int_0^\infty e^\frac{-z}{\tau} r(t+z) \, dz$
True value	Statistical expectation of the return following a particular state. Informally, true average future reward for a given set of circumstances.	$v(s_t) = \mathbb{E}[g_t]$	$v(s(t)) = \mathbb{E}[g(t)]$
Value	Estimate of the return following a particular state.	$\hat{v}(s_t)$	$\hat{v}(s(t))$
State	All aspects of the animal's context that could be used to define its location in a task. For example, position on a linear track, or time since a stimulus was presented.	$s_t$	$s(t)$
Bellman equation	Relationship between the true value of one state and the true value of the next implied by the defintion of true value as the expected return.	$v(s_t) = \mathbb{E}[r_{t+1} + \gamma v(s_{t+1})]$	$\frac{1}{\tau} v(s(t)) = \mathbb{E}\left[r(t) + \frac{d}{dt} v(s(t))\right]$
Temporal difference (TD) error	Estimate of the error in the current values with respect to the Bellman equation.	$\delta_t = r_{t+1} + \gamma \hat{v}(s_{t+1}) - \hat{v}(s_t)$	$\delta(t) = r(t) - \frac{1}{\tau} \hat{v}(s(t)) + \frac{d}{dt} \hat{v}(s(t))$
Undiscounted TD error	Special case of TD error without temporal discounting, often used as an approximation of dopamine neuron activity. Not suitable for continuing tasks.	$\delta_t = r_{t+1} + \hat{v}(s_{t+1}) - \hat{v}(s_t)$	$\delta(t) = r(t) + \frac{d}{dt} \hat{v}(s(t))$
Average reward TD error	Type of undiscounted TD error suitable for continuing tasks.	$\delta_t = r_{t+1} - \rho + \hat{v}(s_{t+1}) - \hat{v}(s_t)$
Reward rate	Average reward per time step in a continuing task.	$\rho$
Learning rate	How quickly value is updated in response to TD errors.	$\alpha$	$\alpha$

Units

In discrete time, all quantities are in the same units as the reward $r_t$ , except for $\gamma$ and $\alpha$ which are unitless. For example, if the reward is an amount of food measured in grams, then the value $\hat{v}(s_t)$ and TD error $\delta_t$ are both also in grams. However, in continuous time, $r(t)$ and $\delta(t)$ are both rates in units of reward per time interval (e.g., grams of food per second) while the value $\hat{v}(s(t))$ and related quantities are in units of reward (e.g., grams of food), the discount timescale $\tau$ is in units of time, and the learning rate $\alpha$ is unitless as in the discrete case.

Bibliographic and historical remarks

Sutton and Barto's Reinforcement Learning: An Introduction is the definitive text on discrete time RL, with the 2018 edition using the modern notation shown above. TD learning was first introduced as a model of animal conditioning by Sutton and Barto (Psychological Review, 1981). Sutton (Machine Learning, 1988) gives a detailed presentation of the theory of TD as a general learning method. The continuous time TD error given above is due to Doya (Neural Computation, 2000). A continuous time version of the discounted TD error with different scaling (value is normalized by a factor of $1/\tau$ and the learning rate is multiplied by $\tau$ ) was first introduced a few years earlier by Doya (NeurIPS, 1995). A derivation of the continuous time TD error as the small $\Delta t$ limit of the discrete time error is given by Mikhael et al. (Current Biology, 2022). For details on the average reward setting, see Mahadevan, Machine Learning, 1996.