| Reward | Something we assume animals want to maximize. | rt | r(t) |
| Temporal discount parameter | Parameter decribing how quickly the worth of a reward decreases as a function of delay. | γ:0≤γ<1 | τ=−1/lnγ |
| Return | Cumulative future reward, which we assume animals seek to maximize. | gt=∑i=0∞γirt+i+1 | g(t)=∫0∞eτ−zr(t+z)dz |
| True value | Statistical expectation of the return following a particular state. Informally, true average future reward for a given set of circumstances. | v(st)=E[gt] | v(s(t))=E[g(t)] |
| Value | Estimate of the return following a particular state. | v^(st) | v^(s(t)) |
| State | All aspects of the animal's context that could be used to define its location in a task. For example, position on a linear track, or time since a stimulus was presented. | st | s(t) |
| Bellman equation | Relationship between the true value of one state and the true value of the next implied by the defintion of true value as the expected return. | v(st)=E[rt+1+γv(st+1)] | τ1v(s(t))=E[r(t)+dtdv(s(t))] |
| Temporal difference (TD) error | Estimate of the error in the current values with respect to the Bellman equation. | δt=rt+1+γv^(st+1)−v^(st) | δ(t)=r(t)−τ1v^(s(t))+dtdv^(s(t)) |
| Undiscounted TD error | Special case of TD error without temporal discounting. This version is found in Sutton and Barto's original 1981 paper and in many dopamine models. Not suitable for continual learning. | δt=rt+1+v^(st+1)−v^(st) | δ(t)=r(t)+dtdv^(s(t)) |
| Average reward TD error | Type of undiscounted TD error suitable for continual learning. | δt=rt+1−ρ+v^(st+1)−v^(st) |
| Reward rate | Average reward per time step in a continuing task. | ρ |
| Learning rate | How quickly value is updated in response to TD errors. | α | α |