Distributional reinforcement learning

Original citation

Dabney, Kurth-Nelson, Uchida, Starkweather, Hassabis, Munos, and Botvinick. "A Distributional Code for Value in Dopamine-Based Reinforcement Learning", Nature, 2020.

Read on publisher site

In classical reinforcement learning, the value of a state or stimulus is defined to be an average of the rewards that follow. However, in AI as in real life, it is sometimes useful to base decisions not on the likely average outcome, but on a best-case or worst-case scenario. Distributional RL tackles these cases using algorithms designed to learn the full distribution of possible outcomes — best case, worst case, and everything in between.

One approach to distributional learning is to build up a bank of reward learning systems that are differentially sensitive to positive and negative reward prediction errors. The reward learners most sensitive to positive prediction errors will eventually learn optimistic predictions, while those that are relatively sensitive to negative prediction errors will learn pessimistic ones. To see this process in action, click "Play" below and adjust the learning rate asymmetry slider to control the level of optimism.

The reason increased sensitivity to positive RPEs leads to an optimism bias is that the added weight given to unusually good outcomes compensates for their rarity, causing value to drift upwards until negative RPEs are frequent enough to balance out the inflated positive RPEs. To see how changes in learning rate asymmetry affect the balance between positive and negative RPEs, use the sliders below to adjust the degree of asymmetry and hypothetical value estimate used to calculate the RPE. Notice how higher values increase the probability of negative RPEs (top), but that reducing their relative weight (middle) can still bring the average RPE to zero (bottom).

Distributional dopamine

As shown in the middle row of the widget above, the hinge-like differential scaling of positive and negative RPEs is a signature of one type of distributional RL algorithm. Inspired by their own fundamental AI research on distributional RL, in the late 2010s a group of DeepMind researchers decided to hunt for evidence of this signature in recordings of dopamine neuron activity collected by the Uchida lab. In 2020, they would publish their positive findings in Nature, catapulting dopaminergic heterogeneity into the spotlight.