Post-Hoc Approaches to Interpreting Reinforcement Learning Agents

The field of interpretability seeks to explain the behaviour of neural networks. Post-hoc interpretability is the sub-field of interpretability that seeks to explain the behaviour of models without altering them to make such an explanation easier (in this post, I will use interpretability to mean post-hoc interpretability though it should definitely be noted that there are many other, equally important approaches to interpretting neural networks). In the course of writing my thesis, I read a lot of papers on interpretting RL agents, and this post is a quick summary of work in this field. The aim of this post is to give a quick overview of some common approaches to interpretting RL agents, and examples of papers apply each approach for anyone to read if interested. As such, this post is not a complete literature review and I am sure I have missed out some important pieces of work.

Most modern methods for interpreting RL agents (in a post-hoc manner) can, roughly, be seen as falling into one of four categories: (1) concept-based interpretability; (2) mechanistic interpretability; (3) example-based interpretability; and (4) attribution-based interpretability. This division isn’t perfect (it leaves out some stuff at the boundary of what counts as a post-hoc explanation e.g. decision trees and related methods) but feels like a good enough typology for the purposes of quickly getting to grips with the field.

Concept-Based Interpretability

Concept-based approaches to interpretability explain neural network behaviours in terms of the concepts that networks learn to internally represent (Kim et al., 2018). A concept is an interpretable feature of an input to a model such as, when investigating a chess-playing RL agent, a chess board having some number of rooks remaining. A concept can be discrete or continuous, though we focus on discrete-valued concepts. Concept-based interpretability approaches determine which concepts a model internally represents by investigating which input features can be reverse-engineered from model activations.

More concretely, concept-based interpretability works as follows. Suppose we are interested in whether a model internally represents a discrete concept \(Q\) within a component \(l\) such as its \(l\)-th feedforward layer. Let \(Q\) take one of \(P\) values in the set \(\Lambda_Q=\{q_1, \cdots, q_P\}\) for every possible model input \(x \in X\). The concept \(Q\) is typically then defined as a mapping \(Q: X \rightarrow \Lambda_Q\) that maps every input \(x\) to the value taken by the concept on that input, \(Q(x) \in \Lambda_Q\). Further, denote the activations at component \(l\) of the model on a forward pass on an input \(x_i\) as \(c_l^i \in \mathbb{R}^{D_l}\). Concept-based interpretability methods determine whether the concept \(Q\) is represented in the activations by introducing a low-capacity auxiliary classifier \(p^l: \mathbb{R}^{D_l} \rightarrow \Lambda_Q\). If \(p^l\) can successfully classify concept value \(Q(x_i)\) of inputs \(x_i\) based on the activations \(c_l^i\), we say that the concept \(Q\) is represented within the model activations at component \(l\). The intuition for this is that, since \(p^l\) is low-capacity, its ability to discriminate based on activations must come from the concept being represented within those activations.

There are different options for \(p^l\), but the most common approach is to use a supervised linear classifier (Alain & Bengio, 2016). The linear classifier \(p^l\) is then referred to as a linear probe, and computes a distribution over concept values for a given vector of activations \(c_l^i\) by projecting the activations along learned directions and passing the resulting logits through a softmax. To obtain a trained linear probe, a dataset of model activations labelled according to the concept value of the input they correspond to is collected and the probe is trained in a standard supervised fashion. Importantly, the directions learned by a linear probe can be viewed as the distributed representations of the concept of interest. Thus, references to learned representations of a concept are references to directions in a model’s activation space that a linear probe learns when decoding that concept.

Concept-based interpretability has been extensively applied to interpret AlphaZero-style agents. For instance, AlphaZero has been found to utilise an array of human, and super-human, chess concepts (Schut et al., 2023; McGrath et al., 2022). Similar results have been found when interpreting the open-source LeelaZero (Hammersborg & Strümke, 2023), and when investigating AlphaZero’s behaviour in Hex (Lovering et al., 2022). AlphaZero has additionally been found to compute concepts relating to future moves when playing chess (Jenner et al., 2024). However, concept-based interpretability has been less extensively applied to model-free RL agents. An exception to this is recent work (that, rather than using linear probes manually inspected convolutional channels to locate representation) that found the concept of a goal within a maze-solving model-free RL agent (Mini et al., 2023). Another application of concept-based interpretability to model-free RL agents investigates using concept representations decoded by probes to provide natural language explanations of agent behaviour (Das et al., 2023). In relevant work in the supervised learning context, concept-based interpretability has been applied to locate representations of apparent world models in language models trained to predict transcripts of board games (Li et al., 2024; Nanda et al., 2023; Karvonen, 2024).

Mechanistic Interpretability

Mechanistic interpretability aims to reverse engineer the computations performed by specific model components with the aim of finding “circuits”, or, groups of model components responsible for certain behaviours (Olah et al., 2020). Whilst similar to representation engineering due to the common focus on model internals, mechanistic interpretability can be distinguished from representation engineering in that its focus is on reverse engineering individual model components - such as individual neurons or attention heads - as opposed to locating meaningful distributed representations.

Mechanistic interpretability includes an array of different methods. A full description of all methods in mech interp would take way too long, but a method of particular interest is activation patching. In activation patching, the activations of a set of model components are modified to some counterfactual value and the change in model output is noted when patching in these counterfactual activations. If altering specific components causes the model output to consistently change on some connected set of inputs, the model’s behaviour on those inputs is attributed to the altered model components. These counterfactual activations can be drawn from a forward pass on some similar-but-importantly-different input, the mean activations from some validation set, or just zero activations.

In supervised learning, mechanistic interpretability methods have successfully located circuits for behaviours like curve detection in vision models (Cammarata et al., 2020) and indirect object identification in language models (Wang et al., 2022). While mechanistic interpretability has primarily been applied in the supervised learning context, recent work has been successfully applied to identify model components responsible for specific behaviours model-based planners in the games of Go and Chess (Haoxing, 2023; Jenner et al., 2024), and model-free agents (Bloom & Colognese, 2023). Applying mechanistic interpretability analysis to RL agents seems like a really cool area and is something I am (hopefully) going to look into more at some point soon.

Example-Based Interpretability

Example-based interpretability methods seeks to explain the behaviour of RL agents by providing examples of trajectories, transitions or states that are particularly insightful regarding the agent’s behaviour. A popular method here is to construct some quantitative measure of the “interestingness” of a transition - defined in terms of features such as being observed incredibly frequently or being assigned an abnormally low value by a value function - which can be used this to generate examples of transitions that illustrate some informative aspect of the agent’s behaviour (Sequeira & Gervasio, 2020; Rupprecht et al., 2020). These examples can then be used to build up a qualitative description of the agent’s behaviour. Other alternative example-based approaches exist, such as determining which training tractories are crucial for some learned behaviour (Deshmukh et al., 2023) and clustering sequences of transitions to locate clusters corresponding to behaviours (Zahavy et al., 2016). Note that in all of these methods, the focus is on generating a qualitative understanding of the agent’s behaviour.

Attribution-Based Interpretability

Attribution-based interpretability seeks to determine which features in an agent’s observation are important for agent behaviour. Importance is typically characterised by the production of a saliency map (Simonyan et al., 2014), which is a heat map over the input that activates strongly over input components judged as important. Saliency maps in RL can be constructed by using gradient-based methods (Weitkamp et al., 2019; Wang et al., 2016) or input perturbation-based methods (Iyer et al., 2018; Puri et al., 2020). Saliency maps have be used for purposes ranging from explaining policy failures by highlighting when agents perform poorly due to focusing on the “wrong” parts of the environment (Greydanus et al., 2018; Hilton et al., 2020) to explaining learned strategies by illustrating what agents focus on when performing certain actions (Iyer et al., 2018; Puri et al., 2020).Attribution-based methods are very popular in interpreting RL agents. However, they have been shown to provide misleading explanations of agent behaviour in RL (Atrey et al., 2020; Puri et al., 2020) and of model behaviour more generally (Adebayo et al., 2018; Kindermans et al., 2019). Thus, they are (probably) best used in for exploratory purposes to generate hypotheses about agent behaviour which can then be investigated more rigourously using alternative methods.

Post-Hoc Approaches to Interpreting Reinforcement Learning Agents

Concept-Based Interpretability

Mechanistic Interpretability

Example-Based Interpretability

Attribution-Based Interpretability

References

2024

2023

2022

2020

2019

2018

2016

2014

Enjoy Reading This Article?