Post-Hoc Approaches to Interpreting Reinforcement Learning Agents
The field of interpretability seeks to explain the behaviour of neural networks. Post-hoc interpretability is the sub-field of interpretability that seeks to explain the behaviour of models without altering them to make such an explanation easier (in this post, I will use interpretability to mean post-hoc interpretability though it should definitely be noted that there are many other, equally important approaches to interpretting neural networks). In the course of writing my thesis, I read a lot of papers on interpretting RL agents, and this post is a quick summary of work in this field. The aim of this post is to give a quick overview of some common approaches to interpretting RL agents, and examples of papers apply each approach for anyone to read if interested. As such, this post is not a complete literature review and I am sure I have missed out some important pieces of work.
Most modern methods for interpreting RL agents (in a post-hoc manner) can, roughly, be seen as falling into one of four categories: (1) concept-based interpretability; (2) mechanistic interpretability; (3) example-based interpretability; and (4) attribution-based interpretability. This division isn’t perfect (it leaves out some stuff at the boundary of what counts as a post-hoc explanation e.g. decision trees and related methods) but feels like a good enough typology for the purposes of quickly getting to grips with the field.
Concept-Based Interpretability
Concept-based approaches to interpretability explain neural network behaviours in terms of the concepts that networks learn to internally represent (Kim et al., 2018). A concept is an interpretable feature of an input to a model such as, when investigating a chess-playing RL agent, a chess board having some number of rooks remaining. A concept can be discrete or continuous, though we focus on discrete-valued concepts. Concept-based interpretability approaches determine which concepts a model internally represents by investigating which input features can be reverse-engineered from model activations.
More concretely, concept-based interpretability works as follows. Suppose we are interested in whether a model internally represents a discrete concept \(Q\) within a component \(l\) such as its \(l\)-th feedforward layer. Let \(Q\) take one of \(P\) values in the set \(\Lambda_Q=\{q_1, \cdots, q_P\}\) for every possible model input \(x \in X\). The concept \(Q\) is typically then defined as a mapping \(Q: X \rightarrow \Lambda_Q\) that maps every input \(x\) to the value taken by the concept on that input, \(Q(x) \in \Lambda_Q\). Further, denote the activations at component \(l\) of the model on a forward pass on an input \(x_i\) as \(c_l^i \in \mathbb{R}^{D_l}\). Concept-based interpretability methods determine whether the concept \(Q\) is represented in the activations by introducing a low-capacity auxiliary classifier \(p^l: \mathbb{R}^{D_l} \rightarrow \Lambda_Q\). If \(p^l\) can successfully classify concept value \(Q(x_i)\) of inputs \(x_i\) based on the activations \(c_l^i\), we say that the concept \(Q\) is represented within the model activations at component \(l\). The intuition for this is that, since \(p^l\) is low-capacity, its ability to discriminate based on activations must come from the concept being represented within those activations.
There are different options for \(p^l\), but the most common approach is to use a supervised linear classifier (Alain & Bengio, 2016). The linear classifier \(p^l\) is then referred to as a linear probe, and computes a distribution over concept values for a given vector of activations \(c_l^i\) by projecting the activations along learned directions and passing the resulting logits through a softmax. To obtain a trained linear probe, a dataset of model activations labelled according to the concept value of the input they correspond to is collected and the probe is trained in a standard supervised fashion. Importantly, the directions learned by a linear probe can be viewed as the distributed representations of the concept of interest. Thus, references to learned representations of a concept are references to directions in a model’s activation space that a linear probe learns when decoding that concept.
Concept-based interpretability has been extensively applied to interpret AlphaZero-style agents. For instance, AlphaZero has been found to utilise an array of human, and super-human, chess concepts (Schut et al., 2023; McGrath et al., 2022). Similar results have been found when interpreting the open-source LeelaZero (Hammersborg & Strümke, 2023), and when investigating AlphaZero’s behaviour in Hex (Lovering et al., 2022). AlphaZero has additionally been found to compute concepts relating to future moves when playing chess (Jenner et al., 2024). However, concept-based interpretability has been less extensively applied to model-free RL agents. An exception to this is recent work (that, rather than using linear probes manually inspected convolutional channels to locate representation) that found the concept of a goal within a maze-solving model-free RL agent (Mini et al., 2023). Another application of concept-based interpretability to model-free RL agents investigates using concept representations decoded by probes to provide natural language explanations of agent behaviour (Das et al., 2023). In relevant work in the supervised learning context, concept-based interpretability has been applied to locate representations of apparent world models in language models trained to predict transcripts of board games (Li et al., 2024; Nanda et al., 2023; Karvonen, 2024).
Mechanistic Interpretability
Mechanistic interpretability aims to reverse engineer the computations performed by specific model components with the aim of finding “circuits”, or, groups of model components responsible for certain behaviours (Olah et al., 2020). Whilst similar to representation engineering due to the common focus on model internals, mechanistic interpretability can be distinguished from representation engineering in that its focus is on reverse engineering individual model components - such as individual neurons or attention heads - as opposed to locating meaningful distributed representations.
Mechanistic interpretability includes an array of different methods. A full description of all methods in mech interp would take way too long, but a method of particular interest is activation patching. In activation patching, the activations of a set of model components are modified to some counterfactual value and the change in model output is noted when patching in these counterfactual activations. If altering specific components causes the model output to consistently change on some connected set of inputs, the model’s behaviour on those inputs is attributed to the altered model components. These counterfactual activations can be drawn from a forward pass on some similar-but-importantly-different input, the mean activations from some validation set, or just zero activations.
In supervised learning, mechanistic interpretability methods have successfully located circuits for behaviours like curve detection in vision models (Cammarata et al., 2020) and indirect object identification in language models (Wang et al., 2022). While mechanistic interpretability has primarily been applied in the supervised learning context, recent work has been successfully applied to identify model components responsible for specific behaviours model-based planners in the games of Go and Chess (Haoxing, 2023; Jenner et al., 2024), and model-free agents (Bloom & Colognese, 2023). Applying mechanistic interpretability analysis to RL agents seems like a really cool area and is something I am (hopefully) going to look into more at some point soon.
Example-Based Interpretability
Example-based interpretability methods seeks to explain the behaviour of RL agents by providing examples of trajectories, transitions or states that are particularly insightful regarding the agent’s behaviour. A popular method here is to construct some quantitative measure of the “interestingness” of a transition - defined in terms of features such as being observed incredibly frequently or being assigned an abnormally low value by a value function - which can be used this to generate examples of transitions that illustrate some informative aspect of the agent’s behaviour (Sequeira & Gervasio, 2020; Rupprecht et al., 2020). These examples can then be used to build up a qualitative description of the agent’s behaviour. Other alternative example-based approaches exist, such as determining which training tractories are crucial for some learned behaviour (Deshmukh et al., 2023) and clustering sequences of transitions to locate clusters corresponding to behaviours (Zahavy et al., 2016). Note that in all of these methods, the focus is on generating a qualitative understanding of the agent’s behaviour.
Attribution-Based Interpretability
Attribution-based interpretability seeks to determine which features in an agent’s observation are important for agent behaviour. Importance is typically characterised by the production of a saliency map (Simonyan et al., 2014), which is a heat map over the input that activates strongly over input components judged as important. Saliency maps in RL can be constructed by using gradient-based methods (Weitkamp et al., 2019; Wang et al., 2016) or input perturbation-based methods (Iyer et al., 2018; Puri et al., 2020). Saliency maps have be used for purposes ranging from explaining policy failures by highlighting when agents perform poorly due to focusing on the “wrong” parts of the environment (Greydanus et al., 2018; Hilton et al., 2020) to explaining learned strategies by illustrating what agents focus on when performing certain actions (Iyer et al., 2018; Puri et al., 2020).Attribution-based methods are very popular in interpreting RL agents. However, they have been shown to provide misleading explanations of agent behaviour in RL (Atrey et al., 2020; Puri et al., 2020) and of model behaviour more generally (Adebayo et al., 2018; Kindermans et al., 2019). Thus, they are (probably) best used in for exploratory purposes to generate hypotheses about agent behaviour which can then be investigated more rigourously using alternative methods.
References
2024
- Evidence of Learned Look-Ahead in a Chess-Playing Neural NetworkarXiv preprint arXiv:2406.00877, 2024
- Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task2024
- Emergent world models and latent variable estimation in chess-playing language modelsarXiv preprint arXiv:2403.15498, 2024
2023
- Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero2023
- Information based explanation methods for deep learning agents – with applications on large open-source chess models2023
- Understanding and Controlling a Maze-Solving Policy Network2023
- State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding2023
- Emergent Linear Representations in World Models of Self-Supervised Sequence Models2023
- Inside the mind of a superhuman Go model: How does Leela Zero read ladders?2023
- Decision Transformer Interpretability2023
- Explaining RL Decisions with TrajectoriesIn The Eleventh International Conference on Learning Representations , 2023
2022
- Acquisition of chess knowledge in AlphaZeroProceedings of the National Academy of Sciences, 2022
- Evaluation beyond task performance: analyzing concepts in AlphaZero in HexAdvances in Neural Information Processing Systems, 2022
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small2022
2020
- Zoom In: An Introduction to CircuitsDistill, 2020https://distill.pub/2020/circuits/zoom-in
- Curve DetectorsDistill, 2020https://distill.pub/2020/circuits/curve-detectors
- Interestingness elements for explainable reinforcement learning: Understanding agents’ capabilities and limitationsArtificial Intelligence, 2020
- Finding and Visualizing Weaknesses of Deep Reinforcement Learning AgentsIn International Conference on Learning Representations, 2020
- Explain Your Move: Understanding Agent Actions Using Specific and Relevant Feature AttributionIn International Conference on Learning Representations, 2020
- Understanding RL VisionDistill, 2020https://distill.pub/2020/understanding-rl-vision
- Exploratory Not Explanatory: Counterfactual Analysis of Saliency Maps for Deep Reinforcement LearningIn International Conference on Learning Representations, 2020
2019
- Visual rationalizations in deep reinforcement learning for atari gamesIn Artificial Intelligence: 30th Benelux Conference, BNAIC 2018,‘s-Hertogenbosch, The Netherlands, November 8–9, 2018, Revised Selected Papers 30, 2019
- The (un) reliability of saliency methodsExplainable AI: Interpreting, explaining and visualizing deep learning, 2019
2018
- Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)2018
- Transparency and explanation in deep reinforcement learning neural networksIn Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, 2018
- Visualizing and understanding atari agentsIn International conference on machine learning, 2018
- Sanity checks for saliency mapsAdvances in neural information processing systems, 2018
2016
- Understanding intermediate layers using linear classifier probesarXiv preprint arXiv:1610.01644, 2016
- Graying the black box: Understanding dqnsIn International conference on machine learning, 2016
- Dueling network architectures for deep reinforcement learningIn International conference on machine learning, 2016
2014
- Deep inside convolutional networks: visualising image classification models and saliency mapsIn Proceedings of the International Conference on Learning Representations (ICLR), 2014
Enjoy Reading This Article?
Here are some more articles you might like to read next: