Post-Hoc Approaches to Interpreting Reinforcement Learning Agents

The field of interpretability seeks to explain the behaviour of neural networks. Post-hoc interpretability is the sub-field of interpretability that seeks to explain the behaviour of models without altering them to make such an explanation easier (in this post, I will use interpretability to mean post-hoc interpretability though it should definitely be noted that there are many other, equally important approaches to interpretting neural networks). In the course of writing my thesis, I read a lot of papers on interpretting RL agents, and this post is a quick summary of work in this field. The aim of this post is to give a quick overview of some common approaches to interpretting RL agents, and examples of papers apply each approach for anyone to read if interested. As such, this post is not a complete literature review and I am sure I have missed out some important pieces of work.

Most modern methods for interpreting RL agents (in a post-hoc manner) can, roughly, be seen as falling into one of four categories: (1) concept-based interpretability; (2) mechanistic interpretability; (3) example-based interpretability; and (4) attribution-based interpretability. This division isn’t perfect (it leaves out some stuff at the boundary of what counts as a post-hoc explanation e.g. decision trees and related methods) but feels like a good enough typology for the purposes of quickly getting to grips with the field.

Concept-Based Interpretability

Concept-based approaches to interpretability explain neural network behaviours in terms of the concepts that networks learn to internally represent (Kim et al., 2018). A concept is an interpretable feature of an input to a model such as, when investigating a chess-playing RL agent, a chess board having some number of rooks remaining. A concept can be discrete or continuous, though we focus on discrete-valued concepts. Concept-based interpretability approaches determine which concepts a model internally represents by investigating which input features can be reverse-engineered from model activations.

More concretely, concept-based interpretability works as follows. Suppose we are interested in whether a model internally represents a discrete concept \(Q\) within a component \(l\) such as its \(l\)-th feedforward layer. Let \(Q\) take one of \(P\) values in the set \(\Lambda_Q=\{q_1, \cdots, q_P\}\) for every possible model input \(x \in X\). The concept \(Q\) is typically then defined as a mapping \(Q: X \rightarrow \Lambda_Q\) that maps every input \(x\) to the value taken by the concept on that input, \(Q(x) \in \Lambda_Q\). Further, denote the activations at component \(l\) of the model on a forward pass on an input \(x_i\) as \(c_l^i \in \mathbb{R}^{D_l}\). Concept-based interpretability methods determine whether the concept \(Q\) is represented in the activations by introducing a low-capacity auxiliary classifier \(p^l: \mathbb{R}^{D_l} \rightarrow \Lambda_Q\). If \(p^l\) can successfully classify concept value \(Q(x_i)\) of inputs \(x_i\) based on the activations \(c_l^i\), we say that the concept \(Q\) is represented within the model activations at component \(l\). The intuition for this is that, since \(p^l\) is low-capacity, its ability to discriminate based on activations must come from the concept being represented within those activations.

There are different options for \(p^l\), but the most common approach is to use a supervised linear classifier (Alain & Bengio, 2016). The linear classifier \(p^l\) is then referred to as a linear probe, and computes a distribution over concept values for a given vector of activations \(c_l^i\) by projecting the activations along learned directions and passing the resulting logits through a softmax. To obtain a trained linear probe, a dataset of model activations labelled according to the concept value of the input they correspond to is collected and the probe is trained in a standard supervised fashion. Importantly, the directions learned by a linear probe can be viewed as the distributed representations of the concept of interest. Thus, references to learned representations of a concept are references to directions in a model’s activation space that a linear probe learns when decoding that concept.

Concept-based interpretability has been extensively applied to interpret AlphaZero-style agents. For instance, AlphaZero has been found to utilise an array of human, and super-human, chess concepts (Schut et al., 2023; McGrath et al., 2022). Similar results have been found when interpreting the open-source LeelaZero (Hammersborg & Strümke, 2023), and when investigating AlphaZero’s behaviour in Hex (Lovering et al., 2022). AlphaZero has additionally been found to compute concepts relating to future moves when playing chess (Jenner et al., 2024). However, concept-based interpretability has been less extensively applied to model-free RL agents. An exception to this is recent work (that, rather than using linear probes manually inspected convolutional channels to locate representation) that found the concept of a goal within a maze-solving model-free RL agent (Mini et al., 2023). Another application of concept-based interpretability to model-free RL agents investigates using concept representations decoded by probes to provide natural language explanations of agent behaviour (Das et al., 2023). In relevant work in the supervised learning context, concept-based interpretability has been applied to locate representations of apparent world models in language models trained to predict transcripts of board games (Li et al., 2024; Nanda et al., 2023; Karvonen, 2024).

Mechanistic Interpretability

Mechanistic interpretability aims to reverse engineer the computations performed by specific model components with the aim of finding “circuits”, or, groups of model components responsible for certain behaviours (Olah et al., 2020). Whilst similar to representation engineering due to the common focus on model internals, mechanistic interpretability can be distinguished from representation engineering in that its focus is on reverse engineering individual model components - such as individual neurons or attention heads - as opposed to locating meaningful distributed representations.

Mechanistic interpretability includes an array of different methods. A full description of all methods in mech interp would take way too long, but a method of particular interest is activation patching. In activation patching, the activations of a set of model components are modified to some counterfactual value and the change in model output is noted when patching in these counterfactual activations. If altering specific components causes the model output to consistently change on some connected set of inputs, the model’s behaviour on those inputs is attributed to the altered model components. These counterfactual activations can be drawn from a forward pass on some similar-but-importantly-different input, the mean activations from some validation set, or just zero activations.

In supervised learning, mechanistic interpretability methods have successfully located circuits for behaviours like curve detection in vision models (Cammarata et al., 2020) and indirect object identification in language models (Wang et al., 2022). While mechanistic interpretability has primarily been applied in the supervised learning context, recent work has been successfully applied to identify model components responsible for specific behaviours model-based planners in the games of Go and Chess (Haoxing, 2023; Jenner et al., 2024), and model-free agents (Bloom & Colognese, 2023). Applying mechanistic interpretability analysis to RL agents seems like a really cool area and is something I am (hopefully) going to look into more at some point soon.

Example-Based Interpretability

Example-based interpretability methods seeks to explain the behaviour of RL agents by providing examples of trajectories, transitions or states that are particularly insightful regarding the agent’s behaviour. A popular method here is to construct some quantitative measure of the “interestingness” of a transition - defined in terms of features such as being observed incredibly frequently or being assigned an abnormally low value by a value function - which can be used this to generate examples of transitions that illustrate some informative aspect of the agent’s behaviour (Sequeira & Gervasio, 2020; Rupprecht et al., 2020). These examples can then be used to build up a qualitative description of the agent’s behaviour. Other alternative example-based approaches exist, such as determining which training tractories are crucial for some learned behaviour (Deshmukh et al., 2023) and clustering sequences of transitions to locate clusters corresponding to behaviours (Zahavy et al., 2016). Note that in all of these methods, the focus is on generating a qualitative understanding of the agent’s behaviour.

Attribution-Based Interpretability

Attribution-based interpretability seeks to determine which features in an agent’s observation are important for agent behaviour. Importance is typically characterised by the production of a saliency map (Simonyan et al., 2014), which is a heat map over the input that activates strongly over input components judged as important. Saliency maps in RL can be constructed by using gradient-based methods (Weitkamp et al., 2019; Wang et al., 2016) or input perturbation-based methods (Iyer et al., 2018; Puri et al., 2020). Saliency maps have be used for purposes ranging from explaining policy failures by highlighting when agents perform poorly due to focusing on the “wrong” parts of the environment (Greydanus et al., 2018; Hilton et al., 2020) to explaining learned strategies by illustrating what agents focus on when performing certain actions (Iyer et al., 2018; Puri et al., 2020).Attribution-based methods are very popular in interpreting RL agents. However, they have been shown to provide misleading explanations of agent behaviour in RL (Atrey et al., 2020; Puri et al., 2020) and of model behaviour more generally (Adebayo et al., 2018; Kindermans et al., 2019). Thus, they are (probably) best used in for exploratory purposes to generate hypotheses about agent behaviour which can then be investigated more rigourously using alternative methods.

References

2024

  1. Evidence of Learned Look-Ahead in a Chess-Playing Neural Network
    Erik Jenner, Shreyas Kapur, Vasil Georgiev, and 3 more authors
    arXiv preprint arXiv:2406.00877, 2024
  2. Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
    Kenneth Li, Aspen K. Hopkins, David Bau, and 3 more authors
    2024
  3. Emergent world models and latent variable estimation in chess-playing language models
    Adam Karvonen
    arXiv preprint arXiv:2403.15498, 2024

2023

  1. Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero
    Lisa Schut, Nenad Tomasev, Tom McGrath, and 3 more authors
    2023
  2. Information based explanation methods for deep learning agents – with applications on large open-source chess models
    Patrik Hammersborg, and Inga Strümke
    2023
  3. Understanding and Controlling a Maze-Solving Policy Network
    Ulisse Mini, Peli Grietzer, Mrinank Sharma, and 3 more authors
    2023
  4. State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding
    Devleena Das, Sonia Chernova, and Been Kim
    2023
  5. Emergent Linear Representations in World Models of Self-Supervised Sequence Models
    Neel Nanda, Andrew Lee, and Martin Wattenberg
    2023
  6. Inside the mind of a superhuman Go model: How does Leela Zero read ladders?
    Du Haoxing
    2023
  7. Decision Transformer Interpretability
    Joseph Bloom, and Paul Colognese
    2023
  8. Explaining RL Decisions with Trajectories
    Shripad Vilasrao Deshmukh, Arpan Dasgupta, Balaji Krishnamurthy, and 4 more authors
    In The Eleventh International Conference on Learning Representations , 2023

2022

  1. Acquisition of chess knowledge in AlphaZero
    Thomas McGrath, Andrei Kapishnikov, Nenad Tomašev, and 6 more authors
    Proceedings of the National Academy of Sciences, 2022
  2. Evaluation beyond task performance: analyzing concepts in AlphaZero in Hex
    Charles Lovering, Jessica Forde, George Konidaris, and 2 more authors
    Advances in Neural Information Processing Systems, 2022
  3. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
    Kevin Wang, Alexandre Variengien, Arthur Conmy, and 2 more authors
    2022

2020

  1. Zoom In: An Introduction to Circuits
    Chris Olah, Nick Cammarata, Ludwig Schubert, and 3 more authors
    Distill, 2020
    https://distill.pub/2020/circuits/zoom-in
  2. Curve Detectors
    Nick Cammarata, Gabriel Goh, Shan Carter, and 3 more authors
    Distill, 2020
    https://distill.pub/2020/circuits/curve-detectors
  3. Interestingness elements for explainable reinforcement learning: Understanding agents’ capabilities and limitations
    Pedro Sequeira, and Melinda Gervasio
    Artificial Intelligence, 2020
  4. Finding and Visualizing Weaknesses of Deep Reinforcement Learning Agents
    Christian Rupprecht, Cyril Ibrahim, and Christopher J Pal
    In International Conference on Learning Representations, 2020
  5. Explain Your Move: Understanding Agent Actions Using Specific and Relevant Feature Attribution
    Nikaash Puri, Sukriti Verma, Piyush Gupta, and 4 more authors
    In International Conference on Learning Representations, 2020
  6. Understanding RL Vision
    Jacob Hilton, Nick Cammarata, Shan Carter, and 2 more authors
    Distill, 2020
    https://distill.pub/2020/understanding-rl-vision
  7. Exploratory Not Explanatory: Counterfactual Analysis of Saliency Maps for Deep Reinforcement Learning
    Akanksha Atrey, Kaleigh Clary, and David Jensen
    In International Conference on Learning Representations, 2020

2019

  1. Visual rationalizations in deep reinforcement learning for atari games
    Laurens Weitkamp, Elise Pol, and Zeynep Akata
    In Artificial Intelligence: 30th Benelux Conference, BNAIC 2018,‘s-Hertogenbosch, The Netherlands, November 8–9, 2018, Revised Selected Papers 30, 2019
  2. The (un) reliability of saliency methods
    Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, and 5 more authors
    Explainable AI: Interpreting, explaining and visualizing deep learning, 2019

2018

  1. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)
    Been Kim, Martin Wattenberg, Justin Gilmer, and 4 more authors
    2018
  2. Transparency and explanation in deep reinforcement learning neural networks
    Rahul Iyer, Yuezhang Li, Huao Li, and 3 more authors
    In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, 2018
  3. Visualizing and understanding atari agents
    Samuel Greydanus, Anurag Koul, Jonathan Dodge, and 1 more author
    In International conference on machine learning, 2018
  4. Sanity checks for saliency maps
    Julius Adebayo, Justin Gilmer, Michael Muelly, and 3 more authors
    Advances in neural information processing systems, 2018

2016

  1. Understanding intermediate layers using linear classifier probes
    Guillaume Alain, and Yoshua Bengio
    arXiv preprint arXiv:1610.01644, 2016
  2. Graying the black box: Understanding dqns
    Tom Zahavy, Nir Ben-Zrihem, and Shie Mannor
    In International conference on machine learning, 2016
  3. Dueling network architectures for deep reinforcement learning
    Ziyu Wang, Tom Schaul, Matteo Hessel, and 3 more authors
    In International conference on machine learning, 2016

2014

  1. Deep inside convolutional networks: visualising image classification models and saliency maps
    K Simonyan, A Vedaldi, and A Zisserman
    In Proceedings of the International Conference on Learning Representations (ICLR), 2014



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • A Very Brief Investigation of Using Finetuning To Interpret Wav2Vec 2.0
  • Basic MCMC Pt. 2: The Metropolis-Hastings Algorithm
  • Bandit Algorithms (& The Exploration-Exploitation Tradeoff)
  • Variational Dropout in Recurrent Models
  • Exploring Tradeoffs Between Safety Metrics with MNIST