Interpreting planning in model-free RL
In this post, I summarise my forthcoming paper in which we present evidence that a model-free reinforcement learning agent can learn to internally perform planning. Our post is organised as follows. In Section 0, I provide a high-level TL;DR. After this, in Section 1, I provide a brief introduction to our work. Sections 2, 3 and 4 then detail the three primary steps of our analysis. Finally, Sections 5 and 6 conclude with some discussion and high-level takeaways.
In the context of modern deep learning, “decision-time planning” – that is, the capacity of selecting immediate actions to perform by predicting and evaluating the consequences of different actions – is conventionally associated with model-based, AlphaZero-style agents.
We investigate this question in the context of a Deep Repeated ConvLSTM (DRC) agent – a type of generic model-free agent introduced by Guez et al. (2019)
Figure 1: llustration of the DRC architecture. For each time step, the architecture encodes the input $x_t$ as a convolutional encoding $i_t$, passes it to a stack of 3 ConvLSTMs which perform three ticks of recurrent computation. The output of the final ConvLSTM after the final internal tick is then flattened, passed through an MLP and projected to produce policy logits $\pi_t$ and a value estimate $v_t$.
Sokoban is a deterministic, episodic environment in which an agent must navigate around an 8x8 grid to push four boxes onto four targets. The agent can move a box by stepping onto the square it inhabits. If, for example, the agent steps up onto a square containing a box, the box is pushed one square up. Sokoban allows agents to push boxes in such a way that levels become permanently unsolvable. It is hence a hard domain
Figure 2: Illustration of Sokoban
Despite lacking an explicit world model, DRC agents have been shown to behave as though they are performing decision-time planning when playing Sokoban. For example, DRC agents have been shown to:
Yet, it was not previously known why DRC agents behave in this way: is this behaviour merely the result of complex learned heuristics, or do DRC agents truly learn to internally plan?
In our work, we take a concept-based approach to interpreting a Sokoban-playing DRC agent and demonstrate that this agent is indeed internally performing planning. Specifically, we perform three steps of analysis:
In performing our analysis, we provide the first non-behavioural evidence that it is possible for agents to learn to internally plan without relying on an explicit world model or planning algorithm.
How might an agent learn to plan in the absence of an explicit world model? We hypothesised that a natural way for such an agent to learn to plan would be for it to internally represent a collection of planning-relevant concepts. Note that we understand a “concept” to simply be a minimal piece of task-relevant knowledge
Two aspects of Sokoban are dynamic: the location of the agent, and the locations of the boxes. As such, we hypothesise that planning-capable agent in Sokoban would learn the following two concepts relating to individual squares of the 8x8 Sokoban board
We use linear probes – that is, linear classifiers trained to predict these concepts using the agent’s internal activations – to determine whether the Sokoban-playing DRC agent we investigate indeed represents these concepts
We measure the extent to which linear probes can accurately decode these concepts from the agent’s cell state using the Macro F1 score they achieve.
Figure 3: Macro F1s achieved by 1x1 and 3x3 probes when predicting (a) Agent Approach Direction and (b) Box Push Direction using the agent's cell state at each layer, or, for the baseline probes, using the observation. Error bars show ±1 standard deviation.
Our linear probes are able to accurately predict both (1) agent approach direction and (2) box push direction for squares of Sokoban boards. For instance, when probing the agent’s cell state at all layers for these concepts, all macro F1 scores are greater than 0.8 (for agent approach direction) and 0.86 (for box push direction). In contrast, probes trained to predict agent approach direction and box push direction based on the agent’s observation of the Sokoban board, only get macro F1s of 0.2 and 0.25. We take this as evidence that the agent indeed (linearly) represents the two hypothesised planning-relevant concepts. Similarly, we take the relatively minimal increase in performance when moving from 1x1 to 3x3 probes as evidence that the agent indeed represents these concepts at localised spatial positons of its cell state.
So, the Deep Repeated ConvLSTM (DRC) agent internally represents the aformentioned planning-relevant, square-level concepts. How, then, does the agent use these representations to engage in planning?
We find that the agent uses its internal representations of Agent Approach Direction and Box Push Direction for each individual Sokoban square to formulate coherent planned paths to take around the Sokoban board, and to predict the consequences of taking these paths on the locations of boxes. That is, when we apply 1x1 probes to the agent’s cell state to predict \(C_A\) and \(C_B\) for every square of observed Sokoban boards, we find what appear to be (1) paths the agent expects to navigate along and (2) routes the agent expects to push boxes along. Examples (a)-(c) in the below figure provide examples of the agent forming “internal plans” in this way in three example levels.
Figure 4: Examples of internal plans decoded from the agent’s cell state by a probe. Teal and purple arrows respectively indicate that a probe applied to the agent’s cell state decodes that the agent expects to next step on to, or push a box off, a square from the associated direction.
This, however, raises a question: how does the agent arrive at these internal plans? Is the agent merely remembering past sequences of actions that leads to good outcomes – that is, is it performing some form of heuristic-based lookup - or has the agent learned something more akin to a generalisable planning algorithm?
One way to go about answering this question is to look at what happens if we force the agent to pause and “think” prior to acting at the start of episodes. If the agent is merely performing something akin to heuristic-based lookup we would not expect the agent’s internal plan to necessarily get any more accurate when given extra “thinking time”. In contrast, if the agent were indeed performing some form of iterative planning, this is exactly what we would expect.
To test this, we forced the agent to remain stationary and not perform any actions for the first 5 steps of 1000 episodes. Since the agent performs 3 internal ticks of computation for each real time step, this corresponds to giving the agent 15 additional computational ticks of “thinking time” before it has to act in these episodes. We then used 1x1 probes to decode the agent’s internal plan (in terms of both \(C_A\) and \(C_B\)) at each tick, and measured the correctness of the agent’s plan at each tick by measuring the macro F1 when using that plan to predict the agent’s actual future interactions with the environment. Figure 5 below shows that, as would be expected if the agent planned via an iterative search, the agent’s plans iteratively improve over the course of the 15 extra internal ticks of computation it performs when forced to remain still prior to acting. This is all to say that, when given extra time to think, the agent seems to iteratively refine its internal plan despite nothing in its training objective explicitly encouraging this!
Figure 5: Macro F1 when using 1x1 probes to decode $C_A$ and $C_B$ from the agent’s final layer cell state at each of the additional 15 internal ticks performed by the agent when the agent is given 5 ‘thinking steps’, averaged over 1000 episodes.
We can also investigate the question of whether the agent forms plans via some internal planning mechanism by qualitatively inspecting the manner in which the agent’s plans form. When visualising how the routes the agent plans to push boxes develop over the course of episodes, we observed a few recurring “plan-formation motifs”. These are described below. Note that, in the below examples, a blue arrow indicates that the agent expects to push a box off of a square in the associated direction. Additionally, note that we visualise the agent’s internal plan at each of the 3 computational ticks it performs each time step.
Figure 6: An example of an episode in which the agent iteratively constructs a plan by extending its plan forward from the lower-most box.
Figure 7: An example of an episode in which the agent iteratively constructs a plan by extending its plan backward from the lower-right target.
Figure 8: An example of the agent appearing to evaluate, and subsequently modify, part of its plan. In this episode, the shortest path between the center-right box and center-most target is to push the box rightwards along the corridor. However, it is not feasible for the agent to push the box along this path, since doing so requires the agent to "get under the box" to push it up to the target. The agent cannot do this as pushing the box in this way blocks the corridor.
Figure 9: An example in which the agent initially plans to push two boxes to the same target (the left-most target) but modifies its plan such that one of the boxes is pushed to an alternate target. Note that, in this episode, the agent appears to realise that the upper-left box must be pushed to the left-most target as it, unlike the lower-left target, cannot be pushed to any other targets.
We think these motifs provide qualitative evidence that the agent forms plans using a planning algorithm that (i) is capable of evaluating plans, and that (ii) utilises a form of bidirectional search. This is interesting as the apparent use of bidirectional search indicates that the agent has learned to plan in a manner that is notably different from the forward planning algorithms commonly used in model-based RL
If the agent does indeed form plans by performing a bidirectional, evaluative search, we would expect the agent to be able to continue to form coherent plans in levels drawn from different distributions to the levels it saw during training. This is because the agent’s learned search procedure would presumably be largely unaffected by distribution shift. In contrast, if the agent solely relied on memorised heuristics, we would expect the agent to be unable to coherently form plans in levels drawn from different distributions, as the memorised heuristics may no longer apply. Interestingly, the agent can indeed form coherent plans in levels drawn from different distributions. We now provide two examples of this.
Figure 10: An example of an episode in which the agent iteratively constructs a plan to push boxes to targets despite not being able to see its own location.
Figure 11: An example of an episode in which the agent iteratively constructs a plan to push 6 boxes to 6 targets despite never having seen such a level during training.
However, the evidence presented thus far only provides evidence that the agent constructs internal plans. It does not confirm that the agent’s apparent internal plans influence its behaviour in the way that would be expected if the agent were truly engaged in planning..
As such, we performed intervention experiments to determine whether the agent’s representations of Agent Approach Direction and Box Push Direction influence the agent’s behaviour in the way that would be expected if these representations were indeed used for planning. Specifically, we performed interventions in two types of handcrafted levels:
Our interventions consisted of adding the vectors learned by the probes to specific positions of the agent’s cell state with the aim of inducing the desired behaviour
Figure 12: An example of the effect of an intervention on the path that the agent plans to follow. Teal arrows correspond to squares that the agent expects to step onto, and the direction is expects to step onto them from. The left-most gif shows the agent's plan over the initial steps of an episode when no intervention is performed. The middle gif shows the squares intervened upon, with our interventions encouraging the agent to step onto the square with the white arrow, and discouraging the agent from stepping onto the squares with the white crosses. The right-most gif shows the agent's plan over the initial steps of an episode when the intervention is performed.
Figure 13: An example of the effect of an intervention on the route that the agent plans to push boxes. Purple arrows correspond to squares that the agent expects to push boxes off of, and the direction is expects to push them when it does so. The left-most gif shows the agent's plan over the initial steps of an episode when no intervention is performed. The middle gif shows the squares intervened upon, with our interventions encouraging the agent to push the box in the direction of the white arrow, and discouraging the agent from pushing boxes off of the squares with the white crosses. The right-most gif shows the agent's plan over the initial steps of an episode when the intervention is performed.
What is the takeaway from all of this? Well, in seeming to plan via searching over potential future actions and their environmental impacts, the agent appears to have leaned to a plan in a manner analogous to model-based planning. However, this raises an obvious question: how can a nominally model-free agent have learned to plan in this way?
We think the answer here is that the inductive bias of ConvLSTMs has encouraged the agent to learn to organise its internal representations in a manner that corresponds to a learned (implicit) environment model. Recall that the agent has learned a spatial correspondence between its cell states and the Sokoban grid such that the agent represents the spatially-localised concepts of Agent Approach Direction and Box Push Direction at the corresponding positions of its cell state. We contend this means that the agent’s 3D recurrent state can be seen as containing a learned “implicit” model of the environment. This is, of course, not a true world model in the sense that the agent is not explicitly approximating the full dynamics of the environment. However, the agent’s recurrent state does seemingly “do enough” to play the role of a world model in that it allows the agent to (i) formulate potential sequences of future actions and (ii) predict their environmental impacts.
Our work provides the first non-behavioural evidence that agents can learn to perform decision-time planning without either an explicit world model or an explicit planning algorithm. Specifically, we provide evidence that appears to indicate that a Sokoban-playing DRC agent internally performs a process with similarities to bi-directional search-based planning. This represents a further blurring of the classic distinction between model-based and model-free RL, and confirms that – at least with a specific architecture in a specific environment – model-free agents can learn to perform planning.
However, our work leaves many questions unanswered. For instance, what are the conditions under which an agent can learn to plan without an explicit world model? For instance, as exemplified by its multiple internal recurrent ticks, the agent we study has a slightly atypical architecture. Likewise, the 3D nature of the agent’s ConvLSTM cell states means that the agent is especially well-suited to learning to plan in the context of Sokoban’s localised, grid-based transition dynamics. While we suspect our findings hold more broadly
This project would not have been possible without the amazing guidance and mentorship I have recieved from Stephen Chung, Usman Anwar, Adria Garriga-Alonso and David Krueger to who I am deeply grateful. I would also like to thank Alex Cloud for helpful feedback on this post.