Hierarchical Reinforcement Learning

There is a lot of hype around Reinforcement Learning (RL) nowadays. I believe this is with good reason: RL has achieved super-human performance in Atari video games and has even beat the world’s best player of Go. Currently, RL is also being applied to system’s management and configuration, autonomous driving, robotics and more. Will RL take over the world? Not yet, first there are many problems to deal with:

  • Sample efficiency: humans need practice to become a decent player of a video game, but the amount of practice a RL agent needs, is crazy! For simple video games like Breakout, we’re talking about thousands of hours of gameplay, of course, this is not a problem when you have a powerful server and you’re playing a video game, but what if you need a decent performance and have no access to the environment beforehand? There are many different approaches used to solve this like Few shot learning or Learning from Demonstrations, but this is still a hot topic in RL research.
  • Scaling: the efficiency becomes worse when the state and actions space grow. If you think of Tabular Q-learning, you have a column per action and a row per state, so if you have 10 states and 10 actions, then you have 100 q-values to calculate. But when you’re dealing with video games for example, the state is defined by the frames of the game that might have 100×100 pixels (just to make it round), where each pixel can assume a value between 0 and 255, so the number of possible states is 255¹⁰⁰⁰⁰, and let’s say we have 10 actions, then you will have 255¹⁰⁰⁰⁰⁰ q-values… oops. To make it worse, think about continuous state or action spaces; this is indeed a complicated problem. Deep RL can manage with this huge state spaces but it is still a challenge to do it in a sample efficient manner.
  • Generalization and transferring learning: in addition, RL agents can perform very well on a task for which they have already been trained for and only for that task. If they start with a new task, the performance will be terrible again.

As I have said before, many approaches have been developed to manage these problems and Hierarchical Reinforcement Learning is one of them. But what is HRL really about?

Hierarchical Reinforcement Learning to the rescue

Walking is not as simple as it looks [Photo by Kira auf der Heide on Unsplash.]

Let’s say while you’re reading this, you doorbell rings. What are you going to do? Well, you could think about getting up and going to the door, but actually, what you have to do is much more complex: you have to control many muscles in order to stand up, then you have to walk to the door, for which you again have to control your muscles and also balance yourself until you get to the door. However, you just think about the high level actions (getting up, going to the door), because the lower levels actions come naturally.

HRL is born exactly with this in mind. Instead of having just one policy that has to achieve the goal of the task, we now have several sub-policies that work together, in a hierarchical structure. This approach brings many benefits:

  • Improved exploration: thanks to the use of high-level actions, the action space is reduced and exploration is simpler.
  • Sample efficiency: states can also be managed in a hierarchical way, and low-level policies can hide irrelevant information from its higher-level policies. This reduces the state space too, which improves sample efficiency as I explained before.
  • Transfer learning: sub-policies, or low-level policies, can be re-used for different tasks. Let’s say an agent learnt how to make coffee but now you want it to learn how to make a cappuccino, if you split the task of making a cappuccino in making coffee and then warming up milk and making it foamy, then the first robot can already use its experience for making the coffee, which will accelerate learning for this new task.

There are different approaches to implement HRL in practice. I have hand-picked a few and made a summary of them for your delight:

The Options Framework

The Options Framework, introduced by Sutton, Precup & Singh in 1999, provides a way to implement hierarchies and macro-actions in RL. This is probably one of the most common formulations of HRL.

In the Options Framework, the agent is in a Semi-Markov Decision Process (SMDP), an extension of Markov Decision Processes in which the time between actions is not constant, allowing it to include different levels of abstractions for its actions.

The Options Framework introduces the concept of option, a generalization of an action that lets us introduce macro-actions. An example of an option could be picking up an object (which involves certain muscle twitches), but it could also be one single muscle twitch. In these two examples, the first one is a macro-action formed by multiple primitive actions, while the second one is a primitive action. Sutton et al. (1999) introduce the concept of options and describe them like this:

Options consist of three components: a policy π : S × A → [0, 1], a termination condition β : S+ → [0, 1], and an initiation set I ⊆ S. An option ⟨I,π,β⟩ is available in state st if and only if st∈ I. If the option is taken, then actions are selected according to π until the option terminates stochastically according to β.

This means that each time a new state is observed, the initiation set I is checked to see if the current state belongs to it, if so, the option starts: from now on, π determines the action to take in each step (instead of the global policy) and in each state, the termination condition β is checked, if the it returns True, then the option finishes and we’re back to using the global policy. In the image below, we can see how using options lets us manage actions and states with different temporal lengths.

State transitions in MDP are represented by discrete-time steps of the same duration, while in SMDP steps have different durations. Options allow us to analyze the trajectory either way [figure from Sutton, et al. (1999).]

The definition of the options for each problem are left to each use case, increasing the complexity of the definition of the problem to be treated by RL. Also, the Options Framework does not consider task segmentation explicitly.

Feudal Reinforcement Learning (FRL) defines a control hierarchy, in which a level of managers can control sub-managers, while at the same time this level of managers is controlled by super-managers. Each manager assigns goals for its sub-managers and the sub-managers perform actions to achieve this goal and obtain a reward.

Control hierarchy of managers in FRL [image from Pixabay.]

Feudal Reinforcement Learning (FRL) defines a control hierarchy, in which a level of managers can control sub-managers, while at the same time this level of managers is controlled by super-managers. Each manager assigns goals for its sub-managers and the sub-managers perform actions to achieve this goal and obtain a reward.

FRL applied to a maze [figure from from Dayan & Hinton (1993).]

FRL was originally conceived to solve a maze problem. The agent starts at a random location in the maze and must find its way to a certain location. The managers are structured like shown on the figure on the left, the blue area is the goal for the task and the “U” represents a barrier the agent cannot go across. The actions for each manager are: North, South, East, West or *, * is a special action and means the manager should explore inside its region. In the top layer, the manager can only perform * while in the bottom layer, the managers can only perform NSEW. The learning takes place bottom-up at the start, in which each manager from the bottom layer learns where it should go to, and then there is some top-down learning that optimises the overall trajectory in the maze. The actions learned by each manager are shown in the following image:

Actions and their learnt probability are represented for each manager, the circle (or *) represents the action of exploring within the area assigned to the manager, the colors on the area represent the actions to go NSEW. The area of each action is proportional to its probability [figure from from Dayan & Hinton (1993).]

FRL is based on two main principles: Reward Hiding and (2) Information Hiding. Reward Hiding means that managers must reward sub-managers for doing their bidding whether or not this satisfies the commands of the super-managers. So each manager learns to satisfy the upper level on its own. Information Hiding is referred to the fact that managers only need to know the state of the system at the granularity of their own choices of tasks. Thanks to this, higher levels work on a broader granularity with a simplified state.

FRL represented a big step forward for HRL but its applications to different domains than the one shown here was not successful, being highly inefficient in some cases.

In 2015, DeepMind brought back FRL and applied it to Deep RL. They inspired themselves in the original architecture and some principles of FRL: they create a modular NN, in which there is a Manager that receives the state and reward from the environment as input, and outputs an embedding of the state and a sub-goal the worker has to achieve. The worker chooses the action and tries to achieve the goal set by its Manager.

The Option-Critic architecture improves the Options Framework, being capable of learning both the internal policies and termination conditions of options without the need to provide any additional rewards or subgoals.

In this case, options are assumed to be able to start at any state, and the internal policy and termination condition are learned, so options do not have to be defined before-hand. The policy over options goal is to decide in each step what option to use, and leaves the sub-policy of that option in control until the option policy returns a terminated condition. The policies and termination conditions are learnt on a solution based on the policy gradient theorem from Sutton et al. (2000).

In addition, The Option-Critic architecture is based on the actor-critic architecture. In which there is an actor, that tries actions and a critic, that evaluates how good is each action and communicates this to the actor. The intra-option policies, termination functions and policy over options belong to the actor part of the system while the critic consists of Q_U, the value of executing an action in the context of a state-option pair, and A_Ω, the advantage function over options. You can see the architecture in the following figure:

The Options-Critic Architecture [figure from Bacon et al. (2016).]

Try it yourself

If you want to give it a try yourself, I have found an implementation of the Option-Critic Architecture on Tensorflow, which you can download from Github. The repository uses OpenAI Gym, but if you want to try it in other environments, you can check this list of RL environments.

Thank you for reading!