Selecting an Algorithm

Selecting an Algorithm

As you probably already noticed, Coach has a lot of algorithms implemented into it:“ok that’s prefect, but I am trying to build a solution for my application, how do I select the right algorithm?”

We collected some guidelines for how to choose the right algorithm for your application.Answer the following questions to see what are the best algorithms for your task.The algorithms are ordered by their release date in descending order.

What are the type of actions your task requires?

Discrete actionsContinuous actions

Do you have expert demonstrations for your task?Can you collect new data for your task dynamically?Do you have a simulator for your task?

DQN Learns action values for discrete actions, and allows learning from a replay buffer with old experiences

Rainbow Combines multiple recent innovations on top of DQN for discrete controls, and achieves much better results on known benchmarks

HAC Works only for continuous actions, and uses hierarchy of agents to make the learning more simple

DDQN An improvement over DQN, which learns more accurate action values, and therefore achieves better results on known benchmarks

DFP Works only for discrete actions, by learning to predict the future values of a set of measurements from the environment, and then using a goal vector to weight the importance of each of the measurements

MMC A simple modification to DQN, which instead of learning action values only by bootstrapping the current action value prediction, it mixes in the total discounted return as well. This helps learn the correct action values faster, and is particularly useful for environments with delayed rewards.

PAL An improvement over DQN, that tries to deal with the approximation errors present in reinforcement learning by increasing the gap between the value of the best action and the second best action.

NAF A variant of Q learning for continuous control.

NEC Uses a memory to "memorize" its experience and learn much faster by querying the memory on newly seen states.

QR DQN Uses quantile regression to learn a distribution over the action values instead of only their mean. This boosts performance on known benchmarks.

Bootstrapped DQN Uses an ensemble of DQN networks, where each network learns from a different subset of the experience in order to improve exploration.

N-Step Q Learning A variant of Q learning that uses bootstrapping of N steps ahead, instead of 1 step. Doing this makes the algorithm on-policy and therefore requires having multiple workers training in parallel in order for it to work well.

Categorical DQN Learns a distribution over the action values instead of only their mean. This boosts performance on known algorithms but requires knowing the range of possible values for the accumulated rewards before hand.

Policy Gradient Based on the REINFORCE algorithm, this algorithm learn a probability distribution over the actions. This is the most simple algorithm available in Coach, but also has the worse results.

Actor Critic (A3C / A2C) Combines REINFORCE with a learned baseline (Critic) to improve stability of learning. It also introduced the parallel learning of multiple workers to speed up data collection and improve the learning stability and speed, both for discrete and continuous action spaces.

ACER Similar to A3C with the addition of experience replay and off-policy training. to reduce variance and improve stability it also employs bias correction and trust region optimization techniques.

SAC Soft Actor-Critic is an algorithm which optimizes a stochastic policy in an off-policy way. One of the key features of SAC is that it solves a maximum entropy reinforcement learning problem.

DDPG An actor critic scheme for continuous action spaces which assumes that the policy is deterministic, and therefore it is able to use a replay buffer in order to improve sample efficiency.

TD3 Very similar to DDPG, i.e. an actor-critic for continuous action spaces, that uses a replay buffer in order to improve sample efficiency. TD3 uses two critic networks in order to mitigate the overestimation in the Q state-action value prediction, slows down the actor updates in order to increase stability and adds noise to actions while training the critic in order to smooth out the critic's predictions.

PPO An actor critic scheme which uses bounded updates to the policy in order to make the learning process very stable.

Clipped PPO A simplification of PPO, that reduces the code complexity while achieving similar results.

BC The simplest form of imitation learning. Uses supervised learning on a dataset of expert demonstrations in order to imitate the expert behavior.

CIL A variant of behavioral cloning, where the learned policy is disassembled to several skills (such as turning left or right in an intersection), and each skill is learned separately from the human demonstrations.

1. Does your environment have a discrete or continuous action space?

Some reinforcement learning algorithms work only for discrete action spaces, where the agent needs to selectone out of several possible actions. Other algorithms work only for continuous action spaces, where there areinfinite possible actions, but there is some spatial relationship between the actions. And there are some algorithmsthat can be applied in both cases. The available algorithms highly depend on the task at hand.

2. Is collecting more samples from your environment painful?

Reinforcement learning algorithm are notoriously known for the amount of samples they need for training.Typically, on-policy algorithms are much less sample efficient compared to off-policy algorithms. But there areother algorithmic features that allow improving the sample efficiency even more, like using a DND in NEC, or usingHindsight Experience Replay. It is hard to say which algorithm is the most sample efficient, but we can at least saywhich ones are not sample efficient.

3. Do you have a simulator that can be parallelized across multiple processes or nodes?

Parallelizing training across multiple workers which are located on the same node or on different nodes is a techniquethat has been introduced in recent years and achieved a lot of success in improving the results of multiple algorithms.As part of this, there are some algorithms that don’t work well without being parallelized with multiple workersworking in parallel, which requires having a simulator for each worker.

4. Do you have human demonstrations for solving the task?

If human demonstrations are available for a task, most of the time it would be better to use those instead of trainingusing regular reinforcement learning from scratch. To use human demonstrations we have implemented several tools andalgorithms for imitation learning in Coach.