Control Flow
Coach is built in a modular way, encouraging modules reuse and reducing the amount of boilerplate code neededfor developing new algorithms or integrating a new challenge as an environment.On the other hand, it can be overwhelming for new users to ramp up on the code.To help with that, here’s a short overview of the control flow.
Graph Manager
The main entry point for Coach is coach.py
.The main functionality of this script is to parse the command line arguments and invoke all the sub-processes neededfor the given experiment.coach.py
executes the given preset file which returns a GraphManager
object.
A preset is a design pattern that is intended for concentrating the entire definition of an experiment in a singlefile. This helps with experiments reproducibility, improves readability and prevents confusion.The outcome of a preset is a GraphManager
which will usually be instantiated in the final lines of the preset.
A GraphManager
is an object that holds all the agents and environments of an experiment, and is mostly responsiblefor scheduling their work. Why is it called a graph manager? Because agents and environments are structured intoa graph of interactions. For example, in hierarchical reinforcement learning schemes, there will often be a masterpolicy agent, that will control a sub-policy agent, which will interact with the environment. Other schemes can havemuch more complex graphs of control, such as several hierarchy layers, each with multiple agents.The graph manager’s main loop is the improve loop.The improve loop skips between 3 main phases - heatup, training and evaluation:
Heatup - the goal of this phase is to collect initial data for populating the replay buffers. The heatup phasetakes place only in the beginning of the experiment, and the agents will act completely randomly during this phase.Importantly, the agents do not train their networks during this phase. DQN for example, uses 50k random steps in orderto initialize the replay buffers.
Training - the training phase is the main phase of the experiment. This phase can change between agent types,but essentially consists of repeated cycles of acting, collecting data from the environment, and training the agentnetworks. During this phase, the agent will use its exploration policy in training mode, which will add noise to itsactions in order to improve its knowledge about the environment state space.
Evaluation - the evaluation phase is intended for evaluating the current performance of the agent. The agentswill act greedily in order to exploit the knowledge aggregated so far and the performance over multiple episodes ofevaluation will be averaged in order to reduce the stochasticity effects of all the components.
Level Manager
In each of the 3 phases described above, the graph manager will invoke all the hierarchy levels in the graph in asynchronized manner. In Coach, agents do not interact directly with the environment. Instead, they go through aLevelManager, which is a proxy that manages their interaction. The level manager passes the current state and rewardfrom the environment to the agent, and the actions from the agent to the environment.
The motivation for having a level manager is to disentangle the code of the environment and the agent, so to allow morecomplex interactions. Each level can have multiple agents which interact with the environment. Who gets to choose theaction for each step is controlled by the level manager.Additionally, each level manager can act as an environment for the hierarchy level above it, such that each hierarchylevel can be seen as an interaction between an agent and an environment, even if the environment is just more agents ina lower hierarchy level.
Agent
The base agent class has 3 main function that will be used during those phases - observe, act and train.
- Observe - this function gets the latest response from the environment as input, and updates the internal stateof the agent with the new information. The environment response willbe first passed through the agent’s
InputFilter
object, which will process the values in the response, accordingto the specific agent definition. The environment response will then be converted into aTransition
which will contain the information from a single step((s{t}, a{t}, r{t}, s{t+1}, \textrm{terminal signal})), and store it in the memory. - Act - this function uses the current internal state of the agent in order to select the next action to take onthe environment. This function will call the per-agent custom function
choose_action
that will use the networkand the exploration policy in order to select an action. The action will be stored, together with any additionalinformation (like the action value for example) in anActionInfo
object. The ActionInfo object will then bepassed through the agent’sOutputFilter
to allow any processing of the action (like discretization,or shifting, for example), before passing it to the environment. - Train - this function will sample a batch from the memory and train on it. The batch of transitions will befirst wrapped into a
Batch
object to allow efficient querying of the batch values. It will then be passed intothe agent specificlearn_from_batch
function, that will extract network target values from the batch and willtrain the networks accordingly. Lastly, if there’s a target network defined for the agent, it will sync the targetnetwork weights with the online network.