Exploration Policies
Exploration policies are a component that allow the agent to tradeoff exploration and exploitation according to apredefined policy. This is one of the most important aspects of reinforcement learning agents, and can require sometuning to get it right. Coach supports several pre-defined exploration policies, and it can be easily extended withcustom policies. Note that not all exploration policies are expected to work for both discrete and continuous actionspaces.
Exploration Policy | Discrete Action Space | Box Action Space |
---|---|---|
AdditiveNoise | X | V |
Boltzmann | V | X |
Bootstrapped | V | X |
Categorical | V | X |
ContinuousEntropy | X | V |
EGreedy | V | V |
Greedy | V | V |
OUProcess | X | V |
ParameterNoise | V | V |
TruncatedNormal | X | V |
UCB | V | X |
ExplorationPolicy
- class
rlcoach.exploration_policies.exploration_policy.
ExplorationPolicy
(_action_space: rl_coach.spaces.ActionSpace)[source] An exploration policy takes the predicted actions or action values from the agent, and selects the action toactually apply to the environment using some predefined algorithm.
- Parameters
action_space – the action space used by the environment
changephase
(_phase)[source]Change between running phases of the algorithm:param phase: Either Heatup or Train:return: none
getaction
(_action_values: List[Union[int, float, numpy.ndarray, List]]) → Union[int, float, numpy.ndarray, List][source]- Given a list of values corresponding to each action, choose one actions according to the exploration policy:param action_values: A list of action values:return: The chosen action,
The probability of the action (if available, otherwise 1 for absolute certainty in the action)
requires_action_values
() → bool[source]Allows exploration policies to define if they require the action values for the current step.This can save up a lot of computation. For example in e-greedy, if the random value generated is smallerthan epsilon, the action is completely random, and the action values don’t need to be calculated:return: True if the action values are required. False otherwise
reset
()[source]- Used for resetting the exploration policy parameters when needed:return: None
AdditiveNoise
- class
rlcoach.exploration_policies.additive_noise.
AdditiveNoise
(_action_space: rl_coach.spaces.ActionSpace, noise_schedule: rl_coach.schedules.Schedule, evaluation_noise: float, noise_as_percentage_from_action_space: bool = True)[source] AdditiveNoise is an exploration policy intended for continuous action spaces. It takes the action from the agentand adds a Gaussian distributed noise to it. The amount of noise added to the action follows the noise amount thatcan be given in two different ways:1. Specified by the user as a noise schedule which is taken in percentiles out of the action space size2. Specified by the agents action. In case the agents action is a list with 2 values, the 1st one is assumed tobe the mean of the action, and 2nd is assumed to be its standard deviation.
- Parameters
action_space – the action space used by the environment
noise_schedule – the schedule for the noise
evaluation_noise – the noise variance that will be used during evaluation phases
noise_as_percentage_from_action_space – a bool deciding whether the noise is absolute or as a percentagefrom the action space
Boltzmann
- class
rlcoach.exploration_policies.boltzmann.
Boltzmann
(_action_space: rl_coach.spaces.ActionSpace, temperature_schedule: rl_coach.schedules.Schedule)[source] The Boltzmann exploration policy is intended for discrete action spaces. It assumes that each of the possibleactions has some value assigned to it (such as the Q value), and uses a softmax function to convert these valuesinto a distribution over the actions. It then samples the action for playing out of the calculated distribution.An additional temperature schedule can be given by the user, and will control the steepness of the softmax function.
- Parameters
action_space – the action space used by the environment
temperature_schedule – the schedule for the temperature parameter of the softmax
Bootstrapped
- class
rlcoach.exploration_policies.bootstrapped.
Bootstrapped
(_action_space: rl_coach.spaces.ActionSpace, epsilon_schedule: rl_coach.schedules.Schedule, evaluation_epsilon: float, architecture_num_q_heads: int, continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters =)[source] - Bootstrapped exploration policy is currently only used for discrete action spaces along with theBootstrapped DQN agent. It assumes that there is an ensemble of network heads, where each one predicts thevalues for all the possible actions. For each episode, a single head is selected to lead the agent, accordingto its value predictions. In evaluation, the action is selected using a majority vote over all the headspredictions.
Note
This exploration policy will only work for Discrete action spaces with Bootstrapped DQN style agents,since it requires the agent to have a network with multiple heads.
- Parameters
action_space – the action space used by the environment
epsilon_schedule – a schedule for the epsilon values
evaluation_epsilon – the epsilon value to use for evaluation phases
continuous_exploration_policy_parameters – the parameters of the continuous exploration policy to useif the e-greedy is used for a continuous policy
architecture_num_q_heads – the number of q heads to select from
Categorical
- class
rlcoach.exploration_policies.categorical.
Categorical
(_action_space: rl_coach.spaces.ActionSpace)[source] Categorical exploration policy is intended for discrete action spaces. It expects the action values torepresent a probability distribution over the action, from which a single action will be sampled.In evaluation, the action that has the highest probability will be selected. This is particularly useful foractor-critic schemes, where the actors output is a probability distribution over the actions.
- Parameters
- action_space – the action space used by the environment
ContinuousEntropy
- class
rlcoach.exploration_policies.continuous_entropy.
ContinuousEntropy
(_action_space: rl_coach.spaces.ActionSpace, noise_schedule: rl_coach.schedules.Schedule, evaluation_noise: float, noise_as_percentage_from_action_space: bool = True)[source] - Continuous entropy is an exploration policy that is actually implemented as part of the network.The exploration policy class is only a placeholder for choosing this policy. The exploration policy isimplemented by adding a regularization factor to the network loss, which regularizes the entropy of the action.This exploration policy is only intended for continuous action spaces, and assumes that the entire calculationis implemented as part of the head.
Warning
This exploration policy expects the agent or the network to implement the exploration functionality.There are only a few heads that actually are relevant and implement the entropy regularization factor.
- Parameters
action_space – the action space used by the environment
noise_schedule – the schedule for the noise
evaluation_noise – the noise variance that will be used during evaluation phases
noise_as_percentage_from_action_space – a bool deciding whether the noise is absolute or as a percentagefrom the action space
EGreedy
- class
rlcoach.exploration_policies.e_greedy.
EGreedy
(_action_space: rl_coach.spaces.ActionSpace, epsilon_schedule: rl_coach.schedules.Schedule, evaluation_epsilon: float, continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters =)[source] - e-greedy is an exploration policy that is intended for both discrete and continuous action spaces.
For discrete action spaces, it assumes that each action is assigned a value, and it selects the action with thehighest value with probability 1 - epsilon. Otherwise, it selects a action sampled uniformly out of all thepossible actions. The epsilon value is given by the user and can be given as a schedule.In evaluation, a different epsilon value can be specified.
For continuous action spaces, it assumes that the mean action is given by the agent. With probability epsilon,it samples a random action out of the action space bounds. Otherwise, it selects the action according to agiven continuous exploration policy, which is set to AdditiveNoise by default. In evaluation, the action isalways selected according to the given continuous exploration policy (where its phase is set to evaluation as well).
- Parameters
action_space – the action space used by the environment
epsilon_schedule – a schedule for the epsilon values
evaluation_epsilon – the epsilon value to use for evaluation phases
continuous_exploration_policy_parameters – the parameters of the continuous exploration policy to useif the e-greedy is used for a continuous policy
Greedy
- class
rlcoach.exploration_policies.greedy.
Greedy
(_action_space: rl_coach.spaces.ActionSpace)[source] The Greedy exploration policy is intended for both discrete and continuous action spaces.For discrete action spaces, it always selects the action with the maximum value, as given by the agent.For continuous action spaces, it always return the exact action, as it was given by the agent.
- Parameters
- action_space – the action space used by the environment
OUProcess
- class
rlcoach.exploration_policies.ou_process.
OUProcess
(_action_space: rl_coach.spaces.ActionSpace, mu: float = 0, theta: float = 0.15, sigma: float = 0.2, dt: float = 0.01)[source] OUProcess exploration policy is intended for continuous action spaces, and selects the action according toan Ornstein-Uhlenbeck process. The Ornstein-Uhlenbeck process implements the action as a Gaussian process, wherethe samples are correlated between consequent time steps.
- Parameters
- action_space – the action space used by the environment
ParameterNoise
- class
rlcoach.exploration_policies.parameter_noise.
ParameterNoise
(_network_params: Dict[str, rl_coach.base_parameters.NetworkParameters], action_space: rl_coach.spaces.ActionSpace)[source] - The ParameterNoise exploration policy is intended for both discrete and continuous action spaces.It applies the exploration policy by replacing all the dense network layers with noisy layers.The noisy layers have both weight means and weight standard deviations, and for each forward pass of the networkthe weights are sampled from a normal distribution that follows the learned weights mean and standard deviationvalues.
Warning: currently supported only by DQN variants
- Parameters
- action_space – the action space used by the environment
TruncatedNormal
- class
rlcoach.exploration_policies.truncated_normal.
TruncatedNormal
(_action_space: rl_coach.spaces.ActionSpace, noise_schedule: rl_coach.schedules.Schedule, evaluation_noise: float, clip_low: float, clip_high: float, noise_as_percentage_from_action_space: bool = True)[source] The TruncatedNormal exploration policy is intended for continuous action spaces. It samples the action from anormal distribution, where the mean action is given by the agent, and the standard deviation can be given in two different ways:1. Specified by the user as a noise schedule which is taken in percentiles out of the action space size2. Specified by the agents action. In case the agents action is a list with 2 values, the 1st one is assumed tobe the mean of the action, and 2nd is assumed to be its standard deviation.When the sampled action is outside of the action bounds given by the user, it is sampled again and again, until itis within the bounds.
- Parameters
action_space – the action space used by the environment
noise_schedule – the schedule for the noise variance
evaluation_noise – the noise variance that will be used during evaluation phases
noise_as_percentage_from_action_space – whether to consider the noise as a percentage of the action spaceor absolute value
UCB
- class
rlcoach.exploration_policies.ucb.
UCB
(_action_space: rl_coach.spaces.ActionSpace, epsilon_schedule: rl_coach.schedules.Schedule, evaluation_epsilon: float, architecture_num_q_heads: int, lamb: int, continuous_exploration_policy_parameters: rl_coach.exploration_policies.exploration_policy.ExplorationParameters =)[source] UCB exploration policy is following the upper confidence bound heuristic to sample actions in discrete action spaces.It assumes that there are multiple network heads that are predicting action values, and that the standard deviationbetween the heads predictions represents the uncertainty of the agent in each of the actions.It then updates the action value estimates to by mean(actions)+lambda*stdev(actions), where lambda isgiven by the user. This exploration policy aims to take advantage of the uncertainty of the agent in its predictions,and select the action according to the tradeoff between how uncertain the agent is, and how large it predictsthe outcome from those actions to be.
- Parameters
action_space – the action space used by the environment
epsilon_schedule – a schedule for the epsilon values
evaluation_epsilon – the epsilon value to use for evaluation phases
architecture_num_q_heads – the number of q heads to select from
lamb – lambda coefficient for taking the standard deviation into account
continuous_exploration_policy_parameters – the parameters of the continuous exploration policy to useif the e-greedy is used for a continuous policy