Whether you are a student in Georgia Tech's OMSCS CS 7642 Reinforcement Learning course or a curious learner, this article aims to discuss the high level abstract concepts on the Advantage Actor-Critic with Proximal Policy Optimization algorithm.
Here are some of the topics that will be discussed:
This article is part of a series of articles for the OMSCS CS 7642 Reinforcement Learning class.
OMSCS Reinforcement Learning Series:
Hello dear reader! I'm thrilled that I get to share my knowledge. However, I am still a student, journeying through the complex field of computer science. While I strive to present information that is accurate and insightful, there may be instances where I misinterpret or oversimplify concepts. Please approach this article with a curious and critical mind and consult the learning material sections for the most accurate and comprehensive information available.
Learning Materials:
This article assumes that you are somewhere in the middle of your reinforcement learning journey. However I have an article on the introductory concepts called "" which can be found here.
This section is meant to showcase items I had to understand at the primary level before attempting to code the proximal policy optimization algorithm. I will not teach these concepts from scratch; I will quickly summarize them and post learning materials for deeper learning.
Policy iteration aims to find the state-value function of all states and use that to find an agent policy.
We can learn more about the world from that policy and use that to recalculate the state-value function. This loop between improving the agent (actor) and the value function (critic) is essential to understanding the actor-critic framework.
Despite policy iteration being in the family of dynamic programming, where the model must be fully known, this algorithm shows that it's possible to use a mix of policy evaluation and policy improvement to help an agent understand more about the world.
Lastly, although the policy is initially set to some random policy, the algorithm will converge to the optimal policy.
Policy Iteration Algorithm:
$$ \begin{aligned} & \text{1. Initialization: } \\ & \text{Hyperparameters: } \text{small threshold } \theta > 0 \text{, } \triangle = +\infty \\ & V(s) \in \mathbb{R} \text{ and } \pi (s) \in A(s) \text{ arbitrarily for all } s \in S \\ & \\ & \text{2. Policy Evaluation} \\ & \text{Loop until } \triangle < \theta \text{:} \\ & \triangle \leftarrow 0 \\ & \text{Loop for each } s \in S \text{:} \\ & \quad v \leftarrow V(s) \\ & \quad V(s) \leftarrow \sum_{s', r} p(s',r \mid s, \pi(s))[r + \lambda V(s')] \\ & \quad \triangle \max(\triangle, \mid v - V(s) \mid) \\ & \\ & \text{3. Policy Improvement} \\ & \text{policy-stable } \leftarrow \text{true} \\ & \text{For each } s \in S \text{:} \\ & \quad \text{old-action } \leftarrow \pi(s) \\ & \quad \text{If old-action } \neq \pi(s) \text{, then policy-stable} \leftarrow \text{false} \\ & \text{If policy-stable, then stop and return } V \approx v_* \text{ and } \pi \approx \pi_* \text{; else go to 2} \end{aligned} $$
Learning Materials:
The first half of the algorithm. Policy evaluation updates the state-value function of all states based on the current policy. The goal is simply to understand what the state-value functions are given the current policy. Last thing to note here is that all state-value function updates require that we have a policy to utilize.
With the updated state-value function for all states based on the current policy, all the policy improvement phase aims to do is seek a better policy based on the new state-value functions of all states.
If there has been one state that changed its policy, the algorithm must run the policy evaluation phase again in order to determine the state-value function of all states based on the new policy.
This section is key to understanding A2C, as it tells us that the state-value functions found in the policy evaluation phase does not represent the new policy, therefore the state-value functions are in effect outdated. This is how the actor Neural Network updates will work.
Value iteration is considered a value based method which is simply just the policy evaluation phase of the policy iteration algorithm with the exception that we update towards the best action.
In a sense we are capable of finding the optimal policy without having to know what our policy is before hand; meaning we don't need to have a policy in order to make updates. We can simply aim to improve the values of all states by updating in the direction that leads to the best rewards and when satisfied we can then derive a policy based on the state-value functions of all states.
Value Iteration Algorithm:
$$\begin{aligned} & \text{Hyperparameters: } \text{small threshold } \theta > 0 \text{, } \triangle = +\infty \\ & \text{Initialize } V(s) = 0 \text{, for all } s \in S^+, V(terminal) = 0 \\ & \\ & \text{Repeat until } \triangle \leq \theta \text{:} \\ & \quad \triangle \leftarrow 0 \text{; } V'(s)= V(s); \\ & \quad \text{for each } s \in S \text{:}\\ & \quad \quad V'(s) = \max_{a \in A} \sum_{s',r}p(s',r \mid s,a)[r + \gamma v_\pi (s') \\ & \quad \quad \triangle \leftarrow \max(\triangle, \mid V'(s) - V(s) \mid) \\ & \quad V \leftarrow V' & \\ & \text{Output a deterministic policy, } \pi \approx \pi_{*}, \text{ such that} \\ & \pi(s) = \text{ argmax}_{a} \sum_{s',r} p(s',r \mid s,a)[r + \lambda V(s')] \end{aligned}$$
Key Bellman Equation Update:
$$ V'(s) = \max_{a \in A} \sum_{s',r}p(s',r \mid s,a)[r + \gamma V_\pi(s')]$$
Notice in this update equation, unlike policy evaluation, all we are doing is simply updating towards the best action possible. In a sense this almost feels like a quasi action-value update where one action rules them all. Lastly, the policy requires going over all states and selecting the best action you can take from each state based on all transition probabilities for actions the given state.
The main concept to take is that the A2C algorithm has a very similar updating feel for the critic Neural Network. The critic does not have to worry about the policy in order to improve the value-state function of states.
Learning Materials:
What makes Monte-Carlo Methods and the Temporal Difference Monte-Carlo TD(1) special is that they can both be considered full trajectory algorithms when the environment's model is unknown, also called model-free methods.
In this case, Monte-Carlo Methods are ways of solving the problem based on only the rewards seen in an episode. TD(1) has the same spirit in that regard, except the algorithm can be updated based on some arbitrary metric, such as total time steps.
More information on this will be helpful if one wants to do an Advantage Actor-Critic with Proximal Policy Optimization algorithm.
Lastly, I recommend understanding the concept of Monte-Carlo Methods and why they are important in the context of the Law of Large Numbers and the Central Limit Theorem. In essence, they involve increasing approximation, or a guess, through the increase of samples.
Learning Materials:
The REINFORCE algorithm, also called the Monte Carlo Policy Gradient algorithm, seeks to optimize the policy by updating the parameters of the Neural Network to move towards the direction of the true goal.
REINFORCE Algorithm:
$$\begin{aligned} & \text{Loop until satisfied (for each episode):} \\ & \quad \text{Generate an episode } S_0, A_0, R_1, \text{ ... }, S_{T-1}, A_{T-1}, R_{T} \text{ , following } \pi (\cdot \mid \cdot, \theta) \\ & \quad \text{Loop for each step of the episode } t = 0,1, \text{ ... , } T-1: \\ & \quad \quad G \leftarrow \sum_{k=t+1}^T \gamma^{k-t-1} R_k \\ & \quad \quad \theta \leftarrow \theta + \alpha \gamma^t G \nabla \ln \pi(A_t \mid S_t, \theta) \end{aligned}$$
The thing to notice about the REINFORCE algorithm is that you have to commit to the entire episode trajectory before any updates occur. Once an update on the policy (Neural Network) occurs, the previously sampled trajectory is no longer helpful because it was generated based on an older and outdated policy.
One thing to keep in mind is that these samples have high variance and low bias for full episode trajectories.
High variance due to the possible "radical" changes to the policy on every update along with different possible trajectories, low bias because REINFORCE is updating based on the rewards seen in the environment and propagating those back to all visited states, thus eliminating bootstrapping or deriving a guess from another guess.
Low bias does not mean the removal of bias, as bad biases can still result from bad initial samples. Then again, the law of large numbers does come into play sometimes.
To better understand the actor-critic framework introduced in the next section, one must appreciate REINFORCE with the Baseline algorithm and how it incorporates a state-value function into its equation to "stabilize" the algorithm and, therefore, allow for faster learning.
Learning Materials:
The Actor-Critic framework extends the REINFORCE with a baseline algorithm and incorporates a form of bootstrapping into the equation.
This is done by introducing two Neural Networks into the equation, one Neural Network for the agent (actor) and the second for the state-value function (critic).
Then all one has to do is take the \(\delta\) or Temporal Difference Error term from the equation \(\delta = R + \gamma v(s') - v(s)\) and apply that error term to both the actor and critic neural networks (based on Sutton and Barto algorithm).
One-Step Actor-Critic (Episodic):
$$\begin{aligned} & \text{Actor: a differentiable policy parameterization } \pi(a \mid s, \theta) \\ & \text{Critic: a differentiable state-value function parameterization } V(s,w) \\ & \text{Parameters: learning rate } \alpha^\theta \in (0,1], \alpha^w \in (0,1] \\ & \text{Initialize policy parameter } \theta \in \mathbb{R}^{d'} \text{ and state-value weights } w \in \mathbb{R}^d \text{ (e.g., to zero)} \\ & \\ & \text{Loop until satisfied (for each episode):} \\ & \text{Initialize } S \\ & I \leftarrow 1 \\ & \text{Loop while } S \text{ is not terminal (for each time step):} \\ & \quad A \sim \pi(\cdot \mid S, \theta) \\ & \quad \text{Take action } A, \text{ observe } S',R \\ & \quad \delta \leftarrow R + \gamma V(S',w) - V(S,w) \quad \text{ (if } S' \text{ is terminal, then } v(S',w)=0 \text{ )} \\ & \quad w \leftarrow w + \alpha^w \delta \nabla v(S,w) \\ & \quad \theta \leftarrow \theta + \alpha^\theta I \delta \nabla \ln \pi(A \mid S,\theta) \\ & \quad I \leftarrow \gamma I \\ & \quad S \leftarrow S' \end{aligned}$$
The most important thing to note is the TD error. It will help you understand the Advantage Actor-Critic algorithm.
Learning Materials:
It should now be clear that the Advantage Actor-Critic (A2C) algorithm and the Proximal Policy Optimization (PPO) algorithm are different.
A2C is an evolution from Asynchronous Advantage Actor-Critic (A3C), and all A2C does differently is make updates synchronously. A3C introduced the concept of the advantage function: \(A(s,a) = R + \gamma Q(s,a) - V(s)\).
A2C w/ PPO Algorithm:
$$\begin{aligned} & \text{Definitions: } \theta_A = \text{actor NN, } \theta_C = \text{critic NN} \\ & \text{Learning Rate Parameters: } \geq 0,\alpha_A \geq 0, \alpha_C \geq 0 \\ & \text{Entropy Strength Parameter: } \beta \geq 0 \\ & J^{clip} \text{ update range: } \epsilon \geq 0, \\ & \text{Amount of Workers Parameter: } N \geq 1 \\ & \text{Timestep Collection Parameter: } T \geq 1 \\ & \text{Minibatch Size Parameter: } M \geq NT \\ & \text{Epoch Parameter: } K \geq 1 \\ & \text{Randomly initialize } \theta_A \text{ & } \theta_C \text{ paramaters} \\ & \theta_{A_{old}} = \theta_A \\ & \\ & \text{Loop for every iteration until satisfied:} \\ & \quad \theta_{A_{old}} = \theta_A \\ & \quad \text{Loop for } N \text{ workers:} \\ & \quad \quad \text{Gather and store experiences } <s_{t},a_{t},r_{t+1},s_{t+1}> \text{ from } \theta_{A_{old}} \\ & \quad \quad \text{Compute Advantages (} A_{t} = A_{t_{GAE}} \text{) from } \theta_{A_{old}} \\ & \quad \quad \text{Compute V-targets } V_{target}(s_t) \text{ from } \theta_C \\ & \\ & \quad \text{Let batch of size } NT \text{ consisting of collected experiences, Advantages, and V-targets} \\ & \\ & \quad \text{Loop for } K \text{ epochs:} \\ & \quad \quad \text{Shuffle batch of size NT} \\ & \\ & \quad \quad \text{Loop for minibatch of size } M \text{ from batch:} \\ & \quad \quad \text{(Calculations done on entire set of minibatch of size M)} \\ & \\ & \quad \quad \quad \text{Normalize Advantages:} \\ & \quad \quad \quad \forall A_{minibatch}, A_t := \frac{A_t * A_{mean}}{A_{std}} \\ & \\ & \quad \quad \quad \text{Calculate Importance Sampling from } \theta_A \text{ and } \theta_{A_{old}} \text{:} \\ & \quad \quad \quad r_{t}(\theta) = \frac{\pi(a_t|s_t)}{\pi_{old}(a_t|s_t)} \\ & \\ & \quad \quad \quad \text{Calculate Clipped Surrogate Function from Advantages:} \\ & \quad \quad \quad J^{Clip}_{t}(\theta) = min(r_t(\theta)A_t, clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon)A_t) \\ & \\ & \quad \quad \quad \text{Calculate Entropy Regularization from } \theta_A \text{:} \\ & \quad \quad \quad H(p,\pi) = \text{ any entropy formula e.g. Shannons Entropy} \\ & \\ & \quad \quad \quad \text{Calculate Actor Loss, from } J^{Clip}_t \text{ and entropy:} \\ & \quad \quad \quad Loss_{\theta_{A}} = -(J^{Clip}_t + \beta H(p,\pi)) \\ & \\ & \quad \quad \quad \text{Calculate predicted V-value } V_\pi(s_t) \text{ using } \theta_C \\ & \quad \quad \quad \text{Calculate Critic Loss (e.g. Mean-Square-Error or Adam):} \\ & \quad \quad \quad Loss_{\theta_{C}} = MSE(V_\pi(s_t), V_{target}(s_t)) \\ & \\ & \quad \quad \quad \text{Update NNs:} \\ & \quad \quad \quad \theta_C = \theta_C + \alpha_C Loss_{\theta_C} \\ & \quad \quad \quad \theta_A = \theta_A + \alpha_A Loss_{\theta_A} \\ \end{aligned}$$
The great thing is that with the introduction of the advantage function, we are now able to update the policy directly or indirectly from bootstrapping; however, our collected trajectories are still based on only the actions generated by the policy.
Proximal Policy Optimization (PPO), although referred to as an algorithm, is more of a framework or methodology of how an algorithm should incorporate "extra" things to reduce variance, increase stability, and improve sample efficiency by allowing multiple updates on the same sample generated; something that vanilla policy algorithms like REINFORCE are not capable of doing.
Due to this, PPO can work with many policy-based algorithms, such as Actor-Critics and REINFORCE, compared to value-based algorithms, such as SARSA, Q-Learning, and Deep-Q Networks.
What makes A2C with PPO hard is that getting a good algorithm working requires the concepts of multiple papers combined into one algorithm you use for your agent in the environment.
Lastly, it should be noted that Actor-Critic is referred to as "A2C" in the literature, as the similarities between Actor-Critic and Advantage Actor-Critic are not that different.
Learning Materials:
The advantage function is simply an objective function, or some metric to improve, that the algorithm tries to maximize.
Advantage Function:
$$ A(s_t,a_t) = r_{t+1} + \lambda V(s_{t+1}) - V(s_{t})$$
This should look familiar as it's the standard bootstrapping formula. Reward for taking an action given a state added by the next states discounted cumulative reward estimate subtracted by the current states estimated discounted cumulative reward estimate.
The advantage function just tells us that every state-action pair has an advantage function that can be compared to other actions in the same state, where the goal of the agent is to learn which actions to maximize.
One important thing about the advantage function is that we now have a better way of updating the actor's Neural Network using the critic's value function, as presented in the paper.
Lastly, we are now able to replace the advantage function with other types of updates such as N-Step or TD updates. In this articles case, I only consider the generalized advantage estimation for replacing the advantage function presented in the A3C paper.
Generalized Advantage Estimation (GAE) is simply an evolution of how TD errors are calculated. As I see it, the inspiration for GAE comes from both TD(\(\lambda\)) and N-Step algorithm.
Generalized Advantage Estimation:
$$ A_{GAE}(s_t,a_t) = \sum_{l=0}^{\infty} (\lambda \gamma)^{l} \delta_{t+l}$$
where \(\delta\) is the temporal difference error \(\delta = r_{t+1} + \lambda V(s_{t+1}) - V(s_{t})\), \(\lambda \in [0,1]\) is a discounting factor for the impact of future states (similar effect as the TD(\(\lambda\))) , \(\gamma \in [0,1]\) is the discounted cumulative future reward coefficient, \(t\) is the timestep where a state-action pair was initiated, and \(l\) is used for discounting future temporal difference errors. Also note that the \(V(s')\) of a terminal state is 0.
In the context of an Actor-Critic paradigm, consider the environment, agents' actions, and the requirements of the task at hand when selecting the values of both \(\lambda\) and \(\gamma\). Eventually, the advantage function starts to converge as the policy improves, and therefore, the policy eventually converges.
The paper mentions using the theory of Trust Region Methods, which documents explain is a second-order derivative calculation.
The paper mentions GAE is based on Trust Region Policy Optimization which is very expensive to calculate as it's a second-order derivative. What GAE does is keep things in the first-order derivatives by "estimating" the value in the trust region, maintaining the robustness of trust regions without any of the computational complexity found in second-order derivatives.
This article uses GAE to replace the advantage function discussed in the A3C paper.
Lastly, the literature in the field discusses normalizing the advantage. This keeps everything centered around a mean of 0 and a standard deviation of 1.
Learning Materials:
Based on the literature on PPO, one technique for stabilizing the actor during training is normalizing the advantages. This normalization should only be applied to the advantages of the experiences that are part of the minibatch collection.
For all the advantages that are part of the minibatch collection:
$$ \forall A_{minibatch}, A := \frac{A * A_{mean}}{A_{std}} $$
where \(\forall A_{minibatch}\) means for all advantages in the minibatch, \(A_{mean}\) and \(A_{std}\) is the mean and standard deviation of the entire advantage collection in the minibatch. \(A\) represents the advantage value, which we would like to replace with the formula on the right side of the equation.
Entropy regularization is the idea that an algorithm can promote exploration by simply adding an entropy term to the actor update, or in the papers case, any function approximater in charge of determining the policy.
Entropy Regularization Formula: $$ V^{\pi}_{\tau} := V^{\pi}(p) + \tau H(p,\pi)$$
where \(\tau \geq 0\) determines how much one would like to "explore," which is also called the regularization parameter. The regularization parameter \(\tau\) controls the weight or influence of the entropy term \(H(p,\pi)\).
When adding the entropy regularization term \( \tau H(p,\pi) \), we give an extra "reward" for the uncertainty in the agent's policy; the more uniformly distributed a policy is, the more the agent does not know which actions lead to better rewards, the better the "bonus reward" the agent gets. Exploration comes from the fact that entropy regularization incentivizes the agents to explore state-action pairs where the outcomes of rewards are unknown.
The original paper does not mention an upper limit for \(\tau\), meaning that any value \(\tau \geq 0 \) should suffice, although it does mention that the value of \(\tau\) should be "sufficiently small." It's through empirical evidence from research in reinforcement learning where \(\tau \in [0.01, 0.02]\) is recommended. The range of values has a good mix of exploration and exploitation; although this is dependent on the problem at hand. One thing to note is that \(\tau\) is referred to as \(\beta\) in the reinforcement learning literature since \(\tau\) is typically reserved for the concept of episodic experience in reinforcement learning.
In the Actor-Critic framework, over time, as the environment becomes better known and as the agent's policy starts to converge, the entropy bonus reward starts to diminish as well.
Learning Materials:
Discounted entropy is what is suggested by Cen et al. in their paper "Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization."
Discounted Entropy:
$$ H(p,\pi) = \sum_{t=0}^{\infty} ( - \lambda^{t} \times log(\pi(a_t|s_t) ) )$$
where \(\lambda^{t}\) represents a discount factor at the timestep the action was taken, meaning that less emphasis is placed on future timesteps, and \(\pi(a|s)\) is just the probability of taking an action from a state based on the policy.
This suggests to me that heavy emphasis is placed on earlier actions when policies in states are uncertain. As earlier policies improve, consideration is then taken for the uncertainty in policies of future states with regard to a decay factor \(\lambda\). Keep in mind that I may be reading the original paper wrong or, more likely, interpreting the math wrong.
From my experiments, I could replace the term \(H(p,\pi)\) with other forms of entropy. In my final result, I replaced the paper discounted entropy with Shannon's Entropy. It may or may not have improved my results; I needed more time to compare and contrast. Do what you will with this information.
Shannons Entropy:
$$ H(p,\pi) = -\sum_{a \in A} (\pi(a|s_t) \times log_{2}(\pi(a|s_t)))$$
where \(A\) is the entire action space for policy \(\pi\), and \(H(p,\pi) \in [0, log_{2}(A)]\).
The environment in which an agent operates is a significant factor that can heavily influence the agent's actions. This consideration is critical in Project 3, where the environment is static and predictable.
In the case of Shannon's Entropy, when all actions are equally uniform in probability, the value is at its highest. Likewise, If there is one action that has a probability of 100%, then the value is 0.
One thing to keep in mind is that the larger the action space, the larger the entropy bonus will be for equally uniform probabilities.
Is it possible to incentivize aggressive behavior through the concept I am coining for this article as "Reverse Entropy Regularization?"
Reverse Entropy Regularization:
$$ V^{\pi}_{\tau} := V^{\pi}(p) - \tau H(p,\pi)) $$
Take note of the entropy formula's subtraction term instead of the addition. When adding, we give an extra "reward" for uncertainty in the agent's policy; the more uniformly distributed a policy is, the more reward said agent receives.
However, if one were to give a "penalty" for uncertain policies, as shown in the formula above, would this incentivize the agent to approach a more aggressive or deterministic behavior?
I leave this as food for thought, as I am not well-versed in math, but intuitively, this makes sense.
In my experiments this does not keep the agent from learning an optimal policy, but it does learn slower than "rewarding" uniform policies. This is based on LunarLander and Shannons Entropy, where \(\tau\) and \(H(p,\pi)\) are non-negative values, \(V^{\pi}(p)\) is replaced with \(J^{Clip}_{t}(\theta)\) as discussed later in this article, and \(H(p,\pi) \in [0, 2]\). Training for both entropy regularization methods led to +200 episodic rewards.
As stated before, proximal policy optimization (PPO) is the idea that by introducing a change to the objective function, also called the surrogate objective, we can change the way the A2C algorithm updates its actor and critic, which allows for multiple updates.
Since A2C can update actor and critic Neural Networks in multiple epochs and batches using multiple trajectories from the policy, we can use larger values for the learning rate.
For this article, I am only concerned with showcasing two concepts: importance sampling and the clipped surrogate function.
Learning Materials:
As explained in the Foundations of Deep Reinforcement Learning and the Berkley slide on Importance Sampling, the concept of importance sampling can be used to update the experiences collected from an older policy.
Importance Sampling Ratio:
$$ r_{t}(\theta) = \frac{\pi(a_t|s_t)}{\pi_{old}(a_t|s_t)}$$
Importance sampling allows us to determine how far off our current policy is from the policy used to collect the original experiences. By creating this ratio, we are calculating how likely our new policy will select the action the old policy "did" select. Ultimately, we can use this ratio to add weight to the older samples or, in our case, the advantage function.
In essence, if the new policy is more likely to select the same action as the old policy, the ratio will be greater than one and, therefore, increase the weight of the sample. The same is true in reverse. If the new policy is less likely to select the action the old policy selected, then the weight of that sample is reduced.
This concept works but also risks causing exploding or vanishing values. This"issue" is reduced by introducing the Clipped Surrogate Function.
Learning Materials:
In essence, the clipped surrogate function allows us to use importance sampling along with the advantage function without the drawbacks of exploding or vanishing values from the importance sample. This is done by clipping the minimum and maximum value possible for these updates. Also, keep in mind that the clipped surrogate function seeks to maximize.
Clipped Surrogate Function:
$$ J^{Clip}_{t}(\theta) = min(r_t(\theta)A_t, clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon)A_t)$$
where \(\epsilon\) is the "clipping" parameter keeping \(r_t(\theta)\) from having to large or small of a ratio. \(\epsilon = 0.2\) is recommended in the literature.
To summarize, we want to find the minimum change between the advantage function and either the unclipped or clipped importance sample. The clipping helps with the issue of vanishing and exploding importance values. Regardless, the algorithm still wants to pick the value that will have the least impact on updates, meaning that the algorithm wants to make pessimistic updates to the Neural Network.
Proximal Policy Optimization (PPO) requires breaking up similarities and correlations between the collected experiences during minibatch updates.
Most environments require a sequential set of state-action pairs before achieving the goal. So, when saving all experiences inside a shared experience buffer, the experience at index 100 is similar to the experience at index 99 and index 101. Due to these similarities, we wish to break these up during minibatch updates.
We do this by collecting many different experiences and randomly ordering them before taking mini-batches. This has a stabilizing effect and, therefore, leads to more performant results.
Note that the PPO algorithm should utilize all experiences on every epoch regardless of the ordering.
Since this article focuses on the surrogate objective function, we must modify how the actor improves its policy. In this case, the actor should aim to update the policy based on the surrogate objective and the entropy term you have defined.
We can update the actor Neural Network by simply using the results from the clipped surrogate function:
Loss (Actor):
$$ Loss_{actor} = -(J^{Clip}_{t}(\theta) + \beta H(p,\pi))$$
where \(\beta\) is the regularization parameter and \(H\) is the entropy term.
Also, note how we flip the results with a negative sign as the clipped surrogate function seeks to maximize.
Now that we have calculated the advantage function, we also need to use that to update the critic neural network.
Since we used GAE, we make the target use that calculation:
$$ V_{target}(s_t) = A_{GAE}(s_t,a_t) + V(s_t)$$
We then use \(V_{target}(s_t)\) along with \(V(s_t)\) as inputs into the critics loss function.
Loss (Critic):
$$ Loss_{critic} = MSE(V(s_t), V_{target}(s_t))$$
I'm sure many have said the same thing, as I will say: hyperparameter selection is problem domain dependent. For example, my hyperparameter selection for the problem of CartPole never seems to work with LunarLander and vice versa. However, I will note that solving larger problems lends itself to easier hyperparameter tuning for smaller problems.
Many people have delved into hyperparameter choices, and I recommend reading academic papers to get an idea of where to start.
When selecting the \(\beta\) coefficient for entropy regularization, consider how deterministic or stochastic your environment is for the agent. LunarLander has built-in randomness against your agent at the start of every episode versus Cartpole, which has no stochastic changes on your agent's starting position.
When selecting the size and shape of your Neural Networks, consider the complexity of the environment. If a neural network is too small, the agent will not be able to pick up on the environment's nuances and patterns. If it is too large of a neural network, the agent will most likely learn nothing simply because the backpropagation calculations are not reaching the entire network.
The learning rate of the actor and critic Neural Networks, who says they have to be the same, should be the same. Remember that the critic should learn at the same rate or faster than the teacher simply because the teacher should know more than the student.
With all things, experimentation is your best friend.
I suggest considering A2C with PPO as an annealing algorithm. Initially, it moves around fast and feels like learning does not occur. The more data it can sample from, the more it can batch from, and the quicker it learns. After a while, it will start to pick up on some patterns and make more aggressive actions as probabilities of state-action pairs converge to a single action choice. Eventually, the algorithm will cool down to a policy that will always stay the same as long as nothing changes in future updates. All this, of course, is dependent on the environment.
Lastly, how you save and use experiences is entirely up to you. Based on my experience with LunarLander and Cartpole, following books on collecting experiences is excellent, but considering alternatives is also ok.
Learning Materials:
Thank you for taking the time to read this guide on Advantage Actor-Critic with Proximal Policy Optimization.
The next article goes over the pseudocode for A2C w/ PPO.
OMSCS Reinforcement Learning Series:
Hello! I'm just a person who wants to help others on their programming journey.