Before starting, it’s advisable to first complete David Silver’s Course on RL and read Lilian’s notes on RL which explains/provides notes on the David’s course in sequential manner.
In simple problems, we simply start with an arbitrary value function, and then go on updating that value function incrementally, using different algorithms such as Monte Carlo (which collects reward over the whole episoe), Temporal difference, aka TD(0) (which considers bootstrapping, i.e only considering the immediate reward and then approximating other remaining rewards with the help of value function $r + V(s)$ ) and other algorithms. But, over time maintaining a table is not feasible for real world problems, so move towards approximating these value functions with the help of a vector of features or more concretely with Neural Networks but approximating these state and action value functions are infeasible too which is explained in the next section.
Why we choose policy approximation instead of state/action value approximation.
In real work tasks, the action space is not discrete, i.e up, down, right, left instead it’s continuous for eg. A robot can do infinite tasks. Deep Q Network works by taking the maximum over the action, and minimizing the error over that action. BUT, approximating every action becomes expensive task.
$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (R_{t+1} + \gamma \max_{a \in \mathcal{A}} Q(S_{t+1}, a) - Q(S_t, A_t))$$ So we move on to approximating our policy, where the approximated policy will provide the probability over the considerable actions, allowing the agent to sample actions rather than search for the best one.
Policy Gradient Algorithms
Proof of policy gradient theorem.
First we collect trajectory $\tau$, which is a collection of states and actions. $\tau = s_0, a_0, s_1, a_1, \ldots, s_H, a_H, s_{H+1}$
The reward for this trajectory is simply the sum of all the rewards from each timesteps. $$ R(\tau) = \sum_{t=0}^{H-1} R(s_t, a_t) $$ Our overall objective is to construct a policy that gives us the highest reward for our trajectory. We construct a function $J(\theta)$ that is parametrized by this policy, that calculates the total expected reward and we want to maximize this expectation. $$ J(\theta) = E[R(\tau); \theta] \tag{1} $$
$$ \max_\theta J(\theta) = \max_\theta E\left[\sum_{t=0}^{H-1} R(s_t, a_t)\right] $$
If we open this objective function it becomes. $$ J(\theta) = \sum_\tau P(\tau) \cdot R(\tau) $$ where $P(\tau)$ is the probability of trajectory $\tau$, and $R(\tau)$ is the total reward for that trajectory. and Lets take the derivative of this $J(\theta)$
$$ \nabla_\theta J(\theta) = \sum_\tau \frac{P(\tau)}{P(\tau)} \nabla_\theta P(\tau) \sum_{t=0}^{H-1} R(s_t, a_t) $$ $$ \nabla_\theta J(\theta) = \sum_\tau P(\tau) \cdot \nabla_\theta \log P(\tau) \cdot \sum_{t=0}^H R(s_t, a_t) \quad \text{(using } \frac{\partial \log P(\tau)}{\partial \theta} = \frac{1}{P(\tau)} \partial_\theta P(\tau) \text{)} $$ now representing it back into the expectation form we get, $$ \nabla_\theta J(\theta) = E\left[ \nabla_\theta \log P(\tau) \cdot \sum_{t=0}^{H-1} R(s_t, a_t) \right] \tag{2} $$
Now what is $P(\tau)$, it’s the joint probability for that trajectory, which is given by the chain rule of probability, $$ P(\tau) = P(s_0) \cdot P(a_0|s_0) \cdot P(s_1|a_0, s_0) \cdot P(a_1|s_1) \cdot P(s_2|a_1, s_1) \cdots $$ In the equation above, we assume Markov’s property that next action only depends on the current state. $$ P(a_t|s_t, s_{t-1}) = P(a_t|s_t) $$
$$ P(s_{t+1}|a_t, s_t, a_{t-1}, s_{t-1}) = P(s_{t+1}|a_t, s_t) $$ It boils down to this equation below. $$ P(\tau) = P(s_0) \prod_{t=0}^H P(a_t|s_t) \cdot P(s_{t+1}|a_t, s_t) $$ and then to this $$ P(\tau) = P(s_0) \prod_{t=0}^H \pi(a_t|s_t) \cdot P(s_{t+1}|a_t, s_t) $$ because action distribution $\pi(a_t\ s_t)$ is governed by the policy $\theta$ whereas the transition probability $P(s_{t+1}|a_t, s_t)$ is governed by the environment. Let’s take the log of that $P(\tau)$
$$ \log P(\tau) = \log(P(s_0)) + \sum_{t=0}^{H-1} \log \pi(a_t|s_t) + \sum_{t=0}^{H-1}\log P(s_{t+1}|a_t, s_t) $$ After taking the derivative of that $\log P(\tau)$, we end up with this equation below, because all the other terms down depend on our policy $\theta$ $$ \nabla_\theta \log P(\tau) = \sum_{t=1}^H \nabla_\theta \log \pi(a_t|s_t) $$ Let’s replace the equation above in our equation 2. we get. $$ \nabla_\theta J(\theta) = \frac{1}{M} \sum_{m=1}^{M} \left( \sum_{t=0}^{H-1} \nabla_\theta \log \pi( a_t^m |s_t^m) \cdot \left( \sum_{t=0}^{H-1} R(s_k^m, a_k^m) \right) \right) \tag{3} $$
Reducing Variance
BUT, there’s a problem withe equation (3), it is high variance, because in stochastic environment R can sometimes take high value, and some times low. Since R acts as a magnitude term in the equation (3) it would make the gradient estimation and update unstable because of the high variance, to stabilize it we subtract a baseline from the the return in equation (3).
$$ \nabla_\theta J(\theta) = \frac{1}{M} \sum_{m=1}^{M} \left( \sum_{t=0}^{H-1} \nabla_\theta \log \pi( a_t^m |s_t^m) \cdot \left( \sum_{t=0}^{H-1} R(s_k^m, a_k^m) - b\right) \right) \tag{3} $$ subtracting b from the trajectory reward doesn’t increase it’s bias and there is a proof of that which I’m not going to explain in this section, it helps us by decreasing the variance. If we’re able to keep the bias stable and variance low, it would be a win win situation and this is what we do here.
What should the b value be?
It would be better if we have that baseline as an average return that we might expect when we are in state t, and that is simply our value function $V^\pi(s)$.
$$
\nabla_\theta J(\theta) = \frac{1}{M} \sum_{m=1}^{M} \left( \sum_{t=0}^{H-1} \nabla_\theta \log \pi( a_t^m |s_t^m) \cdot \left( \sum_{t=0}^{H-1} R(s_k^m, a_k^m) - V^\pi(s_k^m)\right) \right) \tag{4}
$$
Equation (4) tell us we are increasing the logprobs of actions that gives us the return that is better than the average.
There still one way we can reduce further variance from equation (4), we can break down equation (4) into the equation below. $$ \nabla_\theta J(\theta) = \frac{1}{M} \sum_{m=1}^{M} \left( \sum_{t=0}^{H-1} \nabla_\theta \log \pi( a_t^m |s_t^m) \cdot \left( \sum_{t=0}^{t-1}R(s_k^m, a_k^m) + \sum_{k=t}^{H-1} R(s_k^m, a_k^m) - V^\pi(s_k^m)\right) \right) \tag{5} $$ $$ \nabla_\theta J(\theta) = \frac{1}{M} \sum_{m=1}^{M} \left( \sum_{t=0}^{H-1} \nabla_\theta \log \pi( a_t^m |s_t^m) \cdot \left( \sum_{k=t}^{H-1} R(s_k^m, a_k^m) - V^\pi(s_k^m)\right) \right) \tag{6} $$
we can remove this term $\sum_{t=0}^{t-1}R(s_k^m, a_k^m)$, because why would the rewards at time step 4 affect the logprobs of action at time step 7. We only care about the FUTURE rewards not the past rewards, we are essentially deleting the past rewards from there. For concrete understanding, let’s say we have H = 3, M=1
t=0 $\nabla_\theta \log(a_0|s_0) = 0.5$ $R(s_0,a_0) = 10$
t=1 $\nabla_\theta \log(a_1|s_1) = 0.4$ $R(s_1,a_1) = 4$
t=2 $\nabla_\theta \log(a_2|s_2) = 0.2$ $R(s_2,a_2) = 7$
So,
$\sum_{t=0}^{H-1} \nabla_\theta \log \pi( a_t^m |s_t^m) = (0.5 + 0.4 + 0.2)$
$\sum_{t=0}^{H-1} R(s_k^m, a_k^m) = (10 + 4 + 7)$
Let’s add those values in the equation 4.
$(0.5 + 0.4 + 0.2) \cdot (10 + 4 + 7) = (0.5*(10 + 4 + 7) + 0.4* (10 + 4 + 7) + 0.2 * (10 + 4 + 7))$
as you can see we are including rewards at past timestep for future actions too (i.e 0.4 is the logprobs for timestep 1 and it is multiplied with (10 + 4 + 7), where 10 is the reward from the past timestep 0, we don’t want that) so, if we use the equation (6), we end up with something like this
$(0.5*(10 + 4 + 7) + 0.4* (4 + 7) + 0.2 * (7))$
We only consider the future rewards, not the past ones.
How to approximate value function?
We can approximate value function $V^\pi$ as taught in this lecture we can simply sample from the trajectory and then minimize the squared error i.e
let’s sample $M$ trajectories, $\tau_1, \tau_2, … \tau_M$ and then minimize this squared error
$$ \frac{1}{M} \sum_{m=1}^{M} \sum_{t=0}^{H-1} \left( V^\pi(s_t^m) - \sum_{k=t}^{H-1} R(s_k^m, a_k^m) \right)^2 $$ There’s one more thing we can do, we can approximate this term $\sum_{k=t}^{H-1} R(s_k^m, a_k^m)$ with our Q value function which would further reduce the variance.
Since $Q^\pi= E[r_0 + \gamma r_1 + \gamma^2 r_2 …]$ adding gamma would reduce the variance further, and approximating this $Q^\pi$ with the help of value function would further reduce the variance.
$$ \nabla_\theta J(\theta) = \frac{1}{M} \sum_{m=1}^{M} \sum_{t=0}^{H-1} \nabla_\theta \log \pi( a_t^m |s_t^m) \cdot \left( Q^\pi (s_t^m, a_t^m) - V^\pi(s_t^m)\right) ) \tag{7} $$ $$ \nabla_\theta J(\theta) = \frac{1}{M} \sum_{m=1}^{M} \sum_{t=0}^{H-1} \nabla_\theta \log \pi( a_t^m |s_t^m) \cdot A^\pi(s_t, a_t) ) \tag{8} $$
We can approximate this using bootstrapping, $A^\pi(s,a)$ in different ways such as. $$ A^\pi(s_t,a_t)= r_0 + V^\pi (s_{t+1}) - V^\pi(s_t)) $$
Or more compact form called Generalized Advantaged Estimation (GAE)
$$ \delta_t = r_t + \gamma V(s_{t+1} - V(s_t)) $$
$$ A_t(s_t, a_t) = \sum_{l=0}^\infty \gamma \lambda \delta_{t + l} $$
General RL setting
$$ J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \right],\qquad{(1)} $$ In RL setting we aim to optimize the objective function $J(\pi)$ by updating the policy $\pi$ given a reward function $r(s,a)$ that takes in a state and the action performed at that state. The next action is determined by $\pi(a|s)$. We find the expected(average) reward over all the trajectories $\tau$. $\gamma$ is the discount factor from 0 to 1 that balances the desirability of of near vs future-rewards.
An example: If our RL setup is confined to finding a maze, a reward function could be $r(s,a)=-\text{distance to goal}$
RLHF
RLHF reduces the equation 1 by removing this discounted factor $$ J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} r(s_t, a_t) \right],\qquad{(2)} $$ We aim to maximize our objective function by optimizing the policy. The reward function is designed such that the actions must align with the human preferences.
The most common reward model predicts the probability that a piece of text was close to a “preferred” piece of text from the training comparisons.
Reward Models
Given two prompts $y1$ and $y2$ we want to have a reward model that gives high score to $y1$ and low score to $y2$ meaning $y1$ is always preferred over $y2$. Their relative preference is given by the Bradly Terry model.
$$ P(i > j) = \frac{p_i}{p_i + p_j}\qquad{(3)} $$ It gives the probability of i being preferred over j where $p_{i}$ and $p_{j}$ represent “strengths” for those prompts.
We want our model to maximize this probability because later $i$ would represent the text we want (aligned) and $j$ would represent the text we don’t want (not aligned). The training objective can be derived from the equation above. The “strengths” are exponential because we want them to be strictly positive.
$$ P(y_1 > y_2) = \frac{\exp(r(y_1))}{\exp(r(y_1)) + \exp(r(y_2))}\qquad{(4)} $$ The loss function becomes
$$ \mathcal{L}(\theta) = - \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) \right)\qquad{(6)} $$ Our existing model can be configured to output just one value by adding a linear head to the model.
In case of language models, we want a model to rate our answer i.e (give a score based on how good or bad it is), it is the most important part because it will guide our training, it is providing supervision for our PPO algorithm.
Steps for training a reward model
Collect pairs of data, it could be either contrasting pairs or some pairs of prompt with priority. Eg. We want our llm to be trained (later using PPO) to be to generate positive response. In this case our priority prompt would be positive prompt and non-priority prompt would be negative prompt.
We take a language model and add a linear head to it. For instance, for each token a LM outputs 512 dimension vector we add a new head that takes in 512 dimension and outputs a one dimensional vector which gives us the reward.
The loss function for the reward model is constructed likewise, which is same the equation 6.
$$ L = -log(\sigma(r1 - r2)) $$ where $r1$ is reward for priority prompt and $r2$ is reward for non-priority prompt. We want to maximize this difference $r1 - r2$ and minimize the this function. $\sigma$ represents sigmoid function.
NOTES
- we calculate the reward for the ending token which represents a reward given to the whole token.
- reward models overfit fast, so we can consider smaller LM for reward model training.
——-Skipping other reward models for now——-
Regularization
We want aligned reward models but would still “not go off the rails” meaning it stays within the limitation of our reference model. It allows the policy being trained to stay close to the reference policy.
$$ r = r_\theta - \lambda r_{\text{reg.}} \qquad{(1)} $$ $$ r = r_\theta - \lambda_{\text{KL}} \mathcal{D}_{\text{KL}} \left( \pi^{\text{RL}}(y \mid x) , | , \pi^{\text{Ref.}}(y \mid x) \right) \qquad{(2)} $$
KL divergence is calculated as
$$ D_{\text{KL}}(P ,||, Q) = \mathbb{E}_{x \sim P} \left[ \log P(x) - \log Q(x) \right]. \qquad{(3)} $$
Rejection Sampling
This process is kind of a filtering process. We first sample outputs from our base language model that we’ll be training next. For example, we generate 5 example outputs for a sample prompt. We then provide a score to each of the generated examples using our reward function. We sort and only take the top-K elements per prompt and fine-tune our base model on these examples. So it’s kind of like a filtering and only keeping the highly rewarded examples.
PPO
PPO algorithm analogy
Suppose we are taking an exam paper. Our objective is to maximize the change of getting good marks by updating our brain (weights and biases).
$s_t$ is a question that we are looking at and trying to solve. $a_t$ is the our answer for that question. $r_t$ is the immediate reward we get after writing $a_t$, $s_{t+1}$ is the next-question.
$R_t$ is the actual total exam score.
Suppose there is an imaginary machine that gives us the expected exam score that we can get just by looking at our question which is $V(s)$.
$A(s)$ is our advantage i.e how good we are compared to the predicted score.
$A(s) = R_t - V(s)$
this can be modified as as $\delta_t = R_t - V(s)$ and $A_t = \delta_t + \lambda\cdot\gamma\cdot A_{t+1}$. This is just a modified version of advantage to remove the bias and variance.
This can be a little confusing as we might have no idea what our actual total score($R_t$) will be while we are still writing some questions $s$ so approximate this $R_t$ with the help of current reward $r$ and future reward that we might get from next question i.e $V(s+1)$ i.e
this becomes
$\delta_t = r_t + V(s+1) - V(s)$
so this will still give us our advantage at a point t.
i.e how good/bad we did at point t= immediate reward(marks) after writing answer to t + expected future reward from new question - expected future reward from previous question t.
This will give us our advantage.
PPO is done using two phases.
- Rollout phase
- Weight update phase.
Rollout phase
We write a lot of exam paper in this phase in parallel. for each exam paper and for each question in the exam paper we calculate $r_t$, $V(s_t)$ and $R_t$ $A_t$ and use it in our equation. $$ L^{PPO}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}\left(r_t(\theta), 1 - \epsilon,1+\epsilon\right) \hat{A}_t \right) - c_1 \left( V_\theta(s_t) -V_t^\text{target} \right)^2 + c_2 \mathcal{H}\left[\pi_\theta\right(s_t)\right] $$ $$ r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)} $$
Weight update phase
We find this loss and try to maximize this clipped loss and entropy loss, but minimize the value function loss.
The PPO clipped surrogate objective is given as:
$$ L^{\text{CLIP}}(\theta) = \mathbb{E}_t \Big[ \min \big( r_t(\theta) \hat{A}_t, ; \mathrm{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) \hat{A}_t \big) \Big] $$
The gradient ascent update rule is:
$$ \theta \gets \theta + \alpha \nabla_\theta L^{\text{CLIP}}(\theta) $$ The most important part to understand here is
$r_t(\theta) \hat{A}_t$
so when this term gets clipped to either $1 + \varepsilon$ or $1-\varepsilon$ the gradient of this loss $\nabla_{\theta}r_t(\theta) \hat{A}_t$ is 0. So no update to the weights. But when this gradient is $\nabla_{\theta}r_t(\theta) \hat{A}_t$ the update depends on whether $\hat{A}_t$ is >0 or <0,
If $\hat{A}_t > 0$, the gradient update will be in the direction that increases ${\pi_{\theta}}$. If $\hat{A}_t > 0$, the gradient update will be in the direction that decreases ${\pi_{\theta}}$.
Integrating reward model to PPO
As you might have noticed, the PPO loss comes from the token level, meaning we need loprobs, value, reward, return and advantage for each token. Buuuuut, this reward function we just trained is trained to output only one reward for last token so how does that work?
Answer: the reward is propagated.
The advantage is calculated reverse recursively i.e advantage at token n is also passed to the n-1 token. This means that we are looking ahead and telling our token n-1 that we already ended up a good state because of the state we are in. Lets look at this from the lens of advantage formula.
$\delta_t = R_t - V(s)$ and $A_t = \delta_t + \lambda\cdot\gamma\cdot A_{t+1}$
Lets consider we are at 100th token which is the ending token of our sequence
$$R_t = r_t + V_{t+1}(s) - V_t(s)$$ reward model assigns $r_{100}$ 100 and lets ignore both the value function assuming they cancel out as we are already at the end.
$$ \begin{gather} R_t= 100\\A_{100}=100, considering A_{101}=0 \end{gather} $$ At 100th token we are at an advantage, now lets calculate $A_{99}$
$$ A_{99} = \delta_{99} + \lambda\cdot\gamma\cdot A_{100} $$ as you can see the reward 100 is propagated to the 99th token, considering $\delta_{99}$ is positive here, it tells us that token 99 is still at an advantage because the we can already see the future here i.e 100th token which was at an advantage, so taking action 99 is still an advantage to us.
Reward Hacking
our models can find a loophole to maximize their return by generating high reward tokens but no-so-good answer. Example: If we are trying to make our model positive, model may find a way to output tokens such as “thank you” and add it our answer, which will provide a high reward to it, but it is meaningless to us. So we don’t want our new (trained via PPO) model to deviate significantly from the model that we started from (SFT model), so we add KL divergence as a penalty to our each token’s reward.
As described earlier, reward is provided to only the ending token, and the reward for other token becomes this KL penalty.
i.e if we have token_1, token_2, and token_3
r(token_3) = some reward from the reward model r(token_2) = KL penalty i.e (logprobs_for_token_2_from_model_being_trained - logprobs_for_token_2_from_SFT_model)* -KL_penalty_coefficient. and so on…
Some key points
- Human preferences that are used to train LLMs are multi-dimensional but the reward is a single score. LLMs being complex and “intelligent” will always find a way to exploit these rewards thus the reward hacking, so in large scale RLHF, reward models get saturated very fast, and we might need to train a new reward model.
GRPO
The per token loss for GRPO is likewise..
Key difference between GRPO and PPO.
GRPO completely removes values function. as value function is removed the advantage calculate is simplified.
$$ A_i = \frac{r_i - \text{mean}({r_1, r_2, \cdots, r_G})}{\text{std}({r_1,r_2, \cdots,r_G})} $$
For each question/prompt different G samples are generated, and for each advantage for each question the reward is normalized to form the advantage.
Previously, in PPO we added KL penalty to the rewards themselves, but in GRPO we add it to the loss function directly.
$$ J(\theta) = \frac{1}{G}\sum_{i=1}^G \frac{1}{|a_i|} \sum_{t=1}^{|a_i|} \left( \min\left(\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\pi_{\theta_{old}}(a_{i,t}|s_{i,t})}A_{i,t}, \text{clip} \left( \frac{\pi_\theta(a_{i,t}|s_{i,t})}{\pi_{\theta_{old}}(a_{i,t}|s_{i,t})}, 1-\varepsilon, 1+\varepsilon \right) A_{i,t} \right) - \beta D_{KL}(\pi_\theta(\cdot|s_{i,t})||\pi_{ref}(\cdot|s_{i,t})) \right) $$
Observations from DAPO paper
Decoupled Clipping parameter
The same $\epsilon$ using in clipping doesn’t favor exploration token i.e tokens with less probability ex. 0.02. After some steps, the entropy of the model starts to dampen, i.e model is becoming less explorative, to restrict this entropy collapse, DAPO paper recommends using a high $\epsilon$ for (1 + $\epsilon$) and same $\epsilon=0.2$ for (1-$\epsilon$).
Increase sample output variations
when the accuracy for all the samples in G have same accuracy 1. The Advantage becomes 0, and the gradient becomes 0 as well. Also, the number of samples with accuracy=1 increases as we train further which decreases the effective prompts in the batch which can lead to larger variance in gradients. The direction for the gradient update is not accurate since the effective prompts in batch is small because of which the gradient updates are noisy.
The proposed method is to oversample and filter out prompts with accuracy 0 or 1
Per token loss
In GRPO, loss is first averaged over token and then averaged over all samples G. In doing so, there remains no difference between samples with very large/small sequence length i.e each sample will get averaged loss and tokens in long sequence will get less priority which could be a problem for use if that token is an important token in our long sequence.
DAPO solves this by first summing the loss for all tokens for all samples and then averaging it over all tokens.
$$ \mathcal{J}_{\text{DAPO}}(\theta) = \mathbb{E}_{(q,a) \sim \mathcal{D}, {o_i}_{i=1}^G \sim \pi_{\theta_{old}}(\cdot|q)} \left[ \frac{1}{\sum_{i=1}^G |o_i|} \sum_{i=1}^G \sum_{t=1}^{|o_i|} \min \left( r_{i,t}(\theta) \hat{A}_{i,t}, ; \text{clip}\left( r_{i,t}(\theta), 1 - \epsilon_{\text{low}}, 1 + \epsilon_{\text{high}} \right) \hat{A}_{i,t} \right) \right] $$
Overlong Reward Shaping
Instead of adding penalty for overly long outputs, just mask the truncated samples. This way the training is stable and enhances performances. Why do this ? adding penalty may confuse when the long reasoning is a valid/good reasoning trace.
Multi-turn RL
llm and user interact over a period of time.
Problems
(BENCHMARK) no benchmark to test multi-turn interactions, and is not reasoning intensive(meaning they only focus on narrow domains, require max engineering overhead)
(RL ALGORITHMS) PPO, DPO they often suffer from high variance when the horizon gets longer, resulting in poor performance.
veRL algorithm
I forget what these variables that we provide in .sh file while training, so I wrote this boilerplate to remember how these variables play out in the veRL implementation.
rollout code: https://github.com/volcengine/verl/blob/332c7d53c131869690578c0ab84f1e016ed4d817/verl/trainer/ppo/ray_trainer.py#L1081 update code: https://github.com/volcengine/verl/blob/332c7d53c131869690578c0ab84f1e016ed4d817/recipe/sppo/dp_actor.py#L60
# Hyperparameters:
# - train_batch_size = Total experiences per PPO iteration (rollout buffer size)
# - ppo_mini_batch_size = Batch size for each gradient step (subdivided from train_batch_size)
# - ppo_micro_batch_size_per_gpu = Per-GPU batch size (for multi-GPU training)
# - ppo_epochs = Number of passes over the rollout data
for epoch in total_epoch:
train_dataloader = complete_train_data.split(train_batch_size)
for batch in train_dataloader: # each batch size is provided by train_batch_size
generate_rollout() # if GRPO use actor.rollout.n variable
generate_old_logprobs()
generate_ref_logprobs()
calculate_advantages()
# split batch into mini_batches.
minibatch_dataloader = batch.split(ppo_mini_batch_size) # this is a dataloader with each minibatch of size ppo_mini_batch_size
for e in ppo_epoch:
for minibatch in minibatch_dataloader:
#split minibatch into microbatches if needed to train on different GPUs
micro_batches = minibatch.split(ppo_micro_batch_size_per_gpu)
gradient_accumulation = ppo_mini_batch_size // ppo_micro_batch_size_per_gpu
for data in micro_batches:
generate_logprobs()
loss = calculate_ppo_loss() / gradient_accumulation
loss.backward()
optimizer.step()
gradient_accumulation step is not used in a sense that we generally do while pretraining, it just maintains the count total number of micro batches that are processed in separate GPU, by dividing loss by gradient_accumulation we obtain loss as if the minibatch was processed directly without using any micro batch splits.
In this code total_training_steps the total number of rollouts that we do, and gradient update steps the number of gradient updates that we do. So keep in mind there two are two different things.
use dynamic_batch size to prevent OOM. Here’s more complete version
# ---------------------------------
# 0. CONSTANTS (from config)
# ---------------------------------
TOTAL_EPOCHS = config.trainer.total_epochs
TRAIN_BATCH_SIZE = config.data.train_batch_size # raw data loader batch
GEN_BATCH_SIZE = config.data.get("gen_batch_size", TRAIN_BATCH_SIZE)
ROLLOUT_N = config.actor_rollout_ref.rollout.n # GRPO: responses per prompt
PPO_MINI_BATCH_SIZE = config.actor_rollout_ref.actor.ppo_mini_batch_size
PPO_MICRO_BATCH_SIZE_PER_GPU = config.actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu
PPO_EPOCHS = config.actor_rollout_ref.actor.ppo_epochs
GRAD_ACCUM_STEPS = PPO_MINI_BATCH_SIZE // PPO_MICRO_BATCH_SIZE_PER_GPU # only when NOT dynamic-bsz
# ---------------------------------
# 1. OUTER LOOP
# ---------------------------------
for epoch in range(TOTAL_EPOCHS):
for raw_dict in train_dataloader: # yields batches of size GEN_BATCH_SIZE
batch = DataProto.from_single_dict(raw_dict)
# ---------------------------------
# 2. ROLLOUT PHASE
# ---------------------------------
gen_batch = batch.pop_generation_keys() # strip everything except what the policy needs
gen_batch = gen_batch.repeat(ROLLOUT_N) # GRPO: duplicate prompts ROLLOUT_N times
rollout_output = actor_rollout_wg.generate_sequences(gen_batch)
# optional: max-baseline generation for REMAX
if adv_estimator == REMAX:
baseline_output = actor_rollout_wg.generate_sequences(gen_batch, do_sample=False)
# ---------------------------------
# 3. BUILD FULL PPO BATCH
# ---------------------------------
ppo_batch = batch.union(rollout_output) # concat prompts + responses
ppo_batch = ppo_batch.repeat(ROLLOUT_N) # align prompt copies with responses
# 3a. FILL IN EXTRA FIELDS
ppo_batch.batch["response_mask"] = compute_response_mask(ppo_batch)
ppo_batch.batch["old_log_probs"] = actor_rollout_wg.compute_log_prob(ppo_batch).batch["old_log_probs"]
if use_reference_policy:
ppo_batch.batch["ref_log_prob"] = ref_policy_wg.compute_log_prob(ppo_batch).batch["ref_log_prob"]
if use_critic:
ppo_batch.batch["values"] = critic_wg.compute_values(ppo_batch).batch["values"]
rewards = reward_fn(ppo_batch) # rule or RM
ppo_batch.batch["token_level_scores"] = rewards
ppo_batch = compute_advantage(ppo_batch) # adds "advantages" & "returns"
# ---------------------------------
# 4. MINI-BATCH LOOP
# ---------------------------------
mini_batches = ppo_batch.split(PPO_MINI_BATCH_SIZE) # list of DataProto, each ≤ PPO_MINI_BATCH_SIZE
for _ in range(PPO_EPOCHS):
for mini_batch in mini_batches:
# 4a. MICRO-BATCH LOOP (gradient accumulation)
if use_dynamic_bsz:
micro_batches = prepare_dynamic_batch(mini_batch,
max_token_len=PPO_MAX_TOKEN_LEN_PER_GPU * sp_size)
else:
micro_batches = mini_batch.split(PPO_MICRO_BATCH_SIZE_PER_GPU)
actor_optimizer.zero_grad()
for micro_batch in micro_batches:
log_probs, entropy = actor_rollout_wg.forward_micro_batch(micro_batch)
loss = compute_ppo_loss(log_probs,
micro_batch.batch["old_log_probs"],
micro_batch.batch["advantages"],
micro_batch.batch["response_mask"])
loss = loss / GRAD_ACCUM_STEPS # scale for accumulation
loss.backward()
actor_optimizer.step() # update happens once per mini-batch
In https://github.com/ChenmienTan/RL2/blob/main/RL2/workers/actor.py, the algorithm is almost the same but they define update_per_rollout instead of PPO_MINI_BATCH_SIZE, originally in veRL we calculated update per rollout by TRAIN_BATCH_SIZE/ PPO_MINI_BATCH_SIZE
t