General RL setting

$$ J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \right],\qquad{(1)} $$ In RL setting we aim to optimize the objective function $J(\pi)$ by updating the policy $\pi$ given a reward function $r(s,a)$ that takes in a state and the action performed at that state. The next action is determined by $\pi(a|s)$. We find the expected(average) reward over all the trajectories $\tau$. $\gamma$ is the discount factor from 0 to 1 that balances the desirability of of near vs future-rewards.

An example: If our RL setup is confined to finding a maze, a reward function could be $r(s,a)=-\text{distance to goal}$

RLHF

RLHF reduces the equation 1 by removing this discounted factor $$ J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} r(s_t, a_t) \right],\qquad{(2)} $$ We aim to maximize our objective function by optimizing the policy. The reward function is designed such that the actions must align with the human preferences.

The most common reward model predicts the probability that a piece of text was close to a “preferred” piece of text from the training comparisons.

Reward Models

Given two prompts $y1$ and $y2$ we want to have a reward model that gives high score to $y1$ and low score to $y2$ meaning $y1$ is always preferred over $y2$. Their relative preference is given by the Bradly Terry model.

$$ P(i > j) = \frac{p_i}{p_i + p_j}\qquad{(3)} $$ It gives the probability of i being preferred over j where $p_{i}$ and $p_{j}$ represent “strengths” for those prompts.

We want our model to maximize this probability because later $i$ would represent the text we want (aligned) and $j$ would represent the text we don’t want (not aligned). The training objective can be derived from the equation above. The “strengths” are exponential because we want them to be strictly positive.

$$ P(y_1 > y_2) = \frac{\exp(r(y_1))}{\exp(r(y_1)) + \exp(r(y_2))}\qquad{(4)} $$ The loss function becomes

$$ \mathcal{L}(\theta) = - \log \left( \sigma \left( r_{\theta}(x, y_w) - r_{\theta}(x, y_l) \right) \right)\qquad{(6)} $$ Our existing model can be configured to output just one value by adding a linear head to the model.

In case of language models, we want a model to rate our answer i.e (give a score based on how good or bad it is), it is the most important part because it will guide our training, it is providing supervision for our PPO algorithm.

Steps for training a reward model

Collect pairs of data, it could be either contrasting pairs or some pairs of prompt with priority. Eg. We want our llm to be trained (later using PPO) to be to generate positive response. In this case our priority prompt would be positive prompt and non-priority prompt would be negative prompt.

We take a language model and add a linear head to it. For instance, for each token a LM outputs 512 dimension vector we add a new head that takes in 512 dimension and outputs a one dimensional vector which gives us the reward.

The loss function for the reward model is constructed likewise, which is same the equation 6.

$$ L = -log(\sigma(r1 - r2)) $$ where $r1$ is reward for priority prompt and $r2$ is reward for non-priority prompt. We want to maximize this difference $r1 - r2$ and minimize the this function. $\sigma$ represents sigmoid function.

NOTES

  • we calculate the reward for the ending token which represents a reward given to the whole token.
  • reward models overfit fast, so we can consider smaller LM for reward model training.

——-Skipping other reward models for now——-

Regularization

We want aligned reward models but would still “not go off the rails” meaning it stays within the limitation of our reference model. It allows the policy being trained to stay close to the reference policy.

$$ r = r_\theta - \lambda r_{\text{reg.}} \qquad{(1)} $$ $$ r = r_\theta - \lambda_{\text{KL}} \mathcal{D}_{\text{KL}} \left( \pi^{\text{RL}}(y \mid x) , | , \pi^{\text{Ref.}}(y \mid x) \right) \qquad{(2)} $$

KL divergence is calculated as

$$ D_{\text{KL}}(P ,||, Q) = \mathbb{E}_{x \sim P} \left[ \log P(x) - \log Q(x) \right]. \qquad{(3)} $$

Rejection Sampling

This process is kind of a filtering process. We first sample outputs from our base language model that we’ll be training next. For example, we generate 5 example outputs for a sample prompt. We then provide a score to each of the generated examples using our reward function. We sort and only take the top-K elements per prompt and fine-tune our base model on these examples. So it’s kind of like a filtering and only keeping the highly rewarded examples.

PPO

PPO algorithm analogy

Suppose we are taking an exam paper. Our objective is to maximize the change of getting good marks by updating our brain (weights and biases).

$s_t$ is a question that we are looking at and trying to solve. $a_t$ is the our answer for that question. $r_t$ is the immediate reward we get after writing $a_t$, $s_{t+1}$ is the next-question.

$R_t$ is the actual total exam score.

Suppose there is an imaginary machine that gives us the expected exam score that we can get just by looking at our question which is $V(s)$.

$A(s)$ is our advantage i.e how good we are compared to the predicted score.

$A(s) = R_t - V(s)$

this can be modified as as $\delta_t = R_t - V(s)$ and $A_t = \delta_t + \lambda\cdot\gamma\cdot A_{t+1}$. This is just a modified version of advantage to remove the bias and variance.

This can be a little confusing as we might have no idea what our actual total score($R_t$) will be while we are still writing some questions $s$ so approximate this $R_t$ with the help of current reward $r$ and future reward that we might get from next question i.e $V(s+1)$ i.e

this becomes

$\delta_t = r_t + V(s+1) - V(s)$

so this will still give us our advantage at a point t.

i.e how good/bad we did at point t= immediate reward(marks) after writing answer to t + expected future reward from new question - expected future reward from previous question t.

This will give us our advantage.

PPO is done using two phases.

  1. Rollout phase
  2. Weight update phase.

Rollout phase

We write a lot of exam paper in this phase in parallel. for each exam paper and for each question in the exam paper we calculate $r_t$, $V(s_t)$ and $R_t$ $A_t$ and use it in our equation. $$ L^{PPO}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}\left(r_t(\theta), 1 - \epsilon,1+\epsilon\right) \hat{A}_t \right) - c_1 \left( V_\theta(s_t) -V_t^\text{target} \right)^2 + c_2 \mathcal{H}\left[\pi_\theta\right(s_t)\right] $$ $$ r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_\text{old}}(a_t|s_t)} $$

Weight update phase

We find this loss and try to maximize this clipped loss and entropy loss, but minimize the value function loss.

The PPO clipped surrogate objective is given as:

$$ L^{\text{CLIP}}(\theta) = \mathbb{E}_t \Big[ \min \big( r_t(\theta) \hat{A}_t, ; \mathrm{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon) \hat{A}_t \big) \Big] $$

The gradient ascent update rule is:

$$ \theta \gets \theta + \alpha \nabla_\theta L^{\text{CLIP}}(\theta) $$ The most important part to understand here is

$r_t(\theta) \hat{A}_t$

so when this term gets clipped to either $1 + \varepsilon$ or $1-\varepsilon$ the gradient of this loss $\nabla_{\theta}r_t(\theta) \hat{A}_t$ is 0. So no update to the weights. But when this gradient is $\nabla_{\theta}r_t(\theta) \hat{A}_t$ the update depends on whether $\hat{A}_t$ is >0 or <0,

If $\hat{A}_t > 0$, the gradient update will be in the direction that increases ${\pi_{\theta}}$. If $\hat{A}_t > 0$, the gradient update will be in the direction that decreases ${\pi_{\theta}}$.

Integrating reward model to PPO

As you might have noticed, the PPO loss comes from the token level, meaning we need loprobs, value, reward, return and advantage for each token. Buuuuut, this reward function we just trained is trained to output only one reward for last token so how does that work?

Answer: the reward is propagated.

The advantage is calculated reverse recursively i.e advantage at token n is also passed to the n-1 token. This means that we are looking ahead and telling our token n-1 that we already ended up a good state because of the state we are in. Lets look at this from the lens of advantage formula.

$\delta_t = R_t - V(s)$ and $A_t = \delta_t + \lambda\cdot\gamma\cdot A_{t+1}$

Lets consider we are at 100th token which is the ending token of our sequence

$$R_t = r_t + V_{t+1}(s) - V_t(s)$$ reward model assigns $r_{100}$ 100 and lets ignore both the value function assuming they cancel out as we are already at the end.

$$ \begin{gather} R_t= 100\\A_{100}=100, considering A_{101}=0 \end{gather} $$ At 100th token we are at an advantage, now lets calculate $A_{99}$

$$ A_{99} = \delta_{99} + \lambda\cdot\gamma\cdot A_{100} $$ as you can see the reward 100 is propagated to the 99th token, considering $\delta_{99}$ is positive here, it tells us that token 99 is still at an advantage because the we can already see the future here i.e 100th token which was at an advantage, so taking action 99 is still an advantage to us.

Reward Hacking

our models can find a loophole to maximize their return by generating high reward tokens but no-so-good answer. Example: If we are trying to make our model positive, model may find a way to output tokens such as “thank you” and add it our answer, which will provide a high reward to it, but it is meaningless to us. So we don’t want our new (trained via PPO) model to deviate significantly from the model that we started from (SFT model), so we add KL divergence as a penalty to our each token’s reward.

As described earlier, reward is provided to only the ending token, and the reward for other token becomes this KL penalty.

i.e if we have token_1, token_2, and token_3

r(token_3) = some reward from the reward model r(token_2) = KL penalty i.e (logprobs_for_token_2_from_model_being_trained - logprobs_for_token_2_from_SFT_model)* -KL_penalty_coefficient. and so on…

Some key points

  • Human preferences that are used to train LLMs are multi-dimensional but the reward is a single score. LLMs being complex and “intelligent” will always find a way to exploit these rewards thus the reward hacking, so in large scale RLHF, reward models get saturated very fast, and we might need to train a new reward model.

GRPO

The per token loss for GRPO is likewise..

Key difference between GRPO and PPO.

GRPO completely removes values function. as value function is removed the advantage calculate is simplified.

$$ A_i = \frac{r_i - \text{mean}({r_1, r_2, \cdots, r_G})}{\text{std}({r_1,r_2, \cdots,r_G})} $$

For each question/prompt different G samples are generated, and for each advantage for each question the reward is normalized to form the advantage.

Previously, in PPO we added KL penalty to the rewards themselves, but in GRPO we add it to the loss function directly.

$$ J(\theta) = \frac{1}{G}\sum_{i=1}^G \frac{1}{|a_i|} \sum_{t=1}^{|a_i|} \left( \min\left(\frac{\pi_\theta(a_{i,t}|s_{i,t})}{\pi_{\theta_{old}}(a_{i,t}|s_{i,t})}A_{i,t}, \text{clip} \left( \frac{\pi_\theta(a_{i,t}|s_{i,t})}{\pi_{\theta_{old}}(a_{i,t}|s_{i,t})}, 1-\varepsilon, 1+\varepsilon \right) A_{i,t} \right) - \beta D_{KL}(\pi_\theta(\cdot|s_{i,t})||\pi_{ref}(\cdot|s_{i,t})) \right) $$

connect to ssh via vscode, transfer notebook using scp source destination open that notebook in vscode install jupyter from extension select from top right> kernels> python kernals > install python environments.