Paper Summaries

21 Jan 2025 - cohlem

Papers that I’ve read with their respective notes.

LLaMA: Open and Efficient Foundation Language Models

SmolLM2

LR decay

p2 p3

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

backend questions Train set : question, reference answer, 10 sample tests generate trajectories from the Train set, sample sequence is like this

question, agent’s answer, human simulator’s answer–> agent’s answer, human simulator’s answer –> end. run the final solution through their 10 sample tests–> record reward. 1 if passed all test else 0. sample 15k of these

now train advantage llm using bradley terry loss. p4 $o_t^+$ is from those trajectories which had higher reward,

p5

Now train the policy using DPO loss, use that 15k samples trajectories and each

for each $o_t$ sample 16 $a_t$ then rate it using advantage llm, take top50% as $a_+$ remaining as $a_-$ then calculate loss for those actions. $\log \pi (a^+|o_t)$ is the joint probability of all the tokens in $a^+$ p6

Notably, process-only filtering consistently yields the highest accuracy, suggesting that focusing on the procedural aspects of data refinement is more important than the correctness of a training trajectory.

process filtering (filtering trajectory based on whether its action at step a is plausible or not) yields better performance.

filtering for correctness usually harms performance (filtering based on its final answer)

\[r*t = –| log p*θ(A*truth_t | <think>, A_truth*{<t})– log p*θ(ŷ_t | <think>, ŷ*{<t}) |\]

veRL algorithm just to know how the variables provided in the .sh file play out in the main algorithm.

for epoch in total_epoch:
	for batch in train_dataloader: # each batch size is provided by train_batch_size
		generate_rollout() # if GRPO use actor.rollout.n variable
		generate_old_logprobs()
		generate_ref_logprobs()
		calculate_advantages()

		# split batch into mini_batches.
		minibatch_dataloader = batch.split(ppo_mini_batch_size) # this is a dataloader with each minibatch of size ppo_mini_batch_size
		for e in ppo_epoch:
			for minibatch in minibatch_dataloader:
				#split minibatch into microbatches if needed to train on different GPUs
				micro_batches = minibatch.split(ppo_micro_batch_size_per_gpu)
				gradient_accumulation = ppo_mini_batch_size // ppo_micro_batch_size_per_gpu
				for data in micro_batches:
					generate_logprobs()
					loss = calculate_ppo_loss() / gradient_accumulation
					loss.backward()
				optimizer.step()


gradient_accumulation step is not used in a sense that we generally do while pretraining, it just maintains the count total number of micro batches that are processed in separate GPU, by dividing loss by gradient_accumulation we obtain loss as if the minibatch was processed directly without using any micro batch splits