CohleM

RLHF

Before starting, it’s advisable to first complete David Silver’s Course on RL and read Lilian’s notes on RL which explains/provides notes on the David’s course in sequential manner. In simple problems, we simply start with an arbitrary value function, and then go on updating that value function incrementally, using different algorithms such as Monte Carlo (which collects reward over the whole episoe), Temporal difference, aka TD(0) (which considers bootstrapping, i.e only considering the immediate reward and then approximating other remaining rewards with the help of value function $r + V(s)$ ) and other algorithms....

Flops calculation

Calculation of FLOPs multiply accumulate cost: 2FLOPS i.e 1 for multiplication and 1 for accumulation (addition) if we multiply two matrices with sizes (a x b) and (b x c), the flops involved is b Multiply-add operation per the output size (a x c) i.e 2 x b x (a x c) Embedding lookup we initially have tokens with (seq_len,vocab_size) one-hot representation and embedding lookup matrix is (vocab_size, d_model), it will take...

Post Training Strategies

After training, we generally perform alignment i.e teaching the model how to behave/act in desired manner. Post training mainly consists 1) Supervised Fine-tuning 2) RLHF the current consensus within the research community seems to be that the optimal approach to alignment is to i) perform SFT over a moderately-sized dataset of examples with very high quality and ii) invest remaining efforts into curating human preference data for fine-tuning via RLHF....

Notes-while-building-lilLM

Pre-training Document packing while pretraining, different documents could be packed inside a sequence. For instance, a model with context_length 1024 can have 256 tokens from one doc and rest from the other. Demilited by EOS token. The samples may contaminate the attention, for which cross sample attention masking is used. But, it isn’t used by DeepSeek v3, lets not use it. while packing documents. we simply pack them as they appear in order and then add EOS token (used by GPT-2,3)....

Pytorch Commands I forget time to time/ commands that are essential

torch.stack(tensors, dim) stacks the tensors across dim #usage # data has to be tensor torch.stack([data[i:i+some_number] for i in range(10)]) torch.from_numpy(numpy_array) shares the memory with the numpy_array but is tensor type a = np.array([1,2,3]) b = torch.tensor(a) # creates copy c = torch.from_numpy(a) # shares memory a[0] = 11 c # outputs: tensor([11, 2, 3]) torch.flatten(input, start,end=-1) flattens the input from dim start to end (-1 by default) t = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) torch....