Welcome to CohleM’s Blog

Hi there, this is Manish. Here, I document my learnings and share my learnings along the way.

Mixture of Experts

Image Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts Basic MoE structure Experts are FFNN themselves, instead of passing input representation to only one dense FFNN we now have option to route them to more FFNNs. Since most LLMs have several decoder blocks, a given text will pass through multiple experts before the text is generated. Down the line it could use multiple experts but at different blocks i.e (layers) A routing layer is set to choose experts depending on how many experts are selected MoE are categorized into two i....

January 5, 2025 · 14 min · CohleM

LoRA

LoRA Main idea is to approximate the change in weights dW by the use of low-rank matrices Eg: Usually the weight update is done by adding the change in weights dW to the original weight matrix W. dW is obtained through backpropagation, ex if W is 512 x 512 the parameter size of dW is 262,144. In LoRA, we approximate that dW but by breaking down into two low rank matrices B @ A where B = matrix of size 512 x r and A = matrix of size r x 512,...

April 7, 2025 · 5 min · CohleM

Interpretability

Induction circuits Induction behaviour The task of detecting and repeating subsequences in a text by finding some patterns. For example: If there exist a text containing name “James Bond” and later in the text when the model sees the word “James” it predicts/repeats the word “Bond” because it’s already seen the words “James Bond” and analyzes that “bond” should come after the word “James”. Also called “Strict Induction” Induction head A head which implements the induction behaviour....

March 3, 2025 · 11 min · CohleM

RLHF

How’s RLHF different from RL setup No state transition happen, generation of one state does not affect another. Switching from a reward function to a reward model. reward model could be any classification model.

February 24, 2025 · 1 min · CohleM

Flops calculation

Calculation of FLOPs multiply accumulate cost: 2FLOPS i.e 1 for multiplication and 1 for accumulation (addition) if we multiply two matrices with sizes (a x b) and (b x c), the flops involved is b Multiply-add operation per the output size (a x c) i.e 2 x b x (a x c) Embedding lookup we initially have tokens with (seq_len,vocab_size) one-hot representation and embedding lookup matrix is (vocab_size, d_model), it will take...

February 11, 2025 · 3 min · CohleM