Mixture of Experts
Image Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts Basic MoE structure Experts are FFNN themselves, instead of passing input representation to only one dense FFNN we now have option to route them to more FFNNs. Since most LLMs have several decoder blocks, a given text will pass through multiple experts before the text is generated. Down the line it could use multiple experts but at different blocks i.e (layers) A routing layer is set to choose experts depending on how many experts are selected MoE are categorized into two i....
Interpretability
Induction circuits Induction behaviour The task of detecting and repeating subsequences in a text by finding some patterns. For example: If there exist a text containing name “James Bond” and later in the text when the model sees the word “James” it predicts/repeats the word “Bond” because it’s already seen the words “James Bond” and analyzes that “bond” should come after the word “James”. Also called “Strict Induction” Induction head A head which implements the induction behaviour....
RLHF
How’s RLHF different from RL setup No state transition happen, generation of one state does not affect another. Switching from a reward function to a reward model. reward model could be any classification model.
Flops calculation
Calculation of FLOPs multiply accumulate cost: 2FLOPS i.e 1 for multiplication and 1 for accumulation (addition) if we multiply two matrices with sizes (a x b) and (b x c), the flops involved is b Multiply-add operation per the output size (a x c) i.e 2 x b x (a x c) Embedding lookup we initially have tokens with (seq_len,vocab_size) one-hot representation and embedding lookup matrix is (vocab_size, d_model), it will take...
Post Training Strategies
After training, we generally perform alignment i.e teaching the model how to behave/act in desired manner. Post training mainly consists 1) Supervised Fine-tuning 2) RLHF the current consensus within the research community seems to be that the optimal approach to alignment is to i) perform SFT over a moderately-sized dataset of examples with very high quality and ii) invest remaining efforts into curating human preference data for fine-tuning via RLHF....