Welcome to CohleM’s Blog

Hi there, this is Manish. This site mostly contains my notes for now but I hope to add comprehensive blogs in future.

Mixture of Experts

Image Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts Basic MoE structure Experts are FFNN themselves, instead of passing input representation to only one dense FFNN we now have option to route them to more FFNNs. Since most LLMs have several decoder blocks, a given text will pass through multiple experts before the text is generated. Down the line it could use multiple experts but at different blocks i.e (layers) A routing layer is set to choose experts depending on how many experts are selected MoE are categorized into two i....

Multi-head latent attention

Scaled-dot product Attention Q1 Given the attention equation $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{(xWq)(xWk)^\top}{\sqrt{d_k}}\right)(xWv)W_O $$ Why don’t we train by combining $WqWk^\top$ and $WvWo$? because mathematically they seem equivalent $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{x(WqWk^\top)x^\top}{\sqrt{d_k}}\right)x(WvW_O) $$ I initially thought if we could combine those weights, we don’t need to calculate $Q,K,V$ meaning there will be less number of matrix multiplication. Answer We lose the objective of $Q,K,V,O$, they are meant to operate independently....

LoRA

LoRA Main idea is to approximate the change in weights dW by the use of low-rank matrices Eg: Usually the weight update is done by adding the change in weights dW to the original weight matrix W. dW is obtained through backpropagation, ex if W is 512 x 512 the parameter size of dW is 262,144. In LoRA, we approximate that dW but by breaking down into two low rank matrices B @ A where B = matrix of size 512 x r and A = matrix of size r x 512,...

Interpretability

Induction circuits Induction behaviour The task of detecting and repeating subsequences in a text by finding some patterns. For example: If there exist a text containing name “James Bond” and later in the text when the model sees the word “James” it predicts/repeats the word “Bond” because it’s already seen the words “James Bond” and analyzes that “bond” should come after the word “James”. Also called “Strict Induction” Induction head A head which implements the induction behaviour....

RLHF

General RL setting $$ J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \right],\qquad{(1)} $$ In RL setting we aim to optimize the objective function $J(\pi)$ by updating the policy $\pi$ given a reward function $r(s,a)$ that takes in a state and the action performed at that state. The next action is determined by $\pi(a|s)$. We find the expected(average) reward over all the trajectories $\tau$. $\gamma$ is the discount factor from 0 to 1 that balances the desirability of of near vs future-rewards....