Sub-notes

Mixture of Experts

Image Source: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts Basic MoE structure Experts are FFNN themselves, instead of passing input representation to only one dense FFNN we now have option to route them to more FFNNs. Since most LLMs have several decoder blocks, a given text will pass through multiple experts before the text is generated. Down the line it could use multiple experts but at different blocks i.e (layers) A routing layer is set to choose experts depending on how many experts are selected MoE are categorized into two i....

TITLE

Unpacking over indexing why? less noisy # do this a,b = somehting # over a = something[0] b = somehting[1] another example # don't do this snacks = [('bacon', 350), ('donut', 240), ('muffin', 190)] for i in range(len(snacks)): item = snacks[i] name = item[0] calories = item[1] print(f'#{i+1}: {name} has {calories} calories') # do this for rank, (name, calorie) in enumerate(snacks,1): print(rank, name, calorie) Unpacking can be applied to any iterables (dict, lists, tuples)...

Multi-head latent attention

Scaled-dot product Attention Q1 Given the attention equation $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{(xWq)(xWk)^\top}{\sqrt{d_k}}\right)(xWv)W_O $$ Why don’t we train by combining $WqWk^\top$ and $WvWo$? because mathematically they seem equivalent $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{x(WqWk^\top)x^\top}{\sqrt{d_k}}\right)x(WvW_O) $$ I initially thought if we could combine those weights, we don’t need to calculate $Q,K,V$ meaning there will be less number of matrix multiplication. Answer We lose the objective of $Q,K,V,O$, they are meant to operate independently....

LoRA

LoRA Main idea is to approximate the change in weights dW by the use of low-rank matrices Eg: Usually the weight update is done by adding the change in weights dW to the original weight matrix W. dW is obtained through backpropagation, ex if W is 512 x 512 the parameter size of dW is 262,144. In LoRA, we approximate that dW but by breaking down into two low rank matrices B @ A where B = matrix of size 512 x r and A = matrix of size r x 512,...

Interpretability

Induction circuits Induction behaviour The task of detecting and repeating subsequences in a text by finding some patterns. For example: If there exist a text containing name “James Bond” and later in the text when the model sees the word “James” it predicts/repeats the word “Bond” because it’s already seen the words “James Bond” and analyzes that “bond” should come after the word “James”. Also called “Strict Induction” Induction head A head which implements the induction behaviour....