Latest Posts
All unstructured blogs/posts/notes
-
Rl Papers
Skywork Open Reasoner 1 Technical Report -
Building Rl Grpo/building Rl Grpo
This post is the continuation of this blog, where I experiment with the basics of distributed training. This post will explain how we apply those to RL train... -
Distributed Rl Training Step
I was learning how we can do distributed RL training, saw karpathy posting this and thought why not make a complete blog about what I learned so here it is. -
Python
Unpacking over indexing -
Multi Head Latent Attention
Scaled-dot product Attention -
Lora
LoRA -
Interpretability
Induction circuits -
Rlhf
Before starting, it’s advisable to first complete David Silver’s Course on RL and read Lilian’s notes on RL which explains/provides notes on the David’s cour... -
Flops Calculation
Calculation of FLOPs -
Post Training Strategies
After training, we generally perform alignment i.e teaching the model how to behave/act in desired manner. Post training mainly consists 1) Supervised Fine-t... -
Pytorch
torch.stack(tensors, dim) -
Building Lillm
Pre-training -
Tokenization
Unicode -
Paper Summaries
Papers that I’ve read with their respective notes. -
Kv Cache Gqa
KV Cache -
Rope
Recap of Absolute PE -
Rmsnorm
Recap of LayerNorm -
Gpus
GPU physcial structure -
Mixture Of Experts
Image Source:https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts -
Gradient Accumulation
Gradient Accumulation -
Ddp
When we have enough resources we would want to train our neural networks in parallel, the way to do this is to train our NN with different data (different ba... -
Training Speed Optimization
Precision -
Skip Connections
Skip connections are simply skipping the layers by adding the identity of input to it’s output as shown in the figure below. -
Optimization Algorithms
The simplest algorithm is the gradient descent in which we simply calculate loss over all the training data and then update our parameters, but it would be t... -
Manual Backpropagation On Tensors
Main code -
Matrix Visualization
In deep learning, it’s important to visualize a matrix and how it is represented in a dimension space because the operations that we perform on those matrix ... -
Gpt Implementation
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/dat... -
Diagnostic Tool While Training Nn
source: Building makemore Part 3: Activations & Gradients, BatchNorm -
Optimizing Loss
Problem -
Batchnormalization
As we saw in our previous note how important it is to have the pre-activation values to be roughly gaussian (0 mean, and unit std). We saw how we can initial... -
Maximum Likelihood Estimate As Loss
-
Why We Need Regularization
it penalizes the weights, and prioritizes uniformity in weights. -
Backpropagation From Scratch
Source: The spelled-out intro to neural networks and backpropagation: buildingmicrograd -
Using Genetic Algorithm For Weights Optimization
BackStory