Sub-notes

manual-backpropagation-on-tensors

Main code n_embd = 10 # the dimensionality of the character embedding vectors n_hidden = 64 # the number of neurons in the hidden layer of the MLP g = torch.Generator().manual_seed(2147483647) # for reproducibility C = torch.randn((vocab_size, n_embd), generator=g) # Layer 1 W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) * (5/3)/((n_embd * block_size)**0.5) b1 = torch.randn(n_hidden, generator=g) * 0.1 # using b1 just for fun, it's useless because of BN # Layer 2 W2 = torch....

Matrix Visualization

In deep learning, it’s important to visualize a matrix and how it is represented in a dimension space because the operations that we perform on those matrix becomes very much intuitive afterwards. Visualizing two dimensional matrix. This has to be the most intuitive visualization. [ [12, 63, 10, 42, 70, 31, 34, 8, 34, 5], [10, 97, 100, 39, 64, 25, 86, 22, 31, 25], [28, 44, 82, 61, 70, 94, 22, 88, 89, 56] ] We can simply imagine rows are some examples and columns as those examples’ features....

Diagnostic-tool-while-training-nn

source: Building makemore Part 3: Activations & Gradients, BatchNorm Things to look out for while training NN Take a look at previous notes to understand this note better consider we have this simple 6 layer NN # Linear Layer g = torch.Generator().manual_seed(2147483647) # for reproducibility class Layer: def __init__(self,fan_in, fan_out, bias=False): self.w = torch.randn((fan_in, fan_out),generator = g) / (fan_in)**(0.5) # applying kaiming init self.bias = bias if bias: self.b = torch....

BatchNormalization

As we saw in our previous note how important it is to have the pre-activation values to be roughly gaussian (0 mean, and unit std). We saw how we can initialize our weights that make our pre-activation roughly gaussian by using Kaiming init. But, how do we always maintain our pre-activations to be roughly gaussian? Answer: BatchNormalization Benefits stable training preserves vanishing gradients BatchNormalization As the name suggests, batches are normalized (across batches), by normalizing across batches we preserve the gaussian property of our pre-activations....