CohleM

Optimization Algorithms (SGD with momentum, RMSProp, Adam)

The simplest algorithm is the gradient descent in which we simply calculate loss over all the training data and then update our parameters, but it would be too slow and would consume too much resources. A faster approach is to use SGD where we calculate loss over every single training data and then do the parameter update, but the gradient update could be fuzzy. A more robust approach is to do mini batch SGD....

manual-backpropagation-on-tensors

Main code n_embd = 10 # the dimensionality of the character embedding vectors n_hidden = 64 # the number of neurons in the hidden layer of the MLP g = torch.Generator().manual_seed(2147483647) # for reproducibility C = torch.randn((vocab_size, n_embd), generator=g) # Layer 1 W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) * (5/3)/((n_embd * block_size)**0.5) b1 = torch.randn(n_hidden, generator=g) * 0.1 # using b1 just for fun, it's useless because of BN # Layer 2 W2 = torch....

Matrix Visualization

In deep learning, it’s important to visualize a matrix and how it is represented in a dimension space because the operations that we perform on those matrix becomes very much intuitive afterwards. Visualizing two dimensional matrix. This has to be the most intuitive visualization. [ [12, 63, 10, 42, 70, 31, 34, 8, 34, 5], [10, 97, 100, 39, 64, 25, 86, 22, 31, 25], [28, 44, 82, 61, 70, 94, 22, 88, 89, 56] ] We can simply imagine rows are some examples and columns as those examples’ features....

Diagnostic-tool-while-training-nn

source: Building makemore Part 3: Activations & Gradients, BatchNorm Things to look out for while training NN Take a look at previous notes to understand this note better consider we have this simple 6 layer NN # Linear Layer g = torch.Generator().manual_seed(2147483647) # for reproducibility class Layer: def __init__(self,fan_in, fan_out, bias=False): self.w = torch.randn((fan_in, fan_out),generator = g) / (fan_in)**(0.5) # applying kaiming init self.bias = bias if bias: self.b = torch....

BatchNormalization

As we saw in our previous note how important it is to have the pre-activation values to be roughly gaussian (0 mean, and unit std). We saw how we can initialize our weights that make our pre-activation roughly gaussian by using Kaiming init. But, how do we always maintain our pre-activations to be roughly gaussian? Answer: BatchNormalization Benefits stable training preserves vanishing gradients BatchNormalization As the name suggests, batches are normalized (across batches), by normalizing across batches we preserve the gaussian property of our pre-activations....