CohleM

I was learning how we can do distributed RL training, saw karpathy posting this and thought why not make a complete blog about what I learned so here it is. The end goal of this blog is to explain clearly how to do distributed RL training, right now it contains explanations about fundamentals of distributed training, such as data parallelism, model parallelism, and tensor parallelism. Consider this as a part 1, where in the next blog I’ll be explaining how we apply the techniques learned in the blog....

optimizing-loss-with-weight-initialization

Problem Consider a simple MLP that takes in combined 3 character embeddings as an input and we predicts a new character. # A simple MLP n_embd = 10 # the dimensionality of the character embedding vectors n_hidden = 200 # the number of neurons in the hidden layer of the MLP g = torch.Generator().manual_seed(2147483647) # for reproducibility C = torch.randn((vocab_size, n_embd), generator=g) W1 = torch.randn((n_embd * block_size, n_hidden), generator=g) b1 = torch....