Notes-while-building-lilLM

Pre-training Document packing while pretraining, different documents could be packed inside a sequence. For instance, a model with context_length 1024 can have 256 tokens from one doc and rest from the other. Demilited by EOS token. The samples may contaminate the attention, for which cross sample attention masking is used. But, it isn’t used by DeepSeek v3, lets not use it. while packing documents. we simply pack them as they appear in order and then add EOS token (used by GPT-2,3)....

January 29, 2025 · 1 min · CohleM

Pytorch Commands I forget time to time/ commands that are essential

torch.stack(tensors, dim) stacks the tensors across dim #usage # data has to be tensor torch.stack([data[i:i+some_number] for i in range(10)]) torch.from_numpy(numpy_array) shares the memory with the numpy_array but is tensor type a = np.array([1,2,3]) b = torch.tensor(a) # creates copy c = torch.from_numpy(a) # shares memory a[0] = 11 c # outputs: tensor([11, 2, 3]) torch.flatten(input, start,end=-1) flattens the input from dim start to end (-1 by default) t = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]]) torch....

January 29, 2025 · 1 min · CohleM

Tokenization

Unicode Character encoding standard aims to incorporate all the available digital characters Each character in Unicode has a unique 4 to 6-digit hexadecimal number. For Example, the letter ‘A’ has the code 0041, represented as U+0041. compatible with ASCII first 128 characters in Unicode directly correspond to the characters represented in the 7-bit ASCII table Unicode Transformation Format (UTF-8) uses 1-4 bytes to represent each character can encode all the unicode code points backward compatible with ASCII Example: (1 byte) The character 'A' (U+0041) is encoded as `01000001` (0x41 in hexadecimal)....

January 22, 2025 · 7 min · CohleM

Papers Summaries

Papers that I’ve read with their respective notes. LLaMA: Open and Efficient Foundation Language Models Trained on 1.4T tokens. Wikipedia and Books domain trained for 2 epochs (maybe because its cleaner, smaller, offers coherent long sequences) use manual backprop for training efficiency i.e save checkpoints of activations that take longer to compute (linear layers) and use them during backprop and generate others such as (ReLu) on the fly. SmolLM2 including specific data eg....

January 21, 2025 · 2 min · CohleM

KV cache and Grouped Query Attention

KV Cache KV cache visual operation In the note blow, I first describe how inferencing is done if we simply do operation without KV cache and then describe how KV cache helps removing redundant operations. We don’t make use of KV cache while training because we already have data filled for each sequence length, we don’t need to calculate loss one by one, instead we do it in batches, whereas while inferencing we do it generally for 1 batch with some sequences and then we keep on appending next-predicted token to that sequence one by one....

January 18, 2025 · 11 min · CohleM