Notes-while-building-lilLM
Pre-training Document packing while pretraining, different documents could be packed inside a sequence. For instance, a model with context_length 1024 can have 256 tokens from one doc and rest from the other. Demilited by EOS token. The samples may contaminate the attention, for which cross sample attention masking is used. But, it isn’t used by DeepSeek v3, lets not use it. while packing documents. we simply pack them as they appear in order and then add EOS token (used by GPT-2,3)....