Papers that I’ve read with their respective notes.

LLaMA: Open and Efficient Foundation Language Models

  • Trained on 1.4T tokens.
  • Wikipedia and Books domain trained for 2 epochs (maybe because its cleaner, smaller, offers coherent long sequences)
  • use manual backprop for training efficiency i.e save checkpoints of activations that take longer to compute (linear layers) and use them during backprop and generate others such as (ReLu) on the fly.

SmolLM2

  • including specific data eg. math doesn’t only do well in math, but also seems to improve reasoning.

  • rather than training on one specific dataset, training on mixture of datasets yields better results, for instance, 60-40 mixture of FineWeb-Edu and DCLM yielded almost similar performance to only training on FineWeb-Edu

  • decontamination of curated dataset is generally done, using some bi-gram matching using the eval dataset.

  • they do a multi-stage training approach rather than fixed-data mixture.

LR decay

  • Warmup Phase (Steps 0–2,000):

    • Learning rate increases linearly from near 0 to 5.0×10−45.0×10−4.
  • Stable Phase (Steps 2,000–N):

    • Learning rate remains constant at 5.0×10−45.0×10−4.
  • Decay Phase (Last 10% of Steps):

    • Learning rate decreases linearly from 5.0×10−45.0×10−4 to 0.
  • had loss spikes during stage 3, which remained persistent even after rewinding the trianing, and changing the data that caused the spike. The cause of spike remains undetermined, however the eval metrics recovered in the end.

  • They include high quality math data in the end, and decay the to 0

  • They expand the context length from 2k to 8k before the final 75 billion tokens of training and the mixture was adjusted to include 40% long-context documents

  • they curate their own instruction dataset named SmolTalk, because of low performance after training on previously available dataset i.e MagPie-Pro and OpenHermes2.5.

  • Filter high conversational dataset and deduplicate using gte-large embedding models.

  • in short they do a lot of decontamination (using bi-gram overlaps), deduplication, filtering,

  • For smaller models during sft, they filter smoltalk dataset (e.g., function calling) and hard examples from MagPie-Ultra to better align with the models’ capacity and do DPO on UltraFeedback dataset.

p2 p3

Training Compute-Optimal Large Language Models

  • Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?

https://lifearchitect.ai/chinchilla/#deepmind