Intermediate·3 min read

Pre-training

Pre-training is the initial, large-scale training phase where a model learns general language understanding and generation capabilities by training on

Definition

Pre-training is the initial, large-scale training phase where a model learns general language understanding and generation capabilities by training on massive text corpora. It is called "pre-training" because it precedes more focused training stages (fine-tuning, RLHF). The result is a base model or foundation model.

Core Objective: Next-Token Prediction

The model is trained to predict the next token in a sequence:

`

Input: "The capital of France is"

Target: "Paris"

Loss = CrossEntropy(model_output, "Paris")

`

This simple objective, applied at scale over trillions of tokens, forces the model to implicitly learn:

  • Grammar and syntax
  • World knowledge and facts
  • Reasoning patterns
  • Coding conventions
  • Multiple languages
  • Training Data

    | Source Type | Examples |

    |-------------|----------|

    | Web crawls | CommonCrawl, C4, RefinedWeb |

    | Books | Books1, Books2, Project Gutenberg |

    | Code | GitHub repositories, Stack Overflow |

    | Wikipedia | All language editions |

    | Scientific papers | ArXiv, PubMed |

    | Curated datasets | The Pile, Dolma, RedPajama |

    Typical scale: 1–15 trillion tokens for frontier models

    The Training Loop

    1. Sample a batch of token sequences from the dataset

    2. Run forward pass: model predicts next token at every position

    3. Compute cross-entropy loss between predictions and actual next tokens

    4. Run backward pass: compute gradients via backpropagation

    5. Update all parameters using an optimizer (typically AdamW)

    6. Repeat for billions of steps

    Infrastructure Requirements

  • Hundreds to thousands of GPUs/TPUs (H100, A100, TPU v4/v5)
  • Distributed training: Data Parallelism, Tensor Parallelism, Pipeline Parallelism
  • Mixed-precision training (bf16) to fit in GPU memory
  • Gradient checkpointing to trade compute for memory
  • Efficient data loaders to prevent I/O bottlenecks
  • Compute Scale (Chinchilla Scaling Laws)

    The Chinchilla paper (DeepMind, 2022) established optimal token-to-parameter ratios:

  • Rule of thumb: 20 tokens per parameter for compute-optimal training
  • 7B model → ~140B tokens (minimal); frontier models train on 10–100× more for over-training efficiency at inference
  • Compute Cost Estimate (Rough Formula)

    `

    FLOPs ≈ 6 × N × D

    where:

    N = number of parameters

    D = number of training tokens

    `

    Example: 7B model × 2T tokens ≈ 84 × 10^21 FLOPs ≈ $1–5M in GPU cost

    Pre-training Phases

    Some models use a multi-phase curriculum:

    1. Phase 1: Broad web data for general language understanding

    2. Phase 2: High-quality curated data (books, code, math) to boost specific capabilities

    3. Phase 3 (optional): Domain-specific data for specialized models

    What Pre-training Produces

    A base/foundation model that:

  • Can complete text coherently
  • Has broad world knowledge
  • Does NOT reliably follow instructions (may continue a question rather than answer it)
  • Requires further alignment work to be useful as an assistant
  • Notable Pre-trained Models

    | Model | Organization | Params | Tokens |

    |-------|-------------|--------|--------|

    | GPT-3 | OpenAI | 175B | 300B |

    | LLaMA 3 | Meta | 8B–70B | 15T |

    | Mistral 7B | Mistral AI | 7B | ~1T |

    | Gemma 2 | Google | 2B–27B | ~13T |

    | Claude (base) | Anthropic | undisclosed | undisclosed |

    Related Concepts

  • Base Model, Fine-Tuning, RLHF, Parameters, Token, Scaling Laws, Chinchilla

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 2).