Beginner·3 min read

LLM (Large Language Model)

A Large Language Model is a deep learning model trained on massive text corpora that generates text by predicting the most probable next token given a

Definition

A Large Language Model is a deep learning model trained on massive text corpora that generates text by predicting the most probable next token given a sequence of prior tokens.

Core Mechanism

  • Built on the Transformer architecture (introduced in "Attention Is All You Need", 2017)
  • Uses self-attention to weigh the relevance of every token against every other token in the input
  • Processes input in parallel (unlike RNNs which process sequentially)
  • Output is a probability distribution over the vocabulary at each step — the most probable token is selected (or sampled)
  • Architecture Components

  • Embedding Layer — converts tokens to dense vectors
  • Transformer Blocks (stacked) — each contains:
  • - Multi-Head Self-Attention

    - Feed-Forward Network (FFN)

    - Layer Normalization

    - Residual Connections

  • Output Head (LM Head) — linear layer + softmax projecting to vocabulary size
  • Scale

  • "Large" refers to parameter count: billions to trillions of parameters
  • Examples: GPT-4 (~1T estimated), Claude 3 Opus, LLaMA 3 (8B–70B), Mistral (7B)
  • Scale follows scaling laws (Chinchilla): performance improves predictably with more data + parameters + compute
  • Training Objective

  • Next-token prediction (autoregressive/causal language modeling)
  • Given tokens [t1, t2, ..., tn], predict t(n+1)
  • Loss function: Cross-entropy between predicted distribution and true next token
  • Capabilities (Emergent at Scale)

  • Text generation, summarization, translation
  • Code generation and debugging
  • Reasoning, question answering
  • Few-shot and zero-shot task generalization
  • Limitations

  • No real-time knowledge (knowledge cutoff)
  • Prone to hallucination
  • Context window limits
  • No persistent memory across sessions by default
  • Key Variants

    | Type | Description |

    |------|-------------|

    | Base/Pretrained | Raw next-token predictor |

    | Instruct-tuned | Fine-tuned to follow instructions |

    | RLHF-aligned | Further shaped by human feedback |

    | Multimodal | Handles text + images/audio |

    Related Concepts

  • Token, Tokenization, Embeddings, Parameters, Pre-training, Inference

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 1).