LLM (Large Language Model) — FDE@ProdAI Blog

Definition

A Large Language Model is a deep learning model trained on massive text corpora that generates text by predicting the most probable next token given a sequence of prior tokens.

Core Mechanism

Built on the Transformer architecture (introduced in "Attention Is All You Need", 2017)
Uses self-attention to weigh the relevance of every token against every other token in the input
Processes input in parallel (unlike RNNs which process sequentially)
Output is a probability distribution over the vocabulary at each step — the most probable token is selected (or sampled)

Architecture Components

Embedding Layer — converts tokens to dense vectors
Transformer Blocks (stacked) — each contains:

- Multi-Head Self-Attention

- Feed-Forward Network (FFN)

- Layer Normalization

- Residual Connections

Output Head (LM Head) — linear layer + softmax projecting to vocabulary size

Scale

"Large" refers to parameter count: billions to trillions of parameters
Examples: GPT-4 (~1T estimated), Claude 3 Opus, LLaMA 3 (8B–70B), Mistral (7B)
Scale follows scaling laws (Chinchilla): performance improves predictably with more data + parameters + compute

Training Objective

Next-token prediction (autoregressive/causal language modeling)
Given tokens [t1, t2, ..., tn], predict t(n+1)
Loss function: Cross-entropy between predicted distribution and true next token

Capabilities (Emergent at Scale)

Text generation, summarization, translation
Code generation and debugging
Reasoning, question answering
Few-shot and zero-shot task generalization

Limitations

No real-time knowledge (knowledge cutoff)
Prone to hallucination
Context window limits
No persistent memory across sessions by default

Key Variants

| Type | Description |

|------|-------------|

| Base/Pretrained | Raw next-token predictor |

| Instruct-tuned | Fine-tuned to follow instructions |

| RLHF-aligned | Further shaped by human feedback |

| Multimodal | Handles text + images/audio |

Related Concepts

Token, Tokenization, Embeddings, Parameters, Pre-training, Inference