Definition
Tokenization is the process of converting raw text into a sequence of tokens — the discrete numeric IDs that an LLM can process. It is the first and last step in every LLM pipeline: text → token IDs (encoding) and token IDs → text (decoding).
The Full Pipeline
`
Raw Text → [Tokenizer] → Token IDs → [Model] → Token IDs → [Detokenizer] → Output Text
"Hello!" → [15496, 0] → ...model... → [2159] → "World"
`
Steps in Tokenization
1. Normalization — Unicode normalization, lowercasing (model-dependent), whitespace handling
2. Pre-tokenization — split on spaces/punctuation as initial boundary hints
3. Subword splitting — apply BPE/WordPiece/SentencePiece rules to produce final token units
4. Vocabulary lookup — map each token string → integer ID
5. Special token injection — add [BOS], [EOS], [PAD] as required by the model
Major Algorithms
Byte Pair Encoding (BPE)
- Start with individual characters as vocabulary
- Repeatedly merge the most frequent adjacent pair
- Stop when vocabulary reaches target size
- Used by: GPT family, LLaMA, Mistral
- Similar to BPE but merges based on maximizing likelihood of training data
- Unknown tokens split with
##prefix for continuations - Used by: BERT, DistilBERT
- Treats input as a raw byte stream (language-agnostic)
- Supports both BPE and unigram language model modes
- Handles spaces explicitly with
▁symbol - Used by: T5, XLNet, Gemini
- Starts with a large vocabulary and prunes tokens that minimize loss increase
- More probabilistic approach
- Model performance: poor tokenization = poor understanding of rare words, numbers, code
- Multilingual support: byte-level tokenizers handle all languages; word-level struggles with non-Latin scripts
- Arithmetic tasks: models struggle with math partly because numbers tokenize inconsistently ("123" may be 1 or 3 tokens)
- Prompt engineering: knowing how text tokenizes helps design efficient, precise prompts
- Leading spaces:
" word"≠"word"as tokens - Numbers: "1000000" may tokenize as ["100", "0000"] — not ["1000000"]
- Code symbols:
->,=>,//each tokenize differently per tokenizer - Emoji/Unicode: may tokenize into many byte-level tokens
tiktoken(OpenAI) — fast Python tokenizer for GPT modelstransformers.AutoTokenizer(HuggingFace) — universal tokenizer loader- Tokenizer Playground (OpenAI) — visual token inspection
- Token, Embeddings, Vocabulary, Context Window, BPE
WordPiece
SentencePiece
Unigram Language Model
Encoding vs. Decoding
| Direction | Term | Description |
|-----------|------|-------------|
| Text → IDs | Encoding / Tokenizing | Used at input time |
| IDs → Text | Decoding / Detokenizing | Used at output time |