Beginner·3 min read

Embeddings

Embeddings are dense numerical vectors that represent tokens (or sentences, documents, images) in a continuous high-dimensional space. They encode sem

Definition

Embeddings are dense numerical vectors that represent tokens (or sentences, documents, images) in a continuous high-dimensional space. They encode semantic meaning so that similar concepts are geometrically close to each other.

Why Embeddings?

  • Computers can't process raw text — they need numbers
  • One-hot encoding (10,000-dim sparse vector) is inefficient and captures no meaning
  • Embeddings are compact (256–4096 dims) and encode rich semantic relationships
  • How Token Embeddings Work

    1. Each token ID maps to a row in an embedding matrix (shape: vocab_size × embed_dim)

    2. At inference, the model does a simple lookup: token_id → embedding vector

    3. This embedding matrix is learned during pre-training

    4. The same matrix is often used (transposed) at the output layer to predict the next token (weight tying)

    Dimensions (Typical Values by Model Size)

    | Model Size | Embedding Dimension |

    |------------|-------------------|

    | Small (125M) | 768 |

    | Medium (1.3B) | 2048 |

    | Large (7B) | 4096 |

    | XL (70B+) | 8192 |

    Properties of Good Embeddings

  • Semantic similarity: "king" and "queen" are close; "king" and "car" are far
  • Arithmetic: king - man + woman ≈ queen (Word2Vec famous example)
  • Contextual vs. static:
  • - Static (Word2Vec, GloVe): one fixed vector per word regardless of context

    - Contextual (BERT, GPT): each occurrence gets a different vector based on surrounding tokens

    Types of Embeddings

    | Type | Description | Use Case |

    |------|-------------|----------|

    | Token embeddings | Per-token lookup vectors | Input to every transformer |

    | Positional embeddings | Encode position in sequence | Combined with token embeddings |

    | Sentence embeddings | Single vector for entire sentence | Semantic search, RAG |

    | Image embeddings | Encode visual content | Multimodal models |

    | Document embeddings | Encode entire documents | Long-doc retrieval |

    Positional Embeddings

    Transformers have no inherent notion of order — all tokens are processed in parallel. Positional embeddings add position information:

  • Absolute (sinusoidal) — original Transformer, fixed sine/cosine patterns
  • Learned absolute — trained position vectors (BERT, GPT-2)
  • Relative (RoPE) — Rotary Position Embedding, encodes relative distance; used by LLaMA, Mistral, GPT-NeoX
  • ALiBi — adds a linear bias to attention based on distance; good for length generalization
  • Embedding Space (Latent Space)

  • The full embedding space is also called latent space
  • High-dimensional geometry encodes language structure
  • Nearest-neighbor search in this space = semantic search
  • Dimensionality reduction (t-SNE, UMAP) used to visualize clusters
  • Practical Uses of Embeddings

  • Semantic search: embed query + documents → cosine similarity → retrieve relevant docs
  • RAG (Retrieval-Augmented Generation): store doc embeddings in vector DB, retrieve at query time
  • Clustering: group similar documents
  • Classification: feed embedding into a classifier head
  • Anomaly detection: find outliers in embedding space
  • Similarity Metrics

    | Metric | Formula | Notes |

    |--------|---------|-------|

    | Cosine similarity | cos(θ) = A·B / (|A||B|) | Most common, direction-based |

    | Dot product | A·B | Fast, used in attention |

    | Euclidean distance | √Σ(ai-bi)² | Magnitude-sensitive |

    Popular Embedding Models

  • text-embedding-3-large (OpenAI) — 3072 dims
  • amazon.titan-embed-text-v2 (AWS Bedrock)
  • all-MiniLM-L6-v2 (HuggingFace/SentenceTransformers) — lightweight
  • nomic-embed-text — open source, long context
  • Related Concepts

  • Token, Tokenization, Latent Space, Attention, RAG, Vector Database

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 1).