Definition
Multimodality in LLMs refers to the ability of a model to process, understand, and generate content across multiple types of data modalities — not just text, but also images, audio, video, and structured data. A multimodal LLM (MLLM) can accept mixed inputs and reason across modalities.
Modalities in AI Models
| Modality | Description | Input Example | Output Example |
|----------|-------------|--------------|----------------|
| Text | Natural language | Prompts, documents | Responses, summaries |
| Image | Static visuals | Photos, diagrams, screenshots | Descriptions, captions |
| Audio | Sound/speech | Voice recordings | Transcriptions, speech |
| Video | Moving images | Recorded clips | Descriptions, timestamps |
| Code | Programming languages | Source files | Generated code |
| Structured data | Tables, JSON, CSV | Spreadsheets | Analysis, SQL queries |
| Documents | PDFs with layout | Business reports | Extraction, Q&A |
Input vs. Output Modalities
Not all models are symmetric — many accept multi-modal inputs but produce text output only:
| Capability | Examples |
|------------|---------|
| Text + Image → Text | GPT-4o, Claude 3.5, Gemini |
| Text + Audio → Text | Whisper, GPT-4o Audio |
| Text → Image | DALL-E 3, Midjourney, Stable Diffusion |
| Text → Audio | ElevenLabs, OpenAI TTS |
| Text + Image + Video → Text | Gemini 1.5 Pro |
| Any → Any (native) | GPT-4o (approaching this) |
How Image Understanding Works in Transformers
Vision Encoder
Images are processed by a visual encoder (typically a Vision Transformer / ViT) before being fed to the language model:
1. Image → split into patches (e.g., 14×14 pixel patches)
2. Each patch → linear projection → patch embedding vector
3. Sequence of patch embeddings → fed to the LLM's attention layers
4. LLM reasons over both text tokens and image patch tokens together
Connector / Projection Layer
A projection layer maps vision encoder outputs to the LLM's embedding dimension:
`
Image → [Vision Encoder] → visual features → [Projection MLP] → LLM embedding space → [LLM]
`
End-to-End Training
Modern multimodal models are trained end-to-end or fine-tuned on (image, text) paired datasets:
- Image captioning datasets (COCO, CC12M)
- Visual QA datasets (VQA, GQA, ScienceQA)
- Document understanding datasets (DocVQA)
- Interleaved image-text web data (MMC4)
- Whisper (OpenAI): industry-standard open-source ASR
- Deepgram, AssemblyAI: API-based transcription
- OpenAI TTS, ElevenLabs, Azure Cognitive Services
- GPT-4o Audio: processes raw audio natively (not speech-to-text first)
- Gemini: native audio understanding
- Index document pages as images + extracted text
- Retrieve relevant page images
- Feed retrieved images + query to multimodal LLM
- Particularly powerful for PDFs with charts, tables, diagrams
- LLM, Embeddings, Vision Transformer, RAG, Inference, Token, Fine-Tuning
Multimodal Tasks
| Task | Input | Output |
|------|-------|--------|
| Image captioning | Image | Text description |
| Visual QA | Image + Question | Text answer |
| OCR / Document Q&A | Document image | Extracted text/answers |
| Chart/diagram analysis | Chart image | Data interpretation |
| Code from screenshot | UI screenshot | HTML/CSS/code |
| Medical image analysis | X-ray/MRI | Clinical description |
| Video understanding | Video frames | Summary/events |
| Audio transcription | Audio file | Text transcript |
Leading Multimodal Models (2024–2025)
| Model | Modalities | Notes |
|-------|-----------|-------|
| GPT-4o | Text, Image, Audio | Native multimodal, real-time |
| Claude 3.5 Sonnet | Text, Image, PDF | Strong document understanding |
| Gemini 1.5 Pro | Text, Image, Audio, Video | 1M token context, video native |
| LLaVA / LLaVA-1.6 | Text, Image | Open-source vision model |
| Qwen2-VL | Text, Image, Video | Strong open-source option |
| Pixtral | Text, Image | Mistral's vision model |
Audio Multimodality
Speech-to-Text (ASR)
Text-to-Speech (TTS)
Native Audio LLMs
Multimodal Challenges
| Challenge | Description |
|-----------|-------------|
| Hallucination on images | Model describes objects not present in image |
| Cultural/context bias | Images from underrepresented contexts misunderstood |
| Small text in images | OCR quality degrades at small font sizes |
| Complex charts | Mathematical/scientific charts require specialized training |
| Long video | Processing many frames within context limits |
| Audio with noise | Background noise degrades transcription quality |
Multimodal RAG
Extend RAG to handle images and mixed documents:
Practical Use Cases
| Industry | Use Case | Modalities |
|----------|---------|-----------|
| Healthcare | Medical report analysis | Image + Text |
| Finance | Chart interpretation from reports | Image + Text |
| Retail | Product image search + description | Image → Text |
| Legal | Contract image OCR + analysis | Image + Text |
| Education | Diagram explanation | Image + Text |
| Accessibility | Image-to-audio description | Image → Text → Audio |