Multimodality — FDE@ProdAI Blog

Definition

Multimodality in LLMs refers to the ability of a model to process, understand, and generate content across multiple types of data modalities — not just text, but also images, audio, video, and structured data. A multimodal LLM (MLLM) can accept mixed inputs and reason across modalities.

Modalities in AI Models

|----------|-------------|--------------|----------------|

Input vs. Output Modalities

Not all models are symmetric — many accept multi-modal inputs but produce text output only:

| Capability | Examples |

|------------|---------|

| Text + Image → Text | GPT-4o, Claude 3.5, Gemini |

| Text + Audio → Text | Whisper, GPT-4o Audio |

| Text → Image | DALL-E 3, Midjourney, Stable Diffusion |

| Text → Audio | ElevenLabs, OpenAI TTS |

| Text + Image + Video → Text | Gemini 1.5 Pro |

| Any → Any (native) | GPT-4o (approaching this) |

How Image Understanding Works in Transformers

Vision Encoder

Images are processed by a visual encoder (typically a Vision Transformer / ViT) before being fed to the language model:

1. Image → split into patches (e.g., 14×14 pixel patches)

2. Each patch → linear projection → patch embedding vector

3. Sequence of patch embeddings → fed to the LLM's attention layers

4. LLM reasons over both text tokens and image patch tokens together

Connector / Projection Layer

A projection layer maps vision encoder outputs to the LLM's embedding dimension:

Image → [Vision Encoder] → visual features → [Projection MLP] → LLM embedding space → [LLM]

End-to-End Training

Modern multimodal models are trained end-to-end or fine-tuned on (image, text) paired datasets:

Image captioning datasets (COCO, CC12M)
Visual QA datasets (VQA, GQA, ScienceQA)
Document understanding datasets (DocVQA)
Interleaved image-text web data (MMC4)

Multimodal Tasks

| Task | Input | Output |

|------|-------|--------|

| Image captioning | Image | Text description |

| Visual QA | Image + Question | Text answer |

| OCR / Document Q&A | Document image | Extracted text/answers |

| Chart/diagram analysis | Chart image | Data interpretation |

| Code from screenshot | UI screenshot | HTML/CSS/code |

| Medical image analysis | X-ray/MRI | Clinical description |

| Video understanding | Video frames | Summary/events |

| Audio transcription | Audio file | Text transcript |

Leading Multimodal Models (2024–2025)

| Model | Modalities | Notes |

|-------|-----------|-------|

| GPT-4o | Text, Image, Audio | Native multimodal, real-time |

| Claude 3.5 Sonnet | Text, Image, PDF | Strong document understanding |

| Gemini 1.5 Pro | Text, Image, Audio, Video | 1M token context, video native |

| LLaVA / LLaVA-1.6 | Text, Image | Open-source vision model |

| Qwen2-VL | Text, Image, Video | Strong open-source option |

| Pixtral | Text, Image | Mistral's vision model |

Audio Multimodality

Speech-to-Text (ASR)

Whisper (OpenAI): industry-standard open-source ASR
Deepgram, AssemblyAI: API-based transcription

Text-to-Speech (TTS)

OpenAI TTS, ElevenLabs, Azure Cognitive Services

Native Audio LLMs

GPT-4o Audio: processes raw audio natively (not speech-to-text first)
Gemini: native audio understanding

Multimodal Challenges

| Challenge | Description |

|-----------|-------------|

| Hallucination on images | Model describes objects not present in image |

| Cultural/context bias | Images from underrepresented contexts misunderstood |

| Small text in images | OCR quality degrades at small font sizes |

| Complex charts | Mathematical/scientific charts require specialized training |

| Long video | Processing many frames within context limits |

| Audio with noise | Background noise degrades transcription quality |

Multimodal RAG

Extend RAG to handle images and mixed documents:

Index document pages as images + extracted text
Retrieve relevant page images
Feed retrieved images + query to multimodal LLM
Particularly powerful for PDFs with charts, tables, diagrams

Practical Use Cases

| Industry | Use Case | Modalities |

|----------|---------|-----------|

| Healthcare | Medical report analysis | Image + Text |

| Finance | Chart interpretation from reports | Image + Text |

| Retail | Product image search + description | Image → Text |

| Legal | Contract image OCR + analysis | Image + Text |

| Education | Diagram explanation | Image + Text |

| Accessibility | Image-to-audio description | Image → Text → Audio |

Related Concepts

LLM, Embeddings, Vision Transformer, RAG, Inference, Token, Fine-Tuning