Fine-Tuning — FDE@ProdAI Blog

Definition

Fine-tuning is the process of continuing to train a pre-trained model on a smaller, task-specific or domain-specific dataset to adapt its behavior. It modifies the model's parameters (all or a subset) to improve performance on a target domain, task, or behavioral style.

Why Fine-Tune?

Base/instruct models are general-purpose — they may underperform on specialized tasks
Fine-tuning gives the model domain knowledge and task-specific behavior
More efficient than training from scratch (leverages existing pre-trained knowledge)
Can shape tone, format, persona, refusal behavior

Types of Fine-Tuning

Full Fine-Tuning

All model parameters are updated during training
Most expressive — best performance potential
Requires significant GPU memory (same as pre-training the model size)
Risk of catastrophic forgetting (model forgets general capabilities)

Parameter-Efficient Fine-Tuning (PEFT)

Fine-tune only a small subset of parameters to save compute/memory:

| Method | Description | Trainable Params |

|--------|-------------|-----------------|

| LoRA | Adds low-rank decomposition matrices to attention layers | ~0.1–1% of total |

| QLoRA | LoRA on a quantized (4-bit) base model | ~0.1–1% |

| Prefix Tuning | Prepends trainable tokens to input | Tiny |

| Prompt Tuning | Learns soft prompt embeddings only | Tiny |

| Adapters | Inserts small trainable modules between layers | ~1–5% |

Instruction Fine-Tuning (IFT / SFT)

Fine-tune on (instruction, response) pairs
Teaches the model the instruct format and helpful behavior
Also called Supervised Fine-Tuning (SFT)

Domain-Specific Fine-Tuning

Fine-tune on domain text (medical papers, legal documents, code)
Model learns domain vocabulary, conventions, and reasoning
Examples: BioMedLM, LegalBERT, CodeLLaMA

The Fine-Tuning Process

1. Choose a base/instruct model to start from

2. Prepare dataset: (prompt, response) pairs, typically 1K–100K examples

3. Format using chat template: apply the model's expected instruct format

4. Configure training: learning rate, batch size, epochs, max sequence length

5. Train with low learning rate: typically 1e-5 to 1e-4 (much lower than pre-training)

6. Evaluate: compare against base model on target task metrics

7. Merge or deploy: with LoRA, merge adapter weights back into base model

Dataset Requirements

| Quantity | Quality | Format |

|---------|---------|--------|

| 1K–10K examples sufficient for format/style | High quality >> high quantity | Must match model's chat template |

| More data needed for knowledge injection | Diverse examples generalize better | Consistent instruction style |

LoRA: The Dominant PEFT Method

LoRA (Low-Rank Adaptation) works by decomposing weight updates:

Original weight matrix: W (d × d) — frozen

LoRA update: ΔW = A × B where A is (d × r), B is (r × d), r << d

New weight at inference: W + ΔW = W + AB

r (rank) is typically 4–64
Only A and B are trained (tiny vs. full W)
After training, merge: W_new = W + AB — no inference overhead

QLoRA: Fine-Tuning on Consumer Hardware

QLoRA (Quantized LoRA):

1. Quantize the base model to 4-bit (NF4 format)

2. Add LoRA adapters in full precision

3. Train only the LoRA adapters

4. Result: Fine-tune a 70B model on a single 48GB GPU (vs. 8× 80GB GPUs for full fine-tuning)

Common Fine-Tuning Platforms

| Platform | Notes |

|----------|-------|

| HuggingFace TRL | SFTTrainer, DPOTrainer — most popular |

| Axolotl | Config-driven, supports many architectures |

| LLaMA Factory | Flexible UI and CLI fine-tuning |

| Unsloth | 2× faster training, low VRAM |

| AWS SageMaker | Managed cloud fine-tuning |

| Azure ML / Vertex AI | Enterprise cloud fine-tuning |

Evaluation After Fine-Tuning

Task-specific metrics: BLEU, ROUGE, accuracy, F1
Human evaluation: preference over base model
Benchmark regression: ensure general capabilities didn't degrade
MT-Bench, Alpaca Eval: instruction-following quality

Risks and Mitigations

| Risk | Description | Mitigation |

|------|-------------|-----------|

| Catastrophic forgetting | Loses general capabilities | Use PEFT (LoRA), mix in general data |

| Overfitting | Memorizes training set | More data, regularization, early stopping |

| Alignment degradation | Safety behaviors weaken | Include safety examples in fine-tune data |

| Data quality issues | Noisy data hurts performance | Curate and filter carefully |

Related Concepts

Base Model, Instruct Model, LoRA, RLHF, Pre-training, Parameters, SFT, QLoRA