Definition
PEFT (Parameter-Efficient Fine-Tuning) is a family of techniques that fine-tune only a tiny fraction of a model's parameters instead of updating all weights. LoRA (Low-Rank Adaptation) is the dominant PEFT method — it adds small trainable low-rank matrices to the model's attention layers, achieving full fine-tune quality at 0.1–1% of the compute cost.
Why PEFT?
Full fine-tuning a 70B model requires ~140GB of GPU memory for weights alone, plus optimizer states (~3× more for Adam) = ~560GB total. That's 7× A100 80GB GPUs just for memory.
PEFT methods make fine-tuning accessible:
| Method | Trainable Params | VRAM for 7B Model |
|--------|-----------------|-------------------|
| Full fine-tune | 100% (7B) | ~112GB (fp16 + Adam) |
| LoRA (r=16) | ~0.5% (~35M) | ~18GB |
| QLoRA (r=16, 4-bit base) | ~0.5% (~35M) | ~6GB (fits on single consumer GPU) |
LoRA: Low-Rank Adaptation
Core Idea
Weight updates during fine-tuning tend to have low intrinsic dimensionality — they don't need the full rank of the weight matrix. LoRA exploits this by decomposing the weight update into two small matrices:
`
Original weight (frozen): W ∈ ℝ^(d×d)
LoRA update: ΔW = B × A
where A ∈ ℝ^(d×r), B ∈ ℝ^(r×d), r << d
`
At inference:
`
output = (W + ΔW) × input = (W + BA) × input
`
The Rank Hyperparameter (r)
- r is the bottleneck dimension — controls how expressive the adapter is
- Typical values: r = 4, 8, 16, 32, 64
- Higher r → more expressive, more trainable params, slower training
- For most tasks: r=8 or r=16 is sufficient
- A: initialized with random Gaussian noise
- B: initialized to zero → ΔW = BA = 0 at the start of training
- This ensures the LoRA adapter has zero effect initially (preserves the pretrained model)
- α (alpha) is a scaling hyperparameter, typically set to r or 2r
- Normalizes the update magnitude relative to rank
- Insert small MLP modules between Transformer layers
- Only the adapter weights are trained
- Older approach; less popular now than LoRA
- Prepend learnable "soft tokens" (continuous embeddings) to the input
- Only these tokens are trained
- Very parameter-efficient but less expressive
- Rescales attention and FFN activations with learned vectors
- Even fewer parameters than LoRA
- Works well for instruction following
- Decomposes weight into magnitude + direction
- Updates direction with LoRA + magnitude separately
- Often slightly better quality than standard LoRA
- Base model: LLaMA 3 8B (unchanged)
- Adapter A: legal writing style
- Adapter B: code generation specialist
- Adapter C: customer support persona
- Fine-Tuning, Parameters, Quantization, Pre-training, Instruct Model, QLoRA, RLHF
Initialization
Scaling Factor (α/r)
`
ΔW = (α/r) × BA
`
Which Layers to Apply LoRA To?
Typically applied to the attention projection matrices (Q, K, V, O):
`
W_q, W_k, W_v, W_o → each gets its own A and B matrices
`
Recent research shows applying to FFN layers and all linear layers improves results.
Merging LoRA at Inference
After training, the LoRA matrices can be merged back into the base model:
`
W_new = W + BA
`
Result: original model size, zero inference overhead — the adapter disappears.
QLoRA: LoRA on Quantized Models
Quantize base model to 4-bit (NF4 format) → enormous memory reduction:
1. Load base model in 4-bit precision
2. Add LoRA adapters in bf16/fp16 precision
3. Train only the LoRA adapters
4. De-quantize + merge at the end (optional)
QLoRA enables fine-tuning 70B models on a single 48GB GPU.
NF4 (Normal Float 4): a 4-bit quantization format that preserves the normal distribution of weights better than standard INT4.
Other PEFT Methods
Adapter Layers
Prefix Tuning / Prompt Tuning
IA³
DoRA (Weight-Decomposed LoRA)
LoRA in Practice
Tools
| Tool | LoRA Support | Notes |
|------|-------------|-------|
| HuggingFace PEFT | Full | peft.LoraConfig, industry standard |
| HuggingFace TRL | Full | SFTTrainer + LoraConfig |
| Unsloth | Full + fast | 2× faster LoRA training, lower memory |
| Axolotl | Full | Config-driven, many PEFT options |
| LLaMA Factory | Full | Web UI + CLI for LoRA |
Typical Training Configuration
`python
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16, # rank
lora_alpha=32, # scaling factor (typically 2×r)
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
`
When to Use LoRA vs. Full Fine-Tune
| Use LoRA | Use Full Fine-Tune |
|----------|-------------------|
| Limited GPU budget | Unlimited compute |
| Style/format adaptation | Deep knowledge injection |
| Instruction format tuning | Domain-specific specialization |
| Quick experiments | Production-grade fine-tune |
| Multiple task adapters needed | Single-purpose model |
LoRA Adapters as Plugins
One powerful pattern: maintain a single base model + multiple LoRA adapters for different tasks/personas:
Switch adapters dynamically at inference → multi-purpose deployment without multiple full models.