LoRA and PEFT (Parameter-Efficient Fine-Tuning)

Definition

PEFT (Parameter-Efficient Fine-Tuning) is a family of techniques that fine-tune only a tiny fraction of a model's parameters instead of updating all weights. LoRA (Low-Rank Adaptation) is the dominant PEFT method — it adds small trainable low-rank matrices to the model's attention layers, achieving full fine-tune quality at 0.1–1% of the compute cost.

Why PEFT?

Full fine-tuning a 70B model requires ~140GB of GPU memory for weights alone, plus optimizer states (~3× more for Adam) = ~560GB total. That's 7× A100 80GB GPUs just for memory.

PEFT methods make fine-tuning accessible:

| Method | Trainable Params | VRAM for 7B Model |

|--------|-----------------|-------------------|

| Full fine-tune | 100% (7B) | ~112GB (fp16 + Adam) |

| LoRA (r=16) | ~0.5% (~35M) | ~18GB |

| QLoRA (r=16, 4-bit base) | ~0.5% (~35M) | ~6GB (fits on single consumer GPU) |

LoRA: Low-Rank Adaptation

Core Idea

Weight updates during fine-tuning tend to have low intrinsic dimensionality — they don't need the full rank of the weight matrix. LoRA exploits this by decomposing the weight update into two small matrices:

Original weight (frozen): W ∈ ℝ^(d×d)

LoRA update: ΔW = B × A

where A ∈ ℝ^(d×r), B ∈ ℝ^(r×d), r << d

At inference:

output = (W + ΔW) × input = (W + BA) × input

The Rank Hyperparameter (r)

r is the bottleneck dimension — controls how expressive the adapter is
Typical values: r = 4, 8, 16, 32, 64
Higher r → more expressive, more trainable params, slower training
For most tasks: r=8 or r=16 is sufficient

Initialization

A: initialized with random Gaussian noise
B: initialized to zero → ΔW = BA = 0 at the start of training
This ensures the LoRA adapter has zero effect initially (preserves the pretrained model)

Scaling Factor (α/r)

ΔW = (α/r) × BA

α (alpha) is a scaling hyperparameter, typically set to r or 2r
Normalizes the update magnitude relative to rank

Which Layers to Apply LoRA To?

Typically applied to the attention projection matrices (Q, K, V, O):

W_q, W_k, W_v, W_o → each gets its own A and B matrices

Recent research shows applying to FFN layers and all linear layers improves results.

Merging LoRA at Inference

After training, the LoRA matrices can be merged back into the base model:

W_new = W + BA

Result: original model size, zero inference overhead — the adapter disappears.

QLoRA: LoRA on Quantized Models

Quantize base model to 4-bit (NF4 format) → enormous memory reduction:

1. Load base model in 4-bit precision

2. Add LoRA adapters in bf16/fp16 precision

3. Train only the LoRA adapters

4. De-quantize + merge at the end (optional)

QLoRA enables fine-tuning 70B models on a single 48GB GPU.

NF4 (Normal Float 4): a 4-bit quantization format that preserves the normal distribution of weights better than standard INT4.

Other PEFT Methods

Adapter Layers

Insert small MLP modules between Transformer layers
Only the adapter weights are trained
Older approach; less popular now than LoRA

Prefix Tuning / Prompt Tuning

Prepend learnable "soft tokens" (continuous embeddings) to the input
Only these tokens are trained
Very parameter-efficient but less expressive

IA³

Rescales attention and FFN activations with learned vectors
Even fewer parameters than LoRA
Works well for instruction following

DoRA (Weight-Decomposed LoRA)

Decomposes weight into magnitude + direction
Updates direction with LoRA + magnitude separately
Often slightly better quality than standard LoRA

LoRA in Practice

Tools

| Tool | LoRA Support | Notes |

|------|-------------|-------|

| HuggingFace PEFT | Full | peft.LoraConfig, industry standard |

| HuggingFace TRL | Full | SFTTrainer + LoraConfig |

| Unsloth | Full + fast | 2× faster LoRA training, lower memory |

| Axolotl | Full | Config-driven, many PEFT options |

| LLaMA Factory | Full | Web UI + CLI for LoRA |

Typical Training Configuration

`python

from peft import LoraConfig, get_peft_model

config = LoraConfig(

r=16, # rank

lora_alpha=32, # scaling factor (typically 2×r)

target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],

lora_dropout=0.05,

bias="none",

task_type="CAUSAL_LM"

)

When to Use LoRA vs. Full Fine-Tune

| Use LoRA | Use Full Fine-Tune |

|----------|-------------------|

| Limited GPU budget | Unlimited compute |

| Style/format adaptation | Deep knowledge injection |

| Instruction format tuning | Domain-specific specialization |

| Quick experiments | Production-grade fine-tune |

| Multiple task adapters needed | Single-purpose model |

LoRA Adapters as Plugins

One powerful pattern: maintain a single base model + multiple LoRA adapters for different tasks/personas:

Base model: LLaMA 3 8B (unchanged)
Adapter A: legal writing style
Adapter B: code generation specialist
Adapter C: customer support persona

Switch adapters dynamically at inference → multi-purpose deployment without multiple full models.

Related Concepts

Fine-Tuning, Parameters, Quantization, Pre-training, Instruct Model, QLoRA, RLHF