Intermediate·5 min read

LoRA and PEFT (Parameter-Efficient Fine-Tuning)

**PEFT (Parameter-Efficient Fine-Tuning)** is a family of techniques that fine-tune only a tiny fraction of a model's parameters instead of updating a

Definition

PEFT (Parameter-Efficient Fine-Tuning) is a family of techniques that fine-tune only a tiny fraction of a model's parameters instead of updating all weights. LoRA (Low-Rank Adaptation) is the dominant PEFT method — it adds small trainable low-rank matrices to the model's attention layers, achieving full fine-tune quality at 0.1–1% of the compute cost.

Why PEFT?

Full fine-tuning a 70B model requires ~140GB of GPU memory for weights alone, plus optimizer states (~3× more for Adam) = ~560GB total. That's 7× A100 80GB GPUs just for memory.

PEFT methods make fine-tuning accessible:

| Method | Trainable Params | VRAM for 7B Model |

|--------|-----------------|-------------------|

| Full fine-tune | 100% (7B) | ~112GB (fp16 + Adam) |

| LoRA (r=16) | ~0.5% (~35M) | ~18GB |

| QLoRA (r=16, 4-bit base) | ~0.5% (~35M) | ~6GB (fits on single consumer GPU) |

LoRA: Low-Rank Adaptation

Core Idea

Weight updates during fine-tuning tend to have low intrinsic dimensionality — they don't need the full rank of the weight matrix. LoRA exploits this by decomposing the weight update into two small matrices:

`

Original weight (frozen): W ∈ ℝ^(d×d)

LoRA update: ΔW = B × A

where A ∈ ℝ^(d×r), B ∈ ℝ^(r×d), r << d

`

At inference:

`

output = (W + ΔW) × input = (W + BA) × input

`

The Rank Hyperparameter (r)

  • r is the bottleneck dimension — controls how expressive the adapter is
  • Typical values: r = 4, 8, 16, 32, 64
  • Higher r → more expressive, more trainable params, slower training
  • For most tasks: r=8 or r=16 is sufficient
  • Initialization

  • A: initialized with random Gaussian noise
  • B: initialized to zero → ΔW = BA = 0 at the start of training
  • This ensures the LoRA adapter has zero effect initially (preserves the pretrained model)
  • Scaling Factor (α/r)

    `

    ΔW = (α/r) × BA

    `

  • α (alpha) is a scaling hyperparameter, typically set to r or 2r
  • Normalizes the update magnitude relative to rank
  • Which Layers to Apply LoRA To?

    Typically applied to the attention projection matrices (Q, K, V, O):

    `

    W_q, W_k, W_v, W_o → each gets its own A and B matrices

    `

    Recent research shows applying to FFN layers and all linear layers improves results.

    Merging LoRA at Inference

    After training, the LoRA matrices can be merged back into the base model:

    `

    W_new = W + BA

    `

    Result: original model size, zero inference overhead — the adapter disappears.

    QLoRA: LoRA on Quantized Models

    Quantize base model to 4-bit (NF4 format) → enormous memory reduction:

    1. Load base model in 4-bit precision

    2. Add LoRA adapters in bf16/fp16 precision

    3. Train only the LoRA adapters

    4. De-quantize + merge at the end (optional)

    QLoRA enables fine-tuning 70B models on a single 48GB GPU.

    NF4 (Normal Float 4): a 4-bit quantization format that preserves the normal distribution of weights better than standard INT4.

    Other PEFT Methods

    Adapter Layers

  • Insert small MLP modules between Transformer layers
  • Only the adapter weights are trained
  • Older approach; less popular now than LoRA
  • Prefix Tuning / Prompt Tuning

  • Prepend learnable "soft tokens" (continuous embeddings) to the input
  • Only these tokens are trained
  • Very parameter-efficient but less expressive
  • IA³

  • Rescales attention and FFN activations with learned vectors
  • Even fewer parameters than LoRA
  • Works well for instruction following
  • DoRA (Weight-Decomposed LoRA)

  • Decomposes weight into magnitude + direction
  • Updates direction with LoRA + magnitude separately
  • Often slightly better quality than standard LoRA
  • LoRA in Practice

    Tools

    | Tool | LoRA Support | Notes |

    |------|-------------|-------|

    | HuggingFace PEFT | Full | peft.LoraConfig, industry standard |

    | HuggingFace TRL | Full | SFTTrainer + LoraConfig |

    | Unsloth | Full + fast | 2× faster LoRA training, lower memory |

    | Axolotl | Full | Config-driven, many PEFT options |

    | LLaMA Factory | Full | Web UI + CLI for LoRA |

    Typical Training Configuration

    `python

    from peft import LoraConfig, get_peft_model

    config = LoraConfig(

    r=16, # rank

    lora_alpha=32, # scaling factor (typically 2×r)

    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],

    lora_dropout=0.05,

    bias="none",

    task_type="CAUSAL_LM"

    )

    `

    When to Use LoRA vs. Full Fine-Tune

    | Use LoRA | Use Full Fine-Tune |

    |----------|-------------------|

    | Limited GPU budget | Unlimited compute |

    | Style/format adaptation | Deep knowledge injection |

    | Instruction format tuning | Domain-specific specialization |

    | Quick experiments | Production-grade fine-tune |

    | Multiple task adapters needed | Single-purpose model |

    LoRA Adapters as Plugins

    One powerful pattern: maintain a single base model + multiple LoRA adapters for different tasks/personas:

  • Base model: LLaMA 3 8B (unchanged)
  • Adapter A: legal writing style
  • Adapter B: code generation specialist
  • Adapter C: customer support persona
  • Switch adapters dynamically at inference → multi-purpose deployment without multiple full models.

    Related Concepts

  • Fine-Tuning, Parameters, Quantization, Pre-training, Instruct Model, QLoRA, RLHF

Go Deeper With Live Instruction

This topic is covered in depth in our llm engineering program (Session 9).