...
Julien Heiduk

Fine-Tune LLMs with QLoRA

Fine-tuning a large language model on commodity hardware used to require painful trade-offs. QLoRA removed the memory wall by combining 4-bit quantization with low-rank adapters. vLLM removes the inference wall by rethinking how GPU memory is managed during serving. Together they form a practical, end-to-end pipeline — from experiment to production — that fits on a single A100 or even a Colab T4.

This article walks through the complete workflow: adapt a causal LM with QLoRA, then load the result into vLLM for batched, high-throughput generation.

Open In Colab

Note on embedding: Google Colab sets X-Frame-Options: SAMEORIGIN on all its pages, which prevents browsers from rendering the notebook inside an <iframe>. The badge above is the standard workaround — it opens the notebook in a new Colab tab with a single click.

1. Why vLLM?

Standard HuggingFace generation allocates a fixed KV-cache block per sequence upfront. Because sequence lengths vary, most of that memory sits unused. At batch size 32 with mixed-length prompts, 60–80% of reserved memory is wasted.

vLLM introduces PagedAttention: the KV cache is split into fixed-size pages (like OS virtual memory), allocated on demand and freed immediately when a sequence finishes. The gains are concrete:

  • Near-zero internal memory fragmentation
  • Continuous batching — new requests slot in mid-batch instead of waiting for the current batch to drain
  • 20–30× higher throughput than naive model.generate() at the same latency
  • Native support for LoRA adapters, GPTQ/AWQ quantization, and tensor parallelism

2. Fine-Tuning with QLoRA

LoRA (Low-Rank Adaptation) freezes the base model weights and injects small trainable matrices $A \in \mathbb{R}^{r \times k}$ and $B \in \mathbb{R}^{d \times r}$ into each target layer:

$$\Delta W = B \cdot A, \quad r \ll \min(d, k)$$

With rank $r = 16$, a 7B model goes from ~7B trainable parameters to ~4M — a 99.9% reduction.

QLoRA stacks 4-bit NF4 quantization on top:

  • Base model weights stored in 4-bit (NF4 format, preserving the normal distribution of weights)
  • LoRA adapters computed and stored in bf16
  • Double quantization compresses the quantization constants themselves, saving ~0.5 GB per 7B model
  • Paged optimizers offload Adam states to CPU RAM, preventing OOM spikes during gradient steps

The result: fine-tuning a 7B model in under 12 GB of VRAM.

3. The Fine-Tuning Pipeline

Requirements

uv pip install transformers peft trl bitsandbytes accelerate datasets vllm

Step 1 — Load the base model in 4-bit

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"  # swap for Mistral-7B, Llama-3-8B, etc.

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NF4 is optimal for normally distributed weights
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,       # quantize the quantization constants
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token  # TinyLlama has no dedicated pad token

Step 2 — Attach LoRA adapters

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)  # enables gradient checkpointing for 4-bit

lora_config = LoraConfig(
    r=16,                                              # rank — higher = more capacity
    lora_alpha=32,                                     # scaling: effective lr ∝ alpha/r
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],  # attention projections
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 2,883,584 || all params: 1,103,224,832 || trainable%: 0.26

Step 3 — Train with SFTTrainer

from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

# 2,000 Python instruction-following examples in Alpaca format
dataset = load_dataset("iamtarun/python_code_instructions_18k_alpaca", split="train[:2000]")

# Combine instruction / input / output into a single text field
dataset = dataset.map(lambda x: {
    "text": f"### Instruction:\n{x['instruction']}\n\n### Input:\n{x['input']}\n\n### Response:\n{x['output']}"
})
dataset = dataset.remove_columns(["prompt"])  # drop original prompt column

sft_config = SFTConfig(
    output_dir="./tinyllama-python-adapter",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,   # effective batch = 16
    warmup_steps=50,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=50,
    dataset_text_field="text",
)
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=sft_config,
)
trainer.train()
trainer.model.save_pretrained("./tinyllama-python-adapter")
tokenizer.save_pretrained("./tinyllama-python-adapter")

4. Models comparison

Let’s try the original model and fine-tuned model.

Step 1 — Merge the adapter into base weights:

from peft import PeftModel
from transformers import AutoModelForCausalLM

# Load base in full precision and fuse LoRA weights (ΔW = BA)
base = AutoModelForCausalLM.from_pretrained(MODEL_ID)
merged = PeftModel.from_pretrained(base, "./tinyllama-python-adapter")
merged = merged.merge_and_unload()   # fuses ΔW = BA into the base weights
merged.save_pretrained("./tinyllama-merged")
tokenizer.save_pretrained("./tinyllama-merged")
print("Merged model saved.")

Step 2 — Before/after comparison with the HuggingFace pipeline:

from transformers import pipeline

# Original model — no fine-tuning
pipe = pipeline("text-generation", model=MODEL_ID, torch_dtype="auto", device_map="auto")
output = pipe("Write a Python function that reverses a linked list.", max_new_tokens=200)
print(output[0]["generated_text"])
# Fine-tuned model — adapter merged into weights
pipe = pipeline("text-generation", model="./tinyllama-merged", torch_dtype="auto", device_map="auto")
output = pipe("Write a Python function that reverses a linked list.", max_new_tokens=200)
print(output[0]["generated_text"])

The base model rephrases the prompt. The fine-tuned version returns actual Python code following the Alpaca response format it was trained on.

Conclusion

QLoRA makes fine-tuning accessible on a single consumer GPU by combining 4-bit NF4 quantization with low-rank adapters. vLLM handles the serving side, replacing naïve model.generate() with PagedAttention and continuous batching.

The two libraries are complementary by design: train with the PEFT/TRL ecosystem, serve with vLLM. For multi-adapter deployments, vLLM’s dynamic LoRA loading means you only need one copy of the base weights in GPU memory, regardless of how many specialized adapters you accumulate.

References: