Skip to main content
Back to blog

Fine-tuning small language models for specific tasks

·9 min readAI

General-purpose language models are impressive, but they are not always the best tool for a narrow task. If you need a model that classifies support tickets, generates SQL from natural language, or follows a very specific output format, fine-tuning a small model often outperforms prompting a large one.

When to fine-tune (and when not to)

Before reaching for fine-tuning, consider the simpler alternatives:

Start with prompting. If the information the model needs fits in the context window and you just need specific output formatting or a certain tone, a well-crafted system prompt might solve the problem in an afternoon. Most people reach for fine-tuning too early.

Use RAG for knowledge. If the model does not know facts specific to your domain, the answer is retrieval-augmented generation, not fine-tuning. Fine-tuning does not reliably inject facts into a model. RAG handles this better in almost every case.

Fine-tune for behavior. If the model does not behave the way you need it to, that is a fine-tuning problem. Wrong tone, inconsistent output format, poor tool-calling performance, or inability to follow a narrow task specification: these are patterns that fine-tuning teaches effectively.

The best pattern for 2026 is combining approaches: fine-tune for behavior and style (how to respond), RAG for dynamic knowledge (what to respond with). A legal AI fine-tuned on legal reasoning patterns, using RAG to pull current case law, outperforms either approach alone.

Why fine-tune a small model?

Consistency. A fine-tuned model produces consistent output because it learned the pattern from examples, not from instructions it might misinterpret. If you need JSON in a specific schema every time, fine-tuning beats prompt engineering.

Speed and cost. A fine-tuned 7B model runs on consumer hardware and generates tokens 10x faster than calling a 70B model through an API. For high-volume tasks, the cost difference is massive.

Privacy. The model runs on your hardware. No data leaves your network.

How LoRA works

Full fine-tuning updates every weight in the model, which is expensive. A 7B model at full precision has 7 billion parameters. Updating all of them requires enormous GPU memory and takes days.

LoRA (Low-Rank Adaptation) takes a different approach. It freezes all the pre-trained weights and injects small trainable matrices into each transformer layer. Instead of updating a full weight matrix W (dimensions d x d), LoRA adds two small matrices A (d x r) and B (r x d), where r (the rank) is much smaller than d. The update becomes W + BA.

This reduces trainable parameters by 90% or more. The original LoRA paper showed a 10,000x reduction in trainable parameters and 3x reduction in GPU memory versus full fine-tuning. The resulting adapter is typically 10-50MB compared to the full model's 4-16GB.

Key parameters

Rank (r) controls how expressive the adapter is. Lower rank means fewer trainable parameters. r=8 or r=16 works well for simple tasks like classification or formatting. r=64-128 is better for complex or data-rich scenarios. Higher rank increases capacity but also increases memory usage and overfitting risk.

Alpha (lora_alpha) scales how strongly the adapter affects the base model. The adapter output is scaled by alpha/r. A common safe default is setting alpha equal to rank (effective scaling factor of 1.0). Microsoft's original implementation used alpha = 2x rank.

Target modules define which layers get LoRA adapters. Research consistently shows that targeting all linear layers (attention projections plus MLP layers) outperforms targeting only the attention layers. In practice, use target_modules = "all-linear" to apply LoRA to every linear layer.

QLoRA: fine-tuning on consumer hardware

QLoRA quantizes the frozen base model to 4-bit precision during training while keeping the LoRA adapters in higher precision. This cuts GPU memory by about 33% compared to standard LoRA, making it possible to fine-tune a 7B model on a single 16GB consumer GPU.

The tradeoff is roughly 39% longer training time due to the quantization/dequantization overhead during backpropagation. For most use cases on consumer hardware, this is a worthwhile trade.

Preparing your dataset

The dataset is the most important input to fine-tuning. Quality matters more than quantity. 200 carefully curated examples routinely outperform 2,000 sloppy ones. Meta's LIMA paper trained on exactly 1,000 carefully selected examples and matched or exceeded Alpaca's 52,000-example performance.

How many examples you need

TaskExamplesNotes
Classification100-300 per categoryStraightforward pattern
Text extraction / formatting200-500Clear input-output mapping
Style / tone adaptation500-1,000Needs variety to generalize
Complex domain tasks1,000-5,000Reasoning, multi-step

Below 50-100 examples, you are essentially doing few-shot prompting disguised as fine-tuning.

Dataset formats

For single-turn tasks (classification, extraction), use the Alpaca format:

[
  {
    "instruction": "Classify this support ticket",
    "input": "My payment failed and I was charged twice",
    "output": "category: billing\npriority: high\nsentiment: negative"
  },
  {
    "instruction": "Classify this support ticket",
    "input": "How do I change my email address?",
    "output": "category: account\npriority: low\nsentiment: neutral"
  }
]

For multi-turn conversations (chatbots, agents), use the ShareGPT format with role-labeled messages. This matches the OpenAI Chat Completions API structure, so your training data looks exactly like your inference calls.

Synthetic data generation

Using a stronger model (GPT-4, Claude) to generate training data for a smaller model is a proven approach. It works best for well-defined tasks where you can verify output quality programmatically. Generate diverse examples that cover edge cases, then validate through automated checks and human review.

Training with Unsloth

Unsloth is the fastest way to fine-tune on consumer hardware. It optimizes the HuggingFace training loop with custom Triton kernels, achieving 2-2.5x faster training and 70% less VRAM usage than stock HuggingFace.

from unsloth import FastLanguageModel
 
# Load base model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-8B-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)
 
# Add LoRA adapter
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    lora_dropout=0,
    bias="none",
)

Training parameters that matter

from trl import SFTTrainer
from transformers import TrainingArguments
 
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=8,
        num_train_epochs=3,
        learning_rate=2e-4,
        warmup_steps=10,
        lr_scheduler_type="cosine",
        output_dir="outputs",
        logging_steps=10,
        save_strategy="epoch",
    ),
)
 
trainer.train()

Learning rate is the most critical hyperparameter. For LoRA/QLoRA, start with 1e-4 to 2e-4. Too high causes catastrophic forgetting (the model loses general knowledge). Too low and the model barely changes.

Epochs. 1-5 is the standard range. 3 epochs is a safe default. With small datasets (under 500 examples), overfitting happens fast. 1-2 epochs may be enough.

Batch size and gradient accumulation. GPU memory limits the per-device batch size (usually 1, 2, or 4 for LLMs). Gradient accumulation simulates a larger effective batch: batch_size=2 with gradient_accumulation_steps=8 gives an effective batch of 16, which is a stable target for most tasks.

Use a scheduler. Cosine annealing is the standard for fine-tuning. Do not use a flat learning rate.

When to stop training

Monitor validation loss. When it starts increasing while training loss keeps decreasing, you are overfitting. Save checkpoints at regular intervals and keep the best one based on validation loss. Run your task-specific evaluation after each checkpoint to track real-world performance, not just loss numbers.

On a single RTX 3060 12GB, training on 500 examples takes about 15-30 minutes with Unsloth. Training time scales linearly with dataset size and epochs.

Beyond SFT: DPO for preference alignment

Standard fine-tuning (SFT) teaches a model to imitate good examples. DPO (Direct Preference Optimization) teaches it to distinguish good outputs from bad ones by training on pairs of responses: one chosen (preferred) and one rejected.

The difference matters. SFT says "here are good examples, imitate them." DPO says "here is a good response and a bad response, learn to prefer the good one." DPO is better at teaching the model to avoid certain behaviors rather than just imitate good ones.

The recommended pipeline is SFT first (teaches basic task structure), then DPO as a refinement step (improves quality and alignment). DPO matches or exceeds the older RLHF approach while being dramatically simpler: no separate reward model, no reinforcement learning, just a single training step on preference pairs.

Deploying the fine-tuned model

Export to GGUF format for use with Ollama:

model.save_pretrained_gguf("model", tokenizer, quantization_method="q4_k_m")

Unsloth auto-generates a Modelfile with the correct chat template. This is important: the most common cause of poor results in Ollama is using the wrong chat template. The template used during inference must match the one used during training.

ollama create ticket-classifier -f Modelfile
ollama run ticket-classifier

Now you have a purpose-built model running locally that is fast, private, and consistent at its specific task.

For production serving to multiple users, vLLM can serve LoRA adapters directly on top of a base model without merging. You can run multiple adapters on a single base model and switch per request, which is efficient when you have several specialized fine-tunes.

Choosing a base model

The base model you fine-tune determines your starting quality. As of early 2026:

Qwen 3 (4B, 8B, 14B, 32B) is the current leader. Fine-tuned Qwen3-4B matches or exceeds models 30x its size on many benchmarks. The dense models match Qwen 2.5 at 2x their size. Apache 2.0 license.

Llama 3.1/3.2 (1B, 3B, 8B, 70B) has the broadest ecosystem support and the most community fine-tunes. Smaller Llama models (1B, 3B) show the largest improvement from fine-tuning because they start weaker but benefit the most.

Phi-4 (3.8B, 14B) punches above its weight on math and reasoning. If your task involves structured reasoning or calculation, Phi-4 is worth testing.

For a first fine-tuning project, start with Qwen3-4B or Llama 3.2-3B. Small enough for a consumer GPU, strong enough to produce useful results.

When not to fine-tune

If your task can be solved with good prompting and a capable base model, skip fine-tuning. The effort of collecting data, training, and evaluating is not trivial. Fine-tune when you need consistency, speed, or privacy that prompting alone cannot provide. And always define success metrics before committing to any approach. Without evaluation, every architecture decision is just guesswork.

Sources

Enjoying the blog? Subscribe via RSS to get new posts in your reader.

Subscribe via RSS