Hi,
this is the final post in my series about RAG systems. Here I will look into fine-tuning the generative part of a RAG system. Here is the notebook. But mostly this is for me a method to understand LLM fine-tuning in a superficial and general way.
For the records, I looked into search, ANN Algorithms, fine-tuning embedding models, fine-tuning a reranker and evaluation of RAG systems in my previous posts.
In RAG, the retrieval step provides the facts. The fine-tuning is about model behavior to prevent that a base model hallucinate or ignore the provided data. As a rule of thumb, retrieval gives the model its memory and fine-tuning gives it its instructions.
SLM
Small Language Models (SLMs) are language models with parameters in the range of 0.1B to 10B. Compared to frontier models, they require less RAM, run faster and are cheaper to train and deploy. For a domain-specific RAG application, SLMs can be sometimes perform comparably or better than LLMs (Wang et al. 2024).
Since the RAG system already provides the necessary context through the retrieval phase, the generative model doesn’t need to memorize vast amounts of world knowledge. Fine-tuning makes sense for three reasons:
- domain adaptation (teaching the model domain-specific vocabulary, facts, and proprietary concepts)
- task-specific alignment (producing responses that directly answer the question)
- consistent output (enforcing a reliable tone, style, or format).
PEFT
Full fine-tuning of large language models is computationally intensive and can lead to catastrophic forgetting. Parameter-efficient fine-tuning (PEFT) addresses this by updating only a small subset of parameters (typically <1%), significantly reducing memory and compute requirements.
By Han et al. 2024 PEFT methods can be classified into three categories:
- Reparameterization (low-dimensional updates) via LoRA and QLoRA
- Additive methods (additional parameters) via Adapters and Prefix / Prompt tuning
- Selective updates (partial fine-tuning) via BitFit and Diff pruning
In practice, LoRA and QLoRA are the most widely used methods due to their strong performance and efficiency.
LoRA & QLoRA
The core idea of LoRA by Hu et al. 2021 is that instead of fine-tuning all parameters of the model, external low-rank matrices are used for learning. This means all base model weights are frozen, and fine-tuning only affects small matrices of selected layers. Instead of updating a large weight matrix $W$, it learns $W + A \times B$ where $A$ and $B$ are low-rank matrices. This drastically reduces the number of trainable parameters while maintaining near full fine-tuning performance.
Implementations:
QLoRA by Dettmers et al. 2023 is a combination of LoRA and quantization. Quantization converts model weights from a high-precision format to a lower-precision format (e.g., 32-bit floating point to 4-bit integer). This reduces the memory requirement of an LLM.
The standard library is bitsandbytes. Another alternative, still based on the Transformers library is HQQ or Unsloth.
DPO
Direct Preference Optimization (DPO) by Rafailov et al. (2023) is a technique used to align language models with human preferences directly from data. In a standard alignment pipeline, DPO follows an initial stage of Supervised Fine-Tuning (SFT). SFT establishes the model’s ability to follow basic formatting and structural instructions, DPO serves as the optimization layer for behavioral alignment.
Rather than training a separate reward model as in Reinforcement Learning from Human Feedback (RLHF), DPO optimizes the model directly using preference pairs: a “chosen” (good) response and a “rejected” (bad) response for the same prompt. By increasing the relative log probability of the chosen response, we can steer the model toward concise, factual answers grounded in the provided context while actively penalizing hallucinations.
Implementations:
Conclusion
Overall, this fine-tuning step completes the RAG architecture pipeline. We’ve transitioned from setting up an initial vector search, fine-tuning the embedding model, and training a reranker to improve retrieval quality, all the way to fine-tuning the final generative SLM using QLoRA and DPO. I hope you enjoyed this overview and see the possibility to dig deeper. There is a whole universe in each of these steps.
By applying these parameter-efficient techniques, we can build, train, and deploy an entirely customized RAG system that runs efficiently on consumer hardware.
Thank you for your attention.