Hi,

Based on my previous post, I looked into fine-tuning a bi-encoder via self-supervised learning. As a logical next step, I now want to fine-tune a cross-encoder for my specific task. This is not as easy as fine-tuning a bi-encoder, because cross-encoders are not designed to be trained with contrastive learning.

Therefor I looked into four different approaches to fine-tune a cross-encoder for my specific task:

BCE loss

The simplest approach is to use binary cross-entropy loss to train the model. Here we have two labels, 1 for a positive pair and 0 for a negative pair and treat the relevance as a binary classification problem Passage Re-Ranking with BERT by Nogueira and Cho 2020. This can also be called a pointwise approach. For every query-document pair we calculate the logit, then apply a sigmoid activation twhich will represent the probability of relevance.

The model is optimized using Binary Cross-Entropy (BCE) Loss, which penalizes the model based on the distance between the predicted probability $p_i$ and the actual label $y_i \in {0,1}$:

$$ L_{BCE} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)] $$

While easy to implement, this method looks at each pair in isolation. It doesn’t teach the model to rank a good document higher than a bad one directly, only to score them closer to their respective binary targets. As you can see in the notebook, this approach fails short to distinguish the relevance inside the same domain. That’s why we used the same approach, but with hard negatives.

Hard negatives

When training with BCE, you need both positive pairs (matches) and negative pairs (mismatches). If you simply pair a query with a random document from your corpus, the model learns very quickly because random documents are irrelevant. This leads to vanishing gradients and a model that plateaus early Dense Passage Retrieval for Open-Domain Question Answering by Karpukhin et al. 2020.

To build a robust reranker, you must introduce hard negatives. These are documents that look relevant to a shallow system (like BM25 or a first-stage bi-encoder) because they share overlapping vocabulary with the query, but do not actually answer it. Training on hard negatives forces the cross-encoder to move beyond simple keyword matching and learn deep, fine-grained semantic distinctions.

InfoNCE loss

To overcome the limitations of the pointwise BCE approach, we can use a listwise (or contrastive) approach. Instead of scoring pairs in isolation, the model evaluates a positive document alongside several hard negatives for the exact same query Representation Learning with Contrastive Predictive Coding by van den Oord et al. 2018.

By applying the InfoNCE (Information Noise-Contrastive Estimation) loss, the model treats the task as a multiple-choice problem. It uses a softmax function over the cross-encoder scores s(q,d) to maximize the probability of the true positive document d+ while simultaneously minimizing the probabilities of the negatives:

$$ L_{InfoNCE} = -\log \frac{e^{s(q, d^+)}}{e^{s(q, d^+)} + \sum_{i=1}^{N} e^{s(q, d_i^-)}} $$

This creates a competitive environment during training. The model directly learns to rank the positive document higher than the negatives, which aligns much better with the actual goal of a reranking system. In my notebook, this approach falls short to distinguish the relevance inside the same domain. The reason for this is that the model is trained to maximize the probability of the true positive document while minimizing the probabilities of the negatives. My hypothesis is, that this leads to a model that is very good at distinguishing between relevant and irrelevant documents, but not very good at distinguishing between different levels of relevance. That’s why I tried Margin MSE loss.

Margin MSE loss

While pointwise BCE focuses on absolute scores and InfoNCE looks at a whole list, the Margin MSE (Mean Squared Error) approach focuses on the relative difference or margin between pairs of documents. Often this is used with knowledge distillation, where a teacher model provides the soft labels. In our case, we don’t have a teacher model, so we can use the same approach as with InfoNCE, but with a different loss function Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation by Hofstätter et al. 2021.

Margin MSE solves the ranking by changing the objective or loss function. Instead of predicting absolute values, the model is trained to ensure the positive document d+ scores exactly a certain number of points higher than the negative document d−. Since your true labels are 1 and 0, the target margin between a positive and negative document is exactly 1 (i.e., 1−0=1). The loss function penalizes the model based on how far its predicted margin deviates from this target:

$$L_{MarginMSE} = \frac{1}{N} \sum_{i=1}^{N} (1-(s(q, d_i^+) - s(q, d_i^-)))^2$$

Conclusion

For fine-tuning a cross-encoder model on self-supervised data, we explored advanced loss functions: moving from absolute scoring (BCE) to multiple-choice listwise ranking (InfoNCE) and relative pairwise distance (Margin MSE). Theoretically, these relative ranking methods provide a more nuanced, flexible learning environment for the model.

However, our empirical results tell a different story. While InfoNCE and Margin MSE delivered solid, consistent improvements on the test set, the standard BCE loss showed the best results. With this series we show, that a robust reranker doesn’t always require the most complex mathematical architecture. Even with simple data and an approach to mine hard negatives, the reranker trained with BCE loss can outperform the ones trained with more advanced loss functions.

When building your own RAG systems, start simple, try to get a good set of hard negatives, and always let the empirical results guide your final architecture. As shown in the notebooks, self-supervised training with hard negatives is a very effective way to train a cross-encoder model and get good results.

Overall, I hope this post helps you start training your own cross-encoder models. This skill is incredibly valuable for improving RAG (Retrieval-Augmented Generation) systems, especially when fine-tuning on your own specific datasets.

Thank you for your attention.