Training an embedding model

Hi,

Based on my previous post, I need to write a correction. I was wrong.

I previously suggested using embeddings directly from pre-trained models. That turns out to be a bad idea, because the training objectives are fundamentally different. As an alternative, I have now explored training an embedding model using Contrastive Learning.

Please look at my notebook for the full code and details.

The main reason you should not use raw embeddings from pre-trained models is the anisotropy problem. The paper How Contextual are Contextualized Word Representations? from Ethayarajh et al. 2019 shows that the embeddings are clustered in a narrow cone of the vector space, making them almost useless for differentiation. So, if you calculate the cosine similarity between two completely different sentences, the result will be around 0.80.

A method to fix this is described in the paper Supervised Contrastive Learning from Khosla et al. 2021. The idea is to use positive pairs (similar sentences) and negative pairs (dissimilar sentences) to train the model. Positve pairs are pulled close together and result in closer similarity, while negative pairs are pushed apart and result in a lower similarity. Supervised contrastive learning needs a labelled dataset to train the model. This takes a lot of time and effort to create.

To avoid the data labeling, I use SimCSE: Simple Contrastive Learning of Sentence Embeddings from Gao et al. 2022 to train the model. The idea is to create positive pairs on the same sentence only relying on dropout noise from the same model. By passing the same sentence through the encoder twice with standard dropout active, we introduce noise to create a valid positive pair for contrastive learning. Negative pairs are created by sampling random sentences from the dataset.

For SimCSE only two hyperparameters besides the model hyperparameters are critical:

batch size: this determines the number of negative samples available during training. Since SimCSE uses in-batch negatives, a batch size of 128 means the model sees 1 positive pair and 127 negative examples (N−1) for every step. Generally, a larger batch size makes the training task harder for the model. It forces the model to distinguish the correct sentence from a larger crowd of distractors, leading to more robust embeddings.
temperature: this parameter scales the cosine similarity scores before they are passed into the softmax function. It essentially controls the model’s sensitivity. A lower temperature (like the 0.05 used in the original SimCSE paper) sharpens the distribution, making the model very discriminative but potentially unstable. A higher temperature smooths the distribution out.

Here is my implementation of SimCSE using the huggingface trainer class:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


class SimCSETrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):

        outputs_a = model(
            input_ids=inputs["input_ids"], # torch.Size([batch_size, 128])
            attention_mask=inputs["attention_mask"] # torch.Size([batch_size, 128])
        ) # torch.Size([batch_size, 128, 768])

        emb_a = outputs_a.last_hidden_state[:, 0, :] # [CLS] token torch.Size([batch_size, 768])

        # Forward Pass 2 - Re-run same inputs for different Dropout mask
        outputs_b = model(
            input_ids=inputs["input_ids"], # torch.Size([batch_size, 128])
            attention_mask=inputs["attention_mask"] # torch.Size([batch_size, 128])
        ) # torch.Size([batch_size, 128, 768])

        emb_b = outputs_b.last_hidden_state[:, 0, :] # [CLS] token torch.Size([batch_size, 768])

        # Scales vector to L2 norm
        z_a = F.normalize(emb_a, p=2, dim=1)
        z_b = F.normalize(emb_b, p=2, dim=1)

        # Cosine Similarity Matrix
        sim_matrix = torch.matmul(z_a, z_b.T) / temperature

        # Labels are the diagonal (0, 1, 2...)
        batch_size = z_a.size(0)
        labels = torch.arange(batch_size).long().to(z_a.device)

        loss = F.cross_entropy(sim_matrix, labels)

        return (loss, outputs_a) if return_outputs else loss

As you can see, the implementation is surprisingly simple.

Please have a look at the training results and the evaluation of the embeddings in the kaggle notebook. While there is still room for improvement, the embeddings are significantly better than the base model.

Overall, I hope this post helps you start training your own embedding models. This skill is incredibly valuable for improving RAG (Retrieval-Augmented Generation) systems, especially when fine-tuning on your own specific datasets.

Thank you for your attention.