This lesson is all about understanding the internals of the dual encoder. In this lesson, you will learn how to build a dual encoder in PyTorch and how to train it using a dataset of question and answer pairs. Okay, let's dive in. Let's look at the details of the dual encoder architecture for training sentence embedding models. We will train two independent BERT encoders. One is associated with the question and one with the answers. During training, we use the CLS embedding vector from the last layer of each BERT encoder model as the vector embedding representing the question or answer, respectively. The dot product similarity between them represents the semantic match. A high value means the answer is semantically relevant for the question, and a low value means it is not. The dual encoder architecture utilizes a contrastive loss. The idea behind contrastive loss is to ensure that embeddings of similar or positive pairs of data points are closer together in the embedding space, while representations of dissimilar or negative pairs are further apart. The loss function encourages the model to maximize similarity between embeddings of positive pairs and minimize similarity between negative pairs. In our context, the positive pair will include the embeddings of a question and its correct answer, and the negative pair is set up as a question and any other answer considered wrong in the batch. Let's take a look at an example. Here, we have four pairs of questions and answers and a similarity matrix computed between each question and each answer. For Q1, we would like our model to predict A1 as the most likely. For Q2, we would like the model to predict A2 as the most likely, and so on. So we got a single loss based on this 4x4 matrix that becomes smaller as the softmax values on the diagonal get closer to one, and the rest closer to zero. To do this in PyTorch, we use a little trick that involves the cross entropy loss function, resulting in one specific form of contrastive loss that works really well in practice. In code, we set the target argument of the cross entropy loss function to be zero, one, two, etc... until n minus one, where n is the batch size. This tells the cross entropy loss that the correct class in this case answer, for each row, which is the question, is the one that's associated with it, the diagonal. In other words, that A1 is the correct class for Q1, A2 is the correct class for Q2, and so on. In this way, we use cross entropy loss to match our desired goal of the contrastive loss. The use of the softmax in the loss function encourages the exponent of Sii. The similar between question i and answer I to be large, which means high similarity for correct pairs. An exponent of Sij or i and j are not the same, to be small, which is exactly what we want with contrastive learning. Let's see all of this in code. So we will ignore warnings as we usually do. And we import some other libraries from transformer, from pandas and numpy here. To get more intuition about how to compute contrastive loss using this cross entropy loss trick that we talked about. You will now define a data frame, a 4x4 data frame, similar to what we saw in the slides. And this is our contrastive loss function that given the data, computes the actual loss using cross entropy loss. Now, recall that the cross entropy loss applies. The softmax operator on each row. In other words, this force is the sum of exponents in each row to be close to one. Let's see an example. And if we sum these softmax values in each row, we can see they all do sum to one. Now I want you to run a little experiment. Here., you take this 4x4 matrix and run four rounds of a simple operation. You add to the diagonal 0.5 every time, and you, decrease from the other non-diagonal values, 0.02. So you increase the values under that diagonal and reduce the values everywhere else. Now, what happens is, is you see that the diagonal keeps growing every iteration, and the non-diagonal values keep lowering. And the loss computed through this contrastive loss with the cross entropy keeps reducing. And that's exactly the behavior we wanted to have with our contrastive loss. That it forces the diagonal to be closer and closer to higher values, and these to be closer to zero. Cool. So now that we've seen how contrastive loss work with cross entropy, let's look at the encoder. And the encoder has, a bunch of different subcomponents here, which we're going to go through. So in this example we use a batch size of 32, an embedding size of 512 and an output embedding size of 128. It's not a lot of code, but it packs a lot of important details. The first step is the nn embedding model you see here. Take in individual tokens or integer values and mapping them into token embeddings. The input is 32 by your sequence length, and the output is 32 times the sequence length times 512. Since every token is represented by 512 dimensional embedding vector. The embeddings are then processed by the NN transformer encoder layer, shown here. Here we just use eight attention heads and three layers instead of the 12 layers for example in the BERT model for simplicity. But the structure is the same. The output here is a contextualized embedding vector for each vector in the sequence. So it's really 32 by sequence length times 512. In BERT the CLS token is a special token that learns a rough embedding of the whole sentence. So our encoder uses the embedding of the CLS as input to the next step. Finally, we project the learn embedding of the CLS token to an additional NN projection layer that learns an additional transformation into a potentially smaller size output embedding dimension, which is 128. This is not strictly necessary, but allows us to reduce the overall embedding size at the output of the encoder and save some memory. So our final output is 32 by 128. And this is our final contextualized embedding. Okay, so now we have our encoder, and let's see how the training works. So we'll build a function to do the training loop. First, let's define our parameters for the training we have an embedding size of 512. The output embedding of 128. We'll define the max sequence size in this case for 64 and the batch size of 32. Next, you can define this question and answer encoders. First, you define the tokenizer. This case, we take the BERT-based-uncased tokenizer and define these two encoders. One is for the question and then one is for the answer. Now, you load the data set with your data loader here. You define the optimizer to be your typical Adam optimizer. with the parameters of both the question encoder and the answer encoder, and a learning rate of e to the minus five, and to find the loss function, which is our cross-entropy loss. So now, how do you build a training loop? You first, iterate through the data loader in batches of 32. This line takes the question and answer a batch as a separates them out and creates, the tokenization of each of the questions in each of the answers. Using these tokenized outputs, you can compute the question embedding and the answer embedding using the question encoders and the answer encoders you defined above. Now comes the computation of a similarity scores. And there's a really cool PyTorch one-liner that does this. Just dot product of the embeddings. And here's where you compute the contrastive loss using this cross entropy trick. So again the target needs to be zero, one, two, etc... All the way up to the shape of the question embedding which is 32. So up to 31 in this case. And this is the loss using the loss function, the cross entropy loss with this similarity scores and the target. And we add this particular loss in this iteration to a running loss we're going to use to keep computing and see how the loss is progressing over time. Final step, you will have to take in any training loop is of course, run the optimizer step and the backpropagation to get the learning to actually do its magic. And this happens in a loop over all of your dataset in this iteration. Once the iteration is is finished, you want to return the question encoder and the answer encoder the way they were trained with their weights back so we can use them for inference. So that's really our function for training. So you've seen how the training loop works, but this was just one iteration through the dataset. And in deep learning networks, we want to run multiple iterations. So let's change this function a little bit. We'll use the same starting point with all the setup. And then we're going to add an additional loop outside of it that runs through a number of epochs defined in this function. So we do this again and again multiple epochs to continue the training over time. All right. That's exciting. So we have our training loop. Let's train. We're going to use a dataset that has some samples of questions and answers. We're going to have this external class called my data set that essentially loads it into a pandas data frames and makes it available. Let's load this data set and see. Here are some of the questions. Here are some of the answers. So now you can just call the train function on the dataset you loaded with the number of epochs and get the training started. This is a long process and it might take a while. But what we will see is the print out of the loss that is gained after every one of these epochs. Okay. So training finished. And as you can see the loss is not too low. But remember we only did ten epochs. We used 300 examples, which is very small. And this in itself is a really small model. So it's just meant to show you an example to play with. And if you have a bigger machine, you can run it with much bigger parameter sets and many more epochs to see much better results. And I encourage you to do that. Having said that, just for the sake of the exercise, let's look at what we got. So you can take, a question like, what is the tallest mountain in the world? Tokenize it. And use the model that was just trained that you got here from the training as output on this token. And what you get is, again, these are the tokens and these are the embeddings. And if you create a bunch of potential answers and do a similar run, you will see some interesting things here. Again, the first answer here, is the same as our question what is the tallest mountain in the world? And you can see the tokens appear to be exactly the same, which is what we would expect. However, the embeddings here are using the answer encoders. And you can see that the embedding values here are not the same as the value you have here for the same question. This is exactly because of the reason that the question encoder and the answer encoders are not the same encoder. They have different weights and they are trained to be different, in the form that we discussed. And if you compute the similarity, the dot product between those two, you can see that the question, it's the same as the answer does come up as the highest match with the the highest level of dot product, which is what we would expect given that the model was trained for just ten epochs, is very small and doesn't have a lot of data to train on. So that's fine. So, what did we cover in this lesson? You saw how the encoder is built using PyTorch modules. And then step by step, how the training loop works. This was an example with a small-scale model and only trained on ten epochs with a small dataset. In the next lesson, you'll see a fully trained question-answering model and how this works well to provide the full benefit of retrieval for RAG. See you there.

Learn Code

Next Lesson

Embedding Models: from Architecture to Implementation

Introduction

Introduction to embedding models

Contextualized token embeddings

Token vs. sentence embedding

Training a dual encoder

Using embeddings in RAG

Conclusion

Quiz – Test your knowledge

Appendix – Tips and Help

Course Feedback

Community