In this lesson, you will learn about the importance of contextualized token embeddings and how transformer models and specifically, BERT, pioneered the ability to learn contextualized token embeddings, which are used in sentence embedding models. All right, let's get to it. In the previous lesson you learned about word vector embeddings. Here you see a sample word Yoda, and it's embedding vector using a word embedding model. Models like Word2vec, and GloVe, give you the embeddings that capture the semantic meaning of each word. You can use these embeddings to find similarity between words. But there is a problem with word embedding models, they don't understand the context. Here you see two different sentences they use the word bat in two different contexts. Word embedding models like Word2vec will not be able to separate these words by their context. Using these models you will have the same vector embeddings. So the problem is contextualized word embeddings. In 2017, the paper entitled "Attention is all you need." introduced the transformer architecture to Natural Language processing. This was the breakthrough that led to large language models, but along the way also solved the problem of contextualized word embeddings. Let me show you how. The transformer architecture was originally designed for translation tasks and thus had two components an encoder and decoder. The input to the encoder is a sequence of words or tokens, and the output is a sequence of continuous representations. The output of the decoder is again words or tokens. And this is how a translation task would work. The encoder would take in a phrase in one language and would produce output vectors that represented the meaning of the input phrase. To produce these vectors, the encoder could attend to tokens that were to the left or the right of any given token. In contrast, the decoder operates a token at a time and considers the predicted tokens so far along with the outputs of the encoder. The decoder predicts the first word "the". This is fed around to the input. Now the decoder considers the encoder input and the previously generated token and predicts "man" and so on a token at a time. To recap, the encoder attends to tokens to the left and right of the output it is producing. This results in encoder output vectors being the contextualized vectors we're looking for. In contrast, the decoder only attends to the inputs to the left. But transformers with attention were used for more than translation tasks. The most famous implementation is the LLMs, like GPT 2, GPT 3, and GPT 4, which was a decoder-only architecture. And of course, BERT, an encoder only transformer model, which is heavily used as a component in sentence embedding models. Let's talk a little bit more about BERT. It was built in two sizes: BERT Base, with 12 transformer layers and 110 million parameters, and BERT Large was 24 layers and 340 million parameters. BERT was pre-trained on 3.3 billion words and is often used with an additional task specific fine-tuning step. The BERT model was pre-trained on two tasks. One task is called Masked Language Modeling or MLM. Here's an example. The inputs are sentences that start with a special token called CLS and end with a separator token or SEP. 15% of the input words are masked. The model is trained to predict those masked words. Now I'm calling these words, but in reality, these words are tokens. We'll talk more about that later. This task is critical, as this is where the model learns to produce contextualized vectors based on the surrounding words. The second task was Next Sentence Prediction or NSP. In this task, the model predicts if one sentence is likely to follow another. So for example, if sentence A is "The men went to the store" and sentence B is "He bought a gallon of milk", the output prediction is "true". On the other hand, if sentence A is "The men who went to the store" and sentence B is "Penguins are flightless birds", that one would be a "no". It isn't likely that B follows A. So this task trains the model to understand the relationship between two sentences. After the pre-training, you can apply the idea of transfer learning to BERT and adapt it to specific tasks via fine tuning, such as classification, named entity recognition, or question answering. One specific task of interest to us in this course is what is known as cross encoder. Specifically, this is a type of classifier where the input consists of two sentences separated by a special SEP token. The classifier is then asked to determine the semantic similarity between those two sentences. All right. Let's see all these concepts in code. So, to get started with this notebook, first I'm going to, import this and ignore some of the warnings. And then we're going to import some libraries we're going to use in this notebook. Specifically, you see, we have some, BERT tokenizer and the BERT model, from the transformer library as well as some others. So let's import those. Okay. So first let's load, the word embeddings from GloVe. and you can also use word2vec if you'd like, as shown here, but it's just too large for our learning platform. So you can use it on your own machine later on. So we load these. And then let's take a look at one of the words just to see what a word embedding actually looks like. This might take a while for the model to load. Now, let's take a look at an example word. We look at the word "king" and look at the shape of the embedding models. It's a hundred dimensional vector. Now you can take the word "vector" for the word "king" and also look at what it actually looks like. So, limiting here to only, 20 of the first numbers. You see that it's just float numbers. And we have, as we've seen, 100 of these here. Okay. So, let's do a little visualization. So we're going to pick a bunch of words here. As shown here. And then we're going to pick those embedding vectors into an array. And because these vectors are high dimensional, we're going to use PCA to reduce it to two dimensions so we can actually visualize it. So we'll do that here. Now you can take these PCA vectors and visualize them in two dimension. You can see that there are clusters of words in different parts of this graph. And for example, words and their vectors that are close in their semantic meaning appear in a close proximity to each other like this. One cool thing about this is that these are truly vectors. And you can do algebraic operations with them, like we see here. The word closest to king minus the embedding of men and plus the embedding of woman is the embedding for queen. And you can see that the similarities score between that algebraic operation and queen is very, very close. You should try this yourself. It's actually really cool. For example, instead of king minus men plus woman, you can try Paris minus France plus Spain and see what you get. Okay, so let's compare our word embeddings to the contextualized embeddings from BERT. So first, in this cell, we load the tokenizer for BERT and the model for BERT. We'll create a little function here to help us get the actual embeddings given a sentence and a word. So we'll run that. So let's look at an example. I have two sentences here. One says "The bat flew out of the cave at night." and the other one is "He swung the bat and hit the home run". Now, the word bat is really where we're interested to look, and we're going to compute the embeddings of the word bat once as it relates to sentence one and in that context. And the second time, as it relates the second two. And then we're going to use the more static word embedding from the word embedding model GloVe. So, let's run this. So when you run this, you can see now for yourself that the word bat has embeddings in context of sentence one that looks like this or is the same word bat in the context of sentence two looks very different. Now, these are the GloVe embeddings for that word which are not contextual. And we can see also that the cosine similarity between the BERT embeddings in different contexts, are 0.45 or 4.6 in this case. So, they're very different. Whereas the cosine similarity between the word and itself, of course, is one. Because it's exactly the same embeddings. We've discussed how BERT can be used as a cross encoder. Let's take a look at a specific cross encoder that was fine-tuned on the MS-Marco dataset of question and answer pairs. It's a passage retrieval task. We load this model. And now we can define a question and some possible answers. And now we just run the model on the question and the answers were defined. And voila! We see that the best answer is the correct best answer for this question. Paris is the capital of France. Awesome, I encourage you to try this for yourself with other questions and other answer options and see how well trained cross encoder truly understand semantic meaning of questions and answers. So in this lesson, you saw the word embedding models like word2vec and GloVe do not capture the context of words in a sentence and how the BERT embeddings are able to capture that context. But contextualized word embeddings are not sentence embeddings. In the next lesson, you will learn how you can go from contextual word embeddings to a full sentence embedding model. All right, see you there.

Learn Code

Next Lesson

Embedding Models: from Architecture to Implementation

Introduction

Introduction to embedding models

Contextualized token embeddings

Token vs. sentence embedding

Training a dual encoder

Using embeddings in RAG

Conclusion

Quiz – Test your knowledge

Appendix – Tips and Help

Course Feedback

Community