In this lesson, you will gain the tools to create a robust retrieval system capable of handling complex queries involving both text and images. By the end of the lesson, you will be able to ingest data into LanceDB, perform similarity searches and display results that combine information from different modalities, such as video frames and their associated captions. All right. Let's go. Recall the multimodal RAG system diagram. We will be learning about how to populate the vector store and be able to retrieve from it. Vector stores are databases optimized for storing and retrieving high dimensionality vectors. As we have previously seen, we can represent vision language data as embeddings in a common multimodal semantic vector space. Vector stores help in going from these embeddings to a search engine. We will also consider the problem of multimodal retrieval. Once we have the multimodal vector store constructed then we will move on to the task of multimodal retrieval. In this task we will have an input query, we will compute the embedding of this query, and there will be a similarity search between the queries embedding and the embeddings inside the vector store. This way we will be able to retrieve all the metadata that includes the video frame and the associated caption with the video frame that best satisfies the query. So let us see this in action in the notebook for lesson four. Welcome to the lab for lesson four. In this lab, we'll practice ingesting a video corpus into a LanceDB vector store and perform multimodal retrieval using LangChain. In lesson three, we already downloaded the video Corpuses, extracted frames, obtained the transcription, and generated captions. In this notebook, we will first ingest that data into the BridgeTower model and populate a vector store with multimodal embeddings. And then we will demonstrate multimodal semantic search by querying this vector data store with natural language queries. Let's import some libraries. Since we are working with multimodal data, we've extended a few LangChain classes which are in the multimodal landscape library imported and the BridgeTower embedding library. We just exported. Now we will set up a connection to the LanceDB vector data store by declaring a host file and a table name. And we will initialize the vector store. So now we are loading the metadata files for the videos that we constructed in our previous notebook. And we will collect all the transcripts and the video frames and their respective lists. A point to note is that for some videos, the transcript associated with the video segment can be fragmented. This will mean too few words per video frame for contextualization. To mitigate impact from this, we can augment the transcript associated with each frame by the transcript of N neighboring frames. This will result in some overlap of transcripts of neighboring frames. But that is okay. This overall will increase the textual context that is provided with each frame. So let's implement this augmentation with n equals to seven. So in this particular video. Let's take a look at a particular transcript segment before and after this augmentation. So here we see that the transcript example before the update was very short: spacewalk and we now have the chance to have it done." And this would not provide enough context for the video frame. But after the update we see that this short transcript has been augmented with neighboring transcripts. And this gives enough context for the model to contextualize the video frame. So now we are going to ingest our video data into the LanceDB vector store. We initialize the BridgeTower embeddings and we pass it to our vector data store as the embedder. We also need to pass the text data and the image data. The text data is the transcription of the videos we've collected, and the image data are the video frames that we've collected. So now we've constructed the vector data store. Let's take a look at the table as a pandas dataframe. We see that it has all the text from the transcripts with each row also associated with the frame that we extracted from the video. Now we are going to add a retrieval to our vector store. We are defining this retrieval to use the BridgeTower model. And we are also specifying K equals to one, which means that this retrieval will return the nearest matching entry in the vector store. Now we are ready to run a query against our vector data store. So for example, for this query, a toddler and an adult, we can invoke the retrieval that we've just defined and we get retrieve results. We can display the results. And we see that the system has retrieved a video frame which seems to accurately describe this image, of being a toddler and an adult. And the associated text with this video frame is also retrieved. Now let's invoke this exact same query, but this time with K equals to three. We now see that the first retrieval is exactly what we had before. And, you know, we can go to the second retrieval where we only see a toddler. And the third retrieval where we see two toddlers. So probably the first, retrieval had the highest score. and it makes sense because we, it's a toddler with an adult in the picture. Let's run a few more examples. So here we have a query that is an astronaut spacewalk with an amazing view of the Earth from the space behind. So here we see that the system has retrieved an appropriate video frame, where we can see the view of the Earth. And it's also showing an astronaut spacewalk. And the system also retrieves the transcription of the video during this video frame. Okay, let's try one more query. A group of astronauts, and we see that the system has retrieved this video frame. That does show a group of astronauts. And it's also retrieved the transcription segment associated with this video frame. So at this point, feel free to try your own queries. You can also embed your own videos into this vector store. So feel free to try your own examples and I will see you for lesson five.

Learn Code

Next Lesson

Multimodal RAG: Chat with Videos

Introduction

Interactive Demo and Multimodal RAG System Architecture

Multimodal Embeddings

Preprocessing Videos for Multimodal RAG

Multimodal Retrieval from Vector Stores

Large Vision - Language Models (LVLMs)

Multimodal RAG with Multimodal Langchain

Conclusion

Course Feedback

Community