In this lesson, you will learn how to preprocess your videos into a form that can be later ingested into the multimodal RAG pipeline. You will understand how to generate transcripts from videos using the Whisper model, and how to create captions for video frames using large-vision language models. All right, let's dive into the code. Recall the multimodal RAG system diagram. We will be learning about the pre-processing blog in this lesson. We will be considering three cases. Case one is we have a video and its transcript is available. The transcript is commonly available in a web VTT format. This is a sequence of text segments associated with the time interval. The transcript file indicates the spoken words during particular time intervals in the video. In the notebook, we will use the CV2 library to extract video frames using these timestamps, and associate each text segment with the central video frame. Case two is when we do not have a transcript file. In this case, we will generate the transcription using the Whisper model from OpenAI. Case three is when we have a video without explicit language associated with it. For example, this could be a video where there is only background music playing, or it could be a silent video. In this case, we will use large-vision language models or LVLMs to generate captions from the video frames. We will learn more about LVLMs in lesson five. Note that this is a useful way to augment language information for videos that also have transcripts available to them. With this, let's move on to the notebook for lesson three. Welcome to the notebook for lesson three. In lesson two we learned how we can create multimodal embeddings from paired image text data. In this notebook, we'll process a video corpus into video frames and its associated text transcriptions or captions so that this data can be ingested by the BridgeTower model, in a future lesson. We start by importing some necessary libraries. Let's get one video from YouTube. Notice that we are also going to get the transcript file from YouTube itself. I encourage you to practice this lesson again with the video of your choice. So let's see what got downloaded. We have a video in an mp4 format. And we have the transcription of the video in a VTT format. Okay. Let's take a quick look at the transcript file. We see that it's in a WebVTT format. So we have the text segment that was spoken between a starting point and the ending point in the video. For the second video, we will download this directly from its URL. We provide two helper functions which are very helpful in working with the videos we've downloaded. So one is a string to time. It will convert specific time written as a string in the VTT file and convert it into time in milliseconds. This will let us extract a specific frame from a video at that particular millisecond. The maintain aspect ratio resize is a function that will apply to the extracted frame so that it maintains its aspect ratio. And this resize process benefits the computation of embeddings by the BridgeTower model. Next, we will define a function that will help us extract and save the frames and transcript and other metadata from the video. Especially in this function, we are going to loop over the transcript. We are going to compute the midpoint of every text segment and milliseconds and at that timestamp, we are going to extract a video frame. We will write this video frame out into a disk. And we will also prepare the metadata. Finally, this function will save all of the metadata associated with the video to a file on disk. Now let's call this function on our video. Let's inspect what this metadata looks like. So we see that for every video segment we have a central frame extractor and it saved to the disk. And we also have the transcript associated with this frame. Also saved in this dictionary. We are also going to drag the video segment ID and the midpoint of this video segment in milliseconds. So this way, we've converted our video into this metadata that we will work with in further downstream applications. Now we will consider the case that we have a video but without its transcript. In this case, we will use the Whisper model from OpenAI to transcribe the video. Since the Whisper model needs an audio file in its input, we are going to extract audio from the video as a MP3 file. And we will do this using the MoviePy library. Now we will load in the Whisper model. We will use the small variant of the model in this example, and we will set the language to English. This process will usually take about 1 to 2 minutes. For better performance, and depending on the computer system you have, you can also try the larger Whisper models. And you can also try increasing this best of two five. Now that we've obtained the results from the Whisper model, we are going to use a particular helper function, getSubs that we've provided in the utils file, that will convert the output of Whisper into the same webVTT format that we've seen before. We can take quick look at the file that's generated. And you know, we see that it's in the same webVTT format where we have for every sequence of text, we have the starting time and the ending time. Now we move on to our final case, where we have a video without any relevant language associated with it. This might be the case for example, if we have a video of just musical instruments playing in the background, or it could have been for a video where there is just no sound. It's a silent video. In such cases, we will use a large-vision language model, specifically LlaVA, to generate insightful captions for what's happening in the video. We will learn more about LlaVA in lesson five. In this class, we will use the LlaVA model through API as provided by Prediction Guard to generate the caption frames. We are going to use the following prompt to ask the LlaVA model to generate captions. So let's consider a particular frame. We will encode this frame as a base 64 image, and then we will pass it to our LLVM inference function. And we notice here that the LlaVA model has generated quite a detailed caption that describes this image. For example, that this features a space shuttle with a person floating in the air, and so on. Next, we are going to define a function that will help us extract and save frames with captions from a particular video. Specifically, this function is going to read the video and loop over all the video frames. It will sample the video with the specified frames, which is provided as an argument. And then for each frame, it will encode the frame as a base64 image and it will generate the caption using the LLVM model, which is LlaVA. It will then prepare the metadata in the same format that we are used to. And finally, it will save all the metadata back to disk. So let's call this function on our video, and we will extract all the metadata out. Note that this process will take about a minute or two to run. One interesting thing to note is generating captions on videos in this manner is actually pretty useful for videos that also have transcription in them. So this is a useful way for augmenting the language information in a video. So let's look at one of the generated captions. So here we see that for this particular frame, the generated caption is, that the image features a young boy walking on a playground. And it's a pretty appropriate description for for this particular video frame. All right. So with this, we saw that there were three cases. We either had a video with its transcription available or we had a video and we generate transcription on our own using the Whisper model, or we have a video with no associated language information that exists already. And we use the LlaVA model to generate captions. So in this lesson, we've learned about how to process our video data in a manner so that it can be injected into our multimodal embedding model. So in our next lesson, we are going to work with the data that we've created, in this particular notebook.

Learn Code

Next Lesson

Multimodal RAG: Chat with Videos

Introduction

Interactive Demo and Multimodal RAG System Architecture

Multimodal Embeddings

Preprocessing Videos for Multimodal RAG

Multimodal Retrieval from Vector Stores

Large Vision - Language Models (LVLMs)

Multimodal RAG with Multimodal Langchain

Conclusion

Course Feedback

Community