This lesson focuses on leveraging the capabilities of large-vision language models for complex tasks such as image captioning, answering questions based on visual cues, and reasoning about insights from images. You will use helper functions to make API calls to Prediction Guard. Let's get coding. Recall the multimodal RAG system diagram. In this lesson, we will learn about how to implement the LVLM inference module. Where LVLM stands for Large Vision-Language Models. So let's compare a LLMs with LVLMs. So, LLMs focuses on natural language processing and generation. An LLM will take only textual input and can generate text output. Large vision-language models or LVLMs, can combine visual and textual information, bridging computer vision and language understanding. These models excel in tasks like image captioning and vision question answering. We will particularly be working with an open-source LVLM that is known as LlaVa, which stands for Large Language and Vision Assistant. And the LlaVa model, we make use of CLIP's vision transformer. And we combine it with a powerful LLM, which is typically of the Llama family. And architecture diagram is shown here where we can see how the image is processed through the vision encoder. And the output of the vision encoder goes through some projection matrices that bring the image patches into the same space as the language model. And then these are further propagated through the language model, along with a language query or a language instruction. One reason why LlaVa models are popular is because it is very easy to change the backbone LLM with a little bit of training. In our group at Intel labs, we have been training variants of this LlaVa model with different LLMs like GEMMA and Llama-3, and we have open-sourced these models. So let's move on to the notebook for lesson five. Welcome to the lab for lesson five. In previous lessons, we've learned about multimodal embedding models like Bridgetower, which allow you to process image text base and embed them in a common multimodal semantic space. In this lab, we will learn about a different family of multimodal models, specifically large vision-language models or LVLMs. These models can take both image and text as input and generate textual output. So let's say you have some multimodal data and you want to do a question answering on it. You want to be able to chat with it and get more insight from it. Then this is exactly where we can use a large vision-language model. In this lesson, we will particularly be using open-source model called LlaVa, specifically LlaVa 1.5. We will be calling the LlaVa model using an API provided by Prediction Guard. And the model is hosted on Intel Core E2, AI accelerators and on the Intel Developer Cloud. So let's import some necessary libraries for this session. In this lesson, we will use frames and transcriptions extracted in lesson two. And also some additional data from the COCO dataset. For example here we are reading the transcript and the central frame that we've extracted previously in lesson three. And we are also including some new data where we have the URL of the image and its associated caption. Let's add some more data and download the new images. Now we will consider some few use cases with chart completion where the LlaVa model. So the first use case we are going to look at is image captioning. So we are going to provide a prompt to the model. We are going to encode the image to b64. And we are going to start a conversation object, where we will also store the return caption from the LLVM inference. Let's take a look at the frame and how the model captioned the image. So here we see the caption generated by the LlaVa model, where the model, correctly describes that there are five men in this image. And it has a lot of other details that and has been able to produce for this video frame. Let's try another example. This time we are going to do a Q&A on visual cues and an image. So let's display the image and how the LlaVa model described it. So we had a particular prompt on what is likely going to happen next. So in this particular image and in the model's response, the model indicates that the man is performing a handstand on a skateboard at the beach, which is an accurate description. And you know, the model goes on to say that it could be that he loses his balance and falls on the skateboard and so on. So we are able to see that, the LlaVa model is able to reason from the visual content and the image. Let's try one more example where we will try to show how well the model can understand text written and videos. So here, for example, we are prompting the model with "What is the name of one of the astronauts for this particular frame of the video? The model is able to, basically detect the text in the video frame as Robert Behnken. And in its response, it's able to accurately say that one of the astronauts is named Robert Behnken. So let's try another example where we will ask a question based on the transcript associated with the image. So we construct a new prompt with this template. Now the transcript associated with the image is where we'll insert the transcript. And we want to ask the model what do the astronauts feel about their work. So let's go and look at, know, how the problem was constructed. So the prompt had the transcript that we provided to it with the query. "What do the astronauts feel about their work?" and, the model is able to say that the astronauts are proud of their work on the International Space Station. And it's grounding the response based on the information provided in the transcript. So here we see that a model like LlaVa is able to latch on to information that is either present as visual cues in the image, like we saw with the skateboard example, or it's able to latch onto the information present in the associated transcript. And that we've provided to the model. Next, we will consider a good example where we can extend the conversation with the LlaVa model. Specifically, we will ask a follow-up question to its last response. We are going to ask a follow-up query "Where did the astronaut return from?" And we will basically save the chat history. Let's go ahead and look at the results. Here we see that the model successfully answers that the astronaut in the image returned from the International Space Station. So hopefully, looking at this code, you can work out how we can track a chart history that allows us to do multi-tone conversations. So at this point, feel free to use your own images and text prompts to play around with this hosted LVLM. Now I will see you and lesson six.

Learn Code

Next Lesson

Multimodal RAG: Chat with Videos

Introduction

Interactive Demo and Multimodal RAG System Architecture

Multimodal Embeddings

Preprocessing Videos for Multimodal RAG

Multimodal Retrieval from Vector Stores

Large Vision - Language Models (LVLMs)

Multimodal RAG with Multimodal Langchain

Conclusion

Course Feedback

Community