So how much better can LLM get when it's able to use private data? In this lesson, you will build an LLM that is fine-tuned on private medical data, and you'll then be able to see how well it can answer complex medical questions. You will start to do this by using existing centralized fine-tuning methods. So you can better appreciate the benefits federated methods provide. Alright, let's go. Let's first introduce the scenario that we will be using in all the remaining lessons of this course. The scenario you see in this animation, there are multiple medical locations. You can see them in this animation as the three buildings in the bottom. Each location has highly sensitive medical data. You would like to fine-tune an LLM with this data and therefore improve the ability of the LLM to respond to medical questions. Although this data that you will use is medical, the underlying scenario generalizes broadly across industries and applications, legal, enterprise, finance, and so on. In this lesson, you will not use federated learning. The aim instead is to see just how much an LLM can improve when it has access to the type of private data that we are so interested on leveraging. The animation here in fact illustrates the conventional fine tuning approach that we will use in this lesson that is centralized. And so in the animation you can see the data leaving these different medical institutions arriving at the server, after which fine-tuning is performed in the LLM is updated, which is exactly what you're going to do in the code that's forthcoming. Before we get there, let's talk a little bit about the data that you will be using. To provide us with a representative example of private data, we are using Med Alpaca. This is a wonderful resource that contains a variety of different types of medical domain knowledge. Critically, it also includes Q&A pairs. And so while it can be subjected to evaluate the response of an LLM and understand if it's improved or not. Because in this code example, this dataset users question and answer pairs. What this means is that it's easier to evaluate if the LLM is in fact improving and its quality of answer, because we know, what the discrete responses are for these particular question answer examples in the dataset. By using Med Alpaca that is completely available online and open source, that can also act as a very strong proxy for what private data is and how it can benefit in LLM, you're going to be able to experience, and see the improvements we would expect when you start to apply it to your own private data in the different, areas that you might be working. Let's go check out the code. So let's begin by importing a few packages and utility functions that we're going to need. You see here, you're going to use the Hugging Face transformer library with PEFT, is a parameter efficient method for performing fine-tuning that we're going to explain in the lesson after this. You'll also see that we're importing several functions from the utilities module and that will help us, keeping the notebook nice and clean. In the notebooks for this course, you'll make use of Hydra. It's a powerful tool to manage config files. Without further do, let's load the config file that describes how the centralized fine tuning is performed. Let me highlight a few sections in that config file of interest. You can see the name of the dataset that you're using. It is the flashcards Q&A subset of Med Alpaca. You can also see the model that we're using. We're using the 17 million parameter LLM from EleutherAI. We are going to be training this, as I mentioned earlier, with PEFT and LoRA, but explaining PEFT and LoRA in the next lesson. You can take a look at the train section and you'll find in there the typical hyperparameters that you would expect when you are training an LLM. And so for this lesson, they've been adjusted And so for this lesson, they've been adjusted for the fine-tuning that we are performing, and in particular, we've changed these values so that the fine-tuning can be completed in a short amount of time in this notebook. Before jumping into the fine-tuning, let's peek into the Med Alpaca dataset. Let's first load it using the Hugging Face Data API and extract 10% of the training partition. Remember here, you're playing the role of a scientist that has only access to 10% of the total data, that has been distributed across, the various hospitals that you're working with. The print statement shows us that the dataset contains two columns, Instruction and Response, and that this 10% subset contains almost 3400 training examples. Let's take a look at one example. In this case, the 10th training example. You can see it in two parts instruction and response. We have a highly medical question regarding, diuretics and then we see the ideal response in this dataset. That, again, is a fairly technical response about, how diuretics will cause an increase in certain levels in the body. Before starting the actual fine-tuning, let's have a sense of what's our starting point. Let's see how a pre-trained 7 billion parameter LLM responds to first, a general question and then a more specialized one. Let's begin by loading an LLM evaluator of a pre-trained 7 billion Mistral LLM through the fireworks dot API. Feel free to change and try different prompts, but we're going to begin with one that says "how to predict the weather." You can see the answer you get back is a bit vague, but still quite acceptable and rational. Let's evaluate how this pre-trained model does when presented with a domain-specific question. I'm not a doctor, so we're going to take an example from that Med Alpaca dataset. So we're giving it a strong test of a very domain-specific query in the medical domain. Let's take a look at the answer that we got back. You can first see the prompt. This is the question we asked the model. And this, as I said before, came from the Med Alpaca dataset. It's a very technical medical question to do with loop diuretics. You can see the response here of the model. And so what's interesting in this response is that it's give it a good try here. It's actually not answered what we asked it, but instead mainly leans on providing us with a definition of what loop diuretics are. So this is not a bad try, but it clearly didn't include the type of information that we're asking. Below here, in this response, we see the actual ground truth answer that the training set was looking for. And so you can see that this response is much more directly answering the question that was asked. Now that is clear that a pre-trained LLM is not going to get us very far in being able to answer the main specific questions, you can start to introduce the code necessary before we can actually fine-tune this model. First, let's instantiate the model. Here is a good point to start to talk about some of the extracted statistics related to the model, as we have, loaded the config and gotten things ready to start to train. You can see the 70 million parameter LLM that we're about to fine-tune with PEFT, we'll in fact, any modify 170,000 parameters. And so in other words, less than 0.3% of the total parameters. We'll describe PEFT in some detail in the next lesson. But it's interesting that modifying such a small fraction of the total parameters, can make fine-tuning of this kind highly efficient, and we'll use that in the next lesson too. Like with all LLMs, you need to define a tokenizer. Here, we also construct auxiliary objects that will be used to process the training examples when we pass them to the LLMs. With all the components ready. The model. The dataset, the tokenizer. You are ready to start the fine-tuning. The only missing part is the test to glue everything together under a single function. So let's do that. Let's go through what the fine-tune underscore centralized function does. First, we detect the device available in the notebook. Cuda is not available here, so the fine-tuning will make use of a CPU. However, if you run this notebook on your own system, you might have a GPU available. And so these lines should allow it to dynamically adjust based on the type of hardware you have available when you run this code. Then, we make use of Hugging Face's trainer to connect all the components previously instantiated. Finally, the training begins. And after that, you will save a checkpoint. Note, this checkpoint will only include the fine-tune parameters. Let's do some fine-tuning. Here, the LLM will be fine-tuning for just five steps, and we're going to be using a small batch size. So you see the loss go down slowly. This might take a minute, until it completes. And there we have it. We have made five steps through the fine-tuning process, and the loss has gone down. With the code cells above that you've seen already, you have got everything you need to fine-tune any LLM. So far, you have done so only for the smaller 70 million parameter LLM, and just for a handful of steps to fine-tune a much larger 7 billion parameter model. You can run the exact same code, but you'll need to load a different config. Load the full config like this. Then run the cells in the notebook from the beginning. Bear in mind, for this, it is recommended to run it on a machine with at least one GPU. One question remains. Is this fine-tuned LLM any better than the behavior of the LLM un fine-tuned model we saw at the start? We saw how even a large 7 billion parameter model struggled to give us good responses to a detailed medical question. So now that we have fine-tuned it, is it going to improve? Let's test it again. And so for this, we're going to test it again on the same training example. But note here we are running it on the 7 billion parameter Mistral model once again. But this time it has been fine-tuned on 10% of the Med Alpaca subset of the data that we're using. As you can see, from this fine-tuned model, we've got a much better answer this time. Let's look at the question again. It's the same question we saw before regarding how can, loop diuretics impact these various levels. So that is shown here on the prompt. On the bottom here we again we have this, ground truth. This answer that's in the dataset regarding, a response that would be ideal. That's directed to the point explaining that loop diuretics can cause a decrease in these levels. And even giving the technical term for this, effect. So let's now turn our attention to the actual the model. And what you can see here is a much better response. It's directly answering the question and explaining how loop diuretics can indeed decrease these levels. And even explains a little bit about what the implications of that would be, such as saying, therefore it's going to reduce the body's absorption of magnesium. So this is a much better answer. Success. So far we've seen a great response. But this is with respect to just one single question. Really, we need to see the behavior of the model across a large number of questions before we could really know for sure that there's been a systematic improvement. Doing this type of analysis would take quite a long time. We need to prompt the model and look at the response over a large number of examples. And so what we've done is gone ahead and done this offline and saved the result in a data structure. And then we're now going to visualize that for you. But what I mentioned is that in the notebook in markdown you'll see the code necessary to run this for yourself. And so you can generate this, structure that has these results yourself. You could target different models and different datasets to do the exact same analysis if you'd like to do so. But here, this variable results has been populated by ourselves using that code that you have available to yourself, where we have prompted a large number of questions to see if there's a systematic effect and benefit. There you have it. You can clearly see quite a pronounced improvement in the fine-tuned model against the pre-trained model. Let me explain this figure a little bit more for you so you can understand what's going on. The y axis is essentially accuracy that comes from these Q&A examples where it's very clear what is the right answer and what is the wrong answer. So makes evaluation easy for us. And then we've asked the two different models, the 7 billion parameter model without any fine-tuning. And then the same model fine-tuned with 10% of the available Med Alpaca data. And then we've seen what happens to the accuracy metric for those set of questions. And what do we observe, we see that accuracy jumps from little over 30% to almost 50%. Even though we're using a relatively small fraction of the entire dataset that is available. That's the end of lesson two. We've seen in the code how that we can take a generic LLM, fine-tuned on private data and see a pronounced response in the domain of medicine. This has been, a running example. We're going to walk through. And many of the lessons in the remainder of this course, this medical private data scenario where we're using the Mistral 7b fine-tuned model with the Med Alpaca dataset. And as we can see, the accuracy of responses jumped significantly. Why is this important? Well, this demonstrates the value in the types of gains you can get by using private data that is normally unavailable. And so what's next? Is that the way that we achieve this so far, is by using centralized fine-tuning. And as we discussed in lesson one, due to the fact that LLMs are able to regurgitate to take training data, doing this private and sensitive data is dangerous. So what's the answer? We're going to take a look at the answer in the next lesson. And the answer in fact is going to be federated LLM fine-tuning.