This lesson will be a general introduction to o1 to get you familiar with how it's so different, how it functions, and what people are using it for. Before o1, you could think of most models as like children, they always say the first thing that comes to mind. As they grow up, they need to be taught a valuable lesson that they should think before they speak. What makes o1 so different is that it explicitly thinks before it speaks. Every time. This helps it reach new levels of performance at complex tasks in domains like math, coding, science, strategy, and logistics. The way it does it is by using chain of thought to explore all possible paths and to verify its answers as it produces them. This means that o1 requires less context in prompting in order to produce very effective results. We've released our first two models in the OpenAI o1 series. The first is o1, our main line reasoning model for complex tasks that require broad general knowledge, which includes function calling and image input. We also have a faster and more cost efficient version called o1-mini, which is tailored for coding, math and science and designed for those more cost or latency sensitive use cases. The thing that makes the o1 family model so different is how they natively integrate chain of thought into how they solve problems. Let's take an example where we give these scrambled letters and we translate them for the model. And then we give it a new problem to solve. The way it solves it is something like this, where it thinks through what it's given and talks itself through how to solve the problem. It recognizes the example and then understands the transformation that took place here. Then it comes up with hypotheses and tests them. For example, it thinks maybe there's an anagram or a cipher here. Then it notices that the cipher text words are exactly twice as long. So maybe it comes up with an idea to test it, tests it, and finds that it doesn't work, but uses that to iterate to a new hypothesis, which gets it to the right place. And this, in a nutshell, is what makes o1 so special, rather than you having to prompt it to do this, we've trained the model to natively apply this reasoning process so that it can work through more complex problems in many disparate domains like science, law, coding, and math. This introduces a trade off when you're using the o1 family models, where you generate extra completion tokens which you don't actually see, but which the model is using to problem solve for your task. Let's look at how it works. We can see here that completion tokens can be broken into two distinct categories. We can see our gray reasoning tokens here and the actual output tokens, which are the tokens which you're provided for your generation. It's important to note that reasoning tokens are not passed from one turn to the next. If you want to do something like this, you'll need to prompt the model to output some kind of reasoning, which you can choose to pass from one turn to the next. What this means is you also need to consider those reasoning tokens in the cost of what you're paying for, and also when you're calculating your context limit, because those reasoning tokens do count towards the context limit. And if your output goes over and above that, the output will be truncated. Two of the key breakthroughs that led o1's increase in performance were, first of all, recognizing that, during the post-training process, the more reinforcement learning we did, more accurate the model got. But what was possibly more surprising was the more we allowed it to think at inference time, we got an even sharper increase in performance. So the key breakthrough here was the ability to think for a longer time at inference time. And get better results. The other key breakthrough that led to the increase in performance with o1 was teaching it to verify outputs via consensus voting. The way this works is that we generate a bunch of solutions and train the LLM to choose the most common one. You can think of this is similar to when we sample and low temperature. In one paper, we saw math benchmark go from 33% to 50% improvement purely due to consensus voting. Consensus also flatlines before 100 samples. So you don't need a huge amount of samples to realize the performance improvement here. So what's the significance of all these breakthroughs? The results of o1 have so far been really good. For example, here's one benchmark that shows the difference between GPT-4.0 at 13% on this benchmark. And the new o1 model, which got up to 83%. Huge improvement. Similarly, at code, where GPT-4.0 scores 11% and we see o1 at 89%, and science, where we don't see as big of an improvement, but still, 20 odd percentage point improvement on this benchmark. This carries on to other families of benchmarks, for example, math, where we see a giant 30% improvement from o1 over GPT-4.0. Science, 20 to 30 odd percent improvements, a variety of different exams, Most impressive one here being the LSAT, it almost 30% improvement and lastly, the MMLU categories where there was improvement across the board with highest scoring being a 98.1% in college mathematics, which is super impressive and a massive improvement over the previous family of GPT-4.0 models. So how does the o1 model actually work? The key factor is that it uses large-scale reinforcement learning to generate a chain of thought before answering. This chain of thought it produces is longer and higher quality than what you can typically attain by a prompt in alone. It also contains behavior like error correction or trying multiple strategies and selecting the best one, or naturally breaking down problems into smaller steps. There's a bunch of examples of these chain of thoughts on the research blog post. If you want to dig a little bit further. One other area that o1 represents a new standard of performance over the 4.0 models is abstract reasoning. I'll give you one challenge here. We gave GPT-4.0 16 words, and it was challenged to identify the four categories which united those words and then place the right terms in each category. And you can see here that its performance was fairly haphazard. So other than approval, it was only able identify one other category, which is movement/action. And it misidentified one of the terms. Whereas when we tried it with o1, it was able to correctly identify both the four categories and all 16 of the words. The reason we're so excited about this kind of abstract reasoning is it doesn't easily fit into the same bucket as coding, math, or any of the other benchmarks that we've shown, but it shows some of the other emergent capabilities that the o1 model is starting to show. And we're very excited to see you all test it with your own use cases to figure out where else these kind of abstract problem-solving capabilities are going to be significant. The last area we want to highlight, where we've seen significant emergent capability from the o1 family models, is where we have a generator verifier gap. What we mean by that is for some problems, verifying a good solution is easier than generating the perfect solution upfront. For example, with Sudoku, with math, with programing, you can often generate a half-decent solution, verify it, identify the issues, and then use that to iterate to the next one much more easily than building the perfect solution upfront. There are also examples where this verification generation approach doesn't work. For example, for information retrieval where the question does need to be answered corrected the first time, or with image recognition where you need to correctly identify what's in the image first time. But where a generator verifier gap exists, and we have a good verifier that can run the verification process, we can continue spending more compute and it's time to achieve better performance. And for these we're seeing o1 demonstrate great capabilities. So where might the o1 models actually be used? In areas like data analysis, interpreting complex data sets like genome sequences or mathematic problem solving. We've seen a ton of benchmarks here, but this is an area where we really see a step change in performance versus the previous family models. Similarly, experimental design. So coming up with innovative approaches for niche domains like chemistry or physics is another area we've seen pretty encouraging results. Scientific coding as well. You'll notice the common theme here a lot of STEM subjects, a lot of math, a lot of coding that we're seeing great use cases for o1. Biological chemical reasoning is another one. And algorithm development. Literature synthesis is the last one that I want to highlight here, where reasoning across multiple research papers to form coherent conclusions or summaries is another area where we're seeing great capabilities from o1. So let's recap the key takeaways from this introduction o1. The first is that the o1 family models scales compute at inference time by producing tokens to reason through the problem iteratively. o1 gives you more intelligence as a tradeoff against higher latency and cost, so it's not appropriate for all use cases. It can also perform really well at tasks that require a test-and-learn approach, where it can iteratively verify its results. Some of the great emerging use cases we're seeing are planning, coding, or domain-specific reasoning and fields like law and STEM subjects. So let's jump into some of the practical applications that we're seeing and walk you through how you can make use of o1 for your use cases. I'm going to show you a quick example of how o1 works within in-chat completion. First of all, import my API key and set up my OpenAI client. I'm going to set o1-mini to be the o1 model that I'm going to use. And I'll queue up in-chat completion request. You can see o1 requests are exactly the same as your typical chat completion request. We specify an o1 model, and I'm going to give it the classic question: How many r's are in strawberry? We'll generate this. Let's see what comes back. Let's have a look at the response object. We can see here the standard check completion response. We've got a chat completion message in here. We have the output. The word strawberry contains three letters and a couple of other outputs. Now the thing I want to focus on here is what makes o1 different. Here you can print out the tokens. We're going to focus on the prompt tokens, the total completion tokens and then the constituent parts of the completion tokens which consists of the reasoning tokens and the output tokens. If we print these out, you'll see that the model did a ton of thinking here. So we gave it only 15 prompt tokens, but it produced over a thousand reasoning tokens to reach the output. And this highlights the key trade-off with o1. You're getting much greater intelligence. But in this case we paid a ten x premium in terms of the amount of output tokens that we paid for to get this. This is a key thing to consider. You don't use o1 for everything. It's only for those use cases where the increase in intelligence you get is worth the trade-off in latency and cost. Look forward to seeing you on the next lesson, where we dive into how to prompt o1 to get the best performance out of it.

Reasoning with o1

Introduction

Reasoning with images

Meta-prompting

Conclusion

Appendix – Tips, Help, and Download

Course Feedback

Community