In this lesson, we'll go behind the scenes and look what it takes to train the model to create an interface like Canvas. All right, let's go. You learn how Canvas can make your writing and coding a lot better, faster, and more fun. If you're interested in knowing how the model behind Canvas is trained, here we're going to have a very brief overview of the model development steps. So, we trained ChatGPT 4-o, to collaborate as a creative partner. The model knows when to open Canvas, make targeted edits and fully rewrite. It also tries to understand broader context, to provide precise feedback and suggestions on your document. And a key innovation in the Canvas model, is that all the post training was done via synthetic data, and we used techniques like distilling outputs from OpenAI preview smarter model to train the models for its core behaviors. And this approach allowed us to rapidly address writing quality and new user interactions, all without relying on the human generated or human collected data. So our research team developed the following core behavior for this model. Triggering the Canvas for writing and coding. Generating diverse content type, making targeted edits, rewriting documents, providing inline critique and other features. We measured the progress over 20 automated internal evaluations that we built, and continue evaluating more as we launched the product. So Canvas was developed by a group of designers, engineers, researchers, and product managers, and we work together from the very beginning of the project till its launch. On the way, we have established a product during model development framework that I thought might be useful for you too. A general product research lifecycle in developing product features like Canvas might be of the following form: developing model behavior spec. Designing robust and trustworthy evaluations. Rapid prompt iteration to establish the baseline performance via prompting. And here you can use techniques like in-context learning, a few shot examples of your prompting technique. And finally, once you establish the conviction of the product feature, you can post train very targeted behavior into the model. And more specifically, in Canvas, everything was done via synthetic methods, among which is called distillation. Obviously, in reality, everything is much messier and it's not as streamlined as I showed in the previous slide. So once you create a baseline performance, maybe you want it to go back to the evaluations. And once you uncover new edge cases through the variations, you need to update your model behavior spec in which then you change the strategy for how you might want to train the model. So in the real world, everything is more of like iterative development of the model. And so, let's dive deeper into what it takes to define the model behavior. Defining model behavior requires very nuanced and detail oriented spec thinking of edge cases and thinking more of the whole systems interactions. In for example, in Canvas, we have to think how that tool interacts with other tools like search and dollar image generation. And as the models become more complex, the complexity of the system increases. And so the scope of the model behavior spec also increases in the complexity. Personally, I have also learned that fixing language models behavior often involves applying similar principles from software engineering, and I can share some of the principles. Reproducing model behavior is the same thing as reproducing a software bug, but by doing that, you can establish a much more robust and trustworthy evaluation that you can optimize in the future. Dissecting in software engineering is the call to upgrading different parts of the components of your model. To find out the root cause of the problem. And finally, you have to solve the problem holistically without breaking the other parts of the system. And this is where the complexity of the model behavior arises, because it's only during the training that you might observe some of the bugs or the model behavior during the service, but you might want to fix as you iterate further. More specifically, let's dive in into what specific behaviors Canvas model was trained. A key challenge was defining when to trigger a Canvas, and we call it a decision boundary. We told the model to open the Canvas for prompts like write a blog post about the history of coffee beans, while also avoiding over triggering for general question and answering tasks like help me cook a new recipe for dinner. For writing tasks, we prioritized improving correct triggers at the expense of correct and unknown triggers, reaching 83% compared to the baseline zero shot GPT 4-o, with prompted instructions. It is worth noting in the graph on the right that the quality of such baselines is highly sensitive to the prompt that you used. And there's different prompts the baseline may still perform poorly, but in a different manner. For instance, by being evenly inaccurate across coding and writing tasks, resulting in different distribution of errors and alternative forms of suboptimal performance. For coding, we intentionally bias the model against triggering to avoid disrupting our power users, will continue fighting this based on our user feedback. A second challenge in the model behavior involved tuning the model's editing behavior once the Canvas was triggered. Most specifically, deciding when to make a very targeted specific edits versus rewriting the entire content. For prompts like "change the second paragraph to be shorter", it's much more preferable for the model to specifically select that second paragraph rather than rewriting. And so we train the model to perform targeted edits when users explicitly select text, or when the instruction is clear what in the text to select, otherwise favoring rewrites. And again, this behavior is continually updating as refine our model development. Finally, training the model to generate high quality comments and suggestions requires careful iteration. Unlike the first two model behavior, cases, which are easily adaptable to automated evaluation, was thorough manual reviews. Measuring quality in an automated ways, but you call it challenging. And so we used human evaluations to assess common quality and accuracy when to trigger suggestions and comments. Our Canvas model outperforms the zero shot GPT 4-o responsive instructions by 30% in accuracy and 16% in the quality, showing that synthetic training significantly enhances response, quality and behavior compared to zero shot prompting with detailed instructions. All right. We covered some of the basic fundamentals of how we trained the Canvas model. Although the focus of the course was obviously to share some of the delightful use cases that you should try in your writing and coding tasks and shared with the rest of the deep learning community.