Image understanding with GPT four zero can perform well, but often needs chain of thought, prompting. Few shot examples or fine tuning to achieve the best results. Conversely, A1 performs well at understanding images out of the box. This is due to the test and learn approach. It follows in its reasoning, which gives it multiple chances to detect hallucinations before providing an answer. One use case that is emerging is to incur the latency and cost hit for O1 upfront. Preprocessing the image and indexing it with rich details so that it can be used for Q&A later. Let's have some fun! this image reasoning task, we're going to use an org structure of a fictional organization. This is the kind of nuanced diagram with spatial reasoning that 400 would typically hallucinate at, but we're finding that O one performs much better, and it's because of that testing and learning approach that we referred to earlier. It will come up with a first impression of what the diagram is telling it, and iterate until it thinks it has a decent outline of what the purpose is. So we're going to start off fairly basic and just ask everyone to tell us what this is. And then we're going to move on to a use case, which we're finding customers using practically in the real world pretty often, which is that you use O1 to do image understanding and extract a detailed Json that describes that image and what's going on in it in a lot of detail, and then you can do text only follow ups to then interpret the information. So you're not paying the extra cost and latency of processing the image. Every time you effectively preprocess it with zero and one in a very high quality, detailed manner, and then use it for follow up Q&A. So let's head on and see how this works practically. As always, we'll begin with our imports bringing in our libraries and any variables that we need to import. See here are our same standard libraries. We've stored our our vision request in a utils file which will import here. And for this we're going to use the new main line O1 model because we're a B as that one is capable of Imageries and in to get an understanding of the image that we're going to process, you should open up the org chart file which we provided. And you'll be able to see here that we have a CEO at the top. We have their C-suite, their managers, and then the reports of those managers. You'll begin with the simple question of what is this? So we can get a detailed rundown of what's contained in that order chart image. So we'll kick that off and display the contents, the models giving you a detailed rundown. So it's an organizational structure chart. It's called out the different levels of hierarchy. And also given a brief description of how the charts organized and what the purposes. This is informative but not super useful. So what we'll do next is process this into data that we can then use for follow up questions, so that we can create analysis on top of this org chart and understand the details of how these different roles hang together. Before you begin the task, you should understand the minor differences between the O one vision, a function that we've defined, and the typical O one calls that we've done in the previous lessons. You'll see here that the content that we're supplying in the messages now has two objects inside it. One is the text as previous. The second is an image URL. And in this case it's actually a local image which we've base64 encoded. But if you do have an image hosted somewhere, you can feel free to reference those as well. What we're about to work through here is where we start to really see the improvement from 40201 in terms of the quality of image, understanding that we can achieve what we saw previously with four zero is that it could generally give a high level description of what was in an image. But once you start to ask nuanced questions about like spatial reasoning, like, what does that arrow point to? Or you know, who reports to who in that org chart, four zero would give inconsistent performance. And we typically need like few shot examples or fine tuning to achieve a decent level of performance. What we should see here with O1 is that out of the box, it will perform pretty well and be able to reduce our reasonably complex org chart to a simple Json array, which we can then use for analysis. So to accomplish that, we've got a structured prompt here. Again using the principles that we learned in lesson two, we have some instructions where we have a consulting assistant who processes org data and we'd like it to extract the org hierarchy. We've given it a specification for the Json that we want. So we want it to make up an arbitrary ID for every person, their name, their role, and then give us an array of IDs. They report to an array of IDs that report to them. So once we have this, we have something which codifies the relationships in that image as data and enables it for ongoing processing. So you'll run this and receive the results. You'll execute the cell and we print out the prompt. So there it is detailed here. Now you can feed that prompt into a one request with the image. And if we run that and print the results we should receive a array of Json dictionaries which describe the different people in that org chart and how they relate to each other. You can see in front of you the data representation of that org chart. Now. So we've got our array of dictionaries. Each one contains that arbitrary ID. And if we just reconcile the first one, we've got Julianna Silva CEO, id one. And we can see that she reports to no one, she's the CEO. That's correct. And her reports are numbers two, three and four. If we check two, three and four, those are indeed the CFO. CTO. Oh, great. So now that you've reduced this org chart to data, you can now use it for analysis. So let's step on and do some Q&A on top of this data and see if anyone is able to use the data that that's processed to accurately answer some questions about the arc chart. You can start by loading the oh one response as Json, and then we can create a prompt and add this data to it, and then ask some questions. In this case you've created an analysis prompt. You are an org chart expert assistant. Your role is to answer any org chart questions with your or data, and then then there's an org data XML tag here where we're going to interpolate the org data. Before you ask your analysis questions of the org data, you'll need to initiate a fresh opening. Hi client, the reason we're not using the one Vision request is that we're simply heading over the text only request. Now that we've pre-processed that image, we don't need to send the image every time. We're just going to use that data that we extracted and use that for our Q&A. Here you have a simple on request. You've got that analysis prompt that you had at the beginning. And then a structured question here. And there are two questions and contained within it. So who has the highest ranking reports and which manager has the most reports. If you execute that you'll see the results. You can display the results as marked down here. And we can interpret the answers that we've gotten. So first of all, Giuliana Silva, the CEO has the highest ranking direct reports. That is correct. All of her reports are C-level executives. Which manager has the most reports? So first of all, it's successfully filtered for only the folks with role manager. And indeed, these two do have the most direct reports. So that is a correct answer to the question and a conclusion to our image. Understanding task. Before we close off, I want to share one more example so that you can take what you've learned and test it out on a fresh image and domain in the folder you can see a photo that we've included of an entity relationship diagram. This is a great use case for image understanding. Imagine all the data warehouses that you've dealt with with a complex ERD that, you needed one of the local data scientists or warehouse or owners of the warehouse to talk you through. Now we can provide it to everyone with image understanding and get it to parse it for you. So the challenge for you is to think of a few use cases for this entity relationship diagram. For example, you may want to ask o one with vision to generate a table of order records which have IDs to that link to these product and client tables, and then get it to generate SQL to actually query the three generation of synthetic data is a great use case that we've used vision for in these sorts of cases on customer site. We're very interested to see what use cases you come up with and very excited to see what you use o one with image understanding for. Look forward to catching up with you on the next one where we'll be going into meta prompt in how to use o one to automatically optimize your prompts Look forward to seeing you there.

Learn Code

Next Lesson

Reasoning with o1

Reasoning with images

Meta-prompting

Conclusion

Appendix – Tips, Help, and Download

Course Feedback

Community