Whereas o1 was a major technological advancement, GPT-5 is, above all else, a refined product. During a press briefing, Sam Altman compared GPT-5 to Apple’s Retina displays, and it’s an apt analogy, though perhaps not in the way that he intended. Much like an unprecedentedly crisp screen, GPT-5 will furnish a more pleasant and seamless user experience. That’s not nothing, but it falls far short of the transformative AI future that Altman has spent much of the past year hyping. In the briefing, Altman called GPT-5 “a significant step along the path to AGI,” or artificial general intelligence, and maybe he’s right—but if so, it’s a very small step.
Take the demo of the model’s abilities that OpenAI showed to MIT Technology Review in advance of its release. Yann Dubois, a post-training lead at OpenAI, asked GPT-5 to design a web application that would help his partner learn French so that she could communicate more easily with his family. The model did an admirable job of following his instructions and created an appealing, user-friendly app. But when I gave GPT-4o an almost identical prompt, it produced an app with exactly the same functionality. The only difference is that it wasn’t as aesthetically pleasing.
Some of the other user-experience improvements are more substantial. Having the model rather than the user choose whether to apply reasoning to each query removes a major pain point, especially for users who don’t follow LLM advancements closely.
And, according to Altman, GPT-5 reasons much faster than the o-series models. The fact that OpenAI is releasing it to nonpaying users suggests that it’s also less expensive for the company to run. That’s a big deal: Running powerful models cheaply and quickly is a tough problem, and solving it is key to reducing AI’s environmental impact.
OpenAI has also taken steps to mitigate hallucinations, which have been a persistent headache. OpenAI’s evaluations suggest that GPT-5 models are substantially less likely to make incorrect claims than their predecessor models, o3 and GPT-4o. If that advancement holds up to scrutiny, it could help pave the way for more reliable and trustworthy agents. “Hallucination can cause real safety and security issues,” says Dawn Song, a professor of computer science at UC Berkeley. For example, an agent that hallucinates software packages could download malicious code to a user’s device.
GPT-5 has achieved the state of the art on several benchmarks, including a test of agentic abilities and the coding evaluations SWE-Bench and Aider Polyglot. But according to Clémentine Fourrier, an AI researcher at the company HuggingFace, those evaluations are nearing saturation, which means that current models have achieved close to maximal performance.
“It’s basically like looking at the performance of a high schooler on middle-grade problems,” she says. “If the high schooler fails, it tells you something, but if it succeeds, it doesn’t tell you a lot.” Fourrier said she would be impressed if the system achieved a score of 80% or 85% on SWE-Bench—but it only managed a 74.9%.
Ultimately, the headline message from OpenAI is that GPT-5 feels better to use. “The vibes of this model are really good, and I think that people are really going to feel that, especially average people who haven’t been spending their time thinking about models,” said Nick Turley, the head of ChatGPT.
Vibes alone, however, won’t bring about the automated future that Altman has promised. Reasoning felt like a major step forward on the way to AGI. We’re still waiting for the next one.