OpenAI’s GPT-5 hallucinates less than previous models do, but cutting hallucination completely might prove impossible

Can researchers stop AI making up citations?

OpenAI’s GPT-5 hallucinates less than previous models do, but cutting hallucination completely might prove impossible

OpenAI says GPT-5 has reduced the frequency of fake citations and other kinds of hallucination.Credit: Kirill Kudryavtsev/AFP via Getty

Artificial intelligence (AI) models are known to confidently conjure up fake citations. When the company OpenAI released GPT-5, a suite of large language models (LLMs), last month, it said it had reduced the frequency of fake citations and other kinds of ‘hallucination’, as well as ‘deceptions’, whereby an AI claims to have performed a task it hasn’t.

With GPT-5, OpenAI, based in San Francisco, California, is bucking an industry-wise trend, because newer AI models designed to mimic human reasoning tend to generate more hallucinations than do their predecessors. On a benchmark that tests a model’s ability to produce citation-based responses, GPT-5 beat its predecessors. But hallucinations remain inevitable, because of the way LLMs function.

“For most cases of hallucination, the rate has dropped to a level” that seems to be “acceptable to users”, says Tianyang Xu, an AI researcher at Purdue University in West Lafayette, Indiana. But in particularly technical fields, such as law and mathematics, GPT-5 is still likely to struggle, she says. And despite the improvements in hallucination rate, users quickly found that the model errs in basic tasks, such as creating an illustrated timeline of US presidents.

OpenAI is making “small steps that are good, but I don’t think we’re anywhere near where we need to be”, says Mark Steyvers, a cognitive science and AI researcher at the University of California, Irvine. “It’s not frequent enough that GPT says ‘I don’t know’.”

A feature, not a bug

Hallucinations are a result of the fundamental way in which LLMs work. As statistical machines, the models make predictions by generalizing on the basis of learnt associations, leading them to produce answers that are plausible, but sometimes wrong.Another issue is that, similar to a student scoring points for guessing on a multiple choice exam, during training LLMs get rewarded for having a go rather than acknowledging their uncertainty, according to a preprint published by OpenAI on 4 September.

Improvements have come from scaling up the size of LLMs — in terms of both the richness of their internal associations and the amount of data they are trained on, says Xu. But hallucinations are particularly prevalent in topics for which the model has scant training data or its underlying information is wrong, she says. Hallucinations can also happen when an AI tries to summarize or analyse papers that are too long for that model to process.

Eliminating hallucinations entirely is likely to prove impossible, says Mushtaq Bilal, a researcher at Silvi, a Copenhagen-based firm that makes an AI app to aid the creation of systematic reviews in science. “I think if it was possible, AI labs would have done it already.”

But reducing errors and getting a model to admit that it doesn’t know an answer have been “a pretty heavy focus” for OpenAI, says Saachi Jain, who manages the firm’s AI safety team. According to technical documents released with GPT-5, OpenAI concentrated on “training our models to browse effectively for up-to-date information”, as well as cutting hallucinations. The firm focused on reducing hallucinations in lengthy, open-ended responses to queries, because this best represents real-life use of ChatGPT, says Jain.

In one literature-review benchmark known as ScholarQA-CS, GPT-5 “performs well” when it is allowed to access the web, says Akari Asai, an AI researcher at the Allen Institute for Artificial Intelligence, based in Seattle, Washington, who ran the tests for Nature. In producing answers to open-ended computer-science questions, for example, the model performed marginally better than human experts, with a correctness score of 55% (based on measures such as how well its statements are supported by citations) compared with 54% for scientists, but just behind a version of institute’s own LLM-based system for literature review, OpenScholar, which achieved 57%.

However, GPT-5 suffered when the model was unable to get online, says Asai. The ability to cross-check with academic databases is a key feature of most AI-powered systems designed to help with literature reviews. Without Internet access, GPT-5 fabricated or muddled half the number of citations that one of its predecessors, GPT-4o, did. But it still got them wrong 39% of the time, she says.

On the LongFact benchmark, which tests accuracy in long-form responses to prompts, OpenAI reported that GPT-5 hallucinated 0.8% of claims in responses about people or places when it was allowed to browse the web, compared with 5.1% for OpenAI’s reasoning model o3. Performance dropped when browsing was not permitted, with GPT-5’s error rate climbing to 1.4% compared with 7.9% for o3. Both models showed worse performance than did the non-reasoning model GPT-4o, which had an error rate of 1.1% when offline.

On other independent evaluations — such as the Hughes Hallucination Evaluation Model, which is run by the AI platform Vectara in Palo Alto, California, and looks at how often an LLM makes false claims when summarizing a document — rival models such as Google’s Gemini 2.0 slightly outperformed GPT-5, although both erred less than 1.5% of the time.

Learning to admit defeat

OpenAI also reported that the model was more honest in its responses than the company’s previous models were. When given a coding task that was impossible to complete — for example, owing to a lack of access to necessary hardware — GPT-5 claimed to have done the task 17% of the time, compared with 47% for o3. Although Jain wouldn’t give details of the firm’s methods, she hinted that, in later stages of the model’s training, OpenAI worked on rewarding it for answering honestly.