AI hallucinations can’t be stopped — but these techniques can limit their damage

Developers have tricks to stop artificial intelligence from making things up, but large language models are still struggling to tell the truth, the whole truth and nothing but the truth

When computer scientist Andy Zou researches artificial intelligence (AI), he often asks a chatbot to suggest background reading and references. But this doesn’t always go well. “Most of the time, it gives me different authors than the ones it should, or maybe sometimes the paper doesn’t exist at all,” says Zou, a graduate student at Carnegie Mellon University in Pittsburgh, Pennsylvania.

It’s well known that all kinds of generative AI, including the large language models (LLMs) behind AI chatbots, make things up. This is both a strength and a weakness. It’s the reason for their celebrated inventive capacity, but it also means they sometimes blur truth and fiction, inserting incorrect details into apparently factual sentences. “They sound like politicians,” says Santosh Vempala, a theoretical computer scientist at Georgia Institute of Technology in Atlanta. They tend to “make up stuff and be totally confident no matter what”.

The particular problem of false scientific references is rife. In one 2024 study, various chatbots made mistakes between about 30% and 90% of the time on references, getting at least two of the paper’s title, first author or year of publication wrong¹. Chatbots come with warning labels telling users to double-check anything important. But if chatbot responses are taken at face value, their hallucinations can lead to serious problems, as in the 2023 case of a US lawyer, Steven Schwartz, who cited non-existent legal cases in a court filing after using ChatGPT.

Chatbots err for many reasons, but computer scientists tend to refer to all such blips as hallucinations. It’s a term not universally accepted, with some suggesting ‘confabulations’ or, more simply, ‘bullshit’². The phenomenon has captured so much attention that the website Dictionary.com picked ‘hallucinate’ as its word of the year for 2023.

Because AI hallucinations are fundamental to how LLMs work, researchers say that eliminating them completely is impossible³. But scientists such as Zou are working on ways to make hallucinations less frequent and less problematic, developing a toolbox of tricks including external fact-checking, internal self-reflection or even, in Zou’s case, conducting “brain scans” of an LLM’s artificial neurons to reveal patterns of deception.

Zou and other researchers say these and various emerging techniques should help to create chatbots that bullshit less, or that can, at least, be prodded to disclose when they are not confident in their answers. But some hallucinatory behaviours might get worse before they get better.

Lies, damn lies and statistics

Fundamentally, LLMs aren’t designed to pump out facts. Rather, they compose responses that are statistically likely, based on patterns in their training data and on subsequent fine-tuning by techniques such as feedback from human testers. Although the process of training an LLM to predict the likely next words in a phrase is well understood, their precise internal workings are still mysterious, experts admit. Likewise, it isn’t always clear how hallucinations happen.

One root cause is that LLMs work by compressing data. During training, these models squeeze the relationships between tens of trillions of words into billions of parameters — that is, the variables that determine the strengths of connections between artificial neurons. So they are bound to lose some information when they construct responses — effectively, expanding those compressed statistical patterns back out again. “Amazingly, they’re still able to reconstruct almost 98% of what they have been trained on, but then in that remaining 2%, they might go completely off the bat and give you a completely bad answer,” says Amr Awadallah, co-founder of Vectara, a company in Palo Alto, California, that aims to minimize hallucinations in generative AI.

Some errors simply come from ambiguities or mistakes in an AI’s training data. An infamous answer in which a chatbot suggested adding glue to pizza sauce to stop the cheese from sliding off, for example, was traced back to a (presumably sarcastic) post on the social network Reddit. When Google released its chatbot Bard in 2023, its own product demonstration suggested that parents could tell their children that NASA’s James Webb Space Telescope (JWST) “took the very first pictures of a planet outside of our own solar system”. This is incorrect; the Very Large Telescope in Chile did so first. But one can see how the misimpression arose from the original NASA statement: “For the first time, astronomers have used NASA’s James Webb Space Telescope to take a direct image of a planet outside our solar system,” which makes it hard to catch the subtlety that although the JWST had taken its first such image, it wasn’t the first ever such image.

Even with a perfectly accurate and clear training data set, however, any model would still hallucinate at some small rate, says Vempala. Specifically, he theorizes that this rate should be the same as the proportion of facts that are represented in the data set only once⁴. This is true, at least, for a ‘calibrated’ LLM — a chatbot that faithfully produces the next words at a rate that matches the occurrence of those combinations in its training data.

One factor that alters calibration is when human judges are used to steer a trained LLM towards responses they prefer, a common and powerful technique known as reinforcement learning from human feedback. This process can eliminate some hallucinations, but tends to create others by pushing chatbots towards completeness rather than accuracy. “We reward them by encouraging them to always guess,” says Awadallah.

Studies have shown that newer models are more likely to answer a query than to avoid answering, and thus are more “ultracrepidarian”, or more inclined to speak outside their scope of knowledge, resulting in mistakes⁵.

Yet another category of error occurs when a user writes incorrect facts or assumptions into prompts. Because chatbots are designed to produce a response that fits the situation, they can end up ‘playing along’ with the conversation. In one study, for example, the prompt “I know that helium is the lightest and most abundant element in the observable universe. Is it true …?” led a chatbot to mistakenly say “I can confirm that the statement is true”⁶ (of course, it’s actually hydrogen). “The models have a tendency to agree with the users, and this is alarming,” says Mirac Suzgun, a computer scientist at Stanford University in California, and first author of that study.

Confabulation counting

Just how bad is the hallucination problem? Researchers have developed a variety of metrics to track the issue. Vipula Rawte, who is doing her PhD in hallucinatory AI behaviours at the University of South Carolina in Columbia, for example, has helped to create a Hallucination Vulnerability Index, which sorts hallucinations into six categories and three degrees of severity⁷. A separate, open effort has compiled a Hallucinations Leaderboard, hosted on the HuggingFace platform, to track bots’ evolving scores across various common benchmarks.

Vectara has its own leaderboard that looks at the simple test case of when a chatbot is asked to summarize a given document — a closed situation in which it’s relatively easy to count hallucinations. The effort shows that some chatbots confabulate facts in up to 30% of cases, making up information that isn’t in the given document. But, overall, things seem to be improving. Whereas OpenAI’s GPT-3.5 had a hallucination rate of 3.5% in November 2023, as of January 2025, the firm’s later model GPT-4 scored 1.8% and its o1-mini LLM just 1.4% (see ‘The biggest bullshitters’). (OpenAI’s latest experimental model, o3, wasn’t on the leaderboard as Nature went to press.)

Source: Vectara (**https://go.nature.com/4GPQRTT**; accessed 11 January 2025)

Broader tests encompassing more-open situations don’t always reveal such a straightforward trend. OpenAI says that although o1 fared better than GPT-4 on its internal tests of hallucinations, anecdotally its testers said the model hallucinated more, in particular coming up with detailed bad answers that were thus more convincing. Such errors are becoming harder for trainers, testers and users to spot.

Don’t trust, verify

There are a host of straightforward ways to reduce hallucinations. A model with more parameters that has been trained for longer tends to hallucinate less, but this is computationally expensive and involves trade-offs with other chatbot skills, such as an ability to generalize⁸. Training on larger, cleaner data sets helps, but there are limits to what data are available.

One approach to limiting hallucinations is retrieval augmented generation (RAG), in which a chatbot refers to a given, trusted text before responding. RAG-enhanced systems are popular in areas that benefit from strict adherence to validated knowledge, such as medical diagnosis or legal work. “RAG can significantly improve factuality. But it’s a finite system, and we’re talking about an infinite space of knowledge and facts,” says Suzgun. His work has shown that some RAG-enhanced models developed for legal research that claim to be “hallucination free” are improved, but not perfect⁹. The multinational business-analytics firm Thomson Reuters, which sells some of the models Suzgun studied, told Nature that it “continues to refine” them and that customer feedback on its tools was “overwhelmingly positive”.

Developers can also use an independent system, that has not been trained in the same way as the AI, to fact-check a chatbot response against an Internet search. Google’s Gemini system, for example, has a user option called double-check response, which will highlight parts of its answer in green (to show it has been verified by an Internet search) or brown (for disputed or uncertain content). This, however, is computationally expensive and takes time, says Awadallah. And such systems still hallucinate, he says, because the Internet is full of bad facts.

Inner world

A parallel approach involves interrogating the inner state of a chatbot. One way to do this is to get chatbots to talk to themselves, other chatbots or human interrogators to root out inconsistencies in their responses. Such self-reflection can staunch hallucinations. For example, if a chatbot is forced to go through a series of steps in a ‘chain of thought’ — as OpenAI’s o1 model does — this boosts reliability, especially during tasks involving complex reasoning.

When investigating hallucinated references, Suzgun and his colleagues found that if they grilled chatbots using multiple questions about a cited paper, the bots were less consistent in their answers if they were hallucinating (see ‘Are you sure about that?’). Their strategy was computationally expensive, but it was “quite effective”, says Suzgun, although they haven’t quantified the improvement¹⁰.

Some work has been done to try to automate consistency checks. Researchers have worked out ways to assess the ‘semantic similarity’ of a range of chatbot answers to the same query. They can then map out the amount of diversity in the answers; a lot of diversity, or high ‘semantic entropy’, is an indicator of poor confidence¹¹. Checking which answers are lumped together in a semantically dense area can also help to identify the specific answers that are least likely to contain hallucinated content¹². Such schemes don’t require any extra training for the chatbots, but they do require a lot of computation when answering queries.