Artificial-intelligence technologies are being deployed rapidly across industries, yet most organizations lack even basic guidelines to assess the tools’ effects

Why evaluating the impact of AI needs to start now

Artificial intelligence (AI) has the potential to be a transformative force in science, society and the economy. However, much remains unknown about the broader implications of widespread AI use.

For instance, AI technologies can enhance as well as impede the performance of knowledge workers. They can boost productivity for routine tasks, such as idea generation and writing, but might introduce bottlenecks and errors in more-complex tasks when AI advice is adopted blindly1. Chatbots can aid individual people’s creativity, yet overreliance on them might reduce the overall diversity of original ideas2.

Understanding how users engage with the technology — and the outcomes that follow — requires careful, systematic study to differentiate between the positive and negative impacts. For instance, in education, it’s crucial to test whether students use AI tools to deepen their understanding of a topic, or whether they simply use the technology as a crutch, hindering real learning.

Controlled studies can reveal areas to which AI truly adds value and when its risks outweigh the benefits. Randomized controlled trials — in which a randomly selected group of participants receives an intervention while a control group operates under business-as-usual conditions — might be particularly valuable for assessing AI’s impact in public-sector settings. For example, can a chatbot deliver factual and actionable advice to citizens seeking tax-related information, ultimately promoting accurate and timely tax filing? Will partially automating eligibility assessments for social benefits lead to fair, efficient outcomes at reduced cost? Understanding when, how and for whom AI works is crucial to ensuring positive results and a meaningful return on investment.

Here, we introduce an AI impact-evaluation framework3 developed for the UK public sector, providing a potential blueprint for public organizations as well as the private sector.

The need to test

What do we mean by evaluating AI? Evaluation refers to assessing an intervention’s design, implementation and impact4 — in other words, understanding how and to what extent it changes the outcome of interest.

AI companies currently undertake model evaluations — testing the performance of large language models (LLMs) against benchmarks to assess their capabilities in areas such as language, mathematics, reasoning and problem-solving. Furthermore, over the past 18 months, leading AI-security institutes in the United Kingdom, the United States and Japan have advocated for testing frontier AI models to ensure they are safe to use before public release. However, although model evaluation and safety protocols are necessary, they are not sufficient.

This is because evaluating a model’s technical performance is not the same as assessing its real-world economic and societal impacts. For instance, many organizations now use customized LLMs as internal chatbots to help employees to access organization-wide materials or summarize large volumes of information — from meeting notes and market research to sector-wide consultations. Others are using AI tools to create slide decks, reports and business plans.

Conventional AI-model evaluation will ensure that the outputs are reasonably accurate and safe. But such assessments do not tell us whether these tools to improve users’ decision-making, increase their efficiency or redirect their time towards more-useful activities. Or, in the public sector, whether they lead to improved services and better outcomes for citizens.

Although some organizations conduct small pilot studies and collect user feedback, such tests rarely have the quality, scale and independence required. Our guidance recommends incorporating evaluation into the design of the AI tool itself.

Because most AI tools are hosted online, it’s relatively easy to test new features by comparing how various user groups respond5. For example, a government website might randomly show some users a new LLM-powered interactive chatbot, whereas others continue to use a simpler, rules-based one. The new tool’s impact can then be assessed by tracking whether users in the LLM group are less likely to request human assistance or call the help centre — signs that their queries are being resolved more effectively than those of the other group.

People can monitor their blood glucose levels continuously using a smartphone app.Credit: Getty

Surveys can provide useful feedback on user experience for both groups, helping evaluators to understand what worked and why. But it’s important to not rely on self-reports alone. Observing actual behaviour — what people do, not just what they say — provides stronger evidence of impact.

More-complex AI projects require proportionately scaled evaluation designs. Consider a hypothetical scenario in which the UK National Health Service provides people with an AI-powered wearable device to help them to manage a chronic condition. The technology monitors the person’s health and sends automated alerts to their physician if it detects signs that medical attention might be needed. A robust evaluation could randomly assign the AI-enabled wearable to some people, whereas others (serving as a control group) could receive a version without such features.

Key outcomes might include a reduction in the number of hospital admissions through more timely preventative care. But the evaluation should also explore potential unintended consequences — for instance, whether physicians become overly reliant on the AI tool and reduce the frequency of in-person consultations or other standard care practices.

Continuous evaluation

Our AI-evaluation guidance not just builds on the UK government’s Magenta Book4, which outlines its evaluation standards, but calls for updating conventional approaches. Most policies are typically evaluated only once. For AI, however, this needs to change.

AI models are fast-evolving, and their output and performance can change quickly. Many AI systems adapt through user interactions, meaning that their behaviour can change over time — or differ across user groups. Evaluation strategies must therefore be as dynamic and responsive as the technology itself.

Our guidance highlights the need for continuous, iterative evaluation. Instead of relying on a single assessment, setting up regular checkpoints — or better yet, adopting a system that continuously updates the evidence base as data become available — can be much more effective. This kind of flexible approach enables decision makers to adapt quickly and make informed choices as the technology and its impacts evolve.

Login or create a free account to read this content

Gain free access to this article, as well as selected content from this journal and more on nature.com

Access through your institution

or

Sign in or create an account Continue with Google Continue with ORCiD

Nature 643, 910-912 (2025)

doi: https://doi.org/10.1038/d41586-025-02266-7

This story originally appeared on: Nature - Author:Oliver P. Hauser