Medical AI can transform medicine — but only if we carefully track the data it touches

The uncontrolled deployment of machine learning in medicine can distort patient information and sacrifice long-term data reliability for short-term benefits

The practice of modern medicine is built on pattern recognition — whether in a patient’s history, physical examination, laboratory results or response to treatment. A skilled physician can identify crucial patterns early and distinguish them from others that appear deceptively similar.

But some patterns are too chaotic, too subtle or too fleeting to raise red flags. No doctor can reliably catch early-stage pancreatic cancer from routine blood tests, for example. Answers to many questions of profound importance that demand knowledge of the future¹, such as whether a tumour will spread or how long a person might live, are thus subjective — often coming down to a physician’s cumulative experience or ‘gut feeling’.

One approach to reducing subjectivity in medicine is through supervised machine learning — a technique based on creating computer models that can detect patterns by learning from labelled data. For instance, by examining many mammogram images that either include or lack tumours, models can learn how to recognize the statistical features that tend to go with one label or the other, even when those features aren’t obvious to the human eye.

Unsurprisingly, interest in predictive modelling has exploded. In cases involving tumour spread, organ failure or narrow treatment windows, accurate knowledge of how someone’s condition might unfold can conserve resources, reduce suffering and save lives. In 2024 alone, the citation database PubMed indexed more than 26,000 studies mentioning artificial intelligence (AI), machine learning or deep learning in patient health care and clinical medicine. The global market for AI in health care is projected to exceed US$46 billion by the end of this year, and $200 billion by 2030.

Yet any model, no matter how sophisticated, is still a source of uncertainty. If it underestimates risk, it contributes to clinicians overlooking serious concerns. And if it overestimates risk, it could lead to unnecessary tests and interventions, and wasted resources.

A model’s usefulness is typically judged by how well it generalizes to previously unseen data, which is treated as a proxy for real-world performance. But there’s a catch: in learning to predict outcomes, models also absorb the clinical decisions, relationships and biases that are baked into the data used to train them. Supervised learning relies on the assumption that these conditions, including the biases, will remain stable during model use. Without this foundation, things fall apart.

For example, ‘Is this patient at risk of dying tomorrow?’ is a different question in a rural outpatient clinic than in a cardiac intensive-care unit, and a model trained in one setting is likely to perform poorly in the other.

Current best practices² emphasize transparency in data sources and encourage testing models in the environments where they will be used. Still, given that many medical data sets are small, biased or tied to narrow populations, the odds that models will underperform or stop working altogether remain uncomfortably high.

However, the greatest threat to the widespread adoption of predictive modelling in health care could come not from the instances in which the model fails outright, but rather those in which it succeeds in delivering results.

Data contamination

Wherever machine learning is used in a health-care setting, it is typically built on the foundation of the electronic health record (EHR) for patients. Although EHR adoption varies globally, it is deeply embedded in many high-income countries, where it serves as both the source of training data for predictive models and the system through which those predictions are returned to clinicians. At its core, the EHR is a dynamic database that continuously logs almost all aspects of patient care — including lab results, medication, clinical notes and key events such as infections or deaths.

By expanding the amount of patient data available, the EHR enables a standardized workflow: data are pulled from the EHR to train models, and once the models are deployed, they analyse fresh patient data to predict potential health risks. These predictions can guide clinical decisions — for example, prompting a physician to order a chest X-ray or to begin administering antibiotics if a model flags a high risk of pneumonia, even before classic symptoms fully develop.

But the EHR is also the destination of the predictions of models — and the consequences of those predictions. Take, for example, a model designed to detect early signs of the onset of sepsis. Ideally, the physician is alerted and takes timely action in administering antibiotics or fluids to prevent the condition from progressing. This is exactly the kind of impact we want from AI in health care. Sepsis is notoriously hard to catch early and has a mortality rate of 30–40%, so swift intervention can save lives.

But therein lies the rub: because the physician intervened, the patient doesn’t develop sepsis. As a result, the pattern the model flagged — originally linked to sepsis — is now recorded in the EHR as being associated with a non-septic outcome. This creates a ‘contaminated association’³ in the data, in which warning signs of sepsis seem to lead to good outcomes, simply because of successful intervention. As these associations accumulate, they begin to erode the reliability of existing and even future models.

Over time, even well-performing AI models can degrade. Shifts in patient demographics, evolving standards of care, new medications or changes in clinical practice can all cause a model’s predictions to become less accurate — a phenomenon known as model drift.

A physician in a French hospital studies an X-ray in which an artificial-intelligence model has flagged possible fractures. Credit: Damien Meyer/AFP/Getty

Retraining models on newer, more representative data is widely considered the best way to recover performance⁴. But as the EHR database gets corrupted with false associations, retraining becomes effectively impossible. The data set used to train the model now contains a pattern that implies sepsis, but also ‘not-sepsis’. This is the equivalent of teaching addition to a child by telling them that two plus two is four. Sometimes. At other times it’s three, but only when it’s not five³.

Serious conditions such as pneumonia, acute kidney injury (AKI) and sepsis often occur together during a single illness or hospital stay. A model that successfully prevents one of these conditions might indirectly prevent the others as well. This introduces misleading associations into the EHR — not just for current models, but for those yet to be built³.

Things get even more complicated when multiple AI models are used in the same clinical setting. For example, one model might predict the risk of AKI, while another might forecast blood clots. These are different conditions, but both rely on the same lab values, such as measurements of the waste product creatinine, blood platelets or inflammatory markers. If a physician responds to the AKI alert by adjusting fluids or medications, that could render the predictions of the blood-clot model obsolete, or unreliable. In this way, an intervention triggered by one model can quietly disrupt another, even if they are focused on entirely different outcomes³.

Higher-order effects

Current approaches to predictive modelling in health care don’t account for how models interact with each other or with clinical decision-making. This raises serious questions about some of the field’s core practices, starting with how researchers monitor model performance after deployment.

If a model helps to prevent an adverse event, its predictions do not occur — for example, patients do not die of sepsis — and its real-world performance might appear to decline⁵. That said, a drop in performance could also mean that the model isn’t working well in practice and is making poor predictions. It is often difficult to tell the difference between these two situations.

One way to improve understanding of what’s happening is to regularly compare outcomes between periods when the model is active and when it is not. This kind of side-by-side comparison can help to determine whether the model is truly effective or if it’s falling short. In this scenario, an expected range of performance change should be established as part of the evaluation process. If performance drops beyond this range, it might indicate model degradation. If the drop is smaller than expected, it could point to limited model use or ineffective integration into clinical practice. Estimating this range in advance can be difficult, because factors such as model drift or clinical variability might interfere. A more reliable approach might be to determine the range experimentally, under controlled conditions.

Unfortunately, real-world patient care, especially in environments with multiple models and providers, is far removed from controlled conditions. Although randomized controlled trials⁶ (RCTs) remain the gold standard for evaluating clinical treatments and models, applying that level of control in day-to-day clinical settings is rarely possible. In practice, clinicians might need to choose between several overlapping or even conflicting models. As the number of deployed models grows, the results of isolated, tightly controlled studies become less reliable as indicators of real-world effectiveness. Unless a model is going to be used in exactly the same controlled environment in which it was tested — free from competing models, system changes or drift — its performance in isolation should be interpreted with caution.

Even if we accept RCTs at face value as being able to provide usable proof of a predictive model’s effectiveness, they come with substantial financial and time costs. A more practical way to assess a model is to test it on entirely new data — such as from another hospital or site. This process, often called external validation, helps to show whether the model can detect real biological patterns rather than just those specific to the data it was trained on. But akin to the challenges related to retraining models, this kind of testing becomes much harder⁷ when previous models have already shaped or influenced the data being used for testing.