A shout-out for AI studies that don’t make the headlines
In a year that will see many AI achievements and battles, let’s not forget that not all AI research makes the front pages
It is not even past January, and 2025 is already proving to be a defining year for artificial intelligence (AI). On 21 January, just one day into his presidency, US President Donald Trump announced the Stargate Project, a joint venture between leading technology companies and financiers in the United States, Japan and the United Arab Emirates. They pledged a staggering US$500 billion to developing AI infrastructure in the United States.
Yet only the next day, DeepSeek, an AI-research company based in Hangzhou, China, showed that such vast sums might not be needed. It released DeepSeek-R1, a large language model (LLM) capable of step-by-step tasks analogous to human reasoning — reportedly at a fraction of the cost and computing power of existing LLMs. In early tests, its performance on tasks in chemistry and mathematics matched that of the o1 LLM released last September by OpenAI, a firm based in San Francisco, California. The news of a cheap yet advanced AI has sent the price of some technology stocks into a tailspin.
Amid the various visions for AI that are likely to define the coming years, important studies will continue to be published, but not all will make headlines. They, too, need to be heard about, discussed and debated. One such work was published in Nature earlier this month. It is called ‘Accurate predictions on small data with a tabular foundation model’ (N. Hollman et al. Nature 637, 319–326; 2025), and it could be revolutionary for the field of data science, according to one of its reviewers, Duncan McElfresh, a data engineer at Stanford Health Care in Palo Alto, California, writing in an accompanying News and Views article (D. C. McElfresh Nature 637, 274–275; 2025).
The best-known LLMs are pre-trained on hundreds of billions of examples of actual data, such as text and images. This enables them to answer user queries with a degree of reliability. But what if relevant real-world data do not exist in the required quantities? Can AI still provide reliable answers when trained on fewer data sets? This is a key question for researchers who use AI to make predictions from tabulated data sets, of which there is nowhere near the required quantity for training AI models. The Nature study suggests that reliable results could be achieved if AI models are trained on ‘synthetic data’ — randomly generated data that mimic the statistical properties of real-world data.
This advance is the work of computer scientists Noah Hollman, Samuel Müller and Frank Hutter at the University of Freiburg, Germany, and their colleagues. Their model is called TabPFN and is designed to analyse tabulated data, such as those found in spreadsheets. Typically, a user creates a spreadsheet by populating rows and columns with data, and uses mathematical models to make inferences or projections from those data. TabPFN can make predictions on any small data set, ranging from those used in accounting and finance to those from genomics and neuroscience. Moreover, the model predictions are accurate even though it is trained entirely without real-world data, but instead on 100 million randomly generated data sets.
Synthetic data do not come free of risks, such as the danger of producing inaccurate results, or hallucinations. This is partly why it is important that such studies are replicated. Replication, a cornerstone of science, also reassures users that they can trust the results of their queries.
Enhancing trust in AI, along with minimizing harms, must remain a global priority, even though it seems to have been downgraded by Trump. The president has rescinded an executive order by his predecessor, which called on the National Institutes of Standards and Technology (NIST) and AI companies to collaborate to improve both trust in and the safety of AI, including for the use of synthetic data. Trump’s new executive order, which is called ‘Removing barriers to US leadership in artificial intelligence’, neglects to use the word ‘safety’. Last November, NIST published a report on methods for authenticating AI content and tracking its provenance (see go.nature.com/42c21tn). Researchers should build on these efforts and not let them go to waste.
Hollman and colleagues’ work is an example of necessity spurring innovation: the researchers realized that there were not enough accessible real-world data sets to train their model, and so they found an alternative approach.
It remains the case that all AI models, whether trained on synthetic or real-world data, are still black boxes: users and regulators do not know how a result is reached. So, as 2025 brings more exciting developments, let’s not forget the studies that attempt to understand the ‘how and why’ of AI, and the methods papers, too. They are as important as the publications that announce the breakthroughs.
Nature 637, 1022 (2025)
doi: https://doi.org/10.1038/d41586-025-00214-z
This story originally appeared on: Nature - Author:furtherReadingSection