The chatbot excels at science, beating PhDs on a hard science test

‘In awe’: scientists impressed by latest ChatGPT model o1 But it might ‘hallucinate’ more than its predecessors

Technology firm OpenAI released a preview version of its latest chatbot, o1, last month.Credit: GK Images/Alamy

Researchers who helped to test OpenAI’s new large language model, OpenAI o1, say it represents a big step up in terms of chatbots’ usefulness to science.

“In my field of quantum physics, it gives significantly more detailed and coherent responses” than did the company’s last model, GPT-4o, says Mario Krenn, leader of the Artificial Scientist Lab at the Max Planck Institute for the Science of Light in Erlangen, Germany. Krenn was one of a handful of scientists on the ‘red team’ that tested the preview version of o1 for OpenAI, a technology firm based in San Francisco, California, by putting the bot through its paces and checking for safety concerns.

Since the public launch of ChatGPT in 2022, the large language models that drive such chatbots have, on average, become bigger and better, with more parameters, or decision-making nodes; bigger training datasets; and stronger abilities across a variety of standardized tests, or benchmarks.

OpenAI says that its o1 series marks a step change in the company’s approach. The distinguishing feature of this artificial intelligence (AI) model, observers say, is that it has spent more time in certain stages of learning, and ‘thinks’ about its answers for longer, making it slower, but more capable — especially in areas in which right and wrong answers can be clearly defined. The firm adds that o1 “can reason through complex tasks and solve harder problems than previous models in science, coding, and math”. For now, o1-preview and o1-mini — a smaller, more cost-effective version suited to coding — are available to paying customers and certain developers on a trial basis. The company hasn’t released details about how many parameters or how much computing power lie behind the o1 models.

Besting the PhDs

Andrew White, a chemist at FutureHouse, a non-profit organization in San Francisco that focuses on how AI can be applied to molecular biology, says that observers have been surprised and disappointed by a general lack of improvement in chatbots’ ability to support scientific tasks over the past year and a half, since the public release of GPT-4. The o1 series, he says, has changed that.

Source: OpenAI

Strikingly, o1 has become the first large language model to beat PhD-level scholars on the hardest series of questions — the ‘diamond’ set — in a test called the Graduate-Level Google-Proof Q&A Benchmark (GPQA)1. OpenAI says that its scholars scored just under 70% on GPQA Diamond, and o1 scored 78% overall, with a particularly high score of 93% in physics (see ‘Next level’). That’s “significantly higher than the next-best reported [chatbot] performance”, says David Rein, who was part of the team that developed the GPQA. Rein now works at the non-profit organization Model Evaluation and Threat Research, based in Berkeley, California, that works on assessing the risks of AI. “It seems plausible to me that this represents a significant and fundamental improvement in the model’s core reasoning capabilities,” he adds.

OpenAI also tested o1 on a qualifying exam for the International Mathematics Olympiad. Its previous best model, GPT-4o, correctly solved only 13% of the problems, whereas o1 scored 83%.

Chain of thought

OpenAI o1 works by using chain-of-thought logic; it talks itself through a series of reasoning steps as it attempts to solve a problem, correcting itself as it goes.

OpenAI has decided to keep the details of any given chain of thought hidden — in part because the chain might contain errors or socially unacceptable ‘thoughts’, and in part to protect company secrets relating to how the model works. Instead, o1 provides a reconstructed summary of its logic for the user, alongside its answers. It’s unclear, White says, whether the full chain of thought, if revealed, would look similar to human reasoning.

The new capabilities come with trade-offs. For instance, OpenAI reports that it has received anecdotal feedback that o1 models hallucinate — make up incorrect answers — more often than their predecessors do (although the company’s internal testing showed slightly lower rates of hallucination for o1).

The red-team scientists noted plenty of ways in which o1 was helpful in coming up with protocols for science experiments, but OpenAI says the testers also “highlighted missing safety information pertaining to harmful steps, such as not highlighting explosive hazards or suggesting inappropriate chemical containment methods, pointing to unsuitability of the model to be relied on for high-risk physical safety tasks”.

“It’s still not perfect or reliable enough that you wouldn’t really want to closely check over it,” White says. He adds that o1 is more suited to guiding experts than novices. “For a novice, it’s just beyond their immediate inspection ability” to look at an o1-generated protocol and see that it’s “bunk”, he says.

Science solvers

Krenn thinks that o1 will accelerate science by helping to scan the literature, seeing what’s missing and suggesting interesting avenues for future research. He has had success looping o1 into a tool that he co-developed that does this, called SciMuse2. “It creates much more interesting ideas than GPT-4 or GTP-4o,” he says.

Kyle Kabasares, a data scientist at the Bay Area Environmental Research Institute in Moffett Field, California, used o1 to replicate some coding from his PhD project that calculated the mass of black holes. “I was just in awe,” he says, noting that it took o1 about an hour to accomplish what took him many months.

Catherine Brownstein, a geneticist at Boston Children’s Hospital in Massachusetts, says the hospital is currently testing several AI systems, including o1-preview, for applications such as connecting the dots between patient characteristics and genes for rare diseases. She says o1 “is more accurate and gives options I didn’t think were possible from a chatbot”.

doi: https://doi.org/10.1038/d41586-024-03169-9

This story originally appeared on: Nature - Author:Nicola Jones