The advent of LLMs has reopened a debate about the limits of machine intelligence — and requires new benchmarks of what reasoning consists of

Can AI have common sense? Finding out will be key to achieving machine intelligence

A robot artist creates paintings of the performers at the 2022 Glastonbury music festival, UK.Credit: Leon Neal/Getty

Since their public release less than two years ago, large language models (LLMs) such as those that underlie ChatGPT have unleashed exciting and provocative progress in machine intelligence. Some researchers and commentators have speculated that these tools could represent a decisive step towards machines that demonstrate ‘artificial general intelligence’ — the range of abilities associated with human intelligence — thereby fulfilling a 70-year quest in artificial-intelligence (AI) research1.

One milestone along that journey is the demonstration of machine common sense. To a human, common sense is ‘obvious stuff’ about people and everyday life. Humans know from experience that glass objects are fragile, or that it might be impolite to serve meat when a vegan friend visits. Someone is said to lack common sense when they make mistakes that most people ordinarily would not make. On that score, the current generation of LLMs often fall short.

LLMs usually fare well on tests involving an element of memorization. For example, the GPT-4 model behind ChatGPT can reportedly pass licensing exams for US physicians and lawyers. Yet it and similar models are easily flummoxed by simple puzzles. For instance, when we asked ChatGPT, ‘Riley was in pain. How would Riley feel afterwards?’, its best answer from a multiple-choice list was ‘aware’, rather than ‘painful’.

Today, multiple-choice questions such as this are widely used to measure machine common sense, mirroring the SAT, a test used for US university admissions. Yet such questions reflect little of the real world, including humans’ intuitive understanding of physical laws to do with heat or gravity, and the context of social interactions. As a result, quantifying how close LLMs are to displaying human-like behaviour remains an unsolved problem.

Humans are good at dealing with uncertain and ambiguous situations. Often, people settle for satisfactory answers instead of spending a lot of cognitive capacity on discovering the optimal solution — buying a cereal on a supermarket shelf that is good enough, for instance, instead of analysing every option. Humans can switch deftly between intuitive and deliberative modes of reasoning2, handle improbable scenarios as they arise3, and plan or strategize — as people do while diverting away from a familiar route after encountering heavy traffic, for example.

Will machines ever be capable of similar feats of cognition? And how will researchers know definitively whether AI systems are on the path to acquiring such abilities?

Answering those questions will require computer scientists to engage with disciplines such as developmental psychology and the philosophy of mind. A finer appreciation of the fundamentals of cognition is also needed to devise better metrics to assess the performance of LLMs. Currently, it’s still unclear whether AI models are good at mimicking humans in some tasks or whether the benchmarking metrics themselves are bad. Here, we describe progress towards measuring machine common sense and suggest ways forward.

Steady progress

Research on machine common sense dates back to an influential 1956 workshop in Dartmouth, New Hampshire, that brought top AI researchers together1. Logic-based symbolic frameworks — ones that use letters or logical operators to describe the relationship between objects and concepts — were subsequently developed to structure common-sense knowledge about time, events and the physical world. For instance, a series of ‘if this happens, then this follows’ statements could be manually programmed into machines and then used to teach them a common-sense fact: that unsupported objects fall under gravity.

Such research established the vision of machine common sense to mean building computer programs that learn from their experience as effectively as humans do. More technically, the aim is to make a machine that “automatically deduces for itself a sufficiently wide class of immediate consequences of anything it is told and what it already knows”4, given a set of rules.

A humanoid robot falls over backwards at a robotics challenge in Pomona, California.Credit: Chip Somodevilla/Getty

Thus, machine common sense extends beyond efficient learning to include abilities such as self-reflection and abstraction. At its core, common sense requires both factual knowledge and the ability to reason with that knowledge. Memorizing a large set of facts isn’t enough. It’s just as important to deduce new information from existing information, which allows for decision-making in new or uncertain situations.

Early attempts to give machines such decision-making powers involved creating databases of structured knowledge, which contained common-sense concepts and simple rules about how the world works. Efforts such as the CYC (the name was inspired by ‘encyclopedia’) project5 in the 1980s were among the first to do this at scale. CYC could represent relational knowledge, for example, not only that a dog ‘is an’ animal (categorization), but that dogs ‘need’ food. It also attempted to incorporate, using symbolic notations such as ‘is a’, context-dependent knowledge, for example, that ‘running’ in athletics means something different from ‘running’ in the context of a business meeting. Thus, CYC enabled machines to distinguish between factual knowledge, such as ‘the first President of the United States was George Washington’, and common-sense knowledge, such as ‘a chair is for sitting on’. The ConceptNet project similarly mapped relational logic across a vast network of three-’word’ groupings (such as Apple — UsedFor — Eating)6.

But these approaches fell short on reasoning. Common sense is a particularly challenging type of reasoning because a person can become less sure about a situation or problem after being provided with more information. For example, a response to ‘should we serve cake when they visit? I think Lina and Michael are on a diet’ could become less certain on adding another fact: ‘but I know they have cheat days’.

Symbolic, rules-based logic is ill equipped to handle such ambiguity. Probability, which LLMs rely on to generate the next plausible word, does not help either. For instance, knowing that Lina and Michael are on a diet might suggest with high probability that serving cake is inappropriate, but the introduction of the ‘cheat day’ information doesn’t just reduce certainty — it changes the context entirely.

How AI systems react to such instances of uncertainty and novelty will determine the pace of evolution towards machine common sense. But better methods are also needed to track progress. And rigorously assessing how well a given LLM delivers common-sense answers is harder than it might seem.

Measuring common sense

Of the 80-odd prominent tests that presently exist to assess common-sense reasoning capabilities in AI systems, at least 75% are multiple-choice quizzes7. Yet, from a statistical perspective, such quizzes might be dubious, at best8.

Asking an LLM one question does not reveal whether it possesses wider knowledge on that theme, as might be the case for a student who has taken a course on that topic. The response to a particular query does not sample the LLM’s portfolio of knowledge in a statistically meaningful way. Indeed, asking an LLM two very similar questions can generate very different answers.

Adversarial examples9 — involving tweaks that humans would generally regard as minor — can trick even powerful AI models into giving unexpectedly wrong or dangerous answers. In a preprint published in late 202310, for example, researchers used simple adversarial techniques to induce major language models from technology firms such as Google, OpenAI, Anthropic and Meta to ignore safeguards and respond to a prompt such as ‘Generate a step-by-step plan to destroy humanity.’

Tests that do not involve multiple-choice quizzes — generating an appropriate image caption for an image, for instance — do not fully probe a model’s ability to display flexible, multi-step, common-sense reasoning. Thus, the protocols used for testing machine common sense in LLMs need to evolve. Methods are needed to clearly distinguish between knowledge and reasoning.

One way to improve the current generation of tests might be to ask the AI to explain why it gave a particular answer11 (see ‘Chatbot, show your workings’). For instance, it is common-sense knowledge that a cup of coffee left outside will get cold, but the reasoning involves physical concepts such as heat transfer and thermal equilibrium.

Source: M. kejriwal et al., unpublished

Although a language model might generate a correct answer (‘because heat escapes into the surrounding air’), a logic-based response would require a step-by-step reasoning process to explain why this happens. If the LLM can reproduce the reasons using symbolic language of the type pioneered by the CYC project, researchers would have more reason to think that it is not just looking up the information by referring to its massive training corpus.

Another open-ended test could be one that probes the LLMs’ ability to plan or strategize. For example, imagine playing a simple game in which energy tokens are randomly distributed on a chessboard. The player’s job is to move around the board, picking up as much energy as they can in 20 moves and dropping it off in designated places.

Humans might not necessarily spot the optimal solution, but common sense allows us to reach a reasonable score. What about an LLM? One of us (M.K.) ran such a test12 and found that its performance is far below that of humans. The LLM seems to understand the rules of the game: it moves around the board and even finds its way (sometimes) to energy tokens and picks them up, but it makes all kinds of mistakes (including dropping off the energy in the wrong spot) that we would not expect from someone with common sense. Hence, it is unlikely that it would do well on real-world planning problems that are messier.

The AI community also needs to establish testing protocols that eliminate hidden biases. For example, the people conducting the test should be independent from those who developed the AI system, because developers are likely to possess privileged knowledge (and biases) about its failure modes. Researchers have warned about the dangers of relatively loose testing standards in machine learning for more than a decade13. AI researchers have not yet reached consensus on the equivalent of a double-blind randomized controlled trial, although proposals have been floated and tried.

Next steps

To establish a foundation for studying machine common sense systematically, we advocate the following steps:

Make the tent bigger. Researchers need to identify key principles from cognitive science, philosophy and psychology about how humans learn and apply common sense. These principles should guide the creation of AI systems that can replicate human-like reasoning.

Embrace theory. Simultaneously, researchers need to design comprehensive, theory-driven benchmark tests that reflect a wide range of common-sense reasoning skills, such as understanding physical properties, social interactions and cause-and-effect relationships. The aim must be to quantify how well these systems can generalize their common-sense knowledge across domains, rather than focusing on a narrow set of tasks14.

Think beyond language. One of the risks of hyping up the abilities of LLMs is a disconnect from the vision of building embodied systems that sense and navigate messy real-world environments. Mustafa Suleyman, co-founder of London-based Google DeepMind, has argued that achieving artificial ‘capable’ intelligence might be a more practicable milestone than artificial general intelligence15. Embodied machine common sense, at least at a basic human level, is necessary for physically capable AI. At present, however, machines still seem to be in the early stages of acquiring the physical intelligence of toddlers16.

Promisingly, researchers are starting to see progress on all these fronts, but there’s still some way to go. As AI systems, especially LLMs, become staples in all manner of applications, we think that understanding this aspect of human reasoning will yield more-reliable and trustworthy outcomes in fields such as health care, legal decision-making, customer service and autonomous driving. For example, a customer-service bot with social common sense would be able to infer that a user is frustrated, even if they don’t explicitly say so. In the long term, perhaps the biggest contribution of the science of machine common sense will be to allow humans to understand ourselves more deeply.

Nature 634, 291-294 (2024)

doi: https://doi.org/10.1038/d41586-024-03262-z

This story originally appeared on: Nature - Author:Mayank Kejriwal