DeepMind’s new AlphaGenome AI tackles the ‘dark matter’ in our DNA

Tool aims to solve the mystery of non-coding sequences — but is still in its infancy

Researchers feed vast quantities of genomic data into machine-learning systems to train them to predict the role of non-coding sequences.Credit: JuSun/iStock via Getty

Nearly 25 years after scientists completed a draft human genome sequence, many of its 3.1 billion letters remain a puzzle. The 98% of the genome that is not made of protein-coding genes — but which can influence their activity — is especially vexing.

An artificial intelligence (AI) model developed by Google DeepMind in London could help scientists to make sense of this ‘dark matter’, and see how it might contribute to diseases such as cancer and influence the inner workings of cells. The model, called AlphaGenome, is described in a 25 June preprint.

“This is one of the most fundamental problems not just in biology — in all of science,” Pushmeet Kohli, the company’s head of AI for science said at a press briefing.

The ‘sequence to function’ model takes long stretches of DNA and predicts various properties, such as the expression levels of the genes they contain and how those levels could be affected by mutations.

“I think it is an exciting leap forward,” says Anshul Kundaje, a computational genomicist at Stanford University in Palo Alto, California, who has had early access to AlphaGenome. “It is a genuine improvement in pretty much all current state-of-the-art sequence-to-function models.”

An ‘all in one’ approach

When DeepMind unveiled AlphaFold 2 in 2020, it went a long way to solving a problem that had challenged researchers for decades: determining how a protein’s sequence contributes to its three-dimensional shape.

Working out what DNA sequences do is different, because there is no one answer, as in a 3D structure that AlphaFold delivers. A single DNA stretch will have numerous, interconnected roles — from attracting one set of cellular machinery to latch onto a particular section of a chromosome and turn a nearby gene into an RNA molecule, to attracting protein-transcription factors that influence where, when and to what extent gene expression occurs. Many DNA sequences, for example, influence gene activity by altering a chromosome’s 3D shape, either restricting or easing access for the machinery that does the transcription.

Biologists have been chipping away at this question for decades with various kinds of computational tools. In the last decade or so, scientists have developed dozens of AI models to make sense of the genome. Many of these have focused on an individual task, such as predicting levels of gene expression or determining how modular segments of individual genes, called exons, are cut-and-pasted into distinct proteins. But scientists are increasingly interested in ‘all in one’ tools for interpreting DNA sequences.

AlphaGenome is one such model. It can take inputs of up to one million DNA letters — a stretch that could include a gene and myriad regulatory elements — and make thousands of predictions about numerous biological properties. In many cases, AlphaGenome’s predictions are sensitive to single-DNA-letter changes, which means that scientists can predict the consequences of mutations.

In one example, DeepMind researchers applied the AlphaGenome model to diverse mutations identified in previous studies in people with a type of leukaemia. The model accurately predicted that the non-coding mutations indirectly activated a nearby gene that is a common driver of this cancer.