Scientists are seeking to decipher the role of non-coding DNA in the human genome, helped by a suite of artificial-intelligence tools

Beyond AlphaFold: how AI is decoding the grammar of the genome

Scientists are seeking to decipher the role of non-coding DNA in the human genome, helped by a suite of artificial-intelligence tools

In 1862, Victor Hugo reportedly wrote to his publisher to ask how his newly published novel Les Misérables was selling, with a single character query: “?” The response: “!”

This story of one of the world’s most concise correspondences is apocryphal. But some genome-focused artificial intelligence (AI) systems can, like the French writer’s publisher, respond meaningfully to equally short prompts.

Instead of the detailed queries required to use the chatbot ChatGPT effectively, Evo, an AI model trained on some 300 billion nucleotide bases, including 80,000 microbial whole-genome sequences, will — prompted with ‘#’ — dream up a new sequence of mobile DNA. It does so on the basis of other such biological systems that the model has been exposed to (see go.nature.com/3jvp922). Given a prompt such as ‘030’, an AI tool called regLM can spit out 200-base sequences that are predicted to exhibit regulatory activity in any of three human cell lines (go.nature.com/4jpttm8).

Evo and regLM are part of a fast-growing suite of tools that aim to internalize, decode, interpret and build on the grammar of the genome — especially the vast portion that does not code for proteins. Think AlphaFold, but for regulatory DNA, which are sequences that control gene expression.

When Google DeepMind released AlphaFold in 2020, the company claimed it had solved a decades-old ‘grand challenge’ in biology — predicting a protein’s 3D shape from its sequence alone. But the non-coding fraction of the genome could prove to be an even grander challenge.

A given sequence of amino acids will generally fold into the same shape, whatever the cellular context. That predictability is not true of the genome, in which short, functional sequence motifs — gene promoters and enhancers, transcription start and stop sites and so on — can be scattered across long stretches of seemingly purposeless DNA. These motifs might overlap, interact over long distances, bind to competing protein factors or respond to signals that are only present in specific cells or at certain times in development. They are also tightly wrapped within chromatin, a complex of DNA and protein, which might be more or less accessible to external proteins depending on what the cell is doing.

“How proteins are encoded in the genome, the code of how genes are expressed, when and where, how much — is one of the most fascinating problems in biology,” says Stein Aerts, a computational biologist at the VIB Center for AI & Computational Biology and the Catholic University of Leuven (KU Leuven) in Belgium. But with training, AI tools can detect subtle differences between sequences and predict what they do and how they behave, identifying crucial motifs and even estimating the impact of altering them. From there, AI models can attempt to predict the physiological impact of genetic variants and even guide the design of new sequences with specified functions.

These tools are not perfect, and researchers cannot even agree on how best to assess their performance. But that makes the field exciting. “It’s so clear that it’s a solvable problem,” says Julia Zeitlinger, a developmental and computational biologist at the Stowers Institute for Medical Research in Kansas City, Missouri, who developed an AI model called BPNet and uses it to decode the mechanistic sequence rules of gene regulation, “but it’s not clear how”.

Of puppies and puffins

DeepSEA, one of the first genomic AI tools, was published¹ ten years ago this month by computational biologists Jian Zhou and Olga Troyanskaya at Princeton University in New Jersey.

DeepSEA is a convolutional neural network (CNN) — the same kind of deep-learning architecture used to teach computers to classify images as, say, a cat or a dog. Zhou and Troyanskaya trained a model on epigenetics data, including transcription-factor binding, chromatin accessibility and histone modifications, from a public research project called the Encyclopedia of DNA Elements (ENCODE). The model learnt to predict the presence of such features in 1,000-base segments of DNA it had never encountered.

DeepSEA’s training enabled it to tease apart the biological consequence and severity of sequence variants associated with human disease. For instance, one breast-cancer-associated sequence variant called rs4784227 seems to strengthen the binding of a DNA-binding protein called FOXA1, whereas a variant associated with the blood condition α-thalassemia creates a possible binding site for GATA1, a transcription factor involved in blood-cell development.

Since then, the field has exploded. David Kelley, a principal investigator at the biotechnology company Calico Life Sciences in South San Francisco, California, has created or co-created multiple AI models, many with canine-inspired names. These include Akita² (for predicting 3D genome folding), Basset³ and Basenji⁴ (for regulatory-sequence prediction) and Borzoi⁵, which predicts gene expression across the length of a gene.

These models raised a litter of variants: Basset begat Malinois, and Borzoi begat Scooby. Other researchers have built their own (non-canine) models including Puffin, ChromBPNet and more.

Not all are CNNs. Enformer — a model that predicts both gene expression and epigenetic data over long distances — and Borzoi, for instance, “use both convolution blocks and transformer blocks”, says Kelley, whose laboratory developed both models. “The convolution blocks are great for capturing the local sequence patterns, and then the transformer blocks help look around a larger region to consider the local patterns in a broader context before predicting the data.” But whatever the architecture, they come in two basic forms, says Anshul Kundaje, who researches computational genomics at Stanford University in California. Supervised and sequence-to-function models are trained on functional genomic data — gene expression or chromatin accessibility, for instance — and learn to predict the function of DNA sequences they have never encountered. Often working at or near single-nucleotide resolution, these models can identify key motifs, such as functionally important protein-binding sites, and predict the significance of altering them. DeepSEA is one; Kundaje’s ChromBPNet, which predicts regions of chromatin accessibility, another.

The other class is unsupervised or self-supervised ‘genomic language models’ (gLMs). Like ChatGPT, they are trained on vast quantities of text — in this case, genomic sequence data — and are tasked with either predicting the next base (or ‘token’) in a sequence or filling in missing bases on the basis of surrounding context. These models “are not trying to predict the activity of a sequence, they’re trying to predict the composition of a sequence”, says Avantika Lal, a machine-learning scientist at biotechnology firm Genentech in South San Francisco.

With machine-learning scientist Gökçen Eraslan and their colleagues at Genentech, Lal co-developed regLM, a language model that they trained by labelling regulatory sequences with succinct markers of activity⁶ — for instance, ‘04<sequence>’ to indicate strong expression in one cell line and low activity in another. The model is therefore not strictly unsupervised, says Eraslan — he calls it a ‘function-to-sequence’ model. But those same labels can then be used to prompt regLM to create new sequences with predicted behaviours.

Evo 2, announced in February⁷, was trained on 9.3 trillion DNA base pairs — “a representative snapshot of genomes spanning all observed evolution”, as the resulting bioRxiv preprint paper puts it. It could then identify intron–exon boundaries, predict the impacts of mutations and generate ‘realistic’ gene and genomic sequences, among other things.

Models made simple

Genomic AI models can also be distinguished by the type of regulatory interactions they predict, Kundaje says. Sequence-to-function models mostly identify important DNA motifs (which, because their function depends on their proximity to the regulated gene, are said to act in cis) without regard to the biology that occurs there.

Trans models, by contrast, aim to identify which genes regulate which other genes, for instance, to tease apart networks of gene regulation. (They are called trans because the factors that mediate this regulation act at a distance.) But this, says Kundaje, “is still very fraught and very problematic” because trans models — which are generally trained on data such as RNA expression — must infer causal relationships without data that can reveal causality. There’s no guarantee that two genes are directly linked just because their expression rises and falls in tandem. Even if they are, it’s not necessarily obvious in which direction the relationship works: does A regulate B or vice versa? If these models are then asked to predict the impact of a perturbation — for example, what happens if a given gene is knocked out — the models often fail.

Models can include both cis and trans elements, says Sushmita Roy, a computational biologist at the University of Wisconsin–Madison, for instance by building regulatory networks on the basis of chromatin accessibility data and weighting those predictions by gene expression. But perhaps the first model to truly bridge the divide, Kundaje says, is Scooby — a single-cell version of Borzoi (go.nature.com/3upffnp). By leveraging both chromatin accessibility and transcriptional data from the same cells, Scooby predicts genome features and cell state simultaneously. “It is one of the first cis–trans models,” he says.

Sequence-to-function models can also probe other aspects of gene regulation. In 2024, teams led by Zhou (who is now at the University of Texas Southwestern Medical Center in Dallas), Kundaje and Charles Danko, a computational biologist at Cornell University in Ithaca, New York, independently described sequence-to-function models capable of predicting sites of transcription initiation⁸^–¹⁰.

Zhou used his team’s model, Puffin, to identify the common features and placement of key regulatory elements around sites of transcription initiation, including binding sites for the transcription factors YY1, SP1, CREB and Initiator. Danko’s team trained its AI model on matched genome sequences and transcription initiation data from 58 individuals, creating a suite of models that were, he says, “for the first time aware of how differences between individuals in their genome sequence influence the pattern” of transcription initiation.

Collectively, says Zhou, these studies begin to tease apart the motifs that regulate the positioning and strength of transcription initiation, including that of the transcription factor TFIID. TFIID is an essential protein complex that binds to the promoter element known as a TATA box — despite the fact that most eukaryotic promoters don’t seem to contain a TATA box. “One mechanistic interpretation is that TFIID is binding the best available of the ‘bad options’ when it picks a site” in a TATA-less promoter, Danko explains.

Most genomic models make these predictions from relatively small inputs — anywhere from a few hundred to a few thousand bases. But gene regulation can occur over much longer tracts of genome space, and some models are able to make predictions at or near those scales. Borzoi, for instance, accepts 524 kilobases of input DNA, and Evo 2 and Google DeepMind’s newly announced AlphaGenome can work with a megabase.

These models can transform those sequences into vast collections of estimated data. Given an input sequence of 196,608 bases of human DNA, for instance, Enformer outputs 2,131 predictions of transcription factor binding, 1,860 of histone modifications, 684 of chromatin accessibility and 638 of gene expression, at 128-base resolution (go.nature.com/4mbe42h).

A finite genome

Yet despite these models’ extensive ‘receptive fields’, they can still miss things, says Jacob Schreiber, a computational biologist at the Research Institute of Molecular Pathology in Vienna, because enhancers might exert effects that are biologically meaningful but invisible to the AI tool. “We have not cracked long-range regulation,” he says.

Another challenge is that, as vast as it is, the human genome is finite — there are only about 20,000–25,000 genes, for instance, and only a fraction of those are regulated in a cell-type-specific manner. That means that for all those billions of bases, there are relatively few examples of regulatory strategies from which a model can learn.

Carl de Boer is a biomedical engineer at the University of British Columbia in Canada.Credit: Paul Joseph

“There’s just so many different biochemical mechanisms that could happen on DNA that there are probably a very large number of them that only occur once or even zero times in our genome sequence,” says biomedical engineer Carl de Boer at the University of British Columbia in Vancouver, Canada.

One approach to broadening an AI model’s knowledge base is to feed it more than just reference genomes. Some model builders, for instance, train their tools on data from multiple individuals or from across the phylogenetic tree to give the models a sense of genetic diversity.

Another approach, advanced by de Boer and Jussi Taipale, a systems biologist at the University of Cambridge, UK, is to look beyond natural genomes to fully artificial DNAs¹¹.

As a postdoc at the Broad Institute of MIT and Harvard in Cambridge, Massachusetts, de Boer and his colleagues tested some 100 million random sequences, each of which were 80 nucleotides in length — “about a human genome’s worth” — for their ability to drive expression of a fluorescent protein in yeast (Saccharomyces cerevisiae)¹². (The yeast genome is made up of about 12 million bases, compared with roughly 3 billion in the human genome.) This approach, de Boer says, “is actually much better” for understanding the grammar of the genome than using genomic DNA, “because all of the signals you see in the random DNA are causal”. If you see fluorescence, the sequence is active. The genome, by contrast, is a product of evolution, meaning elements might be positioned owing to selective pressures as well as function.

According to de Boer, the yeast exercise yielded two key insights. First, it reinforced that “there are probably widespread biophysical interactions happening in regulatory regions”. Functional motifs were not randomly arranged in active sequences; they were positioned in specific configurations — for instance, to conform to the helical spacing of the DNA double helix.

The second insight involved the importance of low-affinity transcription-factor–DNA interactions. Even weak interactions, the team found, could exert a large influence on gene regulation, just as relatively weak chemical interactions can hold two proteins together.