‘Dark proteins’ hiding in our cells could hold clues to cancer and other diseases The search is on to find out what they do
The human genome encodes potentially thousands of tiny proteins that were previously overlooked
In 2009, Jonathan Weissman was hunting for a new way to spy on what happens inside a cell. In particular, the molecular cell biologist wanted to know what proteins are produced at any given moment. So his laboratory came up with a way to directly measure the output of ribosomes — the cell’s protein factories.
The method, developed with then-postdoc Nicholas Ingolia, who is now at the University of California, Berkeley, involves collecting all of a cells’ ribosomes and sequencing the individual strands of messenger RNA that are bound to them. The researchers hoped this tool, called ribosome profiling, would provide an accurate tally of all the proteins a cell makes and their relative quantities.
But, when Weissman and others began trying the method out, they turned up a giant surprise. Not only were ribosomes busily churning out proteins encoded by known genes in a cell’s genome, but they also seemed to be making thousands upon thousands of ‘dark proteins’ that map to portions of the genome that weren’t thought to produce proteins1. “That was the ‘Aha!’ moment for us,” says Weissman, who is based at the Whitehead Institute in Cambridge, Massachusetts. Soon, his lab and others were uncovering unexpected translation events in nearly every organism they examined.
Fifteen years later, scientists are still scratching their heads over what to make of these proteins.
Dark proteins tend to be short — often just a few dozen amino acids or fewer. And many are unfamiliar — they don’t have close relatives in the genomes of other organisms. Studies suggest that some could have essential roles in the cell and might influence human health. They seem to be abundant in some cancers, and several companies hope to develop treatments that target dark proteins. But for many of these mysterious entities, the evidence that they’re doing anything important — or even whether they survive for very long in the cell — is equivocal.
The problem, says Marie Brunet, who studies proteomics at University of Sherbrooke in Canada, is that scientists don’t really know what they might be missing. “If you have a protein lacking from the repository, you’re not even looking for it,” she says.
Gene-counting conundrum
Brunet is part of a global effort to document all of the dark proteins encoded by the human genome (see ‘Exploring the dark proteome’). The goal is to draw researchers’ attention to this dark matter, so they can begin working out what the proteins are doing, molecule by molecule.
The prospects are exciting, says Sebastiaan van Heesch, a systems biologist at the Princess Máxima Center for Pediatric Oncology in Utrecht, the Netherlands, who is also part of the effort: “There’s definitely new biology there.”
Leading up to the first publications of the human genome in the early 2000s, researchers were furiously examining the emerging sequence data trying to estimate the number of protein-coding genes. Typically, they looked for what are known as open reading frames (ORFs), stretches of code with specific three-letter sequences, or codons, that can contain instructions for making a protein. Genomicists looked for further clues, such as evidence that a sequence is conserved among other organisms and of reasonable length, all of which are indications that the resulting protein might have a function in cells.
Many ORFs that didn’t meet these criteria were ignored or simply missed as biologists refined their estimates. Consortia that maintain lists of predicted genes, such as the GENCODE project, currently list just shy of 20,000 protein-coding genes. About 90% of these have been confirmed through other efforts, to produce corresponding proteins (individual genes can encode multiple different proteins, by including or omitting stretches of code called exons). GENCODE and other projects make periodic adjustments to their lists as data emerge.
But, according to John Prensner, a cancer biologist at the University of Michigan Medical School in Ann Arbor, the accounting is incomplete. The idea that researchers could draw up a complete list of protein-coding genes as early as 2001 was a popular misconception. “The leaders of the Human Genome Project always knew they were just starting a conversation,” Prensner says.
The ability to directly measure ribosomes’ output led to an explosion of interest in overlooked ORFs and their potential to encode working proteins. In a 2022 correspondence to Nature Biotechnology, a team co-led by Prensner, van Heesch and others assembled a list of more than 7,000 of these ‘non-canonical’ ORFs, which generally don’t meet the requirements to be considered protein-coding genes and have therefore been omitted from databases2. (That is a lower bound, says van Heesch; other studies have identified tens of thousands of potential dark proteins encoded by the human genome.)
Most of the unconventional ORFs that the scientists listed are found either near to or overlapping with canonical, protein-coding genes (see ‘Where are all the dark proteins?’). About one-third of them were in sequences called long non-coding RNAs, which — as the name suggests — weren’t expected to encode proteins but were thought to have regulatory roles.
But just because an ORF is translated into a protein doesn’t mean that the proteins are stable or have important jobs in a cell. The translation of some non-canonical ORFs, which is carried out by ribosomes, could be a way for cells to control the activity of a nearby gene, for instance, by gumming up the ribosomal machinery with products that will be quickly degraded, say Prensner and others. This sort of control occurs in certain upstream ORFs that appear ahead of a protein-coding sequence.
In a preprint study that follows up on their 2022 publication, Prensner, van Heesch and an expanded consortium of genomics and proteomics specialists trawled through hundreds of proteomics data sets — comprising billions of data points — and findings from studies that used mass spectrometry and other approaches to identify the protein content of cells3. The researchers found protein fragments corresponding to more than 1,700 of the non-canonical ORFs they had identified in 2022. For 15 of them, the researchers argued, the evidence was strong enough to make a case for adding the proteins and their corresponding genes to official tallies of protein-coding genes.
But, for most non-canonical ORFs, clear-cut evidence that they can produce a protein is lacking. Part of the challenge is the small size of the potential proteins — researchers call them microproteins because they tend to be much shorter than 100 amino acids (on average human proteins contain several hundred amino acids, and many are much longer). Their short length makes it hard to find matching fragments — which are created in experiments that break proteins apart and identify the resulting shards by their mass. Cell samples will over-represent the fragments of longer proteins, says van Heesch, particularly if the microproteins are less abundant.
When scientists have used the artificial-intelligence tool called AlphaFold to predict the structure of dark proteins, the molecules often bear little resemble to well-folded, bona fide proteins. But, Presner says, “there are clear examples that very much look like canonical proteins and were just missed”. GENCODE and other organizations that manage repositories have begun to add these overlooked proteins to their lists.
Jonathan Mudge, who works on the GENCODE project at the European Molecular Biology Laboratory’s European Bioinformatics Institute in Hinxton, UK, and is a co-author of the preprint, says that around 50 sequences identified through ribosome profiling have been included in its list of human protein-coding genes.
But they are moving carefully, he adds. It isn’t just lab biologists who rely on efforts such as GENCODE to make sense of experiments. Clinicians, too, rely on such databases, and adding a slew of suspect protein-coding genes could complicate efforts to identify harmful variants found in patient genomes, says Mudge. “We’re not sceptical. We’re just cautious,” he says.
Edits for clarity
Around the time that researchers started shining a spotlight on the genomes’ potential to encode dark proteins, another breakthrough made it possible to systematically study their effects in cells: CRISPR–Cas9 gene editing. “Suddenly we could surgically take out the coding sequence of these non-canonical proteins and ask, are they important for the function of the cell?” Weissman says.
In a 2020 Science paper, Weissman’s team showed just that. The researchers used CRISPR gene editing to interrupt thousands of non-canonical ORFs, preventing them from being translated into proteins in human induced pluripotent cells, as well as by a cancer cell line4. In hundreds of instances, the CRISPR editing caused a growth defect in the cells. “Many microproteins were really important for cells,” says Weissman.
With further experiments, they were able to ascertain why. In some instances, the proteins encoded by a non-canonical ORF interacted with a protein encoded on the same mRNA strand. This is reminiscent of the way that co-regulated bacterial genes tend to reside next to one another in units called operons, Weissman says. The functional dark proteins his team identified took on diverse roles in cells: one seemed to be involved in the cell cycle, another in mitochondrial physiology.
Cancer cells might be especially rich in dark proteins. Prensner, who is also a paediatric neuro-oncologist, is studying the possibility that non-canonical ORFs present in all human genomes can become misregulated in some cancers, potentially contributing to different treatment outcomes that he and other cancer physicians see in their patients. “We’re asking the central question of why cancers are making these things,” he says.
In similar experiments to Weissman’s, a team led by Prensner found that about 10% of more than 500 non-canonical ORFs they inactivated with CRISPR caused growth defects in various kinds of human cancer cell5. Prensner and his colleagues identified a dark protein that was expressed at elevated levels in breast-cancer cell lines and also seemed to be driving their growth.
In a study published last year, Prensner, van Heesch and their colleagues identified several dark proteins that contribute to medulloblastomas, sometimes fatal paediatric brain cancers6. In one example, the researchers showed that a dark protein — independent of a canonical protein encoded by an adjacent ORF — was driving the growth of especially aggressive forms of medulloblastoma that carry an overactive version of a cancer gene called MYC.
Enjoying our latest content?
Login or create an account to continue
- Access the most recent journalism from Nature's award-winning team
- Explore the latest features & opinion covering groundbreaking research
or
Sign in or create an account Continue with Google Continue with ORCiDNature 637, 1038-1040 (2025)
doi: https://doi.org/10.1038/d41586-025-00217-w
This story originally appeared on: Nature - Author:Ewen Callaway