DNA recovered from ancient remains is transforming our understanding of organisms and ecosystems from tens, thousands and even millions of years ago – but the growing volume of data must be better preserved

Ancient DNA data hold insights into past organisms and ecosystems — handle them with more care

DNA fragments extracted from archaeological human remains can be sequenced to identify the microorganisms that caused disease.Credit: Microgen/Getty

Over the past ten years or so, investigations of degraded or ‘ancient’ DNA have skyrocketed. By extracting DNA fragments from diverse sources —from human teeth and faeces to soil samples and ice cores — researchers have uncovered the stories of all sorts of organisms and ecosystems stretching back for millennia.

Investigators have used ancient genomics to discover the previously unknown Denisovan hominin, an evolutionary cousin and contemporary of Neanderthals that left hardly any fossil record; to identify which microorganisms caused human disease thousands of years ago; and to establish the geographical origins of domesticated plants and animals, including maize (corn) and horses1. Ancient genomics has also been used to reconstruct the composition of Pleistocene ecosystems that existed up to two million years ago2; and even to identify the wearers of ice-age jewellery3.

Most DNA sequence data are now archived in dedicated, publicly accessible databases, and the ancient DNA field has been heralded by some as a poster child for best practices in genetic data sharing4,5. However, as the pace of ancient DNA research has increased — largely thanks to the latest capabilities in DNA sequencing (see ‘A sampling surge’) — so, too, have problems with data archiving.

Source: https://doi.org/10.5281/zenodo.14203618. Analysis by A. Bergström et al.

Often, only some of the data obtained in any one study are uploaded to publicly available databases. Furthermore, the associated metadata — information on the age of the sample, where it was found, how the DNA was extracted and chemically treated, and so on — are frequently inaccurate or incomplete.

Here, we set out the nature of these problems and outline steps to overcome them, so that this astonishing record of the genetic past can be digitally preserved and used again and again.

Data loss

Ancient genomic data have been obtained from more than 10,000 humans6, some 700 microbes and viruses7 and, by our estimate, more than 2,000 plant and non-human animal samples. At least 2,200 ancient host-associated and environmental microbiomes (communities of microorganisms) have been sequenced7 (see also go.nature.com/3bcaxtv).

One major problem, however, is that not all sequences end up being archived.

Earlier this year, one of us (A.B.) assessed what data and metadata had been uploaded into publicly accessible databases by the authors of 42 studies of ancient DNA. All studies involved the analysis of ancient DNA extracted from humans or non-human animals, and had been published in 2021, 2022 or 2023 in the journals Nature, Science and Cell. In about half of the papers, researchers archived only those sequences that they had managed to align to a reference genome, such as that for ancient human remains, leaving no record of the unaligned sequences (see ‘A snapshot of data-archiving troubles’). This represents a permanent loss of data for more than 3,000 ancient samples analysed in just these studies8.

Source: Ref. 8

It might seem that any sequence that does not align to the reference genome is irrelevant. But improvements in computational methods and more-complete reference genomes could enable researchers to align such sequences in the future. Also, even if some of the unaligned sequences are not from the species of interest, this does not mean that they have no scientific value. On the contrary, these sequences could be among the most interesting in the data set, especially if they originated from pathogenic microbes that infected the host.

The study of ancient microbes has become a field in its own right, and is transforming our understanding of microbial evolution and the history of many infectious diseases. Until 2015, for instance, archaeologists and historians thought that plague (caused by the bacterium Yersinia pestis) emerged as a significant human disease only around 1,500 years ago. Analyses of non-aligned sequences from Neolithic and Bronze Age human remains have revealed, however, that outbreaks of the plague were occurring 5,000 years ago9. Researchers have likewise used studies of non-aligned sequences extracted from human remains to illuminate the evolutionary history of infectious agents such as smallpox, hepatitis B virus and parvovirus during the past 10,000 years10. Such work could help scientists to improve their understanding of current infectious-disease threats.

Another problematic archiving practice is the uploading of digitally trimmed sequences. In their analyses, researchers sometimes remove the last few nucleotides from DNA fragments (usually the most degraded parts of the molecule) to increase the likelihood that the fragments will align to a reference genome. Just as with the exclusion of non-aligned sequences, uploading only these trimmed sequences to public databases limits future researchers’ abilities to replicate findings and to authenticate that the data carry the expected patterns of degradation. It leads to the permanent loss of potentially useful data from the scientific record.

To further complicate efforts to reuse data, some researchers upload merged data that have been collected at different times and obtained using various laboratory protocols. We have also found that data sets obtained from different samples are sometimes incorrectly reported as coming from the same sample, and that data sets from a single sample are sometimes incorrectly registered to multiple samples8,11.

Metadata mess

Being able to reuse ancient DNA data easily and efficiently doesn’t just require researchers to make all their primary data available — it also requires them to annotate those data accurately and comprehensively.

At a minimum, information is needed on the estimated age of the remains being studied, where in the world those remains were found, the type of tissue or material sampled, and key technical details, such as whether the ancient DNA was chemically treated to repair or remove the post-mortem damage that accumulates in such molecules1.

By analysing ancient DNA, researchers are uncovering the stories of extinct organisms, such as the woolly mammoth (Mammuthus primigenius; artist’s impression).Credit: Daniel Eskridge/Getty

Public data archives, such as the European Nucleotide Archive (ENA), provide systems for reporting some of this information. But they are underused by the ancient DNA community. In A.B.’s survey of 42 studies, researchers submitting data to archives included information on the geographical origins of samples in only about 60% of studies, and on the age of the sample in only about 17% of studies8.

Part of the problem is that the systems for recording metadata in standard archives are not designed with ancient DNA in mind. It is often unclear how researchers should record information on when an organism lived, for example, or how to indicate that only imprecise age and geographical information is available. Also, researchers often have different understandings about what information should be recorded. Does ‘geographical location’ mean the place of excavation or the museum from where the remains were sampled? For excavation sites, should the name of the nearest town be provided or the latitude and longitude? Does ‘collection date’ refer to when the organism lived, when the excavation took place or when the sampling at the museum occurred?

Currently, managers of databases such as the ENA intend geographical location to refer to where specimens were sampled for DNA or RNA analysis, and collection date to refer to when they were sampled. This also applies to other resources in the International Nucleotide Sequence Database Collaboration (INSDC), an effort to coordinate databases containing DNA and RNA sequences. Most ancient DNA researchers assume, however, that these fields refer to where and when the sampled organism lived.

These metadata-reporting problems lead to considerable confusion and inefficiencies. Researchers wanting to use published ancient DNA data often have to piece together a lot of the metadata themselves by digging through supplementary tables or by contacting the data producers11.

Several initiatives are under way to recover metadata for published ancient genomic data and systematically package them into more user-friendly resources6,7,11,12. (This includes AncientMetagenomeDir, in which J.A.F.Y. and C.W. are involved.) Extensive volunteer work on such projects provides a bandage. But the fact that such post-publication metadata curation is needed speaks to the urgency of the problem. Furthermore, there is currently little coordination between the subfields that are trying to achieve more consistent standards.

Cultural shift

Data archiving in ancient genomics is not sufficiently prioritized. Too often, it seems to be delegated to inexperienced junior researchers or performed at the last minute to comply with journals’ publishing requirements.

Portions of some genomic data sets might need to be withheld from public archiving13 — for instance, if descendant communities want to restrict data sharing or data reuse on ethical grounds14. When this is the case, it should be made explicit — for example, through the Biocultural Labels Initiative, which involves pairing sequence data with statements about community expectations around the appropriate use of biocultural collections and genomic data. But aside from these cases, comprehensive archiving of data and metadata should be standard practice.

What is most needed now is a culture shift in the ancient DNA community.

In developing better standards, researchers don’t have to start from scratch. The challenges around metadata reporting are not unique to ancient DNA research. In 2008, a group of biologists formed the Genomic Standards Consortium to promote the reporting of standardized metadata for the growing body of genomes and metagenomes being deposited in data archives15. The Minimum Information about any (x) Sequence (MIxS) framework developed by the consortium, comprising checklists with standardized metadata fields that researchers must fill out when submitting genomic data, have since been adopted by the INSDC databases16.

Such checklists provide a model for ancient DNA researchers. Indeed, last year, a network of ancient DNA researchers (including J.A.F.Y. and C.W.) proposed exactly this — a Minimum Information about any Ancient Sequence (MInAS) scheme for ancient DNA.

To make such checklists effective, the ancient DNA community needs to develop them in partnership with the INSDC, museum curators, archaeologists, radiocarbon-dating specialists and so on. This would enable researchers in all subfields to establish which metadata fields are common to everyone and what is needed for each subfield.

Journal editors and research funders can help to ensure that all primary data are uploaded to public databases and annotated appropriately. Editors typically require authors to provide a ‘data accession identifier’ — a code obtained from a public database after an upload — to prove that they have complied with data-reporting standards. But reviewers and editors rarely check that the data have been archived correctly or completely. Journal guidelines should explicitly state that authors should submit to a public database all of the sequences generated in a study — not just those aligned to a reference genome — and that the archived data must be accompanied (in the same database) by at least a minimal set of metadata.

Funders such as the European Research Council and the US National Science Foundation could likewise be more explicit about appropriate standards for data archiving. Similarly, the archaeologists and museum curators providing the biological materials used in ancient DNA research could provide researchers with specimens on the condition that any data obtained from them are archived appropriately.

Data derived from existing cell lines or bacterial cultures can usually be regenerated. But with ancient remains, samples are always limited and often rare. Second tries are not always possible or desirable — say, if researchers have to destructively resample a bone or a tooth. All science should be reproducible. But for ancient genomics especially, everyone in the field stands to benefit if the data derived from these finite resources is handled with more care.

Nature 636, 296-298 (2024)

doi: https://doi.org/10.1038/d41586-024-03993-z

This story originally appeared on: Nature - Author:Anders Bergströmv