The ability to link private and public data sets could be putting research participants’ private health information at risk

‘Anonymous’ genetic databases vulnerable to privacy leaks

The ability to link private and public data sets could be putting research participants’ private health information at risk

Links between public genetic data sets could be exploited to reveal people’s private information. Credit: imagedepotpro/Getty

A study has raised concerns that a type of genetic database that is increasingly popular with researchers could be exploited to reveal the identities of its participants, or link private health information to their public genetic profiles.

Single-cell data sets can contain information on gene expression in millions of cells collected from thousands of people. They are often freely accessible, providing a valuable resource for researchers who study the effects of diseases at a cellular level. The data are supposed to be anonymized, but a study published on 2 October in Cell¹ shows how genetic data from one study “can be exploited to uncover private information about individuals in another study”, the authors write.

The findings highlight the difficulty of balancing the interests of researchers with the privacy of donors. “Our genomes are very identifying. They can tell a lot about us, our traits, our predisposition to diseases,” says study co-author Gamze Gürsoy, a bioinformatics researcher at Columbia University in New York City. “You can change your credit-card number if it leaks, but you cannot change your genome.”

Sensitive data

Concerns around privacy in genetic data sets have been raised before, but they mainly focused on ‘bulk’ genetic data sets. These contain information on gene activity averaged across a large population of cells rather than data of an individual cell.

It was previously thought that single-cell data sets wouldn’t be as vulnerable to privacy leaks, owing to the level of ‘noise’, or variation in gene expression, between different cells. But Gürsoy and her colleagues demonstrated that was not the case.

The team reviewed three publically available single-cell data sets, which included blood cells from people with lupus, a chronic autoimmune disease. The researchers found that they could use data on gene expressionto predict the structure of a person’s genome, by combining these values with information on expression quantitative trait loci (eQTLs). The details of eQTLs — variations on the chromosome that correlate with gene expression — are also publically available on single-cell data sets.

To test the reliability of their work, the researchers checked their genome predictions against a genome database corresponding with the cells they used. They were able to link most of the data sets to their corresponding genome, with an accuracy rate of more than 80%.

Unlike the data on gene expression and eQTLs, full-genome databases can usually be accessed only by scientists, to protect donor’s identifying information. But the researchers point out that a participant’s genome data can be publically available elsewhere. For example, they might have uploaded it to a genealogy website in which users send DNA samples to learn more about their ancestry. In this case, an attacker could identify an individual whose cells are in a single-cell data set using their genome. This could reveal personal data related to a sensitive trait such as a psychiatric disorder, given that research participants are often selected to study the biology of these complex conditions.

Privacy breaches like these could have real-world implications, such as causing employment discrimination, says Gürsoy. She adds that any leaks could even have repercussions for future generations, given that genetic traits can be passed to offspring. “Anything that leaks about us will perpetuate through generations,” she says.

Bradley Malin, who researches large-scale genomic data sharing at Vanderbilt University in Nashville, Tennessee, says that the study is a “novel extension and contribution to the literature”. He adds that future research could explore whether genomic data could still be linked in larger data sets that include samples from thousands or millions of people.

Competing interests

Scientists are unsure about how best to tackle these privacy concerns. “There’s the desire to protect individual privacy, but also the desire to collectively advance medical research, and those are, unfortunately, at odds with each other,” says Mark Gerstein, who researches medical data science at Yale University in New Haven, Connecticut. The simplest solution would be to stop making genetic data so easily accessible, but this would negatively affect research, he says. “We need to share and aggregate large amounts of information.” he says. “Locking it down and making it more private, really, just gums that whole process up.”

In their study, Gürsoy and her colleagues say that there should be greater transparency about the risks for participants who share their genomic data, and suggest that researchers should ensure that donors give consent for their data being shared. Another way forward could be encrypting personal data when it is part of a public database. The authors acknowledge that doing this would complicate the process of building and maintaining data sets, but say that it could help to protect participants’ privacy.

doi: https://doi.org/10.1038/d41586-024-03236-1

This story originally appeared on: Nature - Author:Helena Kudiabor