Generative-AI models require massive amounts of data — scholarly publishers are licensing their content to train them

Publishers are selling papers to train AIs — and making millions of dollars

Generative-AI models require massive amounts of data — scholarly publishers are licensing their content to train them

Microsoft made a US$10-million deal with Taylor & Francis to use the publisher’s papers to train its artificial-intelligence systems.Credit: Omar Marques/SOPA/LightRocket via Getty

Since the explosion in popularity of generative artificial intelligence (AI), several scholarly publishers have forged agreements with technology companies looking to use content to train the large language models (LLMs) that underlie their AI tools. A new tracker aims to catalogue what deals are being made — and by whom.

“We were seeing announcements of these deals, and we got to thinking that this is starting to become a pattern,” says Roger Schonfeld, a co-creator of the tracker and vice-president of libraries, scholarly communication and museums at Ithaka S+R, a higher-education consulting firm in New York City. “We wanted to shine some light on not just the individual deals, but also what the overall pattern was starting to look like — and provide a source for the community.”

Schonfeld and his colleagues launched the Generative AI Licensing Agreement Tracker in October. It includes information about licensing deals — confirmed and forthcoming — between technology companies and six major academic publishers, including Wiley, Sage and Taylor & Francis. Schonfeld says that the list documents only public agreements, and that there are probably several others that remain undisclosed.

Many publishers are considering questions such as how licensing — or not licensing — content to generative-AI companies will affect revenue, and the risks or benefits of being among the first to act in this space, Schonfeld says. “Every publisher of a certain scale and above is absolutely grappling with this issue.”

Growing trend

Several big publishers have cashed in on AI licensing deals this year. In May, Informa, the parent company of the UK academic publisher Taylor & Francis, announced that it made a US$10-million deal to license content to Microsoft. The next month, the US academic publisher Wiley announced to its investors that it had earned $23 million from a deal with an unnamed firm developing generative-AI models. In September, the company said that it expected to earn another $21 million from such agreements this financial year. Nature’s news team contacted several other publishers including Elsevier and Springer Nature, Nature’s publisher, about whether they had plans for licensing deals, but received no comment. (Nature’s news team is editorially independent of its publisher.)

“We are providing data and content under license for the purposes of training AI, such as LLMs, so that those models become more accurate and relevant for the benefit of everyone who uses them,” a spokesperson for Taylor & Francis said in a statement. “Licensing activities such as this are a key responsibility for research publishers and part of our ongoing commitment to ensuring authors’ ideas make the fullest possible contribution.”

The spokesperson says that royalties will be paid to authors, and that there are strict boundaries attached to their AI partnership agreements. For example, data and content can be used only for training and are under no circumstances permitted to be reproduced in an equivalent format.

A Wiley spokesperson said that royalties will be paid to book authors and other publishing partners, and that it is monitoring AI-model developers for use of copyrighted material without permission. Several of the publishers contacted by Nature said that they had put measures in place to prevent AI tools from scraping their content from the web without permission.

Some publishers haven’t yet entered into any agreements — including the American Association for the Advancement of Science (AAAS), a non-profit academic publisher that publishes Science. Meagan Phelan, communications director for the Science family of journals in Washington DC, says that the AAAS might consider licensing its content to technology companies in the future, if they meet certain criteria. These include assessing a firm’s trustworthiness and the usefulness of the tools that will be created with the content.

Shifting priorities

There are signs that publishers don’t see these deals as a one-off. In October, Wiley launched a programme dubbed Wiley AI Partnerships, aimed at working with technology companies to develop AI applications. “This is being taken very seriously,” says Maya Dayan, a co-creator of the tracker, and a programme manager for strategic research and market analysis at ITHAKA, Ithaka S&R’s parent company. “We’re seeing new positions and departments created, new priorities being set — these are not one-off deals.”

Some scholars have been apprehensive about deals being made without their knowledge on content they produced. To address this issue, a few publishers have taken steps to involve authors in the process.

Berlin-based academic publisher De Gruyter Brill has created an information page for authors, explaining its plans to enter into formal agreements with generative-AI developers.

“While many authors have given us their explicit consent to use their titles, we fully understand that some authors remain sceptical or concerned about the overall societal impact of AI and about our recent announcements,” says Pablo Dominguez Andersen, director of communications at De Gruyter Brill. “We are currently engaging with many of these authors directly, to understand their concerns and to explain our approach, and why we believe entering into formal agreements is the only way forward.”

Cambridge University Press & Assessment (CUPA) is taking an opt-in approach — the UK publisher has contacted 20,000 authors for permission to license their content to technology companies developing LLMs. “We wanted to ask authors, not because we think they shouldn’t want their content to be going in there, but we want to be able to tell them why this is a good thing,” Mandy Hill, the managing director of CUPA, told The Bookseller in October. According to Hill, only a few authors have declined to license their content.

“It’s been interesting to see the different approaches to how authors are brought on board,” says Dayan. “I’ve started to see a trend towards communicating very directly with authors from the beginning, rather than announcing a deal and then interacting with authors on the back end.”

doi: https://doi.org/10.1038/d41586-024-04018-5

This story originally appeared on: Nature - Author:Diana Kwon