U.S. Libraries Share Books with AI Developers to Feed Data-Hungry Chatbots

especiales

U.S. Libraries Share Books with AI Developers to Feed Data-Hungry Chatbots
By: 
Fecha de publicación: 
16 June 2025
0
Imagen principal: 

Everything ever said on the internet was just the beginning of teaching artificial intelligence about humanity. Now, tech companies are turning to an older repository of knowledge: library bookshelves.

Nearly a million books—published as far back as the 15th century and in 254 languages—are part of a collection recently shared by Harvard University with researchers. Soon, this treasure trove will be joined by historical newspapers and government documents held by the Boston Public Library.

Opening up vaults containing centuries-old volumes could offer a data goldmine for tech firms, which face lawsuits from novelists, visual artists, and others whose creative work has been used—often without consent—to train AI chatbots.

“It’s a prudent decision to begin with public domain material, because that’s less controversial right now than content still under copyright,” said Burton Davis, associate general counsel at Microsoft.

Davis noted that libraries also preserve “tremendous amounts of cultural, historical, and linguistic data” that’s absent from the past few decades of online discourse that has largely fed AI models. Fears of running out of high-quality training data have led AI developers to use “synthetic data,” generated by chatbots themselves, which is often of lower quality.

With the support of unrestricted grants from Microsoft and OpenAI—the maker of ChatGPT—Harvard’s Institutional Data Initiative is working with libraries and museums around the world to prepare their historical collections for AI in a way that also benefits the communities they serve.

“We’re trying to shift some power currently held by AI back to these institutions,” said Aristana Scourtas, who leads research at Harvard Law School’s Library Innovation Lab. “Librarians have always been the stewards of data and information.”

Harvard’s newly released dataset, Institutional Books 1.0, contains over 394 million scanned pages. One of the oldest works is a handwritten 15th-century reflection by a Korean painter on cultivating flowers and trees. The largest concentration of works comes from the 19th century, covering subjects like literature, philosophy, law, and agriculture—carefully preserved and curated by generations of librarians.

This resource promises to be highly valuable for AI developers aiming to improve the accuracy and trustworthiness of their systems.

“Much of the data used in AI training hasn’t come from original sources,” said Greg Leppert, executive director of the data initiative and chief technology officer at the Berkman Klein Center for Internet & Society at Harvard. This collection, he added, includes “the physical copies scanned by the institutions that originally collected them.”

Before ChatGPT sparked a commercial AI boom, most AI researchers weren’t very concerned with the provenance of text extracted from Wikipedia, social media forums like Reddit, or even massive troves of pirated books. All they needed were what computer scientists call “tokens”—units of data representing word fragments or phrases.

Harvard’s new AI training dataset includes an estimated 242 billion tokens, an enormous number by human standards, yet still a small fraction of the data used in today’s most advanced systems. For instance, Meta—the parent company of Facebook—has said the latest version of its large language model was trained on over 30 trillion tokens from text, images, and video.

Meta is also facing a lawsuit from comedian Sarah Silverman and other published authors accusing the company of stealing their books from “shadow libraries” of pirated content.

Now, with some conditions, real libraries are laying down their own rules.

OpenAI, which is also facing several copyright infringement lawsuits, donated $50 million this year to a group of research institutions—including Oxford University’s 400-year-old Bodleian Library—which is digitizing rare books and using AI to transcribe them.

When OpenAI first approached the Boston Public Library, one of the largest in the U.S., the library made it clear that any digitized materials would be made publicly available, according to Jessica Chapel, its director of digital and online services.

“OpenAI had an interest in massive amounts of training data. We have an interest in massive amounts of digital objects. So, this seems like a case where interests align,” Chapel said.

Digitization is expensive. For example, Boston’s library has undertaken the painstaking task of scanning and organizing dozens of French-language newspapers once widely circulated among Quebecois immigrant communities in New England in the late 19th and early 20th centuries. Now, with that text being used to train AI, it helps fund projects that librarians already hoped to pursue.

Harvard’s collection began digitization in 2006 as part of Google’s controversial project to build a searchable online library of over 20 million books.

Google spent years fighting lawsuits from authors over its digital library, which included many newer, copyrighted works. That battle ended in 2016, when the U.S. Supreme Court let stand lower court rulings rejecting copyright infringement claims.

Now, for the first time, Google has worked with Harvard to extract volumes from Google Books that are in the public domain and make them available for AI development. U.S. copyright protections typically last 95 years, and even longer for sound recordings.

The initiative has even won praise from the very group of authors that once sued Google and more recently has taken AI firms to court.

“Many of these titles exist only on the shelves of major libraries, and creating and using this dataset will expand access to these volumes and the knowledge they contain,” said Mary Rasenberger, CEO of the Authors Guild, in a statement. “Most importantly, the creation of a large, legally sound training dataset will democratize the development of new AI models.”

It remains to be seen how useful all of this will be for the next generation of AI tools as the data becomes available on Hugging Face, a platform that hosts open-source AI datasets and models.

The book collection is more linguistically diverse than typical AI training datasets. Less than half of the volumes are in English, though European languages still dominate—particularly German, French, Italian, Spanish, and Latin.

A collection steeped in 19th-century thought could also be “immensely crucial” for tech industry efforts to build AI agents capable of human-like planning and reasoning, said Leppert.

“In a university, you have all these pedagogical materials about what it means to reason,” he noted. “You also have a lot of scientific information about how to execute processes and analyses.”

At the same time, such a vast dataset includes outdated information—from discredited scientific and medical theories to racist and colonialist narratives.

“When you deal with a dataset this large, there are some thorny issues around harmful content and language,” said Kristi Mukk, coordinator at Harvard’s Library Innovation Lab. She added that the initiative is trying to provide guidelines to help mitigate risks and “support users in making informed decisions and using AI responsibly.”

Add new comment

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.