Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft
Harvard University announced Thursday it’s releasing a high-quality dataset of nearly 1 million public-domain books that could be used by anyone to train large language models and other AI tools. Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta’s Llama, the Institutional Data Initiative's database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. Burton Davis, Microsoft’s vice president and deputy general counsel for intellectual property, emphasized that the company’s support for the project was in line with its broader beliefs about the value of creating “pools of accessible data” for AI startups to use that are “managed in the public’s interest.” In other words, Microsoft isn’t necessarily planning to swap out all of the AI training data it has used in its own models with public domain alternatives like the books in the new Harvard database.

New York Times accuses OpenAI of deleting crucial evidence in copyright lawsuit





Discover Related

Manhattan court consolidates OpenAI copyright lawsuits from authors, NYT

OpenAI gives free ChatGPT Plus to college students, but conditions apply

Sam Altman Says OpenAI Will Release an ‘Open Weight’ AI Model This Summer

Data for training stored overseas, copyright law doesn’t apply: OpenAI

Beijing AI academy slams inclusion on US Entity List

Judge allows 'New York Times' copyright case against OpenAI to go forward

A Case For Harmonizing Gen AI And The Copyright Regime

Trump's call for AI deregulation gets strong backing from Big Tech

Hollywood creatives urge government to defend copyright laws against AI

Democrats Demand Answers on DOGE's Use of AI

DOGE Access To 'Sensitive Databases' Sparks Alarm

India’s AI investment, the budgetary push, the need for regulations

UK creative industries launch campaign against AI tech firms’ content use

DeepSeek to share some AI model code, doubling down on open source

HC experts differ in OpenAI copyright case

Perplexity AI launches Deep Research tool for free

US firms adopt DeepSeek despite scrutiny

The danger of relying on OpenAI’s Deep Research

Proper data sharing essential for language models

OpenAI, Microsoft and Google prove again, nothing’s consistent in the world of AI

OpenAI discussing localization of ChatGPT India data

Google drops pledge not to use AI for weapons, surveillance
