4 months ago

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

Harvard University announced Thursday it’s releasing a high-quality dataset of nearly 1 million public-domain books that could be used by anyone to train large language models and other AI tools. Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta’s Llama, the Institutional Data Initiative's database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. Burton Davis, Microsoft’s vice president and deputy general counsel for intellectual property, emphasized that the company’s support for the project was in line with its broader beliefs about the value of creating “pools of accessible data” for AI startups to use that are “managed in the public’s interest.” In other words, Microsoft isn’t necessarily planning to swap out all of the AI training data it has used in its own models with public domain alternatives like the books in the new Harvard database.

Wired

Discover Related