Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

4 months ago

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

Harvard University announced Thursday it’s releasing a high-quality dataset of nearly 1 million public-domain books that could be used by anyone to train large language models and other AI tools. Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta’s Llama, the Institutional Data Initiative's database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. Burton Davis, Microsoft’s vice president and deputy general counsel for intellectual property, emphasized that the company’s support for the project was in line with its broader beliefs about the value of creating “pools of accessible data” for AI startups to use that are “managed in the public’s interest.” In other words, Microsoft isn’t necessarily planning to swap out all of the AI training data it has used in its own models with public domain alternatives like the books in the new Harvard database.

Books Data Public Ai Models Openai Microsoft Harvard

In charts: How AI companies' data hunt is sparking copyright wars

4 months ago

In charts: How AI companies' data hunt is sparking copyright wars

India

New York Times accuses OpenAI of deleting crucial evidence in copyright lawsuit

4 months, 3 weeks ago

New York Times accuses OpenAI of deleting crucial evidence in copyright lawsuit

Training

Everything On The Internet Can Be Used For Free To Train AI Models: Microsoft's AI Division CEO

9 months, 1 week ago

Everything On The Internet Can Be Used For Free To Train AI Models: Microsoft's AI Division CEO

Google

Congress Wants Tech Companies to Pay Up for AI Training Data

55 years, 3 months ago

Congress Wants Tech Companies to Pay Up for AI Training Data

Fair Use

Meta used copyrighted books for AI training despite its own lawyers' warnings, authors allege

1 year, 4 months ago

Meta used copyrighted books for AI training despite its own lawyers' warnings, authors allege

Legal

Researchers warn we could run out of data to train AI by 2026.

1 year, 5 months ago

Researchers warn we could run out of data to train AI by 2026.

Data

AI is causing panic for authors. Now the courts are involved

1 year, 5 months ago

AI is causing panic for authors. Now the courts are involved

Training

Books by J.K. Rowling, Amitav Ghosh part of 183,000-book dataset used for AI training: Report

1 year, 6 months ago

Books by J.K. Rowling, Amitav Ghosh part of 183,000-book dataset used for AI training: Report

Books

OpenAI defends alleged use of novels in training data sets for “innovation”

1 year, 7 months ago

OpenAI defends alleged use of novels in training data sets for “innovation”

Copyright

Sarah Silverman and novelists sue ChatGPT-maker OpenAI for ingesting their books

1 year, 9 months ago

Sarah Silverman and novelists sue ChatGPT-maker OpenAI for ingesting their books

The Independent

Books

Google faces class-action lawsuit over unauthorised data scraping for AI: Know more

1 year, 9 months ago

Google faces class-action lawsuit over unauthorised data scraping for AI: Know more

Google

Bestselling authors Mona Awad and Paul Tremblay sue OpenAI over copyright infringement

1 year, 9 months ago

Bestselling authors Mona Awad and Paul Tremblay sue OpenAI over copyright infringement

Books

Lawsuit says OpenAI violated U.S. authors' copyrights to train AI chatbot

1 year, 9 months ago

Lawsuit says OpenAI violated U.S. authors' copyrights to train AI chatbot

Openai Violated

OpenAI used YouTube data to train some of its models: Report

1 year, 9 months ago

OpenAI used YouTube data to train some of its models: Report

Train

Stack Overflow Will Charge AI Giants for Training Data

55 years, 3 months ago

Stack Overflow Will Charge AI Giants for Training Data

Ai

Discover Related

AI tracker: Three cases of AI ethics that gave us food for thought this week

1 week ago

AI tracker: Three cases of AI ethics that gave us food for thought this week

Of course we have an AI-written paper that denies climate change Climate change deniers are pushing an AI-generated paper questioning human-induced warming, leading experts to warn against the rise of …

Research Paper Ai Models

Manhattan court consolidates OpenAI copyright lawsuits from authors, NYT

Trending 1 week, 2 days ago

Manhattan court consolidates OpenAI copyright lawsuits from authors, NYT

A US judicial panel decided on Thursday to consolidate in New York several copyright cases brought by prominent authors and news outlets against OpenAI and its largest backer Microsoft MSFT.O. …

Copyright Cases Microsoft New York

OpenAI gives free ChatGPT Plus to college students, but conditions apply

Trending 1 week, 2 days ago

OpenAI gives free ChatGPT Plus to college students, but conditions apply

OpenAI has announced free access to ChatGPT Plus for college students as part of its efforts to support education with AI. According to OpenAI, the initiative aims to support students …

Education Students Ai Free

OpenAI Secures Record $40-Billion Funding Round With SoftBank. Here's What ChatGPT-Maker Plans To Spend It On

1 week, 5 days ago

OpenAI Secures Record $40-Billion Funding Round With SoftBank. Here's What ChatGPT-Maker Plans To Spend It On

OpenAI has successfully secured $40 billion in a monumental funding round, making it the largest-ever capital raise for a startup, as reported by Agence France-Presse. SoftBank's Bold Vision for Artificial …

ABP News

Billion Ai Meta Openai

OpenAI to release new open language model in coming months

1 week, 5 days ago

OpenAI to release new open language model in coming months

Just a few days after launching its new image generator, OpenAI has announced that it is planning to release a new “first open language model since GPT 2 in the …

Language Model Open Ai

Sam Altman Says OpenAI Will Release an ‘Open Weight’ AI Model This Summer

55 years, 3 months ago

Sam Altman Says OpenAI Will Release an ‘Open Weight’ AI Model This Summer

Sam Altman today revealed that OpenAI will release an open-weight artificial intelligence model in the coming months. Shortly after DeepSeek’s model was released in January, Altman said that OpenAI was …

Wired

Weight Company Models Meta

Data for training stored overseas, copyright law doesn’t apply: OpenAI

2 weeks, 1 day ago

Data for training stored overseas, copyright law doesn’t apply: OpenAI

American company OpenAI on Friday denied allegations that its ChatGPT software reproduces verbatim content from news agency ANI, asserting before the Delhi High Court that its model is designed specifically …

Hindustan Times

India Training Copyright Court

Authors outraged to discover Meta used their pirated work to train its AI systems

2 weeks, 2 days ago

Authors outraged to discover Meta used their pirated work to train its AI systems

Some of Australia's best-known authors are furious to discover Meta has used their work to develop its AI platform. Australian authors 'furious' their books used to train generative AI Photo …

ABC

Authors Books Copyright Fair Use

Newspaper copyright lawsuit against OpenAI to proceed

2 weeks, 3 days ago

Newspaper copyright lawsuit against OpenAI to proceed

A federal judge has ruled that The New York Times and other newspapers can proceed with a copyright lawsuit against OpenAI and Microsoft seeking to end the practice of using …

Microsoft New York Stein Openai

Beijing AI academy slams inclusion on US Entity List

2 weeks, 4 days ago

Beijing AI academy slams inclusion on US Entity List

Beijing Academy of Artificial Intelligence, a Chinese non-profit AI institution, strongly condemned its inclusion on the US Entity List, calling the decision unjustified and urging Washington to reverse the move. …

Ai Baai

Judge allows 'New York Times' copyright case against OpenAI to go forward

2 weeks, 4 days ago

Judge allows 'New York Times' copyright case against OpenAI to go forward

Judge allows 'New York Times' copyright case against OpenAI to go forward toggle caption Michael Dwyer/AP A federal judge on Wednesday rejected OpenAI's request to toss out a copyright lawsuit …

NPR

Legal Copyright Fair Use Judge

Author Richard Osman says he will 'take on Meta' to fight against AI copyright infringement

2 weeks, 5 days ago

Author Richard Osman says he will 'take on Meta' to fight against AI copyright infringement

Bestselling author Richard Osman says he will 'take on Meta' to fight against AI copyright infringement. The Atlantic magazine, which published the searchable database, says court documents show that Meta …

Copyright Ai Meta Osman

A Case For Harmonizing Gen AI And The Copyright Regime

2 weeks, 6 days ago

A Case For Harmonizing Gen AI And The Copyright Regime

Globally, Generative AI developers are facing lawsuits from publishers, and India is no exception. Therefore, legal and policy questions in these matters will likely shift from 'AI-generated content being infringing' …

Live Law

India Copyright Information Indian

Trump's call for AI deregulation gets strong backing from Big Tech

3 weeks, 2 days ago

Trump's call for AI deregulation gets strong backing from Big Tech

Major tech firms are pushing the administration of President Donald Trump to loosen rules on building artificial intelligence, arguing it is the only way to maintain a US edge and …

Europe China Tech Trump

Ben Stiller, Mark Ruffalo Among 400 Hollywood Celebs Leading AI Copyright Battle Against OpenAI, Google

3 weeks, 2 days ago

Ben Stiller, Mark Ruffalo Among 400 Hollywood Celebs Leading AI Copyright Battle Against OpenAI, Google

A coalition of over 400 Hollywood figures, including Ben Stiller, Mark Ruffalo, and Paul McCartney, has formally voiced opposition to efforts by Google and OpenAI to loosen copyright protections for …

ABP News

Copyright Google Hollywood Ai

OpenAI’s Deep Research Agent Is Coming for White-Collar Work

3 weeks, 3 days ago

OpenAI’s Deep Research Agent Is Coming for White-Collar Work

Isa Fulford, a researcher at OpenAI, had a hunch that Deep Research would be a hit even before it was released. Congrats to the folks behind it.” “Deep Research is …

Wired

Research Report Ai Deep

Cate Blanchett, Mark Ruffalo, Paul McCartney and 400 Hollywood stars urge Donald Trump to defend copyrights from AI

3 weeks, 5 days ago

Cate Blanchett, Mark Ruffalo, Paul McCartney and 400 Hollywood stars urge Donald Trump to defend copyrights from AI

More than 400 prominent figures in the entertainment industry—including former Beatle Paul McCartney, Ava DuVernay, Taika Waititi, Cate Blanchett, Alfonso Cuarón, and Ben Stiller—have urged the Trump administration to resist …

Hindustan Times

Trump Donald Trump Ai Mccartney

Hollywood creatives urge government to defend copyright laws against AI

3 weeks, 5 days ago

Hollywood creatives urge government to defend copyright laws against AI

“Wicked” star Cynthia Erivo signed a letter urging the U.S. government to uphold copyright protections with regard to AI. In its letter to the White House, OpenAI’s chief global affairs …

LA Times

Copyright Google Companies Hollywood

French publishers accuse Meta of AI copyright violations, File lawsuit

1 month ago

French publishers accuse Meta of AI copyright violations, File lawsuit

French publishers and authors said Wednesday they’re taking Meta to court, accusing the social media company of using their works without permission to train its artificial intelligence model. Three trade …

Copyright Ai Meta

Democrats Demand Answers on DOGE's Use of AI

55 years, 3 months ago

Democrats Demand Answers on DOGE's Use of AI

Democrats on the House Oversight Committee fired off two dozen requests Wednesday morning pressing federal agency leaders for information about plans to install AI software throughout federal agencies amid the …

Wired

Data Information Software Musk

French publishers and authors sue Meta over copyright works used in AI training

1 month ago

French publishers and authors sue Meta over copyright works used in AI training

French publishers and authors said Wednesday they’re taking Meta to court, accusing the social media company of using their works without permission to train its artificial intelligence model. Three trade …

Associated Press

Copyright Train Ai Meta

‘Infringement does occur’: ANI on OpenAI using its data for training

1 month ago

‘Infringement does occur’: ANI on OpenAI using its data for training

News agency ANI pushed back against artificial intelligence company OpenAI during the latest hearing on Monday in the copyright infringement lawsuit against the latter, saying its data has already been …

Hindustan Times

Training Copyright Court Data

Guardrails needed for AI growth

1 month ago

Guardrails needed for AI growth

Students learn about AI applications at a free course provided by the government in Huzhou, Zhejiang province, last month. "For example, it will be suitable to interpret some current laws, …

China Development Data Protection Legislation

Guardrails needed for AI growth

1 month ago

Guardrails needed for AI growth

Students learn about AI applications at a free course provided by the government in Huzhou, Zhejiang province, last month. "For example, it will be suitable to interpret some current laws, …

China Development Protection Data

DOGE Access To 'Sensitive Databases' Sparks Alarm

1 month, 2 weeks ago

DOGE Access To 'Sensitive Databases' Sparks Alarm

LOADING ERROR LOADING Sen. Elizabeth Warren and other lawmakers pressed the Education Department on Wednesday to clarify how much access Elon Musk’s team had been granted to student loan borrowers’ …

Education Access Musk Democrats

India’s AI investment, the budgetary push, the need for regulations

1 month, 2 weeks ago

India’s AI investment, the budgetary push, the need for regulations

The allocation of ₹2,000 crore to the IndiaAI mission in the Union Budget 2025-26 demonstrates India’s commitment to developing its artificial intelligence infrastructure. India’s focus and what needs to be …

Technology India Data Legislation

UK creative industries launch campaign against AI tech firms’ content use

Trending 1 month, 2 weeks ago

UK creative industries launch campaign against AI tech firms’ content use

Get the free Morning Headlines email for news from our reporters across the world Sign up to our free Morning Headlines email Sign up to our free Morning Headlines email …

The Independent

Campaign Creative Content Fair

AI nonprofit CEO says ‘closed nature’ of most artificial intelligence research hinders innovation

1 month, 2 weeks ago

AI nonprofit CEO says ‘closed nature’ of most artificial intelligence research hinders innovation

A year before Elon Musk helped start OpenAI in San Francisco, philanthropist and Microsoft co-founder Paul Allen already had established his own nonprofit artificial intelligence research laboratory in Seattle. More …

Associated Press

Data Model Open Able

Consumers warned over AI courses

1 month, 2 weeks ago

Consumers warned over AI courses

While the rapid rise of China's homegrown artificial intelligence reasoning model DeepSeek has sparked a host of enterprises seeking to cash in by offering training courses on how to use …

Deepseek Paid Courses

DeepSeek to share some AI model code, doubling down on open source

1 month, 3 weeks ago

DeepSeek to share some AI model code, doubling down on open source

Chinese startup DeepSeek will make its models' code publicly available, it said on Friday, doubling down on its commitment to open-source artificial intelligence. The company said in a post on …

Model Chinese Open Ai

HC experts differ in OpenAI copyright case

1 month, 3 weeks ago

HC experts differ in OpenAI copyright case

Experts appointed by the Delhi high court in the copyright infringement case between news agency ANI and OpenAI on Friday presented differing views on whether the artificial intelligence company’s use …

Hindustan Times

Delhi Jurisdiction Copyright Court

Perplexity AI launches Deep Research tool for free

1 month, 3 weeks ago

Perplexity AI launches Deep Research tool for free

AI startup Perplexity has released its own AI research tool called Deep Research to help users get in-depth answers with citations for more complex queries. The blog claimed that Perplexity’s …

Research Ai Deep Deep Research

US firms adopt DeepSeek despite scrutiny

1 month, 3 weeks ago

US firms adopt DeepSeek despite scrutiny

The deepSeek logo, a keyboard, and robot hands are seen in this illustration taken January 27, 2025. Despite mounting regulatory scrutiny, Chinese company DeepSeek's artificial intelligence breakthrough is seeing widespread …

Security Model Ai Openai

The danger of relying on OpenAI’s Deep Research

1 month, 4 weeks ago

The danger of relying on OpenAI’s Deep Research

In early February OpenAI, the world’s most famous artificial-intelligence firm, released Deep Research, which is “designed to perform in-depth, multi-step research". “Asking OpenAI’s Deep Research about topics I am writing …

Research Data Model Deep

OpenAI claims it does not ‘steal’ content from Indian media groups to train ChatGPT, shows court filing

2 months ago

OpenAI claims it does not ‘steal’ content from Indian media groups to train ChatGPT, shows court filing

In its 31-page court filing, dated February 11, OpenAI firmly denied that it had used content from these Indian media groups for training its AI models. The company also argued …

Media Legal Indian Train

Google’s ex-CEO Eric Schmidt urges open source AI models as DeepSeek looms

2 months ago

Google’s ex-CEO Eric Schmidt urges open source AI models as DeepSeek looms

Former Google CEO Eric Schmidt called for the development of open-source AI models, as U.S.-based AI firms reassess their model development plans following the entry of China’s DeepSeek R1, which …

China Google Schmidt Open

OpenAI says it does not use Indian media groups' content to train ChatGPT, court filing shows

2 months ago

OpenAI says it does not use Indian media groups' content to train ChatGPT, court filing shows

OpenAI is seeking to stop Indian media groups, including those of Gautam Adani and Mukesh Ambani, from joining a copyright lawsuit against the U.S. company, saying it does not use …

Media Indian Train Content

Thomson Reuters scores early win in AI copyright battles in the US

2 months ago

Thomson Reuters scores early win in AI copyright battles in the US

LOS ANGELES — Thomson Reuters has won an early battle in court over the question of fair use in artificial intelligence-related copyright cases. The media and technology company filed a …

Associated Press

Copyright Fair Use Thomson Ai

Hollywood writers say AI is ripping off their work. They want studios to sue

2 months ago

Hollywood writers say AI is ripping off their work. They want studios to sue

When the Writers Guild of America approved a contract with major studios in 2023, ending a 148-day strike, the union gained significant guardrails around artificial intelligence in Hollywood. John Rogers, …

LA Times

Fair Use Work Companies Writers

Thomson Reuters Wins First Major AI Copyright Case in the US

55 years, 3 months ago

Thomson Reuters Wins First Major AI Copyright Case in the US

Thomson Reuters has won the first major AI copyright case in the United States. “The copying of our content was not ‘fair use.’” The generative AI boom has led to …

Wired

Legal Court Fair Use Companies

Proper data sharing essential for language models

2 months ago

Proper data sharing essential for language models

SONG CHEN/CHINA DAILY The potential for artificial intelligence to improve lives has captured the attention of governments across the world. There are a range of challenges involved in doing this …

Benefits Policy Innovation Data

Elon Musk’s team at DOGE feeding highly sensitive data to Microsoft Azure's AI bots

2 months ago

Elon Musk’s team at DOGE feeding highly sensitive data to Microsoft Azure's AI bots

Federal agencies typically have strict rules against using AI for handling sensitive data. However, Elon Musk’s DOGE is using Microsoft’s Azure cloud platform to analyse sensitive data that includes personal …

Security Data Microsoft Musk

2 months ago

OpenAI looks across US for sites to build its Trump-backed Stargate AI data centers

The latest headlines from our reporters across the US sent straight to your inbox each weekday Your briefing on the latest headlines from across the US Your briefing on the …

The Independent

Trump Data Centers Ai Openai

OpenAI, Microsoft and Google prove again, nothing’s consistent in the world of AI

2 months, 1 week ago

OpenAI, Microsoft and Google prove again, nothing’s consistent in the world of AI

If this is where we are at barely a month into the year, I have absolutely no doubt the AI landscape will effortlessly evolve into a completely different sphere another …

Hindustan Times

India Tech Apple Microsoft

Human artists could disappear if copyright not protected from AI, MPs told

2 months, 1 week ago

Human artists could disappear if copyright not protected from AI, MPs told

For free real time breaking news alerts sent straight to your inbox sign up to our breaking news emails Sign up to our free breaking news emails Sign up to …

The Independent

Music Copyright Artists Think

OpenAI discussing localization of ChatGPT India data

2 months, 1 week ago

OpenAI discussing localization of ChatGPT India data

Generative AI pioneer OpenAI has begun discussions to house data of its Indian users within the country, three executives who attended Wednesday's New Delhi meeting with the company's chief executive …

India Data Indian Ai

OpenAI says DeepSeek ‘inappropriately’ copied ChatGPT – but it’s facing copyright claims too

2 months, 1 week ago

OpenAI says DeepSeek ‘inappropriately’ copied ChatGPT – but it’s facing copyright claims too

Melbourne Melbourne, Feb 5 Until a few weeks ago, few people in the Western world had heard of a small Chinese artificial intelligence company known as DeepSeek. DeepSeek’s new offering …

Model Chinese Ai Models

Google drops pledge not to use AI for weapons, surveillance

2 months, 1 week ago

Google drops pledge not to use AI for weapons, surveillance

Tech giant says in updated ethics policy that it will use AI in line with ‘international law and human rights’. Google’s revised policy announced on Tuesday states that the company …

Human Rights Tech Google Ai

Download DeepSeek, go to jail for 20 yrs: US Senator Hawley proposes prison time for using Chinese AI

2 months, 1 week ago

Download DeepSeek, go to jail for 20 yrs: US Senator Hawley proposes prison time for using Chinese AI

Senator Hawley bill proposes that American companies be barred from researching AI in China or collaborating with Chinese firms. It would also prevent US companies from investing in Chinese AI …

China Research Companies Download

OpenAI launches New AI Tool 'Deep Search'

Trending 2 months, 1 week ago

OpenAI launches New AI Tool 'Deep Search'

Artificial Intelligence company OpenAI launched a new AI tool called Deep Search, which it said conducts multi-step research on the internet for complex tasks. “Deep Research is OpenAI’s next agent …

Deccan Chronicle

Research Search Ai Deep