hit Anthropic. Their books train the AI chatbot, Claude. In the lawsuit, the authors declare that Anthropic used a straggling, open-source dataset known as `The Pile` to train their Claude AI chatbots. The objection was filed on Monday by writers and journalists. They include Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson.
The Pile is an 886.03 GB open-source dataset of English text. It was created as an AI training dataset for large language models (LLMs). It was constructed by EleutherAI in 2020 and released at the end of December. The Pile has reached 22 smaller datasets, including 14 new ones.
The dataset is called Books3. It is a massive library of pirated ebooks, including works from Stephen King, Michael Pollan, and others. Anthropic AI confirmed they used `The Pile` to train Claude. `It is apparent that Anthropic downloaded and reproduced copies of The Pile and Books3, knowing that these datasets were comprised of a trove of copyrighted content sourced from pirate websites like Bibiliotik.`
The authors want the court to certify their class action lawsuit. They want to require the artificial intelligence company to pay the proposed indemnity. Also, they want to stop Anthropic from using copyrighted material in the future. An Anthropic AI spokesperson said they knew of the lawsuit and were evaluating the complaint.
Despite complaints about the AI training, the lawsuit joins other high-stakes complaints filed by visual artists, new outlets, and record labels. They complained about the material used by the Anthropic AI to train their generative artificial intelligence system. Though Books3 is gone from the `most official` version of `The Pile,` the original is still available online.
Also, other groups of authors have accused OpenAI and Meta Platforms of using their work to train their artificial intelligence chatbots. The latest investigation found that companies like Anthropic and Apple trained the AI models on many scraped YouTube video subtitles within `The Pile`.
`In the process of building and operating AI models, Anthropic unlawfully copies and disseminates vast amounts of copyrighted works,` the lawsuit stated. Many publications and media outlets are trying to protect their businesses from AI content spreads. `Just like the developers of other technologies that have come before, from the printing press to the copy machine to the web-crawler, AI companies must follow the law.`