Meta CEO Mark Zuckerberg appears to have used YouTube's fight to remove pirated content to defend his own company's use of a dataset of copyrighted e-books, based on newly released excerpts of an announcement he made emerges At the top of last yr.
The statement, which was a part of a grievance filed with the court by plaintiffs' lawyers, is said to the AI copyright case Kadrey v. Meta. It's one in every of many such cases rippling through the U.S. court system, pitting AI corporations against authors and other mental property owners. The defendants in these cases – AI corporations – largely claim that training on copyrighted content is “fair use.” Many copyright holders disagree.
“For example, I believe YouTube will find yourself hosting some things that folks will pirate for a time period, but YouTube is attempting to remove those things,” Zuckerberg said during his testimony, based on the statement Parts of a transcript Made available Wednesday evening. “And I assume that the overwhelming majority of the stuff on YouTube is pretty good they usually have the license to do it.”
Excerpts from Zuckerberg's statement provide some clues to Zuckerberg's thoughts on copyrighted content and fair use. However, it needs to be noted that a full transcript of the statement has not been released. TechCrunch has reached out to Meta for added context and can update the article if the corporate responds.
Based on the nuggets of evidence, Zuckerberg appears to be defending Meta's use of a training dataset from e-books called LibGen to develop its family of AI models called Llama. Metas Llama competes with flagship models from AI corporations like OpenAI.
LibGen, which describes itself as a “link aggregator,” provides access to copyrighted works from publishers reminiscent of Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen was sued multiple times, ordered to shut down and fined tens of hundreds of thousands of dollars for copyright infringement.
According to court documents unsealed this week, Zuckerberg allegedly approved using LibGen to coach no less than one in every of Meta's Llama models, despite concerns from the corporate's AI executives and research teams concerning the legal implications.
The plaintiffs' attorney, who also includes best-selling authors Sarah Silverman and Ta-Nehisi Coates, quoted Meta employees as saying that LibGen is a “dataset that we all know is pirated” and that its use “could undermine (Meta’s) negotiating position with regulators,” a legal filing said.
During his testimony, Zuckerberg claimed he had “not likely heard of” LibGen.
“I understand that you just're attempting to get me to provide an opinion on LibGen that I've never really heard of,” Zuckerberg said throughout the testimony. “It's just that I don’t know about this particular matter.”
When questioned by one in every of the plaintiffs' lawyers, David Boies, Zuckerberg explained why it could be inappropriate to ban using a dataset like LibGen.
“So do I need to have a policy against people using YouTube because some content could also be copyrighted? No,” he said. “There are cases where such a blanket ban is probably not the proper thing to do.”
Zuckerberg stated that Meta needs to be “fairly careful” when training on copyrighted material.
“You know, if there's someone who's putting up an internet site and intentionally attempting to violate people's rights…then obviously it's something that we would like to watch out about or handle rigorously or possibly even discourage our teams from doing,” I said I got involved in it,” Zuckerberg said during his testimony, based on the transcript.
New allegations
Attorneys for the plaintiffs in Kadrey v. Meta have amended the lawsuit several times because it was filed in 2023 within the U.S. District Court for the Northern District of California, San Francisco Division. The latest amended grievance, filed late Wednesday by plaintiffs' counsel, includes: latest allegations against Meta, including that the corporate compared certain pirated copies of books in LibGen to copyrighted books that were available for licensing. Lawyers claim Meta used this tactic to find out whether it made sense to enter right into a licensing agreement with a publisher.
According to the amended filing, Meta allegedly used LibGen to coach its newest family of Llama models, Llama 3. The plaintiffs also allege that Meta uses the information set to coach its next-generation Llama-4 models.
According to the amended filing, meta-researchers allegedly attempted to cover the proven fact that Llama models were trained on copyrighted materials by inserting “supervised samples” into the fine-tuning of Llama. And Meta downloaded pirated e-books from one other source, Z-Library, for llama training as recently as April 2024, the amended grievance says.
The Z-Library or Z-Lib has been the topic of diverse legal actions by publishers, including domain seizures and deletion. In 2022, the Russian nationals who allegedly maintained it were charged with copyright infringement, wire fraud and money laundering.