The plaintiffs' attorney in a copyright lawsuit filed against Meta claims that Meta CEO Mark Zuckerberg gave the team behind the corporate's Llama AI models the green light to make use of a dataset of pirated e-books and articles for training.
The case of Kadrey v. Meta is one among many against tech giants developing AI, accusing the businesses of coaching models on copyrighted works without permission. Defendants like Meta have largely claimed that they’re protected by “fair use,” the U.S. legal doctrine that enables the usage of copyrighted works to create something latest, so long as they’re sufficiently transformative. Many YouTubers reject this argument.
In latest unredacted documents Filed late Wednesday within the U.S. District Court for the Northern District of California, plaintiffs in Kadrey v. Meta, which incorporates best-selling authors Sarah Silverman and Ta-Nehisi Coates, about Meta's statement late last 12 months revealing that Zuckerberg approved Meta's use of a dataset called LibGen for llama-related training.
LibGen, which describes itself as a “link aggregator,” provides access to copyrighted works from publishers resembling Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education. LibGen was sued multiple times, ordered to shut down and fined tens of hundreds of thousands of dollars for copyright infringement.
According to Meta's statement, as shared by plaintiffs' counsel, Zuckerberg approved the usage of LibGen to coach not less than one among Meta's llama models, despite concerns inside Meta's AI leadership team and others at the corporate. The filing quotes Meta employees calling LibGen a “dataset that we all know is pirated” and that its use “could undermine (Meta's) negotiating position with regulators.”
The filing also cites a memo to Meta AI decision-makers stating that Meta's AI team “received approval to make use of LibGen” following the “escalation to MZ.” (MZ here is a reasonably obvious abbreviation for “Mark Zuckerberg.”)
The details appear to match New York Times reporting last April. suggesting that Meta is cutting corners to gather data for its AI. According to the Times, Meta once hired contractors in Africa to compile book summaries and was considering buying publisher Simon & Schuster. However, company executives concluded that it could take too long to barter licenses and argued that fair use was a solid defense.
Wednesday's filing incorporates latest allegations, resembling that Meta could have tried to cover its alleged breach by de-attributing the LibGen data.
According to the plaintiffs' lawyer, meta-engineer Nikolay Bashlykov, who works on the Llama research team, wrote a script to remove copyright information, including the words “copyright” and “acknowledgments”, from e-books in LibGen. Separately, Meta reportedly removed copyright markings from scientific journal articles and “source metadata” within the training data it used for Llama.
“This discovery suggests that Meta is removing (copyright information) not just for educational purposes,” the filing says, “but additionally to hide its copyright infringement, because removing copyrighted works…prevents Llama from displaying copyright information that Llama “May alert users and the general public to Meta’s breach.”
According to the newest filing, Meta also revealed during depositions that it had torrented LibGen, a move that gave some Meta research engineers pause. Torrenting, a way of distributing files on the Internet, requires torrenters to concurrently “seed” or upload the files they wish to retrieve.
The plaintiffs' attorney claims that Meta actually committed one other type of copyright infringement by torrenting LibGen, thereby helping to distribute its content. Meta also tried to hide its activities by minimizing the variety of files uploaded, the lawyer claims.
According to the filing, Meta's head of generative AI, Ahmad Ah-Dahle, “cleared the best way” for LibGen's torrenting – brushing aside Bashlykov's reservations that this “may not be legal.”
“If Meta had purchased plaintiffs' works from a bookstore or borrowed them from a library and trained its llama models on them with no license, it could have committed copyright infringement,” the plaintiffs' attorney wrote within the lawsuit. “Meta's decision to bypass legitimate methods of acquiring books and turn out to be a knowing participant in an illegal torrenting network…serves as evidence of copyright infringement.”
The case against Meta is much from decided. Currently, it only affects Meta's earliest Llama models – not probably the most recent releases. And the court could well rule in Meta's favor whether it is persuaded by the corporate's fair use argument.
But the allegations don’t reflect well on Meta, because the judge presiding over the case, Judge Thomas Hixson, noted Wednesday in an order denying Meta's request to redact large portions of the file.
“It is obvious that Meta’s sealing motion is just not intended to guard against disclosure of sensitive business information that competitors could use to their advantage,” Hixson wrote. “Rather, it’s about avoiding negative publicity.”
We've reached out to Meta for comment and can update this text if we hear back.