YouTuber files class motion lawsuit over OpenAI scraping his creators’ transcripts

August 6, 2024

372

A YouTube creator desires to file a category motion lawsuit against OpenAI, claiming that the corporate trained its generative AI models on hundreds of thousands of transcripts from YouTube videos without notifying or compensating the owners of the videos.

In a Complaint Lawyers for David Millette, a YouTube user from Massachusetts, filed suit within the U.S. District Court for the Northern District of California on Friday, alleging that OpenAI secretly transcribed Millette's and other YouTubers' videos to coach the models that power the corporate's AI-powered chatbot platform, ChatGPT, and other generative AI tools and products. By collecting this data, OpenAI “substantially profited from the YouTubers' work,” the lawsuit says, while also violating copyright law and YouTube's terms of service, which prohibit using videos for apps independent of the YouTube service.

“As (OpenAI's) AI products turn out to be more sophisticated through using training datasets, they turn out to be more useful to prospective and current users who purchase subscriptions to access (OpenAI's) AI products,” the grievance states. “However, much of the fabric in OpenAI's training datasets comes from works copied by OpenAI without consent, without attribution, and without compensation.”

Millette, represented by the law firm Bursor & Fisher, is demanding a jury trial and over $5 million in damages for all YouTube users and creators whose data can have been harvested during OpenAI's training.

Generative AI models like those from OpenAI don’t have any real intelligence. Using an enormous variety of examples (e.g. movies, voice recordings, essays), the models “learn” from patterns how likely certain data is to occur, making an allowance for the context of all the encircling data.

Most models are trained with data obtained from public web sites and datasets across the web. Companies argue that fair use shields their efforts to indiscriminately grab data and use it to coach industrial models. But many copyright holders disagree—and are filing lawsuits to stop the practice.

Video transcriptions have turn out to be a vital part of coaching data as other data sources dry up.

More than 35% of the world's top 1,000 web sites Block OpenAI's web crawler nowbased on data from Originality.AI. And around 25% of information from “high-quality” sources was excluded from the important thing datasets used to coach AI models, a study by MIT's Data Provenance Initiative. If the present trend of access blocking continues, the research group Epoch AI predicts that developers will run out of information to coach generative AI models between 2026 and 2032.

In April, the New York Times wrote reported that OpenAI has built its first speech recognition model, Whisper, to transcribe audio from videos and collect additional training data. An OpenAI team that included the corporate's president, Greg Brockman, transcribed greater than 1,000,000 hours of video from YouTube using Whisper, based on The Times, and used the transcripts to coach OpenAI's text generation and text evaluation model, GPT-4.

According to the Times, some OpenAI employees discussed that such a move could potentially violate YouTube's rules.

In July, Proof News reported that firms including Anthropic, Apple, Salesforce, and Nvidia used a dataset called “The Pile,” which incorporates captions from a whole lot of 1000’s of YouTube videos, to coach generative AI models. Many YouTube creators whose captions appeared in “The Pile” were unaware of this and didn’t consent to it; Apple later released an announcement saying it didn’t intend to make use of these models to power AI features in its products.

Google, YouTube's parent company, has also tried using transcripts to coach its models.

Last yr, Google has expanded its terms of service partly to permit the corporate to make use of more user data to coach generative AI models. Under the old terms of service, it was unclear whether Google could use YouTube data to develop products beyond the video platform. Under the brand new terms, that's not the case, which loosens the reins considerably.

We have reached out to OpenAI and Google for comment on the category motion lawsuit and can update this text in the event that they respond.

The month was a difficult start for OpenAI.

Tesla and X CEO Elon Musk filed a brand new lawsuit against OpenAI on Monday and CEO Sam Altman accused the corporate of abandoning its original nonprofit mission by reserving a few of its most advanced technology for industrial customers. Musk made the identical allegations in a February lawsuit against OpenAI, but the brand new suit claims OpenAI can also be engaged in extortion activities.

YouTuber files class motion lawsuit over OpenAI scraping his creators’ transcripts

LEAVE A REPLY Cancel reply

Must Read

Generative AI tool helps 3D print personal items that may withstand on a regular basis use

Wikipedia at 25: Can their original ideals survive within the age of AI?

Reddit and TikTok are changing – with the assistance of AI – the way in which researchers understand substance use

VoiceRun receives $5.5 million to construct a voice agent factory

Ring founder describes camera company's era of “intelligent assistants.”

AI could possibly be your next boss

Hands-on with Bee, Amazon's newest AI wearable

Latest articles

Generative AI tool helps 3D print personal items that may withstand on a regular basis use

Wikipedia at 25: Can their original ideals survive within the age of AI?

Reddit and TikTok are changing – with the assistance of AI – the way in which researchers understand substance use

Our Newsletter

YouTuber files class motion lawsuit over OpenAI scraping his creators’ transcripts

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter