HomeEthics & SocietyInside Big Tech’s tussle over AI training data

Inside Big Tech’s tussle over AI training data

In the frantic pursuit of AI training data, tech giants OpenAI, Google, and Meta have reportedly bypassed corporate policies, altered their rules, and discussed circumventing copyright law. 

A New York Times investigation reveals the lengths these corporations have gone to reap online information to feed their data-hungry AI systems.

In late 2021, OpenAI researchers developed a speech recognition tool called Whisper to transcribe YouTube videos when facing a shortage of reputable English-language text data. 

Despite internal discussions about potentially violating YouTube’s rules, which prohibit using its videos for “independent” applications, 

NYT found that OpenAI ultimately transcribed over a million hours of YouTube content. Greg Brockman, OpenAI’s president, personally assisted in collecting the videos. The transcribed text was then fed into GPT-4.

OpenAI CEO Sam Altman acknowledged the finite nature of online data in a speech at a tech conference in May 2023: “That will run out,” he said, referring to the viable data on the web for training AI models.

The NYT journalists also found that Google transcribed YouTube videos to reap text for its AI models, potentially infringing on video creators’ copyrights.

This comes days after YouTube’s CEO said such activity would violate the company’s terms of service and undermine creators. 

In June 2023, Google’s legal department requested changes to the corporate’s privacy policy, allowing publicly available content from Google Docs and other Google apps for a wider range of AI products. 

Meta, facing its own data shortage, has considered various options to accumulate more training data. 

Executives discussed paying for book licensing rights, buying the publishing house Simon & Schuster, and even harvesting copyrighted material from the web without permission, risking potential lawsuits. 

Meta’s lawyers argued that using data to coach AI systems should fall under “fair use,” citing a 2015 court decision involving Google’s book scanning project.

Ethical concerns and the long run of AI training data

The collective actions of those tech corporations highlight the critical importance of online data within the booming AI industry. 

These practices have raised concerns about copyright infringement and the fair compensation of creators. 

A filmmaker and writer, Justine Bateman, told the Copyright Office that AI models were taking content – including her writing and movies – without permission or payment.

“This is the most important theft within the United States, period,” she said in an interview.

In the visual arts, MidJourney and other image models have been proven to generate copyright content, like scenes from Marvel movies. 

With some experts predicting that high-quality online data could possibly be exhausted by 2026, corporations are exploring alternative methods, comparable to generating synthetic data using AI models themselves. However, synthetic training data comes with its own risks and challenges and might adversely impact the standard of models

Sy Damle, a lawyer representing Andreessen Horowitz, a Silicon Valley enterprise capital firm, also discussed the challenge: “The only practical way for these tools to exist is in the event that they might be trained on massive amounts of information without having to license that data. The data needed is so massive that even collective licensing really can’t work.”

The NYT and OpenAI are locked in a bitter copyright lawsuit, with the Times searching for what would likely be thousands and thousands in damages.

OpenAI hit back, accusing the Times of ‘hacking’ their models to retrieve examples of copyright infringement.

Undoubtedly, this inside investigation further paints Big Tech’s data heist as ethically and legally unacceptable.

With lawsuits mounting up, the legal landscape surrounding the usage of online data for AI training is amazingly precarious. 


Please enter your comment!
Please enter your name here

Must Read