YouTube CEO Neal Mohan said OpenAI’s potential use of YouTube videos to coach text-to-video model Sora would violate its terms of service.
Mohan told Bloomberg, “If Sora used content from YouTube it might be a ‘clear violation’ of its terms of service.”
There will likely be no love lost between YouTube and OpenAI, with each drawn on different sides of the Big Tech divide.
Sora is OpenAI’s revolutionary latest text-to-video model, which remains to be being tested. It signifies generative AI’s conquest of all media forms, starting with text, then images, and now audio and video.
Generative video and audio include a brand new set of risks for AI corporations to barter, corresponding to their models producing near-exact replicas of copyright material.
We’ve already witnessed this with text-to-audio model Suno, which produces very similar audio to famous songs like Queen’s “Bohemian Rhapsody” and ABBA’s “Dancing Queen.”
Neither OpenAI nor most AI corporations have been notably transparent about their reliance on vast amounts of internet-sourced data, including copyrighted material, to coach models.
The company acknowledged the challenges of avoiding copyrighted data in its development processes, stating in a submission to the British House of Lords that “it was ‘unattainable” to construct the technology without it.”
That was somewhat of a Freudian slip and self-exposure of an inconvenient truth for the generative AI industry. However, infringement has not yet been proven in a court of law, reflecting how copyright law in its current incarnation was simply not born for this era.
On the subject of Sora specifically, OpenAI CTO Mira Murati, in an interview with Wall Street Journal, expressed uncertainty concerning the specific varieties of content used to coach Sora, including whether any YouTube content was involved.
Murati said, “I’m actually unsure about that,” when questioned concerning the content sources for Sora’s training, adding that any data utilized was either “publicly available or licensed.”
It’s not a gleaming report of transparency for OpenAI as they prepare to release their groundbreaking latest model – one which they’re already using to tender for business inside Hollywood for its potential applications in film and TV.
Sora already caused producer Tyler Perry to pause an $800 million studio expansion, hinting at potentially massive upheaval for the creative industries ahead.
YouTube’s CEO speaks about Sora
YouTube CEO Mohan showed his awareness of the continuing discussions about AI training practices. He hinted at OpenAI’s must make clear the usage of YouTube data.
He told Bloomberg, “From a creator’s perspective, when a creator uploads their labor to our platform, they’ve certain expectations. One of those expectations is that the terms of service goes to be abided by. It doesn’t allow for things like transcripts or video bits to be downloaded, and that may be a clear violation of our terms of service. Those are the principles of the road by way of content on our platform.”
YouTube’s terms of service explicitly “prohibit unauthorized scraping or downloading of YouTube content,” a policy confirmed by a spokesperson for YouTube in light of Mohan’s comments.
Alphabet, YouTube’s parent, is keenly developing their very own AI tools. If OpenAI directly or not directly used YouTube videos to coach Sora, then we are able to expect backlash.
The AI data gold rush has led to strategic partnerships and licensing agreements between tech corporations and content providers. Numerous lawsuits are still in progress within the domains of text and image generation, but these remain largely inconclusive.
First, even when AI models expose themselves by producing copyrighted work, their black box nature makes it nigh-impossible to find out where this data was retrieved and when precisely the infringement occurred.
Secondly, the audio, image, video, etc, might illustrate strong evidence of infringement – but that’s not as clear cut as you or me copying a picture of Mickey Mouse and selling it for hundreds of thousands without permission.
In response to legal pressures, AI corporations are beginning to seal deals for data.
For instance, Reddit’s $60 million per yr licensing cope with Google for training AI tools exemplifies the formal arrangements emerging within the industry.
Similarly, media organizations corresponding to The Associated Press and Axel Springer have entered into agreements allowing their content for use for AI training, with provisions for attribution in AI-generated responses.
This brings forth its own challenges. Generative AI is expensive to construct and run, and now, AI corporations must pay for the info moderately than simply lift it from the web.