HomeNewsDeepMind's latest AI generates soundtracks and dialogues for videos

DeepMind's latest AI generates soundtracks and dialogues for videos

DeepMind, Google's AI research lab, says it’s developing AI technology to generate soundtracks for videos.

In a post On its official blog, DeepMind says it sees V2A (short for “video-to-audio”) technology as a vital piece of the AI-generated media puzzle. While many organizations, including DeepMind, have developed AI models for video generation, these models cannot create sound effects that will be synchronized with the videos they generate.

“Video generation models are advancing at an incredible pace, but many current systems can only generate silent output,” writes DeepMind. “V2A technology (could) turn into a promising approach to bring generated movies to life.”

DeepMind's V2A technology combines the outline of a soundtrack (e.g. “pulsating jellyfish underwater, marine life, ocean”) with a video to create music, sound effects and even dialogue that matches the video's characters and tone, and is watermarked using DeepMind's deepfake-fighting SynthID technology. The AI ​​model that powers V2A, a diffusion model, was trained using a mix of sounds and dialogue transcripts, in addition to video clips, DeepMind says.

“By training with video, audio and the extra annotations, our technology learns to associate specific audio events with different visual scenes while responding to the data provided within the annotations or transcripts,” said DeepMind.

It isn’t known whether the training data was copyrighted – and whether the creators of the information were informed of DeepMind's work. We have asked DeepMind for clarification and can update this post if we receive a response.

AI-powered sound generation tools are nothing latest. Startup Stability AI released one just last week and ElevenLabs launched one in May. Models for creating video sound effects aren't latest either. A Microsoft Project can generate speech and singing videos from a still image, and platforms like Pika And GenreX Train models to record a video and guess what music or effects are appropriate for a specific scene.

However, DeepMind claims that its V2A technology is exclusive in that it could actually understand the raw pixels of a video and routinely sync generated sounds to the video, optionally without description.

V2A isn’t perfect, and DeepMind acknowledges this. Because the underlying model has not been trained on many videos with artifacts or distortions, it doesn’t produce particularly high-quality audio for them. And basically, the audio generated is unconvincing; my colleague Natasha Lomas described it as “a hodgepodge of stereotypical sounds,” and I can't say I disagree.

For these reasons and to stop misuse, DeepMind won’t make the technology available to the general public any time soon, if in any respect.

“To ensure our V2A technology can have a positive impact on the creative community, we gather diverse perspectives and insights from leading creatives and filmmakers and use this worthwhile feedback to tell our ongoing research and development,” writes DeepMind. “Before we consider making it available to most people, our V2A technology undergoes rigorous safety assessments and testing.”

DeepMind touts its V2A technology as a very useful gizmo for archivists and other people working with historical film material. But generative AI on this direction also threatens to show the film and TV industry on its head. Some really strong health and safety regulations are needed to be sure that generative media tools don't destroy jobs – or, because the case could also be, entire professions.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read