No one really knows yet what generative video models are useful for, but that hasn't stopped corporations like Runway, OpenAI and Meta from pouring hundreds of thousands into their development. Meta's latest version is named Movie Genand true to its name, it turns text prompts into relatively realistic videos with sound… but luckily still without voice. And properly, they don't announce this publicly.
Movie Gen is definitely a set (or “solid” as they call it) of basic models, the biggest of which is the text-to-video part. Meta claims it's outperforming games like Runway's Gen3, LumaLabs' latest, and Kling1.5, although something like this at all times shows they're playing the identical game relatively than Movie Gen winning. The technical details may be present in the paper published by Meta, which describes all of the components.
The sound is generated to match the content of the video, for instance by adding engine sounds that correspond to the movements of the automotive, or the sound of a waterfall within the background, or a clap of thunder in the midst of the video if crucial. Music is even added if that seems relevant.
It was trained on “a mix of licensed and publicly available datasets” which they described as “proprietary/commercially sensitive” and declined to supply further details about them. We can only assume that there are a number of Instagram and Facebook videos, in addition to some partner content and plenty of others which can be poorly protected against scrapers – aka “publicly available”.
However, what Meta is clearly aiming for here shouldn’t be simply to take the “state-of-the-art” crown for a month or two, but a practical, soup-to-nuts approach that takes a quite simple approach Solution a solid final product may be produced using natural language prompt. Things like, “Imagine me as a baker making a shiny hippopotamus cake in a thunderstorm.”
For example, one sticking point with these video generators was how difficult they typically are to edit. If you ask for a video of somebody walking across the road after which realize that you just want the person to walk right to left as a substitute of left to right, there's a very good probability the complete shot will look different if repeat the prompt with this extra instruction. Meta adds an easy, text-based editing method where you’ll be able to just say, “Change the background to a busy intersection” or “Change her clothes to a red dress,” and it’s going to attempt to make that change, but that change .
Camera movement can be commonly understood, with things like “camera movement” and “pan left” taken into consideration when creating the video. This remains to be pretty awkward in comparison with real camera controls, but significantly better than nothing.
The limitations of the model are a bit strange. It produces videos which can be 768 pixels wide, a dimension most know from the famous but outdated 1024Ă—768, but which can be thrice larger than 256, making it well compatible with other HD formats. The Movie Gen system upscales this to 1080p, which is the source of the claim that it produces this resolution. Not really true, but we'll let it pass because upscaling is surprisingly effective.
Oddly enough, it produces as much as 16 seconds of video… at 16 frames per second, a frame rate that nobody in history has wanted or desired. However, it’s also possible to record 10 seconds of video at 24 FPS. Lead with it!
As for why there's no voice acting… well, there's probably two reasons for that. First of all, it's super hard. Producing speech is now easy, but matching it to lip movements and people lips to facial movements is a way more complicated matter. I can't blame them for leaving this until later because it could be a mistake from the primary minute. Someone might say, “Create a clown reciting the Gettysburg Address while riding in circles on a tiny bicycle” — a nightmare that quickly goes viral.
The second reason might be political: releasing something like a deepfake generator a month before a giant election is… not the most effective for optics. A practical preventive step is to limit its capabilities a bit in order that if malicious actors try to use it, it could require real work on their part. You could definitely mix this generative model with a speech generator and an open lip synchronization model, but you’ll be able to't just use it to create a candidate that makes wild claims.
“Movie Gen is currently a pure AI research concept, and even at this early stage, safety is a top priority, as has been the case with all of our generative AI technologies,” a Meta representative responded to TechCrunch’s questions.
Unlike, for instance, the massive Llama language models, Movie Gen won’t be publicly available. You can reproduce the techniques to some extent by following the research paper, however the code shouldn’t be published aside from the “underlying evaluation prompts dataset,” which is the record of the prompts used to create the test videos.