OpenAI has unveiled Sora, a state-of-the-art text-to-video (TTV) model that generates realistic videos of as much as 60 seconds from a user text prompt.
We’ve seen big advancements in AI video generation currently. Last month we were excited when Google gave us a demo of Lumiere, its TTV model that generates 5-second video clips with excellent coherence and movement.
Just just a few weeks later and already the impressive demo videos generated by Sora make Google’s Lumiere look quite quaint.
Sora generates high-fidelity video that may include multiple scenes with simulated camera panning while adhering closely to complex prompts. It may generate images, extend videos , and generate a video using a picture as a prompt.
Some of Sora’s impressive performance lies in things we take with no consideration when watching a video but are difficult for AI to provide.
Here’s an example of a video Sora generated from the prompt: “A movie trailer featuring the adventures of the 30 yr old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colours.”
This short clip demonstrates just a few key features of Sora that make it truly special.
- The prompt was pretty complex and the generated video closely adhered to it.
- Sora maintains character coherence. Even when the character disappears from a frame and reappears, the character’s appearance stays consistent.
- Sora retains image permanence. An object in a scene is retained in later frames while panning or during scene changes.
- The generated video reveals an accurate understanding of physics and changes to the environment. The lighting, shadows, and footprints within the salt pan are great examples of this.
Sora doesn’t just understand what the words within the prompt mean, it understands how those objects interact with one another within the physical world.
Here’s one other great example of the impressive video Sora can generate.
The prompt for this video was: “A trendy woman walks down a Tokyo street stuffed with warm glowing neon and animated city signage. She wears a black leather jacket, an extended red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, making a mirror effect of the colourful lights. Many pedestrians walk about.”
A step closer to AGI
We could also be blown away by the videos, nevertheless it is that this understanding of the physical world that OpenAI is especially excited by.
In the Sora blog post, the corporate said “Sora serves as a foundation for models that may understand and simulate the true world, a capability we imagine will likely be a crucial milestone for achieving AGI.”
Several researchers imagine that embodied AI is crucial to realize artificial general intelligence (AGI). Embedding AI in a robot that may sense and explore a physical environment is one strategy to achieve this but that comes with a variety of practical challenges.
Sora was trained on an enormous amount of video and image data which OpenAI says is answerable for the emergent capabilities that the model displays in simulating points of individuals, animals, and environments from the physical world.
OpenAI says that Sora wasn’t explicitly trained on the physics of 3D objects but that the emergent abilities are “purely phenomena of scale”.
This signifies that Sora could eventually be used to accurately simulate a digital world that an AI could interact with without the necessity for it to be embodied in a physical device like a robot.
In a more simplistic way, that is what the Chinese researchers try to realize with their AI robot toddler called Tong Tong.
For now, we’ll need to be satisfied with the demo videos OpenAI provided. Sora is just being made available to red teamers and a few visual artists, designers, and filmmakers to get feedback and check the alignment of the model.
Once Sora is released publicly, might we see SAG-AFTRA movie industry employees dust off their picket signs?