HomeNewsCombining Next Token Prediction and Video Dissemination in Computer Vision and Robotics

Combining Next Token Prediction and Video Dissemination in Computer Vision and Robotics

In the present AI zeitgeist, sequence models have gotten increasingly popular as a result of their ability to research data and predict what to do next. For example, you've probably used next token prediction models like ChatGPT, which anticipate each word (token) in a sequence to form responses to user queries. There are also full-sequence diffusion models like Sora, which transform words into dazzling, realistic images by successively “denoising” a whole video sequence.

Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) have proposed a straightforward change to the diffusion training scheme that makes this sequence denoising significantly more flexible.

When applied to areas comparable to computer vision and robotics, the next-token and full-sequence diffusion models have performance tradeoffs. Next token models can spit out sequences of various lengths. However, they generate these generations without being aware of desirable states within the distant future – comparable to controlling their sequence generation towards a selected goal that’s 10 tokens away – and subsequently require additional mechanisms for long-term (long-term) planning. Diffusion models can perform such future-contingent sampling, but they lack the flexibility of next-token models to provide sequences of variable length.

Researchers at CSAIL wish to mix the strengths of each models and have developed a sequence model training technique called “diffusion forcing.” The name comes from “Teacher Forcing,” the standard training scheme that breaks down the whole sequence generation into the smaller, simpler steps of the subsequent token generation (just like how a great teacher simplifies a fancy concept).

Diffusion forcing found similarities between diffusion models and teacher forcing: each use training schemes through which masked (noisy) tokens are predicted from unmasked tokens. In the case of diffusion models, they step by step add noise to the information, which could be viewed as fractional masking. The MIT researchers' diffusion forcing method trains neural networks to scrub a group of tokens, removing different amounts of noise in each while concurrently predicting the subsequent few tokens. The result: a versatile, reliable sequence model that led to higher quality artificial videos and more precise decision-making for robots and AI agents.

By sorting through noisy data and reliably predicting the subsequent steps of a task, diffusion forcing might help a robot ignore visual distractions and perform manipulation tasks. It may produce stable and consistent video sequences and even guide an AI agent through digital mazes. This method could potentially allow household and factory robots to tackle recent tasks and improve AI-generated entertainment.

“Sequence models aim to depend on the known past and predict the unknown future, a variety of binary masking. However, the masking doesn’t need to be binary,” says Boyuan Chen, lead creator, MIT PhD student in Electrical Engineering and Computer Science (EECS) and CSAIL member. “With diffusion forcing, we add different levels of noise to every token, effectively acting as a variety of fractional masking. At the time of testing, our system can “unmask” a group of tokens and propagate a sequence with a lower noise level within the near future. It knows what it will possibly trust in its data to beat inputs outside the distribution.”

In several experiments, diffusion forcing managed to disregard misleading data to perform tasks while anticipating future actions.

For example, when implemented right into a robotic arm, it helped swap two toy fruits on three circular mats, a minimal example of a family of tedious tasks requiring reminders. The researchers trained the robot by controlling (or teleoperating) it remotely in virtual reality. The robot is trained to mimic the user's movements via its camera. Even though it began from random positions and saw distractions like a grocery bag blocking the markers, it placed the objects at their goal points.

To create videos, they trained diffusion forcing on “Minecraft” gameplay and colourful digital environments created in Google DeepMind Lab Simulator. Using a single movie frame, the strategy produced more stable, higher resolution videos than comparable baselines comparable to a Sora-like full-sequence diffusion model and ChatGPT-like next token models. These approaches produced videos that appeared inconsistent, with the latter sometimes failing to generate working videos beyond just 72 frames.

In addition to generating fancy videos, diffusion forcing can function a movement planner targeting desired outcomes or rewards. Thanks to its flexibility, Diffusion Forcing can uniquely create plans with different horizons, perform a tree search, and consider the intuition that the distant future is more uncertain than the near future. When solving a 2D maze, Diffusion Forcing outperformed six baselines by generating faster plans resulting in the goal location, suggesting it might be an efficient planner for robots in the longer term.

In each demo, Diffusion Forcing acted as a full sequence model, a next token prediction model, or each. According to Chen, this versatile approach could potentially function a strong backbone for a “world model,” an AI system that may simulate the dynamics of the world by training on billions of Internet videos. This would allow robots to perform novel tasks by imagining what they should do based on their environment. For example, for those who asked a robot to open a door without being trained, the model could produce a video showing the machine learn how to do it.

The team is currently attempting to extend their method to larger data sets and the newest transformer models to enhance performance. They plan to expand their work to develop a ChatGPT-like robot brain that helps robots perform tasks in recent environments without human intervention.

“With Diffusion Forcing, we’re taking a step toward bringing video generation and robotics closer together,” says lead creator Vincent Sitzmann, an assistant professor at MIT and a member of CSAIL, where he leads the Scene Representation group. “Ultimately, we hope that we are able to use all of the knowledge stored in videos on the Internet to enable robots to assist in on a regular basis life. Many more exciting research challenges remain, comparable to how robots can learn to mimic humans by observing them, even when their very own bodies are so different from ours!”

Chen and Sitzmann co-authored the paper with MIT's most up-to-date visiting researcher, Diego Martí Monsó, and CSAIL members: Yilun Du, an EECS graduate student; Max Simchowitz, former postdoc and future assistant professor at Carnegie Mellon University; and Russ Tedrake, Toyota Professor of EECS, Aerospace and Mechanical Engineering at MIT, vp of robotics research at Toyota Research Institute and CSAIL member. Their work was supported partly by the US National Science Foundation, the Singapore Defense Science and Technology Agency, Intelligence Advanced Research Projects Activity through the US Department of the Interior, and the Amazon Science Hub. They will present their research at NeurIPS in December.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read