Meta's latest world model has robot objects manipulated in environments that they've never met before

June 12, 2025

121

While large language models (LLMS) have mastered text (and other modalities to a certain extent), they lack the physical “common sense” to work in dynamic, real environments. This has restricted the usage of AI in areas resembling production and logistics, by which the reason behind understanding and effect is of crucial importance.

Metas latest model, V-Jepa 2Take a step to bridge this gap by learning a world model from video and physical interactions.

V-Jepa 2 may also help create AI applications that predict the predictions, results and planning actions in unpredictable environments with many marginal cases. This approach can offer a transparent strategy to more capable robots and advanced automation in physical environments.

Like a “world model” learns to plan

People develop physical intuition early in life by observing their surroundings. When you see a ball thrown, you already know his trajectory instinctively and may predict where it is going to land. V-Jepa 2 learns an analogous “world model” that’s the inner simulation of a AI system for the way the physical world works.

The model is predicated on three core functions, that are of essential importance for corporate applications: understand what happens in a scene, the prediction of how the scene changes on an motion and plan a sequence of actions with a purpose to achieve a particular goal. Like Meta in his BlogHis “long-term vision is that world models enable AI agents to plan and convey within the physical world.”

The architecture of the model, which is known as a video joint predictive architecture (V-Jepa), consists of two key parts. An “encoder” looks at a video clip and condenses it right into a compact numerical summary, which is known as embedding. This embedding captures the essential information concerning the objects and their relationships within the scene. A second component, the “predictor”, assumes this summary and imagines how the scene will develop and creates a prediction of what the subsequent summary will appear like.

V-Jepa consists of an encoder and a predictor (source: meta blog)

This architecture is the most recent development of the JEPA frameworks, which was used for the primary time on pictures with i-jepa and now has progress within the video, which shows a consistent approach to constructing world models.

In contrast to generative AI models that attempt to predict the precise color of every pixel in a future framework-a calculation-intensive task-V-Jepa 2 works in an abstract space. It focuses on the prediction of the high-ranking features of a scene resembling the position and the trajectory of an object and never their texture or background details, which makes it way more efficient than other larger models with only one.2 billion parameters

This results in lower calculation costs and makes it higher suited to provision in real settings.

Learning from statement and motion

V-Jepa 2 is trained in two phases. First, his basic understanding of physics through self-inspected learning and sees over 1,000,000 hours of uncrooped web videos. By simply observing how objects move and interact, it develops a general world model without human leadership.

In the second stage, this pre -educated model is finely coordinated on a small, specialized data record. By processing only 62 hours of video, which shows a robot, the tasks and the corresponding control commands perform, V-Jepa 2 learns to mix certain actions with their physical results. This results in a model that may plan and control actions in the actual world.

V-Jepa two-stage training pipeline (source: meta)

This two-stage training enables a critical ability to routinely automate the actual world: robot planning with zero shot. A robot driven by V-Jepa 2 will be utilized in a brand new environment and the objects that he has never met before are successfully manipulated without having to rewrite for this specific setting.

This is considerable progress in comparison with previous models that need training data from the robot and the encircling area by which you’d work. The model was trained in an open source data set after which successfully provided on various robots within the Meta laboratories.

In order to do a task resembling recording an object, the robot receives a goal image of the specified result. The V-Jepa 2 predictor is then used to simulate an area of possible next moves internally. Every imaginary motion is predicated on how close it’s achieved to the goal, the highly rated motion performs and repeats the method until the duty is accomplished.

With this method, the model achieved the success rates between 65% and 80% for pick-and-place tasks with unknown objects in latest settings.

Real effects of physical justification

This ability to plan and act in latest situations has a direct impact on business operations. In logistics and production, it enables more adaptable robots that may manage variations in products and warehouse layouts without extensive reprogramming. This will be particularly useful because firms examine the usage of humanoid robots in factories and assembly lines.

The same world model can lead highly realistic digital twins and firms enable firms to simulate latest processes or to coach others in a physically accurate virtual environment. In industrial environments, a model of video feeds from machines could monitor and, on the premise of its learned understanding of physics, predict security problems and errors before they occur.

This research is an important step towards what Meta calls “Advanced Machine Intelligence (AMI)”, by which AI systems can “learn concerning the world and folks, plan how they’ll perform unknown tasks and efficiently adapt to the continually changing world around us.”

Meta has published the model and its training code and hopes “to construct a broad community on this research and to bring progress to our final goal, to develop world models that may change the best way AI interacts with the physical world”.

What it means for company technical decision -makers

V-Jepa 2 shifts the robotics closer to the software-defined model that cloud teams already recognize: use it anywhere before training. Since the model learns general physics from public video and only needs just a few dozen hours of tasks -specific film material, firms can reduce the information acquisition cycle that typically pulls down pilot projects. From a practical viewpoint, you may prototize a pick-and-place robot on an inexpensive desktop arm after which roll the identical guideline on an industrial rig on the factory floor without collecting hundreds of fresh samples or writing custom movement scripts.

Lower training effort also forms the fee equation. With 1.2 billion parameters, V-JEPA 2 matches comfortably on a single high-end GPU, and its abstract prediction goals further reduce the inference load. This enables the teams to avoid control over the close loop control and the compliance headache, that are delivered with the streaming video outside the work. Budget that when went to massive computing clusters can as a substitute finance additional sensors, redundancy or faster iteration cycles.

Meta's latest world model has robot objects manipulated in environments that they’ve never met before

Like a “world model” learns to plan

Learning from statement and motion

Real effects of physical justification

What it means for company technical decision -makers

LEAVE A REPLY Cancel reply

Must Read

From fear to flow ability: Why empathy is the shortage of ingredient in Ki -Rollouts

Big Tech urges a 10-year ban on the US states that regulates AI

AI applications produce cleaner cities, more intelligent houses and more efficient transit

The rise of Alexandr Wang: Metas $ 14 billion bet on 28-year-old AI boss

UK AI Start-up Physicsx approaches $ 1 billion $

Openaai says

Cloud Quantum Computing: Eine Gelegenheit zur Billion-Dollar-Gelegenheit mit gefährlichen versteckten Risiken

Latest articles

From fear to flow ability: Why empathy is the shortage of ingredient in Ki -Rollouts

Big Tech urges a 10-year ban on the US states that regulates AI

AI applications produce cleaner cities, more intelligent houses and more efficient transit

Our Newsletter

Meta's latest world model has robot objects manipulated in environments that they’ve never met before

Like a “world model” learns to plan

Learning from statement and motion

Real effects of physical justification

What it means for company technical decision -makers

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter