In the classic animated film “The Jetsons,” Rosie the robot girl seamlessly transitions from vacuuming the home to cooking dinner to taking out the trash. But in real life, training a general-purpose robot stays a serious challenge.
Typically, engineers collect data specific to a selected robot and task and use it to coach the robot in a controlled environment. However, collecting this data is dear and time-consuming, and the robot will likely have difficulty adapting to environments or tasks it has not seen before.
To train higher general-purpose robots, MIT researchers have developed a flexible technique that mixes a considerable amount of heterogeneous data from many sources right into a system that may teach each robot a wide range of tasks.
Their method is to bring together data from different domains, resembling simulations and real robots, and multiple modalities, including vision sensors and positioners for robotic arms, into a standard “language” that a generative AI model can process.
By combining such enormous amounts of information, this approach might be used to show a robot to perform a wide range of tasks without having to coach it from scratch every time.
This method may very well be faster and more cost effective than traditional techniques since it requires far less task-specific data. Additionally, it outperformed training from scratch by greater than 20 percent in simulations and real experiments.
“In robotics, it is commonly said that we don’t have enough training data. However, one other big problem, for my part, is that the info comes from so many alternative domains, modalities and robotic hardware. “Our work shows how you possibly can train a robot with all these components,” says Lirui Wang, a doctoral student in electrical engineering and computer science (EECS) and lead creator of a book Paper on this system.
Wang's co-authors include EECS colleague Jialiang Zhao; Xinlei Chen, research scientist at Meta; and senior creator Kaiming He, associate professor of EECS and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). The research can be presented on the Conference on Neural Information Processing Systems.
Inspired by LLMs
A robot “policy” captures sensor observations, resembling camera images or proprioceptive measurements, that track the speed and position of a robot arm, after which tells a robot how and where to maneuver.
Policies are typically trained through imitation learning, meaning a human demonstrates actions or teleoperates a robot to generate data that’s fed into an AI model that learns the policy. Because this method uses a small amount of task-specific data, robots often fail when their environment or task changes.
To develop a greater approach, Wang and his collaborators took inspiration from large language models like GPT-4.
These models are pre-trained with an unlimited amount of various language data after which refined by inputting a small amount of task-specific data. Pre-training on a lot data helps models adapt to perform well on a wide range of tasks.
“In the language area, the info is all just sentences. In robotics, given the heterogeneity of the info, we’d like a special architecture if you would like to pre-train in an identical way,” he says.
Robot data can take many forms, from camera images to voice instructions to depth maps. At the identical time, each robot is mechanically unique and has a special number and orientation of arms, grippers and sensors. Additionally, the environments during which data is collected vary widely.
The MIT researchers developed a brand new architecture called Heterogeneous Pretrained Transformers (HPT) that unifies data from these different modalities and domains.
They integrate into the center of their architecture a machine learning model generally known as Transformer, which processes visual and proprioceptive inputs. A transformer is identical style of model that forms the backbone of enormous language models.
The researchers collate data from vision and proprioception into the identical style of input, called a token, that the transformer can process. Each input is represented with the identical fixed variety of tokens.
Then the transformer maps all inputs into a standard space and grows right into a huge, pre-trained model because it processes and learns from more data. The larger the transformer becomes, the higher its performance.
A user only needs to supply HPT with a small amount of information about their robot's design, setup, and task it’s presupposed to perform. HPT then transfers the knowledge the transformer acquired during pre-training to learn the brand new task.
Allows skillful movements
One of the largest challenges in developing HPT was constructing the massive transformer pre-training dataset, which included 52 datasets with greater than 200,000 robot trajectories in 4 categories, including human demo videos and simulations.
The researchers also needed to develop an efficient method to convert raw proprioception signals from an array of sensors into data that the transformer could process.
“Proprioception is the important thing to enabling many expert movements. Since the variety of tokens in our architecture is all the time the identical, we place equal emphasis on proprioception and vision,” explains Wang.
When they tested HPT, it improved robot performance in simulations and real-world tasks by greater than 20 percent every time in comparison with training from scratch. Even when the duty was very different from the pre-training data, HPT still improved performance.
,This paper provides a novel approach to training a single policy ,across multiple robot executions. This enables training across different data sets and allows robot learning methods to significantly increase the dimensions of information sets they’ll train on. This also allows the model to quickly adapt to latest robot designs, which is very important because latest robot designs are consistently being produced, says David Held, an associate professor at Carnegie Mellon University's Robotics Institute, who was not involved on this work.
In the longer term, the researchers want to research how data diversity could improve the performance of HPT. They also wish to improve HPT to handle unlabeled data like GPT-4 and other large language models.
“Our dream is to have a universal robot brain that you could download and use on your robot with none training. Even though we’re still within the early stages, we are going to proceed to work hard and hope that scaling results in a breakthrough in robotics policy, as has happened with large language models,” he says.
This work was funded partially by the Amazon Greater Boston Tech Initiative and the Toyota Research Institute.