One day, you could want your household robot to hold a load of dirty laundry downstairs and place it within the washer within the far left corner of the basement. The robot must mix your instructions with its visual observations to find out the steps it must take to finish this task.
For an AI agent, this is less complicated said than done. Current approaches often use multiple hand-crafted machine learning models to handle different parts of the duty, which require loads of human effort and expertise to construct. These methods, which use visual representations to make navigation decisions directly, require huge amounts of visual data for training, which are sometimes hard to return by.
To overcome these challenges, researchers at MIT and the MIT-IBM Watson AI Lab developed a navigation method that converts visual representations into pieces of speech, that are then fed right into a large language model that handles all parts of the multi-step navigation task.
Instead of encoding visual features from images of a robot's environment as visual representations, which can be computationally intensive, their method creates text labels that describe the robot's viewpoint. A big language model uses the labels to predict the actions a robot should take to perform a user's voice-based instructions.
Because their method uses only language-based representations, they’ll efficiently generate an enormous amount of synthetic training data using a big language model.
While this approach doesn’t outperform techniques that use visual features, it really works well in situations where there will not be enough visual data for training. The researchers found that combining their voice-based inputs with visual cues leads to raised navigation performance.
“By using only language as a perceptual representation, our approach is more direct. Since all inputs may be encoded as language, we are able to generate a trajectory that’s comprehensible to humans,” says Bowen Pan, a PhD student in Electrical Engineering and Computer Science (EECS) and lead creator of a Paper on this approach.
Pan's co-authors include his advisor Aude Oliva, director of strategic industry engagement at MIT's Schwarzman College of Computing, MIT director of the MIT-IBM Watson AI Lab, and senior research scientist on the Computer Science and Artificial Intelligence Laboratory (CSAIL); Philip Isola, associate professor at EECS and member of CSAIL; senior creator Yoon Kim, assistant professor at EECS and member of CSAIL; and others at MIT-IBM Watson AI Lab and Dartmouth College. The research can be presented on the conference of the North American chapter of the Association for Computational Linguistics.
Solving a vision problem with language
Since large language models are essentially the most powerful machine learning models available, researchers tried to integrate them into the complex task of visual and linguistic navigation, Pan says.
However, such models accept text-based input and can’t process visual data from a robot's camera, so the team had to search out a method to use speech as a substitute.
Their technique uses an easy captioning model to acquire text descriptions of a robot's visual observations. These captions are combined with speech-based instructions and fed right into a large language model that decides which navigation step the robot should perform next.
The large language model outputs a caption of the scene the robot should see after completing this step. This is used to update the trajectory history so the robot can keep track of where it has been.
The model repeats these processes to generate a trajectory that guides the robot step-by-step toward its goal.
To optimize the method, the researchers developed templates in order that the commentary information is presented to the model in a standardized form – as a series of choices that the robot could make based on its environment.
For example, a label might read: “30 degrees to your left is a door with a potted plant next to it, behind you is a small office with a desk and a pc,” etc. The model chooses whether the robot should move toward the door or the office.
“One of the largest challenges was determining learn how to encode this type of information into language in an appropriate way in order that the agent understands what the duty is and learn how to respond,” says Pan.
Advantages of the language
When they tested this approach, they found that while it couldn’t outperform visual techniques, it offered several benefits.
First, since text synthesis requires less computational power than complex image data, their method may be used to quickly generate synthetic training data. In one test, they generated 10,000 synthetic trajectories based on 10 real-world visual trajectories.
The technique can even bridge the gap that may prevent an agent trained in a simulated environment from performing well in the true world. This gap often arises because computer-generated images can look very different from real scenes as a consequence of elements like lighting or color. But the language describing an artificial from an actual image can be much harder to tell apart, Pan says.
Furthermore, the representations utilized by their model are easier for humans to grasp because they’re written in natural language.
“If the agent fails to realize its goal, we are able to more easily determine where and why it failed. Perhaps the history information will not be clear enough or the commentary ignores some essential details,” says Pan.
In addition, their method was easier to use to different tasks and environments since it uses just one form of input. As long as data may be encoded as a language, they’ll use the identical model without modification.
A drawback, nevertheless, is that their method naturally loses some information that will be captured by visual models, comparable to depth information.
However, the researchers were surprised that combining language-based representations with visual methods improves an agent's navigation ability.
“Perhaps because of this language can capture higher-level information that can’t be captured using purely visual features,” he says.
This is an area that the researchers hope to explore further. They also need to develop a navigation-focused subtitler that would boost the performance of the tactic. In addition, they need to research the flexibility of enormous language models to show spatial awareness and the way this might aid speech-based navigation.
This research is funded partially by the MIT-IBM Watson AI Lab.