A robot trying to find employees trapped in a partially collapsed mineshaft must quickly create a map of the crime scene and discover its location throughout the crime scene while navigating the treacherous terrain.
Researchers have recently begun developing powerful machine learning models to perform this complex task using only images from the robot's onboard cameras. But even one of the best models can only process a number of images at a time. In a real-world disaster, where every second counts, a search and rescue robot would want to quickly traverse large areas and process hundreds of images to finish its mission.
To solve this problem, MIT researchers drew on ideas from each newer artificial intelligence vision models and classic computer vision to develop a brand new system that may process any variety of images. Your system creates precise 3D maps of complicated scenes like a crowded office corridor in seconds.
The AI-driven system incrementally creates and aligns smaller sub-maps of the scene, which it stitches together to reconstruct a full 3D map, while concurrently estimating the robot's position in real time.
Unlike many other approaches, their technique doesn’t require calibrated cameras or an authority to tune a fancy system implementation. The simpler nature of their approach, coupled with the speed and quality of the 3D reconstructions, would make it easier to scale up for real-world applications.
In addition to helping search and rescue robots navigate, this method may be used to create augmented reality applications for wearable devices like VR headsets or enable industrial robots to quickly find and move goods inside a warehouse.
“For robots to perform increasingly complex tasks, they need way more complex map representations of the world around them. But at the identical time, we don't have the desire to make it harder to place these maps into practice. We have shown that it is feasible to create an accurate 3D reconstruction in seconds using a tool that works out of the box,” says Dominic Maggio, an MIT graduate student and lead writer of a Articles about this method.
Maggio is joined within the work by postdoctoral researcher Hyungtae Lim and senior writer Luca Carlone, associate professor in MIT's Department of Aerospace Engineering (AeroAstro), principal investigator within the Laboratory for Information and Decision Systems (LIDS), and director of the MIT SPARK Laboratory. The research can be presented on the Conference on Neural Information Processing Systems.
Design an answer
For years, researchers have been working on a vital element of robot navigation, the so-called Simultaneous Localization and Mapping (SLAM). With SLAM, a robot creates a map of its surroundings and orients itself in space.
Traditional optimization methods for this task often fail in demanding scenes or require prior calibration of the robot's on-board cameras. To avoid these pitfalls, researchers train machine learning models to learn this task from data.
Although they’re easier to implement, even one of the best models can only process around 60 camera images at a time, making them unsuitable for applications where a robot needs to maneuver quickly through a varied environment while processing hundreds of images.
To solve this problem, MIT researchers developed a system that generates smaller sub-maps of the scene as an alternative of all the map. Their method “glues” these partial maps together into a complete 3D reconstruction. The model still only processes a number of images at a time, however the system can recreate larger scenes much faster by stitching together smaller submaps.
“This gave the look of a quite simple solution, but once I tried it for the primary time I used to be surprised that it didn't work that well,” says Maggio.
In search of a proof, he delved into research papers on computer vision from the Nineteen Eighties and Nineteen Nineties. Through this evaluation, Maggio realized that flaws in the best way the machine learning models processed images made aligning submaps a more complex problem.
Traditional methods align submaps by applying rotations and translations until they’re aligned. However, these recent models can introduce ambiguity within the submaps, making them difficult to align. For example, a 3D submap of 1 side of a room may need barely curved or stretched partitions. Simply rotating and moving these deformed subcards to align them doesn't work.
“We must ensure that each one the submaps are evenly deformed in order that we will align them well with one another,” Carlone explains.
A more flexible approach
Drawing on ideas from classical image processing, the researchers developed a more flexible mathematical technique that may represent all deformations in these partial maps. By applying mathematical transformations to every submap, this more flexible method can align them to resolve the paradox.
Based on the input images, the system outputs a 3D reconstruction of the scene and estimates of the camera positions that the robot would use to locate itself in space.
“Once Dominic had the intuition to attach these two worlds – learning-based approaches and traditional optimization methods – implementation was fairly easy,” says Carlone. “Developing something so effective and easy has potential for a lot of applications.
Their system worked faster and with fewer reconstruction errors than other methods, without requiring special cameras or additional data processing tools. The researchers created near real-time 3D reconstructions of complex scenes like the inside of the MIT Chapel using only short videos recorded on a mobile phone.
The average error of those 3D reconstructions was lower than 5 centimeters.
In the longer term, the researchers have the desire to make their method more reliable for particularly complicated scenes and work towards implementing it on real robots in demanding environments.
“It pays to be conversant in traditional geometry. If you understand exactly what's occurring within the model, you’ll be able to get significantly better results and make things way more scalable,” says Carlone.
This work is supported partially by the US National Science Foundation, the US Office of Naval Research, and the National Research Foundation of Korea. Carlone, who’s currently on research leave as an Amazon Scholar, accomplished this work before joining Amazon.

