Sensing technologies have improved steadily, but the ability of robots to make decisions in real time based on what they perceive still has a long way to go to equal or surpass human capabilities. Researchers from Microsoft Corp., Carnegie Mellon University, and Oregon State University have been collaborating to improve perception-action loops.
As members of Team Explorer, they are participating in the Defense Advanced Research Projects Agency’s Subterranean (DARPA SubT) Challenge
In a blog post, the research team explained how it has created machine learning systems to enable robots or drones to make decisions based on camera data
“The [perception-action loop] system is trained via simulations and learns to independently navigate challenging environments and conditions in real world, including unseen situations,” the researchers wrote. “We wanted to push current technology to get closer to a human’s ability to interpret environmental cues, adapt to difficult conditions, and operate autonomously.”
Abstract
Machines are a long way from robustly solving open-world perception-control tasks, such as first-person view (FPV) aerial navigation. While recent advances in end-to-end Machine Learning, especially Imitation and Reinforcement Learning appear promising, they are constrained by the need of large amounts of difficult-to-collect labeled real-world data. Simulated data, on the other hand, is easy to generate, but generally does not render safe behaviors in diverse real-life scenarios.
In this work we propose a novel method for learning robust visuomotor policies for real-world deployment which can be trained purely with simulated data. We develop rich state representations that combine supervised and unsupervised environment data. Our approach takes a cross-modal perspective, where separate modalities correspond to the raw camera data and the system states relevant to the task, such as the relative pose of gates to the drone in the case of drone racing.
We feed both data modalities into a novel factored architecture, which learns a joint low-dimensional embedding via Variational Auto Encoders. This compact representation is then fed into a control policy, which we trained using imitation learning with expert trajectories in a simulator.
We analyze the rich latent spaces learned with our proposed representations, and show that the use of our cross-modal architecture significantly improves control policy performance as compared to end-to-end learning or purely unsupervised feature extractors.
We also present real-world results for drone navigation through gates in different track configurations and environmental conditions. Our proposed method, which runs fully onboard, can successfully generalize the learned representations and policies across simulation and reality, significantly outperforming baseline approaches.
The full paper can be downloaded here.
Sources: The Robot Report; Microsoft Research