The way most robots navigate is very different from the way most humans navigate. Robots are happiest when they have total environmental understanding, with some sort of full geometric reconstruction of everything around them plus exact knowledge of their own position and orientation. Lidars, pre-existing maps, powerful computers, and even a motion capture system if you can afford it; the demands of autonomous robots never end.
Obviously, this stuff doesn’t scale all that well. With that in mind, Dhruv Shah and professor Sergey Levine at the University of California, Berkeley are working on a different approach. Their take on robotic navigation does away with the high-end, power-hungry components. What suffices for their navigation technique are a monocular camera, some neural networks, a basic GPS system, and some simple hints in the form of a very basic human-readable overhead map. Such hints may not sound all that impactful, but they enable a very simple robot to efficiently and intelligently travel through unfamiliar environments to reach far-off destinations.
ViKiNG: Vision-Based Kilometer-Scale Navigation with Geographic Hints (Summary Video)
If that little robot looks familiar, that’s because we met it a couple of years ago through Greg Khan, a student of Levine’s. Back then, the robot was named BADGR, and its special skill was learning to navigate through novel environments based on simple images and lived experience—or whatever the robot equivalent of lived experience is. BADGR has now evolved into ViKiNG, which stands for “Vision-Based Kilometer-Scale Navigation with Geographic Hints” which is a slightly less forgivable acronym. While BADGR was perfectly happy to wander around small areas, its successor is intended to traverse long distances in search of a goal, which is an important step towards practical applications.
Navigation, very broadly, consists of understanding where you are, where you want to go, and how you want to get there. For robots, this is the equivalent pf a long-term goal. Some far-off GPS coordinate can be reached by achieving a series of short-term goals, like staying on a particular path for the next couple of meters. Achieve enough short-term goals, and you reach your long-term goal. But there’s a sort of medium-term goal in the mix too, which is especially tricky, because it involves making more complex and abstract decisions about what the “best” path might be. Or in other words, which combination of short-term goals best serves the mission to reach the long-term goal.
This is where the hints come in for ViKiNG. Using either a satellite map or a road map, the robot can make more informed choices about what short-term goals to aim for, vastly increasing the likelihood that it’ll achieve its objectives. Even with a road map, ViKiNG is not restricted to roads; it just may favor roads because that’s the information it has. Satellite images, which include roads but also other terrain, give the robot more information to work with. The maps are hints, not instructions, which means that ViKiNG can adapt to obstacles it wasn’t expecting. Of course, maps can’t tell the robot exactly where to go at smaller scales (whether those short-term goals are traversable or not), but ViKiNG can handle that by itself with just its monocular camera.
The performance of ViKiNG is impressive; as you can see in the figure above, the blue line shows that ViKiNG takes a sensible, efficient route to its goal. Remember, it doesn’t have a comprehensive environmental map. It gets the job done with: a very basic GPS; a picture and the general GPS coordinates of its goal; a monocular camera; and the map. This figure shorobot traversing a short route; but ViKiNG can navigate autonomously without any problems until the researchers get tired of following it—distances as long as several kilometers.
“I think this is exciting because the entire method is very simple,” says UC Berkeley’s Levine. “In contrast to autonomous driving systems that use enormous software stacks with many interacting components, this system uses two neural networks (one to process first-person images, and one to process the map images), and a planning algorithm that uses them to decide where to drive. This is significant because the complexity of today’s robotic navigation systems is one of the big obstacles preventing their large-scale deployment: if simple learning-based systems can match or surpass the performance of complex, hand-engineered methods, this may point the way to much more tractable and scalable application of machine navigation in the future. Among the imagined applications are delivery robots and higher-level autonomous driving.”
For more details, we sent Levine a couple of questions via email.
IEEE Spectrum: What constitutes a hint, and what would the training process be for new kinds of hints?
Sergey Levine: In our prototype, the robot gets either a satellite image or a road map (both from Google maps right now). But the image could really be anything that is spatially organized. For example, we also thought about giving it park hiking trail maps, but we just didn’t have quite the right type of data for it (most of our current data is not in parks). The current system does require these images to be spatial (i.e., 2D layouts), so it wouldn’t work for example with textual instructions in its current form. However, we are currently exploring extensions that could support the use of strings of text in the future. It is not conceptually a big leap, since our current system already uses similar contrastive learning methods as those that have been used in recent work on combining language and images (e.g., CLIP).
To add new types of hints (e.g., park trail maps, amusement park maps, whatever), the robot would need data that contains trajectories of the robot driving through such environments, and an approximate GPS registration to the image. These details could be pretty rough, as it is not trying to do exact reconstruction. The need for this last step could be eliminated in future work, but that’s still something we are working on.
Can you talk about the impact that different hints have on the performance of ViKiNG? When dealing with outdated imagery, ViKiNG can handle unexpected obstacles, but can it (for example) identify new shortcuts as well?
Levine: ViKiNG will respond differently depending on whether it is given a satellite image or a road map. See, for example, Figure 9 in the paper: the road map is enough for it to follow the road, but if it gets a satellite image, it can tell that it’s possible to cut across a grassy field instead. The system does not currently account for safety explicitly, so in that sense it will always opt for the most direct path that is collision free—in fact, properly accounting for risk (e.g., not driving into the middle of a road) is something we are planning to investigate in future work.
In terms of identifying new shortcuts: we haven’t tested this, but the map is used primarily as a heuristic, so if it sees a path to the goal, in principle it could take it even if the map suggests it’s not possible. However, we have not observed this behavior (and we haven’t tested for it); we only tried adding new obstacles (well, we didn’t so much add them as find that one day that someone had parked a giant truck in the path we were going to test…).
Did ViKiNG demonstrate any behaviors that surprised you?
Levine: I was surprised by a few things. First, I was actually quite surprised by just how far ViKiNG drives without intervention—this is not a complete self-driving system, it’s really just two neural net models stapled together with a clever (but simple) planning algorithm in the middle, so having it drive for several kilometers without crashes or interventions (about the range that Dhruv can run after it with a video camera), in environments ranging from forest trails to industrial parks, was quite surprising to me. Another thing that I actually found quite interesting is its ability to deviate from the map in the presence of obstacles, and to “backtrack” when the map leads it astray (or when it simply makes a mistake).
GPS is necessary for Viking to be successful, correct? What are the other constraints on the system?
Levine: Yes, it requires GPS or some other form of approximate localization. Technically, all it needs is something to give it an approximate “you are here” dot on the map. We could even imagine training a separate first-person-to-map model to predict this (that would be an interesting future work direction!). Besides this, all it requires is the dataset to train on (images + actions), a GPU, a way to steer the vehicle, and enough battery life. The requirements are actually pretty minimal. The camera images are monocular (no depth) and are processed by a standard conv net, so in principle it could be retrained with any other sensor modality that can be sent into a neural net (which is practically anything).
How would this system scale to larger or more complex environments?
Levine: Scaling to larger and more complex environments is mostly a matter of training data, though the environments are already quite large and complex. Probably the biggest thing that is technically preventing this from being used today, as-is, for sidewalk delivery is actually safety: right now, there is nothing that tells ViKiNG to, for example, avoiding driving in the middle of the street, or attempting to cross a busy freeway. One of the things we are still developing is a way to issue more fine-grained “rewards” to the robot to encourage it to follow “social conventions” like staying off the street. If we wanted it to scale it up to larger vehicles, like self-driving cars, then there are also many safety issues that will come up (like in any system of that sort). This will also require significant additional technical development to address. All this is to say that safety still remains a major obstacle for autonomous robots, same as for any other kind of system. But, hopefully, learning-based approaches like ViKiNG will bring us closer to eventually bridging this gap.
As it turns out, UC Berkeley is also involved in DARPA’s RACER program alongside the JPL team. We have an article on RACER here, but the gist is that it’s a long distance off-road competition for wheeled robots that need to navigate through, uncharted environments with help from a low-resolution topographic map. Sounds like something ViKiNG might be cut out for—maybe with some bigger wheels on it, though.