Lecture 8.2: John Leonard - Mapping, Localization, & Self-Driving Vehicles

Flash and JavaScript are required for this feature.

Download the video from Internet Archive.

Description: Technical challenges for self driving cars, historical perspective on robot mapping and localization and object based mapping.

Instructor: John Leonard

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.

JOHN LEONARD: OK, thanks. Thanks for the opportunity to talk. So hi, everyone. It's a great pleasure to talk here at MBL. I've been coming to the Woods Hole Oceanographic Institution for many years as my first thing over here at MBL. And so I'm going to try to cover three different topics, which is probably a little ambitious on time. But there's so much I'd love to say to you.

I want to talk about self-driving cars. And use it as a context to think about questions of representation for localization and mapping, and maybe connect it into some of the brain questions that you folks are interested in, and time permitting, at the end mention a little bit of work we've done on object-based mapping in my lab.

So my background-- I grew up in Philadelphia. Went to UPenn for engineering. But then went to Oxford to do my PhD at a very exciting time when their computer vision and robotics group was just being formed at Oxford under Michael Brady. And then I came back to MIT and started working with underwater vehicles. And that's when I got involved with Woods Hole Oceanographic Institution. And I was very fortunate to join the AI lab back around 2002, which became part of CSAIL. And really, I've been able to work with really amazing colleagues and amazing robots in a challenging set of environments.

So autonomous underwater vehicles provide a very unique challenge because we have very poor communications to them. Typically, we use acoustic modems that might give you 96 bytes if you're lucky every 10 seconds to a few kilometers range. And so we also need to think about the sort of constraints of running in real time onboard a vehicle.

And so the sort of work that my lab's done-- the more we investigate more fundamental questions about robot perception, navigation, and mapping, we also are involved in building systems. So this is a project I did for the Office of Naval Research some years ago using small vehicles that would reacquire mine-like targets on the bottom for the Navy. And so this is an example of a more applied system where we had a very small resource-constrained platform. And the sort of work we did is a robot built a map as it performed its mission, and then matched the map against the prior map to do terminal guidance to a target.

Another big system I was involved with, as Russ mentioned, was the Urban Challenge. And I'll say a bit about that in the context of self-driving cars. So let's see. So who's heard any of the recent statements from Elon Musk from Tesla?

So he said self-driving cars are solved- he said. And a particular thing that he said that just made my-- I don't know, maybe steam came out of my head-- was that he compared autonomous cars with elevators that used to require operators but are now self-service. So imagine you getting in a car, pressing a button, and arriving at MIT in Cambridge 80 miles away, navigating through the Boston downtown highways and intersections.

And maybe that will happen. But I think it's going to take a lot longer than folks are saying. And some of that comes from just fundamental questions and intelligence and robotics. So in a nutshell, when Musk says that self-driving is solved I think he's wrong, as much as I admire what Tesla and SpaceX have done.

And so to talk about that, I think we need to be very honest as a field about our failures as well as our successes, and try to balance what you hear in the media with the reality of where I think we are. And so I wanted to quote verbatim what Russ said about the robotics challenge, about a project that was so exhausting and just all-consuming and so stressful, yet so rewarding.

So we did this in 2006 and 2007-- my wonderful colleagues, Seth Teller, John Howe, Amelia Fratoli-- amazing students and postdocs. We had a very large team. And we tried to push the limit on what was possible with perception and real-time motion planning.

So our vehicle built a local map as it traveled from its perceptual data, using data from laser scanners and cameras. And we didn't want to blindly follow GPS. We wanted the car to make its own decisions because GPS navigation was part of the original quest with the challenge.

And so Seth Teller and his student, Albert Wang, developed a vision-based perceptual system where the car tried to detect from curbs and lane markings in very challenging vision conditions. For example, looking into the sun, which you'll see in a second-- really challenging situation for trying to perceive the world.

And so our vehicle-- at the time, we went a little crazy on the computation. We had 10 blades, each with four cores-- 40 cores-- which may not seem a lot now, but we needed 3.5 kilowatts just to power the computer at full tilt. We fully loaded the computer with a randomized motion planner, with all these perception algorithms. We had a Velodyne laser scanner on the roof. And about 12 other laser scanners, 5 cameras, 15 radars, and we really pushed the envelope on algorithms.

And so when faced with a choice in a DARPA challenge, if you want to win at all costs you might simplify, or try to read the rules carefully, or guess the rule simplifications. But that would have meant just sort of turning off the work of our PhD students, and we didn't want to do that.

So at the end of the day, all credit to the teams that did well. Carnegie Mellon-- first, $2 million, Stanford-- second, $1 million, Virginia Tech-- third, half a million dollars, MIT-- fourth, and nothing for fourth place. But it was quite an amazing experience. And in the spirit of advertising our failures I think I have time to show this. This used to be painful for me to watch. But now I've gotten over it. This is our--

[VIDEO PLAYBACK]

- Let's check in once again with the boss.

JOHN LEONARD: Even though we finished the race, we had a few incidents so DARPA stopped things and let us continue.

- --across the line.

JOHN LEONARD: Carnegie-Mellon, who won the race. Why did that stop? Let's see.

- --at the end of mission two behind Virginia Tech. Virginia Tech got a little issue. [INAUDIBLE] Here's--

JOHN LEONARD: We were trying to pass Cornell for a few minutes.

- Looks like they're stopped. And it looks like they're-- that the 79 is trying to pass and has passed the chase vehicle for Skynet, the 26 vehicle. Wow. And now he's done it. And Talos is going to pass. Very aggressive. And, whoa. Ohh. We had our first collision. Crash in turn one. Oh boy. That is, you know, that's a bold maneuver.

[END PLAYBACK]

JOHN LEONARD: So what actually happened? So it turned out Cornell were having problems with their actuators. They were sort of stopping and starting and stopping and starting. And we had some problems. It turned out we had about five bugs. They had about five bugs that interacted. And here's a computer's eye-- sort of, brain of the robot's view.

Now back in '07, we weren't using a lot of vision for object detection and classification. So with the laser scanner-- the Cornell vehicle's there. It has a license plate. It has tail lights. It has a big number 26. It's on the middle of a road. We should know that's a car. Stay away from it. But to the laser scanner it's just a blob of laser scanner data. And even when we pull around the side of the car we weren't clever enough with our algorithms to fill in the fact that it's a car.

And you have the problem when it starts moving of the aperture problem-- that as you're moving, and it's moving, it's very hard to tell and deduce the true motion. Now, another thing that happened was we had a threshold. And so in our 150,000 lines of code our wonderfully gifted student, who's now a tenured professor at Michigan, Ed Olson, had a threshold of 3 meters per second. So anything moving faster than 3 meters per second could be a car. Anything less than 3 meters per second couldn't be a car.

Now that might seem kind of silly. But it turns out that slowly moving obstacles are much harder to detect and classify than fast moving obstacles. That's one reason that city driving or driving, say, in a shopping mall parking lot is actually in many ways more challenging than driving on the highway. And so despite our best efforts to stop at the last minute, we steered into the car and had this little minor fender bender.

But one thing that we did is we made all our data available open source. And we actually wrote a journal article on this incident and a few others. And so if you'd asked me then in 2007, I would have said we're a long way from turning your car loose on the streets of Boston with absolutely no user input.

And the real challenge is our uncertainty and robustness and developing robust systems that really work. But for our system, some of the algorithm progress we made-- I mentioned the lane tracking. Albert Wang, who's now, I think, working at Google, developed-- was given very sparse-- I'd say about 10% of the recent graduates or more are working at Google these days.

AUDIENCE: Albert's at [INAUDIBLE].

JOHN LEONARD: Oh. OK. And then here is a video for the qualifying event to get into the final race. We had to navigate-- whoops, I can't press the mouse. That's going to stop. So we had to navigate along a curved road with very sparse waypoints.

And so, in real time the computer has to make decisions about what it sees. Where is the road? Where am I? Are there obstacles? And there are no parked cars in this situation, but other stretches had parked cars.

And our car-- in a nutshell, if our robot became confused about where the road was it would stop. It would have to wait and get its courage up, like lowering its thresholds as it was stuck. But we were the only team to our knowledge to qualify without actually adding waypoints. So it turns out the other top teams, they just went in with a Google satellite image and just added a breadcrumb trail for the robot to follow, simplifying the perception.

So this was back in '07. Now let's fast forward to 2015. And right now-- so of course, we have the Google self-driving car which has just been an amazing project. And you've all probably seen these videos, each with millions of hits on YouTube. The earlier one of taking a blind person for a ride to Taco Bell, this was driving-- that was 2012, city streets in 2014, spring 2015. And then the new Google car, which won't have a steering wheel in its final instantiation, won't have pedals. It will just have a stop button. And that's your analogy to the elevator.

And so I think that the Google car is an amazing research project that might one day transform mobility. But I do think, with all sincerity-- so I rode in the Google car last summer. I was blown away. I felt like I was on the beach at Kitty Hawk. It's like this just really profound technology that could in the long term have a very big impact. And I have amazing respect for that team-- Chris Urmson, Mike Montemerlo, et cetera.

But I think in the media and in others, the technology has been a bit overhyped, and it's poorly misunderstood. And a lot of it goes down to how the car localizes itself, how it uses prior maps, and how they simplify the task of driving. And so even though people like Musk have said driving is a solved problem, I think we have to be aware that just because it works for Google, doesn't mean it'll work for everybody else.

So critical differences between Google and, say, everyone else. And this is with all respect to all players. I'm not trying to criticize. It's more just trying to balance the debate. The Google car localizes on the left with a prior map, where they map the lighter intensity off of the ground surface. And they will annotate the map by hand-- adding pedestrian crossings, adding stoplights. They'll drive a car around many, many times, and then do a SLAM process to optimize the map.

But if the world changes, they're going to have to adapt to that. Now, they've shown the ability to do response to construction, bicyclists with hand signals. When I was in the car we crossed the railroad tracks. That just blew me away. I mean, it's pretty impressive capability but more a vision-based approach that just follows the lane markings. If the lane markings are good, everything's fine.

In fact, Tesla either just have released-- or are about to release-- their autopilot software, which is an advanced lane keeping system. And Elon Musk, a few weeks ago, posted on Twitter that there's one last corner case for us to fix.

And apparently he-- on part of his commute in the Los Angeles area there is well defined lane markings. And part of it is a concrete road with weeds and skid marks and so forth. And he said publicly that the system works well if the lane markings are well-defined. But for more challenging vision conditions like looking into the sun it doesn't work as well.

And so the critical difference is if you're going to use the LiDAR with prior maps, you can do very precise localization down to less than 10 centimeters accuracy. And the way I think about it is robot navigation is about three things-- where do you want the robot to be? Where does the robot think it is? And where really is the robot? And when the robot thinks it's somewhere, but it's really somewhere different, that's really bad. That happens.

We've lost underwater vehicles and had very nervous searches to find them-- luckily-- when the robot made a mistake. And so with the Google approach they really nail this "where am I" problem-- the localization problem. But it means having an expensive LiDar. It means having accurate maps. It means maintaining them.

One critical distinction is between level four and level three. These are definitions of autonomy from the US government-- from NTSA. A level four car is what Google are trying to do now, which is really, you just-- you could go to sleep. The car has a 100% control. You couldn't intervene if you wanted to. You just press a button. Go to sleep. Wake up at your destination.

Musk has said that he thinks within five years you can go to sleep in your car, which to me I just-- five decades would impress me, to be honest. But level three is when the car is going to do most of the job, but you have to take over if something goes wrong.

And for example Delphi drove 99% of the way across the US in spring of this year, which is pretty impressive. But 50 miles had to be driven by people-- getting on and off of highways and city streets. And so there's something about human nature, and the way humans interact with autonomous systems, that it's actually kind of hard for a person to pay attention.

Imagine if 99% of the time the car does it perfectly. But 1% of the time it's about to make a mistake, and you have to be alert to take over. And research experience from aviation has shown that humans are actually bad at that.

And another issue is-- and this is-- I mean, Mountainview is pretty complicated-- lots of cyclists, pedestrians, I mentioned the railroad crossings, construction. But in California they've had this historic drought. And most of the testing has been done with no rain, for example, and no snow. And if you think about Boston and Boston roads, there are some pretty challenging situations.

And so for myself, when I first-- a couple of years ago I said I didn't expect a taxi in Manhattan in my lifetime-- a fully autonomous taxi-- to go anywhere in Manhattan. And I got criticized online for saying that. So I put a dash cam on my car, and actually had my son record cell phone footage.

The upper left is making a left turn near my house in Newton, Mass. And if you look to the right, there's cars as far as the eye can see. And if you look to the left, there's cars coming at pretty high rate of speed, with a mailbox, and a tree.

And this is a really challenging behavior for a human, because it requires making a decision in real time. We want very high reliability in terms of detecting the cars coming from the left. But the way that I pulled out is to wave at a person in another car. And those sort of nods and waves-- they're some of the most challenging forms of human-computer interaction. So imagine vision algorithms that could detect a person nodding at you from the other direction.

Or here's another situation. This is going through Coolidge Corner in Brookline. And I'll show a longer version of this in a second. But the light's green. And see here-- this police officer? So despite the green light, the police officer just raises their hand, and that means the signal to stop. And so interacting with crossing guards and people-- very challenging, as well as changes to the road surface and, of course, adverse weather. And so here's a longer sequence for that police officer.

First of all, you'll see flashing lights on the left-- which may be flashing lights, you should pull over. Here you should just drive past them. It's just the cop left his lights on when he parked his car. But the light's red. And this police officer is waving me through a red light, which I think is a really advanced behavior. So imagine a car that's-- imagine the logic for OK, stop at red lights unless there's a police officer waving you through it, and how you get that reliable.

And now we're going to pull up to the next intersection, and this police officer is going to stop at a green light. And so despite all the recent progress in vision, things like image labeling, ImageNet-- most of those systems are trained with vast archives of images from the internet where there's no context. And they're so challenging for even humans to classify. So that if you had some data sets, like the Caltech pedestrian data set, if you got 78% performance, that's really good. But we need 99.9999% or better performance before we're going to turn cars loose in the wild in these challenging situations.

Now going back more to localization and mapping. Here I collected data for about three or four weeks of my commuting. This is crossing the Mass. Ave. Bridge going from Boston into Cambridge. And the lighting is a little tricky. But tell me what's different between the top and the bottom video.

And notice, by the way, how close we come to this truck. The slightest angular error in your position estimate, really bad things could happen. But the top-- this is a long weekend. This is Veterans Day weekend. They repaved the Mass. Ave. Bridge. So on the bottom, the lane lines are gone. And so if you had an appearance-based localization algorithm like Google's, you would need to remap the bridge before you drove on it. But the lines aren't there yet. And how well is it going to work? And so, this is just a really tricky situation.

And, of course, there's weather. Now, snow is difficult for things like traction and control. But for perception, if you look at how the Google car actually works-- if you're going to localize yourself based on precisely knowing the car's position down to centimeters so that you can predict what you should see, then if you can't see the road surface you're not going to be able to localize. And so this is just a reminder of the sorts of maps that Google uses. So I think to make it to really challenging weather and very complex environments, we need a higher level understanding of the world. I think more a semantic or object-based understanding of the world.

And then, of course, there's difficulties in perception. And so what do you see in this picture? The sun? There's a green light there. I realize the lighting is really harsh, and maybe you could do polarization or something better.

But does anyone see the traffic cop standing there? You can just make out his legs. There's a policeman there who gave me this little wave, even though I was sort of blinded by the sun. And he walked out and put his back to me and was waving pedestrians across, even though the light was green. So a purely vision-based system is going to just need dramatic leaps in visual performance.

So to wrap up the self-driving car part, I think the big questions going forward-- technical challenges, maintaining the maps, dealing with adverse weather, interacting with people-- both inside and outside of the car-- and then getting truly robust computer vision algorithms. We want to get in a totally different place on the ROC curves, or the precision recall curves, where approaching perfect detection with no false alarms. And that's a really hard thing to do.

So I've worked my whole life on the robot mapping and localization problem. And for this audience I wanted to just ask you a little question. Does anyone know what the 2014 Nobel Prize in medicine or physiology was for? Anybody?

AUDIENCE: [INAUDIBLE]

AUDIENCE: Grid cells.

JOHN LEONARD: Grid cells. Grid cells and place cells. And so this has been called SLAM in the brain. Now, you might argue. And we might be very far from knowing. But I think it's just really exciting to-- so for myself, I'll explain.

I've had what's called an ONR MURI grant-- multidisciplinary university research initiative grant-- with Mike Hasselmo and his colleagues at Boston University. And these are a couple of Mike's videos. And so, I think Matt Wilson spoke to your group. And the notion that in the entorhinal cortex that there is this sort of position information that's very metrical, and it seems to be at the heart of memory formation, to me is very powerful and very important.

And so, we have this underlying question of representation. How do we represent the world? And I believe location is just absolutely vital to building memories and to developing advanced reasoning in the world. And the fact that grid cells exist-- to me-- and they have this role in memory formation is just this really exciting concept.

And so, in robotics we call the problem of how a robot builds a map and uses that map to navigate, SLAM-- simultaneous localization and mapping. This is for a PR2 robot being driven around the second floor of our building, not far from Patrick's office if you recognize any of that. And this is using stereo vision.

My PhD student, Hordur Johannsson, who graduated a couple of years ago, created a system to do real time SLAM and try to address how to get temporally scalable representations. And one thing you'll see as the robot goes around occasionally is loop closing, where the robot might come back and have like, an error and then correct that error.

So this is the part of the SLAM problem that in some ways is well understood in robotics, which is how you detect features from images, track them over time, and try to bootstrap up, building a representation and using that to locate your estimation.

And I've worked on this my whole career. And as a grad student at Oxford, I had very primitive sensors. So for a historical SLAM talk I recently digitized an old video and some old pictures. This was in the basement of the engineering building at Oxford. This is just the localization part of how you have a map, and you generate predictions-- in this case for sonar measurements.

And at the time there we had-- I'm sitting at a SUN workstation. To my left is something called a data cube, which for about $100,000 could just barely do like real time frame grabbing and then edge detection out. And so vision just wasn't ready.

And the exciting thing now in our field is vision is ready-- that we're really using vision in a substantial way. But I think a lot about prediction. If you know your position, you can predict what you should see and create a feedback loop. And that's sort of what we're trying to do.

And so SLAM is a wonderful problem, I believe, for addressing a whole great set of questions, because there are these different axes of difficulty that interact with one another. And one is representation. How do we represent the world? And I think that question-- we still have a ton of things to think about.

Another is inference. We want to do real time inference about what's where in the world and how we combine it all together. And finally, there's a systems in autonomy access, where we want to build systems, and deploy them, and have them operate robustly and reliably in the world.

So in SLAM, here's an example of how we pose this as an inference problem. This is from the classic Victoria Park data set from Sydney University. A robot drives around, in this case, a park with some trees. There are landmarks shown in green. The robot's positioner drifts over time. We have dead reckoning error. That's shown in blue. And we estimate the trajectory of the robot in red, and the position of the landmarks from relative measurement.

So as you take relative measurements, and you move through the world, how do you put that all together? And so we, cast this as an inference problem where we have the robot poses, the odometric inputs, landmarks-- you can do it with or without landmarks-- and measurements.

And an interesting thing-- so we have this inference problem on a belief network. The key thing about SLAM is it's building up over time. So you start with nothing and the problem's growing ever larger.

And, let's see, if I had to say-- 25 years of thinking about this up through 2012, the most important thing I learned is that maintaining sparsity in the underlying representation is critical. And, in fact, for biological systems I wonder if there is evidence of sparsity. Because sparsity is the key to doing efficient inference when you pose this problem. And so many algorithms have basically boiled down to maintaining sparsity and the underlying representations.

So just briefly, the most important thing I learned since then in the last few years-- I'm really excited by building dense representations. So this is work in collaboration with some folks in Ireland-- Tom Whelan, John McDonald-- building on KinectFusion from Richard Newcombe and Andrew Davison-- how you can use a GPU to build a volumetric representation, and build rich, dense models, and estimate your motion as you go through the world. So this is something we call continuous or spatially extended KinectFusion.

This little video here from three years ago is going on in an apartment in Ireland. And I'll show you the end result. Just hand-carrying a sensor through the world-- and you can see the quality of the reconstructions you can build, say, in the bathroom, the sink, the tub, the stairs, to have really rich 3D models that we can build and then enable the more advanced interactions that Russ showed. That's fantastic.

And I mentioned loop closing-- something we did a couple of years ago was adding loop closing to these dense representations. So this is-- again, in CSAIL-- this is walking around the Stata Center with about eight minutes of data going up and down stairs. If you watch the two blue chairs near Randy Davis's office, you can see how they get locked into place as you correct the error.

So this is taking mesh deformation techniques from graphics and combining it. So the underlying pose graph representation is like a foundation or a skeleton on which you build the rich representation. OK. So this is the end resulting map. And there's been some really exciting work just this year from Whelan and from Newcombe in this space of doing deformable objects, and then really scalable algorithms where you can sort of paint the world.

So the final thing I want to talk about in my last few minutes is our latest work of using object-based representations. And for this audience, I think if you go back to David Marr, who I feel is unappreciated in the historical sense of how I feel, that vision is the process of discovering from images what is present in the world and where it is. And to me, the what and where are coupled. And maybe that's been lost a bit. And I think that's one way in which robotics can help, I think, with vision and brain sciences.

I think we need to develop object-based understanding of the world. So instead of just having representations that are a massive amount of points or purely appearance, where we can start to build higher level and symbolic understanding of the world. And so I want to build rich representations that leverage knowledge of your location to better understand where objects are and knowledge about objects to better understand your location.

And just as a step in that direction, my student, Sudeep Pallai, who was one of Seth's students, has an RSS paper where we looked at coupling using SLAM to get better object recognition by effectively-- so here's an example of an input data stream from Peter Fox's group. There's just some objects on the table. I realize it's a relatively uncluttered scene. But this has been a benchmark for RGBD perception.

And so, if you combine data as you move from the world using a SLAM system to do 3D reconstruction on the scene, and then using the reconstructed points to help improve the prediction process for object recognition, it leads to a more scalable system for recognizing objects. And it comes back to this notion to me that a big part of perception is prediction-- the ability to predict what you see from a given location. And so what we're doing is we're leveraging off techniques and object detection, featuring coding and the newer SLAM algorithms, and particularly the semi-dense orb SLAM technique from Zaragoza, Spain.

And so I'm just going to jump to the end here. The key concept is that by combining SLAM with object detection we get much better performance and object recognition. So on the left shows our system. On the right is a classical approach just looking at individual frames. And you can see, for example, here, the red cup that's been misclassified would get substantially better performance by using location to cue the object detection techniques.

All right. So I'm going to wrap up. And just a little bit of biological inspiration from our BU collaborators, Eichenbaum has looked at the what and the where pathways in the entorhinal cortex. And there's this duality between location-based and object-based representations in the brain. And I think that's very important.

OK. So my dream is persistent autonomy and lifelong map learning and making things robust. And just for this group I made a-- I just want to pose some questions on the biological side, and I'll stop here. So some questions-- do biological representations support multiple location hypotheses? Even though we think we know where we are, robots are faced with multimodal situations all the time. And I wonder if there is any evidence for multiple hypotheses in the underlying representations in the brain, even if they don't rise to the conscious level, and how experiences build over time.

And the question-- what are the grid cells really doing? Are they a form of path integration? Or there obviously, to me, seems to be some correction. And my crazy hypothesis as a non-brain brain scientist is, do grid cells serve as an indexing mechanism that effectively facilitates search-- so a location index search so that you can have these pointers to what and where information get coupled together.

Free Downloads

Video


Subtitle

  • English - US (SRT)