MITRE's Tech Futures Podcast

Imitation Learning

June 09, 2022 MITRE Season 1 Episode 12
MITRE's Tech Futures Podcast
Imitation Learning
Show Notes Transcript

How do humans learn new or complicated tasks? One possible way is by following a demonstration. What if we could train autonomous agents in the same way?

By Eliza Mace

Guests: Amanda Vu, Alex Tapley and Guido Zarella

Amanda Vu (00:02):

A researcher was trying to teach a robot to hammer a nail into a piece of wood, and so initially he defined the reward to correspond to how far the nail was driven into the hole. But the robot learned a policy that basically involved smashing its body into – or its limb into the nail, which is obviously not ideal; you want to preserve your robot. The team ended up adding in a term into the reward function that explicitly told the robot to pick up the hammer. But even then, the robot was able to find a loophole where it would actually throw the hammer against the nail instead of maintaining control over it.

Eliza Mace (host) (00:44):

Hello and welcome to MITRE’s Tech Futures podcast. I’m your host, Eliza Mace. I’m also a machine learning engineer here at MITRE and will be joining Brinley as the co-host of Season 2. 

At MITRE, we offer a unique vantage point and objective insights that we share in the public interest. In this podcast series, we showcase emerging technologies that will affect the government and our nation in the future. 

Today we are going to be exploring a recent MITRE investigation into imitation learning, which is a family of machine learning techniques where an agent learns by being rewarded for following human examples of a task.

But before we begin, I want to say a huge thank you to Dr Kris Rosfjord, the Tech Futures Innovation Area leader in MITRE’s Research and Development program. This episode would not have happened without her support.

Eliza Mace (host) (01:44):

Okay, first thing’s first: let’s unpack the wacky situation that you heard Amanda Vu describe in our opening clip. Amanda is a Lead Autonomous Systems and Machine Learning engineer here at MITRE and was the principal investigator on the Imitation Learning project. The bizarre behavior she described for a hammering robot is a great example of a weakness of Reinforcement learning. Reinforcement learning is a general subfield of machine learning in which agents, such as physical robots or decision-making algorithms, are trained to complete tasks by seeking out rewards. The idea behind this method of training is pulled straight from the human experience.  When you were growing up, if you were well-behaved, perhaps you received a reward in the form of encouraging words, or maybe even a piece of candy! This incentivized you to repeat this behavior in the future. In the same vein, if you, for example, touched a hot pot on the stove, you immediately received a very impactful negative reward. This inspiration seems intuitive, but can have some unexpected outcomes in practice when applied to autonomous systems. In order to incentivize an agent, the engineers training it have to come up with a reward function that defines what actions or series of actions will result in the agent receiving a positive or negative reward. Additionally, the agent isn’t explicitly told this reward function; instead, it must explore its environment by taking actions and receiving the rewards set by the reward function. Over time, the agent learns what is known as a policy, which means that when it reaches a state within its environment that it is familiar with, it has a plan for what action to take next. The agent will encounter the same scenario over and over again. Each of these trials is known as an episode. As the episodes go by, the agent must balance trying new actions, called “exploration,” with following the policy it has figured out so far, which is known as “exploitation.” After many episodes – maybe even tens of thousands – if the engineers were able to define an appropriate reward function, the agent will figure out a plan to get the highest possible reward by optimizing its actions, and voila! A trained agent!

So, there was a big “if” there, that is “if the engineers were able to define an appropriate reward function.” That’s exactly what went wrong with the rogue hammer bot in the introduction; by only incentivizing the agent by how far the nail went into the wood, the reward function accidentally encouraged the robot to smash itself into the nail. So how can we prevent crazy or unreasonable actions? Let’s go back to human inspiration. Remember touching the hot pot on the stove? Well, maybe at some point, a parent or other trusted person demonstrated for you how to use an oven mitt or potholder to safely interact with hot pots. You learned by following a demonstration! Imitation learning extends this idea to autonomous agents. Instead of an agent starting with no knowledge of its task or environment, a human demonstrates the task to the agent, so it has a model to start from and is less likely to take actions way outside the realm of reason.

In this paradigm, instead of being rewarded for completing the task, the agent is rewarded for how well it can mimic the human demonstrator. MITRE Engineer Alex Tapley summed up this benefit quite nicely:

Alex Tapley (05:54):

With reinforcement learning models, they do exactly what you tell them to do, and not what you want them to do. Reward functions are very finnicky, and there are a lot of things that can go wrong with it. So, with imitation learning, when you’re just saying, “hey, just copy what this person’s doing,” it’s a lot easier for the model to pick up on that and then you don’t have do deal with any of the reward function tuning or anything.

Eliza Mace (host) (05:16):

As with any machine learning problem, there are many different Imitation learning algorithms that aim to optimize the agent’s performance in different ways. For example, training behavioral cloning algorithms involves showing your agent many different examples of human operators completing a task, and the agent is rewarded for copying the example actions as closely as possible. In contrast, another type of imitation learning, known as inverse reinforcement learning, encourages the agent to find a set of “guiding principles” that are common across the human examples it has been exposed to. In these algorithms, the agent is essentially learning its own reward function, which, as we discussed earlier, can be desirable in situations where hand crafting rewards is difficult. Another benefit of training with multiple human demonstrator examples is that the agent can more easily adapt to new environments. This is especially important when considering real-world applications, as Alex explains:

Alex Tapley (06:07):

With reinforcement learning or imitation learning, you need to train in simulation, and that is just because reinforcement learning and imitation learning, it’s like trial-and-error, and you just can’t buy a drone and go and crash it into the wall 10,000 times, because it just won’t work. Not to mention, it would take forever. All training needs to happen in simulation, and the biggest thing that you need to overcome with that is things in simulation do not look like what things in real life look like. Even if, let’s say you recreate the environment in a simulation to a T, then you move it outside onto a real robot, it’s going to go, “wow, this is completely new to me; I have no idea what to do, I’m just not going to do whatever I learned.”

Eliza Mace (host) (06:47):

So how can imitation learning help handle unexpected scenarios? Amanda hopes that continued research into the balance between exploration and exploitation will lead to improvements, along with providing human demonstrations that cover a wide variety of ways a task can play out.

Amanda Vu (07:03):

Exploration/exploitation is still definitely something that needs to be addressed in imitation learning in comparison to reinforcement learning.

From the exploration sense, you’re putting the agent in a good place to start and bounding where it should explore, so there is a lot less exploration that needs to be done. Therefore, it’s able to exploit good states more quickly, and especially if you’re trying to teach your agent something that is a very complex behavior, or you’re trying to teach them a very sequential task, it can help the agent quickly get to the next part of the task and continue learning versus continually getting stuck in that first state. One thing that is something you should be careful about with imitation learning is that you could be restricting that initial exploration a little bit too tightly. An example of this is if you’re trying to train, through imitation learning, a racecar to follow a line around a loop; it may never encounter, in your human demonstrations, a crash condition, or a condition where you maybe deviate from the line a little bit. In that way, imitation learning is really similar to regular machine learning, where your model will learn what it’s presented with, and if it’s never presented failure scenarios, then it may react really unexpectedly. So, with some imitation learning methods, like behavioral cloning, the answer is to basically provide these expert demonstrations that show how to recover from failure cases.

Eliza Mace (host) (08:37):

Not only does providing expert demonstrations better prepare an agent for a variety of outcomes but doing so can also give imitation learning agents a leg up over traditional reinforcement learning agents in terms of efficiency in learning a task. As Alex explains,

Alex Tapley (08:51):

The biggest issue facing reinforcement learning is definitely the lack of data or the training time needed in order to train the reinforcement models. So typically for reinforcement learning, you’re starting with a model that knows pretty much absolutely nothing, and the only way for it to learn the best way to react in a specific scenario is to repeat that scenario hundreds of times. When you start thinking about that, then you know your training times are going to be super long. The good thing about imitation learning is the data efficiency involved is a lot better. For example, if you’re doing a behavioral cloning aspect, the model’s just trying to copy the expert, and in some cases it’s able to learn how to copy exactly within less than 100 episodes or something, whereas for reinforcement learning, that same process could end up taking 20,000, 30,000 episodes. So, the data efficiency is just significantly better.

Eliza Mace (host) (09:50):

So we have discussed how imitation learning compares to reinforcement learning, but you may be wondering how these algorithms stack up against more traditional methods of programming robots, such as through optimal control theory. Amanda provided some comparisons for us to consider:

Amanda Vu (10:03):

Comparing imitation learning with optimal control theory, there’s definitely pros and cons. Leading with a con, with imitation learning and with a lot of machine learning methods in general, there’s still that gap in trust and explainability, and also performance guarantees that you get with optimal control theory. This is actually a pretty active research field both in machine learning and imitation learning, so hopefully this situation will only improve with machine learning methods in general over time, and we’re already starting to see that in some of the literature. There are a lot of pros for imitation learning over control theory that make it much more adaptable and scalable in the long run. Traditional control theory is very optimization-based, so you’ll usually have a bunch of mathematical equations that basically describe how the robot operates within the environment and how it can take actions, and it will basically try and compute the optimal trajectory it should take to achieve its goal. Mathematically, control theory is so elegant and so beautiful, and it feels simple, but in real life these methods can actually be pretty brittle. That’s because it’s pretty hard to basically model all of your potential noise sources, and then you start throwing in that your environment, in fact, is not static, and maybe there are other players that are in there, like other robots or other humans. Those are pretty hard to account for in your model, and then that makes the use case for your control theory-based model pretty limited to constrained and controllable environments. Imitation learning is a lot more scalable because in order to adapt to a new and changing environment, you basically just need to provide new demonstrations in that environment; it’s a very data-driven method, and because the demonstrations lie in the bounds of human experience, even when you’re encountered with new and challenging scenarios, hopefully you would take an action that’s within human reason.

Eliza Mace (host) (12:07):

As we’ve discussed, there is a lot of research being done to aid in training agents to adapt to unforeseen scenarios or new challenges. Currently, many applications of robotics are confined to settings where unexpected environmental changes are unlikely, such as in factories. However, for future challenges impacting our nation, we will need agents that are able to adapt easily. This is especially true if we expect our robots to team up with humans. Amanda sees this as a great opportunity for imitation learning.

Amanda Vu (12:36):

Imitation learning is a tool within a greater toolbox of reinforcement learning and control theory and all these other techniques. Imitation learning definitely isn’t going to be that golden bullet that solves all of your robot teaching problems, but I do think that it does apply in a wide variety of scenarios where the current state of robotics is that robots aren’t super generalizable to variations in task or environment. The big dream for a lot of autonomous systems, especially with our sponsors, is that they’re able to coexist and work with human teams seamlessly. I think imitation learning offers a great solution for that because if you teach your imitation learning agent to act the way that a human acts, or at least behave in a way that you might expect, then you’re going to have a lot better human-machine teaming.

Eliza Mace (host) (13:34):

Beyond training physical robots with imitation learning, there is also government interest in leveraging agents as decision makers in complicated environments. I spoke to Decision Science innovation area lead, Guido Zarrella, about possible applications.

Guido Zarrella (13:48):

Using these as tools that let us study mission-critical problems, I think, is a really exciting path forward, and that’s something that we’re pretty aggressively investing in today within MITRE. So, we’ve got a project that is focusing on some of these types of questions in the context of space Command and Control and the future of conflict in space. We’ve also got a project that’s looking at some of these topics within the context of environmental resilience to climate change and preparing for wildfires. So, recognizing that the strategies that used to work to manage our public lands have led us to the point where we’re getting increasingly destructive wildfires every year as the planet dries out and heats up. And so, understanding if there are new strategies we can explore that are informed by these simulation tools is another example of a spot we’re kind of aggressively interested in.

Eliza Mace (host) (14:37):

I’d like to challenge you to think of some robotics or planning tasks that would benefit from the inclusion of a trained agent. If you come up with a good use case – you’re in luck! Amanda and her team, as part of their research efforts, designed an imitation learning toolbox that is available within MITRE and open source!

Amanda Vu (14:53):

We purposefully designed it to be very modular and customizable. We have separate modules for the agent, the environment, the network, the policy, and things like that so you can easily customize it – swap and switch different components – and basically create novel algorithms. This whole entire infrastructure plugs into, and works well with, the OpenAI Gym framework. So, if you wanted to create a custom environment and custom task, you not only are able to easily create your custom algorithm, but you also have implementations of famous and well-known imitation learning algorithms and reinforcement learning algorithms that have been verified on other benchmarks to be good implementations and be able to quickly benchmark what you’re getting versus what other common models in the field would be getting. The toolbox is a tool to accelerate research in the area both internally within MITRE, it’s also open source so people in the academic and research community can also engage with it as well.

Eliza Mace (host) (15:56):

You heard Amanda mention the OpenAI gym, which is an open-source set of challenges that can be used to compare learning algorithms. One of the strengths of the machine learning community is the open collaboration between research groups in academia and industry, and Amanda and her team are excited to contribute their toolbox. Participating in the broader research community benefits our government organizations, and the nation as a whole, because the toolbox allows for researchers to apply imitation learning algorithms to a wide range of problems beyond the standard academic assessments. If you want to learn more about how you can leverage imitation learning for your needs, check out Amanda and her team’s recent study and their open-source toolbox. A link to the team’s paper, “imitation learning: applications in autonomous systems” can be found in the show notes.

Thanks so much for tuning in to this episode of MITRE’s Tech Futures podcast. This show was written by me. It was produced and edited by myself and my co-host, Brinley Macnamara, Dr. Kris Rosfjord, and Dr. Heath Farris.

Our guests were Amanda Vu, Alex Tapley, and Guido Zarrella.

The music in this episode is brought to you by Truvio, Ooyy, Sarah the Instrumentalist, and West & Zander.

Copyright 2022 The MITRE Corporation.

MITRE: solving problems for a safer world.