Podcast: Play in new window | Download | Embed

# LM101-044: What happened at the Deep Reinforcement Learning Tutorial at the 2015 Neural Information Processing Systems Conference?

## Episode Summary:

This is the third of a short subsequence of podcasts providing a summary of events associated with Dr. Golden’s recent visit to the 2015 Neural Information Processing Systems Conference. This is one of the top conferences in the field of Machine Learning. This episode reviews and discusses topics associated with the Introduction to Reinforcement Learning with Function Approximation Tutorial presented by Professor Richard Sutton on the first day of the conference.

## Show Notes:

Hello everyone! Welcome to the forty-fourth podcast in the podcast series *Learning Machines 101*. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner. This is the third of a short subsequence of podcasts providing a summary of events associated with Dr. Golden’s recent visit to the 2015 Neural Information Processing Systems Conference. This is one of the top conferences in the field of Machine Learning. This episode reviews and discusses topics associated with the Introduction to Reinforcement Learning with Function Approximation Tutorial presented by Professor Richard Sutton on the first day of the conference.

A general introduction to the Neural Information Processing Systems Conference and also a review of the Morning Deep Learning Tutorial which took place on the first day of the conference can be found in Episode 41 of Learning Machines 101. Episode 42 provides a review of the Monte Carlo Methods Tutorial.

I will now provide a brief review of the Deep Reinforcement Learning Tutorial based upon the notes that I took during the conference. This review will be flavored with my own random opinions, comments, and interjections. I will try to do my best to distinguish my thoughts from the ideas presented when I have opinions which are different or additional to what was presented and I will also try not to distort the presentations.

The Deep Reinforcement Learning Tutorial was presented by Professor Rich Sutton who has done seminal work in the area of reinforcement learning and temporal difference learning since the mid 1980s. The underlying concepts of reinforcement learning and temporal difference learning been around forever. These ideas have their foundations in classical control theory work published in the 1950s and were used in the Checkers playing program developed by Samuels in the late 1950s. Concepts regarding Reinforcement Learning have been previously discussed in Episode 2, Episode 9, and Episode 25 of Learning Machines 101.

In supervised learning, the learning machine is provided a set of training data consisting of input patterns and desired responses. The learning machine’s goal is to generate the desired response for a given input pattern. Generalization performance is a key issue. Given an input pattern the learning machine has never seen before, the learning machine’s performance is evaluated on whether it can generate an appropriate response to novel input patterns based upon its past experiences. Supervised learning requires a teacher.

In unsupervised learning, the learning machine is provided a set of training data consisting only of input patterns without desired responses. The learning machine’s goal in this case is to discover clusters of input patterns which belong to the same category. For example, a child learning the letters of the alphabet might autonomously decide that different ways of writing a particular letter are simply equivalent representations of the same letter and this learning can be done independently of a teacher. In some cases, the child might request help from the teacher. This would be an example of semi-supervised “active learning” where the learning machine or child in this case actively alters its statistical environment to facilitate the learning process.

Like semi-supervised active learning, reinforcement learning can be viewed as a type of mixture of supervised and unsupervised learning with a temporal component. An example of reinforcement learning would be a pilot who is learning how to land an airplane in a flight simulator. In order to land the airplane, the pilot must generate a sequence of actions. Each action which is generated alters the statistical learning environment of the pilot. In addition, if an instructor is not present, then the pilot does not know if a particular action is appropriate or inappropriate. Only after the plane lands gently or crashes in the simulator does the pilot know whether or not the sequence of generated actions was appropriate. And then, the learning problem is complicated because the pilot learning machine must figure out how to use information from the safe landing or the crash landing to modify its parameters so that it can have more successful landings in the future. Reinforcement learning has its roots in classical control theory. The terminology “control law” in control theory corresponds to the concept of a “policy” in reinforcement learning. Instead of saying a learning machine follows a “control law” we say the learning machine follows a “policy” but the key idea is the same.

In classical control theory, one often has detailed mathematical models of the pilot and the aircraft and these detailed mathematical models greatly facilitate reinforcement learning machine design. In the real world, however, such mathematical models may not be readily available. Thus, function approximation methods such as those discussed in Episodes 14, 23, and 41 of learning machines 101 may be appropriate. Hence the title of this tutorial which was: *“Introduction to Reinforcement Learning with Function Approximation”.*

Professor Sutton began by introducing the basic principles of reinforcement learning where the learning machine learns by interacting with its environment to achieve some goal. The learning machine does not receive specific information from the environment regarding whether or not a particular action is correct. Rather, the learning machine only receives vague delayed feedback from its environment about how it is doing and must figure out for itself whether it is behaving in an appropriate manner. Professor Sutton stressed this is the beginning of a science of both natural and artificial minds. These ideas are the foundations of intelligent goal seeking behavior.

Professor Sutton started the talk with a video of a little caterpillar type robot. Initially it is just sitting on the table. Then it flops around at random a little. And then you can begin to see it figure out how to inch forward and inch backwards along the table.

Sutton emphasized that the goal of such learning machines is not to maximize the reward per step but the cumulative future reward. When learning how to drive a car, the driving instructor doesn’t congratulate you upon your unique brilliant approach to turning the steering wheel a little to the left but rather congratulates you at the end of the driving assessment which is based fundamentally upon how the car performed in the environment rather than your actions in operating the steering wheel, gas pedal, and brake pedal at each instant in time.

Sutton also provided three example successes of reinforcement learning machines. In the late 1950’s Samuel’s developed a checker playing program which looked ahead several moves and evaluated the choice of a particular move using a reinforcement type learning rule. In 1992, Gerald Tesauro applied these techniques to develop a computer program which could play back-gammon. In 2003 and 2006, Andrew Ng and his colleagues used reinforcement learning methods to teach a drone helicopter to not only fly but perform radical acrobatic maneuvers in the air. More recently, reinforcement learning methods have been used to figure out how to place ads on websites, help Watson the computer play jeopardy, and automatically learn to play ATARI video games.

To keep things simple, the case where the states of the environment, the actions of the learning machine, and the rewards received by the learning machine from its environment are finite sets. In addition, assume (for now) a “deterministic policy” which means that for a given state of the environment the learning machine’s goal is to learn a function which maps that state of the environment into an action. As the learning machine interacts with its environment it attempts to estimate not just the immediate reward which could be received for a specific action but rather it tries to estimate the sum of all future rewards it expects from the current action, the immediate reward it expects from the action after the current action, the immediate reward it expects from the second action after the current action and so on. Typically, the value of these rewards is adjusted so that rewards for actions which will be executed in the distant future are weighted to have less of an effect on the sum of all future rewards then actions which will be executed. This type or weighting scheme is called an **“unlimited horizon”** weighting scheme. In contrast, for some reinforcement learning problems there is a clear end to the sequence of interactions. This corresponds to the case of a **“limited horizon”.** For the limited horizon case, rewards for actions in the infinitely distant future are not required…we just need all future rewards up to the point where the action sequence ends. For example, a learning machine whose goal is to win a game of checkers or land a lunar lander safely. In some cases, this information can be represented by a function which takes the current state of the environment and possible action of the learning machine and returns a number which is not the immediate reward expected for executing that action with in the current environment state but the cumulative future award to be received as a result of executing that action in the current environment state.

Such a function which maps the current state of the environment and the current action of the learning machine into an estimate of the total cumulative future reward the learning machine expects to receive into the distant future is called the **“action-value function”.**

In addition to limited and unlimited horizon problems, there are “**on policy”** and “**off policy”** learning methods. An “on-policy” learning method involves taking actions based upon the learning algorithm, while an “off-policy” learning method simply wanders around the state space and doesn’t necessary select actions which are intended to maximize future rewards. For example, a “off-policy” learning method might work by just picking a state and action at random and adjusting the Q function for that state and action pair based upon the learning machine’s performance into the future. An “on-policy” learning method might work by having some deterministic or probabilistic rule for generating an action based upon the current state of the environment and attempting to adjust that rule as a function of the learning process.

If one assumes that an “unlimited horizon” weighting schemes where reinforcement signals received in the distant future are weighted as less important than reinforcement signals received in the immediate future, one obtains two simple but very fundamentally important learning rules.

The first of these fundamentally important reinforcement learning rules is Q-learning which is typically presented as an “off-policy” rule which works as follows. Given the current state of the environment S and current action A we are going to take our current guess for the Q function evaluated at S and A and increase it’s value for the current choice of S and A if Q evaluated at S and A is too small. We will decrease it’s value for the current choice of S and A if Q evaluated at S and A is too large. The magnitude of the perturbation to Q is called the “learning rate”. If this was a supervised learning problem, then the environment would provide us with the DESIRED value of Q for the current choice of S and A but unfortunately this information is not available. It turns out, however, that the DESIRED value of Q can be ESTIMATED if one makes the assumption that we have an “unlimited horizon” weighting scheme where reinforcement signals in the distant future are weighted as less important than reinforcement signals in the immediate future in a particular manner. There is a special number between 0 and 1 called the LAMBDA parameter which adjusts how much future reinforcement signals should be weighted. When LAMBDA is equal to zero, then the sum total of all expected future reinforcement is simply estimated by the current immediately received reinforcement. When LAMBDA is close to one, then all future reinforcement signals are weighted to be almost as important as the current immediately received reinforcement signal.

Let S be the current state of the environment, A be the action executed in the current state, R be the reinforcement signal received as a result of this execution and SNEXT be the next environmental state resulting from the execution of the action. Then the DESIRED value of Q(S,A) may be estimated by the reinforcement signal R + LAMBDA multiplied by the Q(SNEXT,A*) where A* is the action which makes the current Q

Evaluated at SNEXT as large as possible. Notice that the method for estimating the DESIRED value of Q(S,A) does not depend upon the action A which is actually used by the learning machine to explore its environment. Thus, this is an off-policy method.

Next, Professor Sutton provided a nice pedagogical video demonstration of the performance of a Q LEARNING algorithm in a grid-world. The grid world is an array of 48 squares arranged in a grid consisting of 8 squares per row. The learning machine starts at the lower-left hand corner of the grid and the final goal state is the upper-right hand corner of the grid. The learning machine moves around the grid by choosing a direction at random—either left, right, up, or down and then moving one square in that direction. Things are complicated a little by the fact that the there are a few “barriers” in the grid world which are squares which the learning machine can not move onto. So when the learning machine makes a movement, it is constrained to stay in the grid world…its can’t move outside the boundaries. It is also constrained such that it can’t move onto a “barrier square”. So basically this learning machine simply wanders around the grid world at random. Ok so now as it is wandering around the grid world at random it is building up a “mental model” of what move is the correct choice choice in order for it to reach its future goal. So the Q learning method described earlier is used to estimate a Q function. The learning machine receives a reinforcement signal when it visits any square in the grid world except for the “goal state” square where it receives a reinforcement signal of +1. This is thus a simplified model of many real-world situations where the intelligent decision maker generates a sequence of actions in a complex environment but does not obtain immediate feedback regarding the consequences of those actions. The demo illustrated clearly the idea of “off policy learning”. You learn the right thing to do but you just don’t do it!!

This leads to the “exploitation” versus “exploration” dilemma. If you just do “off policy learning” then you are exploring the space effectively but you are generating random actions. If you do “on policy learning” then you are generating actions based upon your current policy but this could have seriously bad consequences if you are in the initial stages of the learning process. Accordingly, Professor Sutton makes the following key distinction between two distinct types of “policies”. The “target policy” is the “control law” or “policy” which the learning machine is learning. The “behavior policy” is the “control law” or “policy” which the learning machine is using to control its behavior. On-policy learning occurs when the target policy and the behavior policy are identical.

Off-policy methods have the advantage that learning takes place in a steady manner regardless of whether the learning machine makes good or poor choices for its actions. However, off-policy methods have the disadvantage that the performance of the learning machine in the environment could seem quite silly if the actions are simply chosen at random. The algorithm could be turned into an on-policy method if the learning machine doesn’t pick action A at random but rather chooses action A at random sometimes and at other times chooses action A so that it computes the expected value of the the current Q function evaluated at the current environmental state S and current action. However, the probability distribution used to select an action is chosen so that the optimal action based upon the current Q function and the next state tends to be chosen with a very high probability.

Professor Sutton illustrates the difference between on-policy and off-policy methods with a different type of grid world example called the “cliff grid world”. In this grid world, the starting position is the lower-left hand corner of the grid and the lower right-hand corner of the grid is the goal state. As before, the learning machine can move to the left, right, up, or down but can not move off the grid. However, here is the interesting part of the “cliff grid world”. The bottom row of the grid starts with the “start state” on the far left and then we have a sequence of squares which correspond to a cliff and then the final square is the “goal state”. If the learning machine lands on the cliff then it “falls off” and is automatically transported to the start state and receives a penalty of -100. If the learning machine moves onto any other square in the grid world which is not a “start state”, “goal state”, or “cliff square” then the learning machine receives a penalty of -1.

So what’s interesting when you run an “off-policy” reinforcement learning machine is that it wanders around aimlessly and thus falls off the cliff multiple times. Thus, over the long term the average reward received by the “off-policy” reinforcement learning machine is less than the average reward received by the “on-policy” reinforcement learning machine because as the “on-policy” learning machine gets smarter…it learns to stay away from the edge of the cliff. However, what is interesting is that the types of routes that the “on-policy” reinforcement learning machine learns tend to keep a safe distance away from the edge of the cliff while the the “off policy” reinforcement learning machine learns the optimal route which is to travel as close as possible to the edge of the cliff without stepping off the edge!

The next part of the talk covered the rationale behind introducing function approximation methods into reinforcement learning. The basic idea here is that in the real-world we might have an environment with a large number of states and a large number of actions. For example, suppose there were 1000 possible actions of the learning machine and the environment had 10,000 possible states. This would imply that we would need to learn to figure out an action-value function which had 1000 multiplied by 10,000 possible values or 10,000,000 values. In other words, we would have a function with 10,000,000 free parameters. Moreover, for more complicated environments, the number of states or actions could be much larger. Furthermore, in order to correctly estimate such a function it is necessary for the learning machine to experience the consequences of each of these 10,000,000 possible state-action pairs at least a few times and perhaps some of these combinations might be very rare.

Function approximation methods can address these problems by using a smooth function which has free parameters to map a state and action into the Desired Q-value. For example, suppose we have a grid world which consists of 1000 rows and 1000 columns. Thus, there are 1000 multiplied by 1000 possible environmental states or a million possible environmental states. Also assume the learning machine can go up, down, left, or right so this means there are four possible actions. This means that if we don’t use function approximation methods and estimate a value of the state-action Q function for each possible state-action pair we have approximately 4 million free parameters for this relatively simple problem. Now let’s consider an alternative representational scheme. Let’s represent a location in this grid world as an ordered pair of variables R and C where R indicates the row location of the learning machine in the grid world and C indicates the column location of the learning machine in the grid world. Also suppose that we represent the action “move left” as the list of numbers: 1,0,0,0, the action “move right” as the list of numbers: 0,1,0,0, the action “move up” as the list of numbers: 0,0,1,0 and the action “move down” as the list of numbers: 0,0,0,1. Thus a state-action pair such as: R=3, C=5, Action=0,1,0,0 means that the learning machine is on grid square which is in row 3 and column 5 and moves right. Now define the state-action Q function as a weighted sum of the numerical components state-action pair. That is, the predicated Q value is a weight multiplied by R plus a weight multiplied by C plus a weight associated with the relevant action plus an additional weight. The seven free parameters needs to be estimated but what’s interesting here is that the original problem required about four million parameters but using function approximation we can represent the original problem using only seven free parameters!

So the big disadvantage here is that the function approximation method assumes that we can adequately represent the original state-action Q function as a weighted sum. If this approximation is inappropriate, then this guarantees wrong predictions regardless of the amount of training data. On the other hand, the good news is that the learning is a lot faster and if the linear approximation is acceptable then the system will have excellent generalization performance because the linear approximation implies that the predicted Q-value for one location R, C will be similar in value to another location if that other location is physically close to R,C. Briefly, this is the double-edged sword of function approximation methods. If the function approximation is good, this enhances both learning and generalization performance. If the function approximation is poor, this hurts both learning and generalization performance.

Now, to take this to the next level, we can approximate a Q state-action function by a deep learning network with multiple layers of softplus, radial basis function, or sigmoidal units. See Episodes 14 and 23 of Learning Machines 101 for more details. This can possibly result in a nice compromise where the increase in number of free parameters is still not as excessive as not using function approximation, yet the deep learning network is flexible enough to learn some crazy functions. Gradient descent type learning methods can be used in this case to update the weights. Specifically, stochastic approximation expectation maximization methods such as those used in the lunar lander reinforcement learning machine discussed in Episode 25. Episode 43 also discusses stochastic approximation expectation maximization methods which can be used to design adaptive deep reinforcement learning machines.

A 2 and a half hour Reinforcement Learning Tutorial video as well as the original slides for the Deep Reinforcement Learning Tutorial can be found on the Neural Information Processing Systems Conference website. For your convenience, I have provided a hyperlink in the show notes of this episode which take you directly to the video and the slides. If you have the opportunity, I would recommend watching the video.

I have provided in the show notes at: www.learningmachines101.com hyperlinks to all of the papers published at the Neural Information Processing Systems conference since 1987, the workshop and conference schedule for the Neural Information Processing Systems conference, and links to related episodes of Learning Machines 101!

I have been reviewing the user profiles and many of you still have not updated your user profiles. Regardless of whether or not you are currently a member of the Learning Machines 101 community, simply go to the website: www.learningmachines101.com and fill out the “subscription form”. Make sure to fill in the entry labeled “Briefly describe your interests”.

If you are already a member, you also have the option of updating your user profile when you receive the email newsletter by simply clicking on the: *“Let us know what you want to hear”* link!

Also follow either me on Twitter! My twitter handle is:“lm101talk”.

You are also encouraged to link up with me on LINKEDIN and we are starting a new group called “Statistical Machine Learning” you are welcome to join! I also have been constructing some boards on PINTEREST as well.

From time to time, I will review the profiles of members of the Learning Machines 101 community and do my best to talk about topics of interest to the members of this group! So please make sure to visit the website: www.learningmachines101.com and update your profile!

So thanks for your participation in today’s show! I greatly appreciate your support and interest in the show!!

## Further Reading:

**Neural Information Processing System 2015 Conference Schedule and Proceedings**

Proceedings of ALL Neural Information Processing System Conferences.

2015 Neural Information Processing Systems Conference Book.

2015 Neural Information Processing Systems Workshop Book.

Reinforcement Learning Tutorial video

Deep Reinforcement Learning Tutorial

**Reinforcement Learning References
**An original description of Samuel’s Checker playing program written by Samuels in 1959 may be found in:Arthur, Samuel (1959). “Some Studies in Machine Learning Using the Game of Checkers” (PDF).

*IBM Journal*

**3**(3): 210–229. This was reprinted recently in the

*IBM Journal of Research and Development*(Vol. 44, Issue 1.2) in the January, 2000. There’s also a free copy you might be able to find on the web!!! Also check it out for free from your local university library (not public library or community college library).

http://www.scholarpedia.org/article/Temporal_difference_learning

https://webdocs.cs.ualberta.ca/~sutton/papers/sutton-88-with-erratum.pdf

https://en.wikipedia.org/wiki/TD-Gammon

Video of Helicopter which Learns To Fly using Reinforcement Learning

http://papers.nips.cc/paper/3151-an-application-of-reinforcement-learning-to-aerobatic-helicopter-flight (NIPS 2006)

http://papers.nips.cc/paper/2455-autonomous-helicopter-flight-via-reinforcement-learning (NIPS 2003)

**Stochastic Approximation Expectation Maximization Episodes from Learning Machines 101
**http://www.learningmachines101.com/lm101-022-learn-to-solve-constraint-satisfaction-problems/

**Reinforcement Learning Episodes from Learning Machines 101
**http://www.learningmachines101.com/lm101-025-how-to-build-a-lunar-lander-autopilot-learning-machine/

http://www.learningmachines101.com/build-machine-learns-play-checkers/

**Deep Learning (Function Approximation) Episodes from Learning Machines 101**

http://www.learningmachines101.com/lm101-014-function-approximation/