LM101-048: How to Build a Lunar Lander Autopilot Learning Machine (Rerun)
In this episode we consider the problem of learning when the actions of the learning machine can alter the characteristics of the learning machine’s statistical environment. We illustrate the solution to this problem by designing an autopilot for a lunar lander module that learns from its experiences.
Hello everyone! Welcome to the twenty-fifth podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.
In this episode we consider the problem of learning when the actions of the learning machine can alter the characteristics of the learning machine’s statistical environment. We illustrate the solution to this problem by designing an autopilot for a lunar lander module that learns from its experiences. In other words, we have a lunar lander circling around the moon and we want it to land on the moon without a pilot! In addition, we want the lunar lander to LEARN on its own how to land on the moon!
However, before we begin our discussion, let’s review the differences between three types of learning machines: unsupervised learning machines, supervised learning machines, and temporal reinforcement learning machines.
Suppose we are teaching young children the letters of the alphabet. One way to teach the letters of the alphabet is to simply show the children a large collection of letters. We visually present the children with the letter A written in a variety of styles as well as the letters B, C, D, and so on. However, we don’t tell children that there are letter categories we just show them all of these different types of letters. So, for example, we might show the children, the letter A written in Tahoma font, then the letter B written in Georgia font, then the letter B written in Arial font, then the letter A written in Segoe Script, then the letter B written in Segoe Script, then the lowercase letter b written in Tahoma font, and so on. By simply showing these examples and without telling the children which groups of letters “go together”, the children will eventually do very well after carefully studying the examples in determining which groups of letter go together without any type of guidance. So, for example, you might not have to tell the children that the letter A written in Tahoma font and the A written in Georgia font are different ways of representing the same letter. However, upon initial exposure to this large collection of letters it is important to emphasize that the children would not perceive the alphabetic categorical structure. Also note that this phenomenon is not restricted to young children learning the letters of the alphabet.
If you were an American who did not speak or write Russian then you would probably have an experience very similar to young children learning the letters of the English alphabet as an American learning the letters of the Russian alphabet. More generally, “experts” in different fields have learned to classify and categorize patterns even though the labels for those patterns have never been formally presented to the experts. Through experience, the experts have identified statistical regularities which are common to particular groups of objects. Examples include: recognition of literary authorship, patterns of checker pieces on a checker board, weather conditions associated with a good day of sailing, and the interpretation of an x-ray of an injured arm. The ability to learn without supervision how to group and classify patterns is an example of unsupervised learning.
In contrast to unsupervised learning, supervised learning provides the learner with a category label for an example from that category. For a child learning the letters of the alphabet, supervised learning would require a teacher who informs the child that a particular visual pattern is called a “B” and then the teacher might show the child several specific examples of the letter “B”. Although supervised learning can certainly be done without unsupervised learning, the combination of both unsupervised and supervised learning is very powerful. In this latter case, the novice learning machine learns to group and classify patterns on its own and then the teacher simply provides supervised learning on the misclassified patterns.
But, in addition, to supervised learning and unsupervised learning there is another category of learning which lies somewhere between supervised and unsupervised learning. This is called “reinforcement learning”. In ”reinforcement learning”, the learning machine is only provided ”hints” regarding the desired response. For example, if the instructor simply tells the learning machine when the machine has incorrectly classified a letter but does not provide information regarding the correct classification, then this is an example of reinforcement learning. So, for example, suppose that a child has misclassified the visual pattern “b” as the letter “d”. In unsupervised learning mode, the child would not be aware that a mistake in learning had occurred. In supervised learning mode, a teacher would tell the child that the visual pattern “b” should not be classified as the letter “d” but rather should be classified as the letter “b”. In reinforcement learning, the teacher might simply say that the visual pattern “b” was misclassified but not provide the correct classification.
Now suppose the instructor lets the learning machine classify all of the alphabetic letters in the training data set, and then simply informs the learning machine after (but not during) the testing session whether the learning machine made incorrect classifications without revealing the correct classifications. This type of learning is an example of temporal reinforcement learning. Another example of temporal reinforcement learning arises in the control of action. Suppose a robot is learning a complex sequence of actions in order to achieve some goal such as picking up a coke can. There are many different action sequences which are equally effective and the correct evaluation of the robot’s performance is whether or not the robot has successfully picked up the coke can. Furthermore, this reinforcement is non-specific in the sense that it does not provide explicit instructions to the robot’s actuators regarding what they should do and how they should activate in sequence. And, the problem is even more challenging because the feedback to the robot is provided to the robot after the sequence of actions has been executed. To summarize, the temporal reinforcement learning problem involves learning a collection of action sequences or trajectories and only a few hints regarding the quality of a trajectory is provided. In many important cases, these hints are provided only at the end of the trajectory but the hints can be provided at various points in time during the reinforcement learning process. Examples of temporal reinforcement learning machines discussed in previous episodes include: the Episode 1 scenario where Data the Starship commander has to decide whether or not to attack, the Episode 2 checker playing learning machine and the Episode 9 robot named “Herbert” who roamed the halls of MIT looking for soda cans to pick up.
Let’s now make a distinction between two types of reinforcement learning situations by considering the situation where you are being taught how to land a lunar lander module on the moon’s surface. We assume that your learning will take place using a flight simulator which simulates the behavior of the lunar lander module as it lands on the moon under your control. In one learning scenario, you are a complete novice but have the benefit of watching an instructor expertly land the lunar lander module in the flight simulator. As you watch the expert, you notice which controls should be modified and altered at different altitudes and different velocities. Although you are not actually controlling the lunar lander module in the flight simulator, you are nevertheless learning how to control the lunar lander module by observing. In this mode of learning, the behavior of the statistical environment is not influenced by you since you are simply watching someone else land the lunar lander module. We will refer to a learning machine such as yourself which lives in this type of statistical environment as a passive learning machine. You are a passive learning machine because your current actions do not modify, create, or eliminate potential future learning experiences.
Also this is an unsupervised learning scenario because you are simply watching the instructor but the instructor is not telling you how to land the lunar lander. If the instructor explains every single step of the landing procedure as you watch, then this would be a supervised learning scenario. If the instructor explains some but not all of the steps of the landing procedure as you watch then this would be called a reinforcement learning scenario.
Now suppose that we return to the lunar lander flight simulator but let’s now consider an alternative learning scenario. In this scenario you are a complete novice but you do not have the benefit of watching an instructor expertly land the lunar lander module in the flight simulator. Instead, you simply interact with the flight simulator experimenting with different landing strategies. You notice that sometimes you land safely, sometimes you crash into the lunar surface, and sometimes you end up floating away from the moon! There are two unique features of this alternative learning scenario which must be emphasized. First, you are in a situation where you need to make a choice regarding how much fuel to apply to the thrusters. This choice is made at the current moment but whether or not this was the correct choice will not be clear until sometime in the near future when either the lunar lander module lands gently and safely on the lunar surface or the lunar lander module crashes into the moon in a ball of fire. Thus, this is the example of temporal reinforcement learning which we discussed earlier. So the first unique feature of this “learn by doing without instruction” scenario is that it is a temporal reinforcement learning situation.
But there is a second unique feature of this “learn by doing without instruction” scenario. And this second unique feature is that the statistical environment is not a passive learning environment. It is an active learning environment because the actions you exercise at the control will create, delete, or modify your future learning experiences. For example, suppose that it is relatively easy to use the simulator to land the lunar lander module on the moon. In other words, suppose that by simply fiddling around with the controls you can usually manage to land the module safely. Then you probably could learn to successfully land the lunar lander module on your own. However, if the learning problem was very difficult and you kept crashing the lander on the moon every time, then you might never learn to successfully land the lunar lander regardless of the number of learning experiences because you never have the opportunity to experience “informative learning experiences”. In other words, in an active statistical learning environment, the environment responds to the behaviors of the learning machine and then those responses, in turn, determine the future experiences of the learning machine.
The problem of building a lunar lander autopilot learning machine is an example of a temporal reinforcement learning problem in an active statistical learning environment. We will call this “active temporal reinforcement learning”.
Ok…so now that we have identified the learning problem…how can we solve it? We approach this problem using a variation of standard methods encountered in optimal control theory.
More specifically, the problem of active temporal reinforcement learning is typically addressed by a learning machine which has two specific components. The first component is called the: control law or policy. The control law is typically defined as a machine which takes as input the current state of the environment and generates an action. One aspect of the learning process involves having the learning machine figure out what is the appropriate control law to use. So, for example, the concept of a “control law” in the lunar module landing example means that for each possible situation encountered on the space craft, there is an appropriate dial to turn or button to push. If we knew exactly which was the correct dial to turn and which was the correct button to push for every possible situation, then this would be all the knowledge we would need to land the lunar lander module safely. This collection of rules indicating which actions to following in particular situations is referred to as the “control law” or “policy”.
The second component is called the “adaptive critic”. The adaptive critic takes as input an action of the learning machine and the current state of the environment and returns some measure of performance. For the case of temporal reinforcement learning, it is helpful to think of a state of the environment as defined as a short sequence of environmental states which were generated through interactions of the learning machine’s control law with the environment. That is, the environment presents a situation to the lunar lander. The lunar lander then uses its current control law to pick an action. Thus, the next environment presented to the lunar lander is functionally dependent upon the lunar lander’s last action.
For example, if the last action was to apply thrust, then the lunar lander is likely to experience situations where it has less fuel and where it is accelerating away from the moon. These situations, in turn, will generate different actions according to the current control law. After a short sequence of these interactions, the learning machine tries to adjust its control law so that its predictive performance is improved. In addition, the learning machine may modify its adaptive critic so that that the adaptive critic’s performance will be improved in the future as well.
We are now going to provide an even more concrete example which illustrates these ideas by illustrating how these ideas can be used to design a lunar lander autopilot learning machine.
The first step in the design is to define a list of numbers which represents the state of the lunar lander learning machine’s world at a particular instant in time. We will call this the state vector. So, for example, the process in which the lunar lander leaves its home space crafts and departs for the lunar surface is documented as a sequence of these state vectors. We will call this sequence of state vectors a trajectory. The process of learning then corresponds to obtaining experience with a collection of trajectories. Hopefully at the end of each trajectory, we have a “safe landing” but it is possible that we could have a “crash landing” at the end of a trajectory or be “drifting off into outer space” at the end of a trajectory!
So returning to the concept of a state vector. A state vector consists of four numbers. The first number in the list is the height of the lunar lander above lunar surface measured in meters. The second number in the list is the downwards velocity of the lunar lander measured in meters/second, and the third number in the list is the amount of kilograms of fuel in the lunar lander which is remaining. The fourth number is a very special number which is called the “reinforcement signal” and it takes on a large value when the lunar lander is performing poorly and a small value when the lunar lander is performing well. In particular, the reinforcement signal will be a large number when the lunar lander’s height or velocity changes by a large amount from one instant of time to the next since this is a “hint” that the lunar lander is “out of control”! The reinforcement signal will also become a REALLY REALLY large number if the space craft hits the lunar surface at a crash velocity and blows up!! And…the reinforcement signal will also be a large number if at the end of the trajectory, the lunar lander is not on the lunar surface but rather is drifting off out into outer space!
The second step in the design is to specify in detail the environment of the lunar lander. Our lunar lander is going to learn in a simulated environment since the lunar lander is simulated on a computer it is necessary to build a simulated environment in which the lunar lander can live. The assumptions of the simulated environment are fundamentally crucial because here is the “profound idea”. One of the most important teachers for an active temporal reinforcement learning machine is the machine’s “environment”. The environment is going to respond differently depending upon the machine’s actions and these responses provide “hints” to the learning machine regarding how it is doing! The main point is that the consequences of the decisions the learning machine makes at the current point in time are not realized by the learning machine until some future time. So let’s now get into the details of the lunar lander’s environment!
It is assumed that the lunar lander’s initial height is randomly chosen to be approximately 15000 kilometers and we allow this height to randomly vary by about 20 kilometers each time we start a new simulated landing. We assume the lunar lander’s initial velocity is also randomly varying between about 100 kilometers per second and 200 kilometers per second.
The downwards gravitational acceleration of the lunar lander due to the moon’s gravitational field is simply equal to the moon’s gravitational constant which is 1.63 meters per second squared. To counteract the downwards gravitational acceleration of the lunar lander, an accelerative upwards force is generated by the rocket thrusters at a particular instant in time. This force can be explicitly calculated using the laws of Newtonian physics by the following formula.
Maximum upwards acceleration is equal to the maximum accelerative upwards force divided by the sum of the mass of the lander and the mass of the remaining fuel. We assume the maximum upwards force is 25000 newtons and the mass of the lander is 4000 kilograms. The initial amount of fuel in the lander is 5000 kilograms which is actually more than the mass of the lander! Note that as the lunar lander ejects fuel, the mass of the lander changes and so the maximum acceleration actually will change in a stranger nonlinear manner as the fuel in the lander gets used up! We will assume that there is a “throttle control” on the lunar lander which specifies the desired percentage of upwards acceleration. When the “throttle control” takes on the value of one, the maximum upwards acceleration is realized.
We then complete the analysis by noting that the velocity of the lunar lander decreases from one time to the next by an amount directly proportional to the difference between the downwards acceleration due to the moon’s gravity and the upwards acceleration due to the rocket thrusters.
The height of the lunar lander decreases from one time to the next by an amount directly proportional to the current downwards velocity of the lunar lander.
Also note that the fuel in the lunar lander decreases by an amount directly proportional to the amount of fuel currently used where the proportionality constant needs to take into account the efficiency of the lunar lander’s engines.
The reinforcement signal is formally defined as the sum of the squares of the differences between the current height and velocity and the previous height and velocity respectively plus the reinforcement safe landing term. The reinforcement safe landing term is always equal to zero when the lander is above a height of 0 meters and is equal to the square of the lunar lander’s velocity when the lander is below the height of 0 meters. Clearly, when the velocity is greater than zero at a height of 0 meters this is called a “crash landing”.
The third step in the lunar lander control design is to develop the control law. The control law involves constructing a set of “features” from the current state of the lunar lander. Next, a weighted sum of these features is computed. The probability that the throttle control is increased by a fixed percentage is equal to P and the probability that the throttle control is decreased by that same fixed percentage is equal to 1-P. The “weights” are initially chosen at random and correspond to the “state of knowledge” of the learning machine. We can refer to the weights as the learning machine’s parameter values. A different choice of the parameter values corresponds to a different control law. This is a crucial point! The goal of the learning process is to figure out these parameter values or weights!
The fourth step in the lunar lander control design is to develop the learning rule. An adaptive gradient descent algorithm can be derived using the methods of Episode 16. The resulting learning rule can be expressed by the following simple formula. The change in a learning parameter value is calculated to be directly proportional to that parameter value multiplied by the difference between whether you did increment the throttle and the predicted probability for incrementing the throttle signal and then multiply the result by the reinforcement signal. Although this is a very simple rule, it is intuitively not obvious why this particular form was chosen or why this works. However, it can be shown that this formula can be mathematically derived from the assumptions we have made in the previous three steps of the design process.
Intuitively, the adaptive gradient descent algorithm tweaks the parameter values such that sequences of actions which are associated with large reinforcement penalties or negative reinforcement are avoided in the future.
And that is all there is to it. We have just build a Lunar Lander Autopilot Learning Machine!
The learning machine adjusts its control law weights by simply tweaking those weights based upon responses from its environment! But how well does this lunar lander autopilot learning machine really work? When will it have difficulties and when will it do very well? These are important questions and really deserve an entirely new episode of Learning Machines 101!
Stay tuned for a future episode of Learning Machines 101 where we will study the learning behavior of this simple lunar lander autopilot learning machine and see how well it actually does!!!
Also…one more thing before we go, I’d like to bring everyone’s attention to a really great podcast that you should check out. The name of the podcast is called “The Data Skeptic Podcast”. You can find the podcast on ITUNES or you can go to the website at: “dataskeptic.com”.
My favorite episodes that I have listened to thus far are: “Ant colony Optimization”, “Monkeys on Typewriters”, “Easily fooling deep neural networks”, “Partially observable state spaces”, and “Ordinary Least Squares Regression”. These episodes and others at the “Data Skeptic Podcast” are really great but I have to admit that I have a soft spot in my heart for the “Monkeys on Typewriters” episode…
“The Dataskeptic podcast” is highly informative and FUN…so check it out!!!!
And finally…If you are a member of the Learning Machines 101 community, please update your user profile.
You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link!
Or if you are not a member of the Learning Machines 101 community, when you join the community by visiting our website at: www.learningmachines101.com you will have the opportunity to update your user profile at that time.
From time to time, I will review the profiles of members of the Learning Machines 101 community and do my best to talk about topics of interest to the members of this group!
Keywords: Temporal Reinforcement Learning, Unsupervised Learning, Supervised Learning, Adaptive gradient descent
- Good reference for temporal reinforcement learning www.scholarpedia.org/article/Reinforcement_learning
- An original description of Samuel’s Checker playing program written by Samuels in 1959 may be found in:Arthur, Samuel (1959).“Some Studies in Machine Learning Using the Game of Checkers”(PDF). IBM Journal 3 (3): 210–229.
This was reprinted recently in the IBM Journal of Research and Development (Vol. 44, Issue 1.2) in the January, 2000.
- The bookThe Quest for Artificial Intelligence by Nils Nilsson (Cambridge University Press) has a nice discussion of Samuel’s Checker playing program on pages 90-93. Also on pages 415-420 there is a nice discussion of temporal reinforcement learning.
- Niv,Y.(2009).Reinforcement learning in the brain.Journal of Mathematical Psychology, 53,139-154.
- Sutton, R. S., and Barton, A. (1988).Reinforcement learning: An introduction. MIT Press.
Copyright © 2015 by Richard M. Golden. All rights reserved.