LM101-065: How to Design Gradient Descent Learning Machines (Rerun)
In this episode we introduce the concept of gradient descent which is the fundamental principle underlying learning in the majority of deep learning and neural network learning algorithms.
Hello everyone! Welcome to the sixteenth podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.
The Great 16th Century Samurai Warrior Miyamoto Musashi wrote: “From one thing, know ten thousand things.” This means that the proper mastery of some particular skill or piece of knowledge may instantly provide mastery of 10,000 other skills. In this episode we introduce the concept of gradient descent which captures the essence of most popular machine learning algorithms. It is an extremely powerful concept and if you master the concept of gradient descent, then you will have instantly acquired an understanding of the principles of learning that guide essentially all existing machine learning rules. In other words, if you truly understand the concept of gradient descent, then you will truly understand ten thousand things!
In the previous episode, the concept of learning arbitrary functions using a Perceptron architecture was introduced. It was noted that if one is provided with a network architecture capable of representing the solution to a particular input pattern classification problem, then the Perceptron learning rule will converge to that solution in a finite number of learning trials.
But the Perceptron Learning Rule will not work properly if a correct classification does not exist! This would correspond to the case where the connections from the input units to the hidden units are not correctly chosen. That is, the assumption that the learning machine is capable of classifying input patterns perfectly may not be appropriate in many situations in the real world.
So in order to avoid the problem of situations where the Perceptron Learning Rule does not work properly because a correct classification does not exist, we need to redefine our objectives for learning. Instead of trying to find a learning machine that can correctly classify all of the input patterns, our new goal is to find a learning machine whose objective is to obtain the best possible performance. Or, in other words, the goal of the learning machine is to improve its classification performance rather than obtain perfect classification performance.
This reformulation of the perfect performance problem in terms of an explicit performance measure provides a method for us to define an objective for a learning machine which is trying to solve a problem which does not have an exact solution. This is a very desirable goal if we want to design learning machines that can solve really difficult problems. In the real world, we often do not encounter problems which can be perfectly solved. Instead, we often try to find a solution to a problem which is as close as possible to a perfect solution with the understanding that perfect solutions to complex problems in the real world are rare.
For example, suppose you are playing baseball and you want to hit the baseball with the bat. Your objective should not be to hit the “perfect homerun” in which you obtain the maximum distance, height, speed of the baseball and you hit the ball with absolutely perfect form while perfectly taking into account wind velocity and air density. Instead, athletes often measure their performance in a relative manner striving for improvement rather than perfection.
Another example would be the design of a learning machine which learns to filter spam emails. Most email systems have a spam filter which is designed to capture and isolate email messages which are unsolicited and are likely to have no value, while simultaneously allowing important email messages to filter through the system. One can imagine a system such as the Perceptron which was discussed in Episode 15 whose input pattern might be features associated with a particular email such as the number of exclamation points in the subject header or specific word patterns. The output of the Perceptron would be either a 1 or a 0 where a 1 indicates the email message is SPAM and a 0 indicates the email message is not SPAM. One can easily imagine some email messages which are important which might have similar structure features such as word patterns which are similar to SPAM emails. One can also image some email messages which are SPAM which have word patterns characteristic of important emails. In this type of environment, the goal of ALWAYS correctly classifying an email message as SPAM or not SPAM is unrealistic. Like the human athlete, a well designed learning machine in this case should not strive for perfection but rather strive for the best possible improvement.
Let’s now discuss how to approach the problem of decision making in environments in which perfect performance is not possible. Let’s begin by reformulating the goal of “perfect performance”. For now, we will continue to assume that the learning machine is capable of classifying input patterns perfectly. Next, we define some sort of performance measure which measures the effectiveness of the learning machine’s performance. If the learning machine can perfectly classify all input patterns in the set of training data, then the performance measure takes on its smallest possible value which is equal to zero. So a performance measure of ZERO corresponds to perfect performance. The learning machine’s learning rule is then designed to adjust the parameters of the learning machine such that the performance measure takes on its optimal or equivalently its minimal value.
Although at first glance it appears that nothing is gained from this alternative perspective, in fact this shift in perspective provides a new and powerful way of thinking about the analysis and design of learning machines. As previously discussed, when perfect classification performance is possible then the learning machine seeks the best performance by adjusting its parameters to minimize the performance measure. When the learning machine obtains its best performance this corresponds to the perfect classification case where the performance measure is zero.
But now the power of this new perspective can be revealed. Suppose that perfect classification is impossible. Even in this case, the objective of the learning machine is clearly defined. The goal of the learning machine is to minimize the performance measure. However, the smallest possible value of the performance measure is some positive number which is always greater than zero since the Nirvana of perfect classification is never reached.
So, to summarize, this new perspective is to view the learning machine as a machine whose goal is to minimize the value of some performance function. If a perfect solution exists, then the minimum value of the performance function is equal to ZERO. If a perfect solution does not exist, then the minimum value of the performance function is not equal to zero. However, the really cool thing about this formulation is that we can design learning machines which can learn in situations where a perfect solution doesn’t exist by reformulating the learning machine’s goal as seeking the minimum value of the performance function rather than seeking perfect performance. The performance function framework provides a general theoretical framework for thinking about learning machine goals which includes as a special case the situation where the learning machine is capable of actually achieving those goals.
The objective for learning for any learning machine whose performance can be measured by a performance function can be reformulated as computing the minimum value of a performance function.
So now the question arises is what method should we use to minimize the performance measure? The performance measure’s value is functionally dependent upon many parameter values. One could perturb each parameter and evaluate the performance function at the perturbed values and use that as a basis for minimizing the performance function.
This leads us to a particular method for adjusting the parameters of a learning machines which has many free parameters which is called the method of steepest descent. In the method of steepest descent, the derivative of the performance measure is computed. If the performance measure depends upon K different parameters, then the derivative of the performance measure is a list of K numbers where each number corresponds to a different perturbation for each parameter value. It can be shown that these perturbations have a very special property. When the next guess for the parameters of the performance measure of the learning machine are estimated by subtracting the first derivative of the performance measure from the current guess for the parameter values, the new updated parameter values correspond to a performance measure that is less than the performance measure associated with the original set of parameter values.
To illustrate these ideas, suppose that we are on a mountain in a snowstorm which forms the side of a large valley. We can still find our way down the mountain by taking a step in each possible direction and choosing the step which takes us down the mountain most rapidly. Our location on the mountain corresponds to a particular pattern of parameter values of the learning machine. The height of the mountain at that location corresponds to the performance of the learning machine for the parameter values associated with the location. The strategy of taking the step which brings us downhill most rapidly will eventually lead us to the bottom of the mountain.
However, this strategy has some technical difficulties. First, if the steps are too small then it will take a long time to reach the bottom of the mountain. On the other hand, suppose we are on a pogo stick and we hop down the mountain in large jumps. Its possible that we would hop over the bottom of the valley if our jumps are too large. That is, we would “overshoot” the bottom of the valley and we might actually start going up-hill. That is, we might start going up the opposite side of the valley. This would require that we turn around but again if our jumps are too large we will again overshoot the bottom of the valley. This process might continue indefinitely where we continually try to jump down-hill but continually jump over the bottom of the valley possibly by accident. Thus, the steps can’t be too small because it will take a very long time to reach the bottom of the valley but they can’t be too large otherwise we might “jump over” the bottom of the valley.
So, to summarize, the choice of the size of the steps once takes down the mountain are important. The stepsize can not be too small or too large. Returning to the problem of adjusting the parameters of our learning machine, this observation implies that the change in the parameter values for each adjustment can not be too small or too large either. If the change in the parameter values is very small at each learning trial, then the learning process will be very slow. If the change in the parameter values of the learning machine is very large at each learning trial, then the pattern of parameter values may “overshoot” or “jump over” the optimal choice of parameter values.
The search direction is also extremely important in the design of an effective learning rule. Notice that if you are on the mountain and have identified the direction to step which takes you down the mountain side most rapidly then you will still go downhill even if you don’t step in that direction but simply take your step which is “close” to the direction which takes you downhill most rapidly. For example, you could choose a direction which deviated 20 degrees from the direction which would take you downhill most rapidly and you would still go downhill. Also recall that an explicit formula for the direction of steepest descent can be derived by computing the derivative of the performance measure with respect to the parameter values of the performance measure. This is called the “direction of steepest descent”. Although in many important situations, this strategy for going downhill may be advantageous, there are important situations where traveling downhill in the direction of steepest descent is not the best possible choice.
For example, suppose that you are trying to move down a mountain which overlooks a narrow valley. The valley is shaped like a banana: It has a very short width and a very long length.The deepest location in the valley is located at the valley’s center. If you a skier located on the side of this valley at some high elevation and you are far from the valley’s center, then skiiing directly downhill is not an efficient path. You will rapidly reach the floor of the valley, but then when you reach the valley’s floor you will be unable to ski because the slope is so gradual. You will have to walk the valley and it will be difficult to find your way because the slope on the floor of the valley is so gradual. It would have been much more efficient if you “traversed” the side of the mountain and descended towards the center of the valley at an angle rather than skiing directly downhill.
Now suppose that we are located, as before, on the side of a banana shaped valley at some high elevation and that we are far from the deepest point of the valley so we can’t simply ski straight down the slopes to the center. However, assume the valley is very long and that we are miles away from the bottom of the valley. The method we have been using for searching for the bottom of the valley is based upon essentially estimating the slope of the valley at a given location in the valley which corresponds to a particular direction of travel. Or, in other words, we approximate the shape of the valley at a particular point by a straight line.
Suppose now that we estimate not only the slope of the valley at a particular location but also the change in the slope of the valley at that location. This corresponds to approximating the shape of the valley at a particular point by a curve rather than a line. In particular, if one knows both the slope of the valley at a particular location as well as the change in the slope of the valley at that location, then it can be shown that one can calculate theoretically where the deepest location of the valley is under the assumption that the valley has a quadratic shape. For a learning machine, this means that one actually can solve a simple formula for the location of the deepest location of the valley given one’s current position. This formula will not give exactly the right answer if the curvature of the valley can not be expressed as a quadratic function. However, it will identify a location on the valley which very close to the deepest location since the quadratic approximation is appropriate in many important cases. Still this is a super-fast method. It is equivalent to instantly vanishing from one end of the valley and rematerializing at the predicted location of the deepest spot just like using the Transporter Beam in the TV show Star Trek!
This latter approach is called “Newton’s Method” for adjusting the parameter values of a learning machine in order to optimize performance. Unfortunately, it is very computationally intensive. If the performance function depends upon the values of K free parameters, then Newton’s method requires that a matrix with K2 elements must be inverted and multiplied by a vector with K elements. However, Newton’s method is very fast and mathematicians refer to the convergence rate of Newton’s method as quadratic and the convergence rate of the standard gradient descent method as linear. Newton’s method is an entire order of magnitude faster than the standard gradient descent method for choosing the search direction. And, in practice, it is even faster than that!!
Modern methods for the derivation of learning rules for performance measures not only use both the standard gradient descent method and Newton’s method but additionally generate search directions which do not require the computation or inversion of a high-dimensional matrix yet which can be shown to be almost as fast as Newton’s method. This method is called the memoryless Broyden-Fletcher-Goldfarb-Shanno or BFGS method which includes conjugate gradient descent methods as an important special case.
Memoryless BFGS methods which involve computing different combinations of search directions in a computationally efficient manner yet which have the end result of Newton method like speeds can be shown to have “faster than linear” or “superlinear” convergence rates.
I will sometimes use the terminology “gradient descent” to refer to all of these methods because they are all essentially variations of the basic method of steepest descent which is based upon computing the first derivative or equivalently the gradient of the performance measure. However, in the literature, the terminology “gradient descent” is often used interchangeably with the terminology “steepest descent”.
To summarize, we have covered the following important concepts. First, a learning rule can be designed so that the parameter values are perturbed at each learning trial so that the performance measure decreases. Second, the particular pattern of parameter value changes is important. If the parameter values are changed by an amount proportional to the derivative of the performance measure, then this perturbation is in the direction of “steepest descent”. This means that the parameter values are perturbed in such a way that they decrease the value of the performance measure as rapidly as possible. Third, the size of the changes in the parameters can’t be too small or too large. If they are too small, then it will take a long time for the learning process to complete. If the changes to the parameter values are too large, then the parameter values will tend to “overshoot” the optimal parameter values that minimize the performance measure. This is analogous to taking large jumps down a ski slope and jumping over the lowest point in the valley. And fourth, it is not necessary to move in the direction of steepest descent. One can go downhill if one moves in a direction whose angle with the direction of steepest descent is not too large. In fact, methods exist for choosing a downhill search direction which can greatly accelerate the learning process and result in superlinear convergence speeds.
However, there is still one more very important concept which we have not yet covered. Suppose we have a collection of N training stimuli. The performance measure is based upon all N training stimuli which means that at each learning trial the parameters of the learning machine are modified based upon all N training stimuli. Although it is true that one may have the entire set of training stimuli available for the purpose of learning in many cases, in many other situations in the real world updating the parameters of the learning machine with all possible training stimuli is not practical.
As a first example, consider a learning machine operating within a real environment where the learning machine’s experiences are generated in real-time by the environment. In this situation, it is impossible for the learning machine to update its parameter values based upon “future experience”. The learning machine can only update its parameter values based upon its past experiences. In such a case, the only possible option is for the learning machine to update its parameter values in real-time. As each new training stimulus is presented to the learning machine, the learning machine updates its parameter values.
As a second example, consider a learning machine in which the amount of computation required to update its parameter values based upon a single training stimulus from the environment is very costly. Such a situation could arise, for example, in situations where a learning machine is processing audio, image, or video data. If the goal of the learning machine was to update hundreds of parameter values based upon training data from thousands of hours of video data, then it might not be computationally feasible to use all of the training data to update the learning machine’s parameters at each learning trial. Rather, one might select some video data at random, update the parameters of the learning machine and then select some more video data at random, and then repeat this process.
When all of the training data is used to update the parameter values at each learning trial, then this is called “off-line” or “batch” learning. When only one training stimulus is used to update the parameter values of the learning machine at each learning trial this is called “on-line” or “adaptive” learning. When small batches of training stimuli are selected at random and used to update the parameter values of the learning machine at each learning trial, this is called “mini-batch” learning.
The essential methodology behind these “on-line” or “adaptive” learning methods is that a downhill search direction is computed using just one training stimulus or a small “mini-batch” of training stimuli. This downhill search direction computed in this manner is clearly not the same as the downhill search direction which would be computed if all of the training data was used. Suppose however it can be shown that the average value of the downhill search direction based upon a single training stimulus or a “mini-batch” of training stimuli is the same as the average value of the downhill search direction based upon all of the training stimuli. Given this assumption, it is sometimes possible to establish that the on-line adaptive methods of learning will find parameter values that minimize the performance function based upon ALL of the training data even though the learning machine updates its parameter values using only a PART of the training data. The relevant branch of mathematics associated with such techniques is called “Stochastic Approximation Theory”.
Although it is not possible to provide explicit formulas and equations in this podcast for the methods discussed here, if you go to the website: www.learningmachines101.com you will find the transcript of this episode with additional ADVANCED references for you to learn more about both off-line and on-line gradient descent methods. These additional ADVANCED references provide specific mathematical formulas and theorems associated with today’s discussion. And please visit our website at: www.learningmachines101.com and join the Learning Machines 101 community. At some point in the near future, I will provide an episode similar to episode 13 which describes specific machine learning software you will be able to download from this website which can be used to implement ideas such as those discussed in this podcast.
“Wolfe Conditions” http://en.wikipedia.org/wiki/Wolfe_conditions Provides explicit formulas for selecting the stepsize for batch learning
Steepest Descent http://en.wikipedia.org/wiki/Gradient_descent
This is an introduction to discussion of gradient descent.
Note that the terminology “gradient descent” in this article is used interchangeably with the concept of “steepest descent”.
Newton’s Method http://en.wikipedia.org/wiki/Newton’s_method
Accelerated Gradient Descent Methods
Adaptive Learning Methods http://en.wikipedia.org/wiki/Stochastic_gradient_descent
Keywords: Steepest descent, gradient descent, line search, stochastic approximation, off-line learning, on-line learning, batch learning, memoryless BFGS, Newtons method, stochastic gradient descent
The Wind Sound for this episode was provided by Mark DiAngelo and distributed under the Creative Commons Attribution 3.0 License via SoundBible.com
Copyright © 2014 by Richard M. Golden. All rights reserved.