LM101-085: Ch7: How to Guarantee your Batch Learning Algorithm Converges

By | May 20, 2021
Interpreting skiing as gradient descent.

Episode Summary:

This 85th episode of Learning Machines 101 discusses formal convergence guarantees for a broad class of machine learning algorithms designed to minimize smooth non-convex objective functions using batch learning methods. In particular, a broad class of unsupervised, supervised, and reinforcement machine learning algorithms which iteratively update their parameter vector by adding a perturbation based upon all of the training data. This process is repeated, making a perturbation of the parameter vector based upon all of the training data until a parameter vector is generated which exhibits improved predictive performance. The magnitude of the perturbation at each learning iteration is called the “stepsize” or “learning rate” and the identity of the perturbation vector is called the “search direction”. Simple mathematical formulas are presented based upon research from the late 1960s by Philip Wolfe and G. Zoutendijk that ensure convergence of the generated sequence of parameter vectors. These formulas may be used to ensure that your batch learning algorithm converges. The material in this podcast is designed to provide an overview of Chapter 7 of my new book “Statistical Machine Learning” and is based upon material originally presented in Episode 68 of Learning Machines 101!

Show Notes:

Hello everyone! Welcome to the 85th podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.  

This particular podcast is actually the eighth episode in a special series of episodes designed to provide commentary of my new book. The book’s title is “Statistical Machine Learning: A unified framework” published by CRC Press in their “Texts in Statistical Science” series. The basic idea of the book is that it is designed to provide mathematical tools from the fields of nonlinear optimization theory and mathematical statistics for the analysis and design of machine learning algorithms. If you have an undergraduate degree in computer science or engineering or math, then this book is a great way to get up-to-speed with some of the most important widely used mathematical tools in machine learning used by the top experts in machine learning. You can learn more about my new book by visiting the website: www.statisticalmachinelearning.com.

Each podcast in this special series describes a chapter from my new book. Each podcast is typically divided into four sections. The first section provides an easy-to-understand self-contained overview of the key concepts from the relevant chapter. The second section is specifically designed to provide guidance to the human student who is using the book for self-study. The third section is designed to provide guidance for the human instructor who is using the book for instruction. The fourth section is designed to provide guidance for robot web spiders (also called web spiderbots or web crawlers) so they can find this podcast and properly index it for humans. But even if you are not a robot web spider…you shouldn’t feel left out! Humans familiar with the relevant terminology will also find this fourth section of the podcast useful for a quick technical overview of what is covered in the book chapter.

The main focus of this particular episode covers the material in Chapter 7 of my new forthcoming book titled “Statistical Machine Learning: A unified framework.”  Chapter 7 is titled “Batch Learning Algorithm Convergence”.

[MUSICAL INTERLUDE]

Overview of Book Chapter 7

Most machine learning algorithms (including unsupervised, supervised, and reinforcement learning algorithms) work by perturbing the parameters of the learning machine based upon information in the set of training data. The magnitude of this perturbation governs the rate of learning and is typically specified by a number which is called the “stepsize’’ or “learning rate”. The particular pattern of perturbations is called the “search direction”. After each such perturbation of the parameters we obtain a revised set of parameter values which we will call “parameter estimates”. Ideally, we would like the algorithm to generate a sequence of parameter estimates which would converge to some parameter values in order to minimize the learning machine’s prediction error.

Note that if the learning rate is too slow….that is we just change the parameter values of the learning machine just by a very small amount at each learning trial, it takes forever to learn. However, if we try to increase the speed of learning by making large changes to the parameter values using a large learning rate then a large change to the parameter values might damage the learning machine’s existing knowledge of its environment Thus, the first question is how do we choose a learning rate which is not too slow and not too fast.

In addition, we need to impose some constraints on the “search direction”. Basically we want the pattern of perturbations to be designed at each iteration of the learning algorithm so that the prediction error of the learning machine becomes smaller. The method of gradient descent provides one answer which is to choose the pattern of perturbations so that it is proportional to the negative derivative of the prediction error evaluated at the current parameter values. This approach identifies the pattern of perturbations which decreases the prediction error most rapidly.

In the late 1960’s, Philip Wolfe, G. Zoutendijk, and other researchers developed simple mathematical constraints which would ensure that batch learning algorithms of the type typically encountered in unsupervised, supervised, and reinforcement machine learning algorithms would generate sequences of parameter estimates which would converge to some desired solution set.

Many machine learning algorithms can be interpreted as variants of gradient descent algorithms. Gradient descent type methods are often used in unsupervised and reinforcement learning algorithms as well as supervised learning algorithms.  Indeed, the recent advances in the development of learning methods for deep learning neural networks are based upon variations of the concept gradient descent, yet standard methods for estimating parameters in classical logistic and multinomial logistic regression models are based upon the concept of gradient descent as well.

Consider a machine learning algorithm which generates a prediction for a given input pattern based upon the current parameter values of the learning machine. One can then compute the prediction error for each input pattern if we know the desired response of the learning machine for each input pattern. For training data set, the average prediction error across all training stimuli in the training data set is also assumed to be a differentiable function of the learning machine’s parameter values. The generic descent algorithm works by first making an initial guess for the learning machine’s parameter values and then perturbing these parameter values in such a way that the new parameter values will have a smaller average prediction error than the previous parameter values. This perturbation is often represented as a positive number called the “step size” multiplied by a list of numbers which is called the “search direction”.

If the descent algorithm is a gradient descent algorithm, then the search direction is chosen to be negative one multiplied by the derivative of the prediction error with respect to the learning machine’s parameter values. It can be shown that this choice of search direction decreases the prediction error most rapidly if the step size is sufficiently small.

In fact, it can also be shown that if we choose a search direction whose angular separation with respect to the gradient descent search direction is strictly less than 90 degrees that the prediction error is guaranteed to decrease provided the step size is sufficiently small. This provides quite a bit of flexibility in choosing the pattern of perturbations to the parameter values of the learning machine during the learning process.

We can think of a gradient descent algorithm as specifying the trajectory of a spaceship from Earth. Every so often, the spaceship computes the angular separation between the planned direction or equivalently the search direction which is an arrow pointing from the spaceship and another arrow pointing from the space ship which is pointing directly towards the star. This latter arrow corresponds to the negative gradient descent direction when the performance function is convex or bowl-shaped. The pilot of the spaceship checks that the angular separation between the direction of flight (or search direction) with the arrow pointing directly to the star (that is the negative gradient) is always less than 90 degrees. Then the spaceship travels along the search direction a short distance called the stepsize and then a new flight direction and stepsize is proposed. In order for the spaceship to not “overshoot” its target star, it is clear that initially it should probably try to take larger steps and then when it gets closer to the target star the stepsize should decrease resulting in more “inflight corrections” to the search direction.

So to summarize the discussion thus far, we note that many machine learning algorithms can be interpreted as descent algorithms which work in the following manner. First, guess parameter values for the learning machines. Then, using all of the training data, compute the derivative of the prediction error or the gradient with respect to the parameter values.

Then change the current parameters values by adding a positive number called the stepsize multiplied by the search direction. Choose the search direction so that the dot product of the normalized gradient and the normalized search direction is strictly less than some small negative number.

Note that just showing the average prediction error decreases at each parameter update is not sufficient to show that the resulting sequence of parameter estimates converges. It is quite possible to generate a sequence of parameter estimates whose corresponding prediction errors are decreasing yet the sequence of parameter estimates does not converge. Analogous to the spaceship example, suppose we consider the case where a parameter vector corresponds to a particular location of a hiker on a mountain side and the height of the hiker above sea level is defined as the average prediction error. If the hiker’s goal is to reach sea level, then one strategy is to choose each step so that the hiker’s height above sea level decreases at each step. This is analogous to a machine learning algorithm which decreases its prediction error at each step by updating its parameter values in an appropriate manner.

Note that, as in the spaceship or hiker examples, if the step size is chosen to be some constant number, then convergence is not possible since the learning algorithm (or hiker) will “overstep” a solution. On the other hand, it is possible that the step size is decreased too rapidly. For example, if a downhill hiker takes shorter and shorter steps this may eventually lead to the hiker stopping not at the bottom of the mountain but at some arbitrary point on the mountain side.

For this reason, the problem of choosing the  stepsize must be carefully addressed. One approach is to select an “optimal stepsize” which means that once the search direction is identified then one attempts to choose the stepsize which generates the largest decrease in prediction error over some pre-specified stepsize range. However, in a high-dimensional parameter space, it is computationally expensive to exactly compute the optimal stepsize. Instead, one would like to compute a computationally cheap “sloppy stepsize’’ which includes the optimal stepsize as a special case which ensures that the descent algorithm will converge.

What types of technical conditions are required for a “sloppy stepsize’’ to have the property that it ensures a descent algorithm will converge? A sufficient set of conditions is obtained if the “sloppy stepsize’’ satisfies the strong Wolfe conditions which were discussed by Philip Wolfe and others in the late 1960s. We refer to such sloppy stepsizes as Wolfe stepsizes. A Wolfe stepsize must satisfy two conditions called the Wolfe conditions.

The first Wolfe condition states that the prediction error evaluated after the proposed parameter update for a candidate stepsize minus the prediction error evaluated at the current parameter values must be less than or equal to some positive number α (alpha) multiplied by the stepsize multiplied by the dot product of the gradient vector at the current parameter values and the proposed search direction. Since this dot product is always negative, this condition basically ensures that not only the prediction error decreases for the new candidate stepsize but the prediction error decreases with an extra safety margin whose magnitude is governed by α (alpha).

The second Wolfe condition is a requirement that the magnitude of the derivative of the prediction error with respect to the stepsize evaluated at the new candidate stepsize is less than or equal to a positive number β (beta) multiplied by the magnitude of the predictor error derivative evaluated at the current stepsize. This second Wolfe condition is a requirement ensuring that the slope of the prediction error with respect to the new stepsize must is smaller in magnitude by some safety margin which is governed by the magnitude of β (beta).

In addition, it is required that the number β is less than one and additionally that β is larger than α. The constants α (alpha) and β (beta) are chosen before the learning process is initiated.

These conditions might seem complicated when stated in words but they are essentially easily computable requirements that a stepsize must satisfy which only requires evaluating the prediction error and the prediction error derivative which are typically functions which have already been implemented.  We can write a simple software function to check if a candidate stepsize is a Wolfe stepsize.

At this point we are ready to discuss the Zoutendijk-Wolfe convergence theorem. Assume the prediction error function has continuous derivatives and also has a lower bound.

We also define a subset of the parameter space called OMEGA as the set of all parameter vectors whose prediction error is less than or equal to the prediction error of our initial guess. Then it follows that the parameter updating process will eventually result in a sequence of parameter values which gets closer and closer to the set of critical points of the prediction error function in OMEGA or the sequence of parameter vectors will grow without bound.  From an engineering perspective, this result is useful because it means that if the algorithm does converge to a parameter vector based upon empirical observation, then that parameter vector must be critical point of the prediction error objective function.

As a reminder, recall that a critical point is defined as a parameter vector which makes the first derivative of the prediction error function vanish. This is a necessary but not sufficient condition for a local or global minimizer of the prediction error function.

In practice, it would be great if we could find the smallest possible prediction error that is a global minimizer of the prediction error but for complex problems it is not unusual that simply finding a relatively deep minimizer of the prediction error function yields acceptable performance. Humans and animals are highly optimized to survive in their environments but there is not technically an optimal way to pick up an apple and eat the apple. There is not an optimal way to run away from a predator such as a tiger. What is important is that the biological learning machine eats applies and avoids tigers. This is usually good enough for the purpose of survival!

These results can also be used to design automatic stepsize selection algorithms for descent algorithms. The basic idea of such algorithms is that one first starts with the current parameter estimates. Then one picks a search direction which points in a direction which has an angular separation with the negative gradient descent direction which is strictly less than some 90 degrees minus some small constant. Then we need to pick a stepsize associated with the search direction. This is done by first starting with the maximum stepsize which is usually chosen to be one. Next, one checks the Wolfe conditions. If the stepsize is ok, then that stepsize is chosen because it is a large stepsize and the Wolfe conditions are satisfied. If not, then the current stepsize choice is reduced in magnitude and the Wolfe conditions are checked again. This latter step is repeated until the Wolfe conditions are satisfied or until it is determined that it is too computationally expensive to consider additional stepsizes. Typically, for high dimensional parameter estimation problems, one only wants to spend a few iterations looking for a good stepsize. Next, the parameters are updated by adding the stepsize multiplied by the search direction to the current parameter estimates. This process is then repeated until convergence to a critical point is reached or the system is empirically observed to exhibit convergence failure. Automatic stepsize selection methods of this type are called “backtracking stepsize” search methods and are discussed in the references at the end of blog for this podcast which is located at: www.learningmachines101.com . In this reference list, I have also included some relatively recent research on this topic as well!!!

[MUSICAL INTERLUDE]
How to Read Book Chapter 7

 This section of the podcast is designed to provide guidance for the student whose goal is to reach Chapter 7. Carefully read the beginning of the Chapter which is before Section 7.1. Then read Section 7.1.1. which discusses Search Direction Selection. Don’t bother working through Examples 7.1.1 and 7.1.2. in Section 7.1.1. at this time but make sure you really understand how Figure 7.1 graphically illustrates the main equation (7.2).

Section 7.1.2 discusses selection of the Stepsize. Carefully read this section up to the discussion of the Definition of a Wolfe Stepsize. Make sure you understand the definition of an “optimal stepsize” which is illustrated in Figure 7.2. You will probably have difficulty understanding the Definition of a Wolfe Stepsize provided in Definition 7.1.3 if you don’t carefully understand the relevant notation which is used. The critical notation which is used in the definition is provided in the paragraph immediately before Definition 7.1.3 so read that paragraph very carefully before you launch into the Definition. Some insights into the definition are also provided in the text following Definition 7.1.3.

Make sure you understand the distinction between the Wolfe Conditions and the Strong Wolfe conditions.

 For now, you can temporarily ignore Theorem 7.1.1. which basically says that the conditions for the Wolfe Stepsize are not impossible to achieve. Algorithm 7.1.1 may be helpful in making the concept of checking the Wolfe stepsize practical.

Section 7.2 introduces the Zoutendijk-Wolfe convergence theorem. For your first reading of this chapter, you might want to first read the formal statement of the Theorem 7.2.1 and then skip the proof, and then read Recipe Box 7.1 very carefully. It will be helpful to compare the presentation in Recipe Box 7.1 with the statement of Theorem 7.2.1. Note that this convergence theorem requires that the objective function is smooth but it does not require that the objective function is convex.

Section 7.3 discusses a variety of important strategies for choosing search directions. It begins with classical gradient descent which is based upon approximating the objective function locally by a hyperplane, then it introduces Newton’s method which is based upon approximating the objective function locally by a quadratic surface. The challenges of applying Newton’s method in practice are then discussed which motivates the Levenberg-Marquardt algorithm which is a hybrid algorithm which combines the gradient descent method and the Newton’s method to yield a hybrid search direction. This is followed by several advanced optimization methods which were developed in the late 1970’s and still used today which build up implicit approximations to the search direction on a quadratic surface by simply taking weighted sums of previous gradient descent and previous search directions with the current gradient descent direction.

After you reach this point, you might want to skim the exercises and examples in Chapter 7 because many of the exercises and examples have been chosen to illustrate commonly used machine learning algorithm search directions or variations of commonly used machine learning algorithm search direction strategies.

[MUSICAL INTERLUDE]

How to Teach Book Chapter 7

This section of the podcast is designed to provide guidance to the instructor whose goal is to teach Chapter 7. Remember these notes are available for download by simply visiting this episode on the website: www.learningmachines101.com. In addition, supplemental teaching materials and errata for the book (as they become available) will be available on the website: www.statisticalmachinelearning.com. These class notes are designed for advanced undergraduate computer science students or first year computer science graduate students.

Begin this chapter by spending some time carefully defining the concept of a descent algorithm which basically has the form of: New parameter estimates are equal to the old parameter estimates plus the stepsize multiplied by the search direction. Also emphasize that the goal of the descent algorithm is to minimize an objective function. Make sure the terminology “stepsize”, “search direction”, and “objective function” is understood in this context. Then explain that certain constraints need to be imposed to ensure the convergence theorem which will be introduced later in the chapter works. At this point it is helpful to go over the downhill condition in Equation (7.2) and carefully explain that although the search direction is still “downhill” if DELTA in Equation (7.2) was equal to 0

as illustrated in Figure 7.1 It is requirement of the Convergence Theorem in the following section that DELTA is a small positive number which effectively prevents situations where the search direction is changing and converging over time to a search direction which is orthogonal to the negative gradient.

Note that there are two general approaches which are commonly used to satisfy Equation (7.2). First, we simply check to see if (7.2) is violated for some positive DELTA. If so, then we take a gradient descent step. Alternatively, we might be able to prove that (7.2) is never violated under certain conditions and try to understand those conditions. An example of this latter strategy is provided in Example 7.1.1.  It is worthwhile working through Example 7.1.1. at this point.

Then it is helpful to go over Figure 7.3 and Figure 7.4 which demonstrate that the stepsize needs to be dynamically adjusted during the learning process. Then go over Figure 7.2 which describes the concept of an “optimal stepsize”. Then explain that it is computationally expensive to compute an optimal stepsize so we want to use a sloppy stepsize which is called the “Wolfe stepsize”.

Then explain the concept of a Wolfe Stepsize as described in Definition 7.1.3 and the discussion following. Spend some time carefully introducing the notation introduced above Definition 7.1.3 which is the objective function projection in the direction of the search direction.  Explain the “strong” Wolfe conditions. In my class, I just emphasize the strong Wolfe conditions. I skip the “Existence Theorem” but mention that it basically says that the Wolfe conditions are not impossible to achieve at each algorithm iteration.

Then go over the Recipe Box 7.1 in the lecture. This is a simplified special case of the Theorem 7.2.1 but it is a good format for lecture presentation. Carefully explain that the conclusion of the theorem is that we either have convergence to the set of critical points in a particular closed and bounded set or the trajectories are not ultimately confined to that closed and bounded set. Note that if the closed and bounded set is defined about a particular critical point this is helpful for establishing stability of a particular critical point while a large closed and bounded set is helpful for establishing global stability. Note that convergence to a set of critical points is a necessary but not sufficient condition for convergence to a local or global minimizer. Note that if the objective function is convex then we have a stronger statement. Also note that if objective function’s value approaches infinity when the magnitude of the parameter values approaches infinity, it is possible to construct a closed and bounded set which contains all trajectories of the dynamical system thus ruling out the case where we have unbounded trajectories.

Then cover the important material in Section 7.3 which covers gradient descent, Newton’s method, Levenberg-Marquardt algorithm, and the Limited Memory Broyden-Fletcher-Goldfarb-Shanno algorithm which also known as the LBFGS algorithm. I tried to include some examples of commonly used machine learning algorithms and heuristics which are scattered throughout the examples and exercises so you might want to look over those and decide which you would like to present. Examples include block coordinate gradient descent, natural gradient descent, and variants of RMSPROP.

My goal in this chapter is to teach students to identify the search direction and stepsize associated with a particular descent algorithm and then show them how to check if the stepsize satisfies the Wolfe conditions, check if the search direction satisfies the downhill condition, and then understand the limited conclusions of the Wolfe-Zoutendijk convergence theorem.

[MUSICAL INTERLUDE]

Technical Overview of Book Chapter 7

This section is designed mainly for search engines and web spider robots but humans will find this final section relevant also!!

The objective of chapter 7 is to provide explicit conditions for ensuring the convergence of a large class of commonly used batch learning algorithms. The  Zoutendijk-Wolfe convergence theorem is introduced as a tool for investigating the asymptotic behavior of batch learning algorithms. The change in parameter estimates at each iteration is defined as a stepsize multiplied by a search direction vector.  The search direction must satisfy a downhill condition which requires that the search direction vector is less than ninety degrees from the negative gradient descent direction. Rather than choosing a computationally expensive optimal stepsize, the concept of a Wolfe stepsize is introduced that includes an optimal stepsize as a special case but allows for sloppy stepsize choices. The Zoutendijk-Wolfe convergence theorem provides sufficient conditions to ensure a batch learning algorithm with Wolfe stepsize and downhill search direction will converge to a set of critical points of a possibly nonconvex objective function.  Convergence of commonly used algorithms used to estimate parameters is analyzed using the Zoutendijk-Wolfe convergence theorem. These algorithms include: linear regression, logistic regression, multilayer perceptrons, gradient descent, Newton-Raphson, Levenberg-Marquardt, limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS), conjugate gradient, block coordinate gradient descent, natural gradient descent, and variants of RMSPROP. 

[MUSICAL INTERLUDE]

My new book “Statistical Machine Learning” can be purchased right now on AMAZON for about $100, the Kindle version is about $40, and the paperback version of the book should be available in about 1 or 2 months and the cost of the paperback edition will be about $50.

If you are not a member of the Learning Machines 101 community, you can join the community by visiting our website at: www.learningmachines101.com and you will have the opportunity to update your user profile at that time.  You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear”  link!

In addition, I encourage you to join the Statistical Machine Learning Forum on LINKED-IN. The purpose of the Statistical Machine Learning Forum is to provide an opportunity for more advanced discussions of machine learning topics which are typically not covered in these introductory podcasts.

And don’t forget to follow us on TWITTER. The twitter handle for Learning Machines 101 is “lm101talk”! Also please visit us on Apple Podcast or Stitcher and leave a review. You can do this by going to the website: www.learningmachines101.com and then clicking on the Apple Podcast or Google Podcast or Stitcher.com. This will be very helpful to this podcast! Thank you again so much for listening to this podcast and participating as a member of the Learning Machines 101 community!

REFERENCES:

Golden, R. M. (2020). Statistical Machine Learning: A Unified Framework. CRC Press. (Chapter 7).

Nocedal, Jorge; Wright, Stephen (1999). Numerical Optimization.

Steepest Descent http://en.wikipedia.org/wiki/Gradient_descent

Tuyen Trung Truong and (Tuan Hang) Hang-Tuan Nguyen, Backtracking gradient descent method and some applications in Large scale optimization. Part 1: Theory, accepted in Minimax Theory and its Applications. This is the more theoretical part of arXiv: 1808.05160, with some additional experiments. Accompanying source codes are available on GitHub: [Link] 

Tuyen Trung Truong and (Tuan Hang) Hang-Tuan Nguyen, Backtracking Gradient Descent method and some applications in Large scale optimization. Part 2: algorithms and experiments. The main part of the paper is based on the more experimental part of arXiv:1808.05160, together with arXiv:2001.02005 and arXiv:2007.03618. Accompanying source codes are available on GitHub: [Link] Published online in Applied Mathematics and Optimization. The paper is Open Access. doi:10.1007/s00245-020-09718-8. [Link to PDF]

Wolfe, P. (1969). “Convergence Conditions for Ascent Methods”. SIAM Review. 11 (2): 226–000. JSTOR 2028111doi:10.1137/1011036.

Wolfe, P. (1971). “Convergence Conditions for Ascent Methods. II: Some Corrections”. SIAM Review. 13 (2): 185–000. doi:10.1137/1013035.

“Wolfe Conditions”  http://en.wikipedia.org/wiki/Wolfe_conditions Provides explicit formulas for selecting the stepsize for batch learning

Leave a Reply

Your email address will not be published. Required fields are marked *