LM101-079: Ch1: How to View Learning as Risk Minimization

By | December 23, 2019
Android presenting his empirical risk family tree.

Episode Summary:

This particular podcast covers the material in Chapter 1 of my new (unpublished) book “Statistical Machine Learning: A unified framework”. In this episode we discuss Chapter 1 of my new book, which shows how supervised, unsupervised, and reinforcement learning algorithms can be viewed as special cases of a general empirical risk minimization framework. This is useful because it provides a framework for not only understanding existing algorithms but also for suggesting new algorithms for specific applications.

Show Notes:

Hello everyone! Welcome to the 79th podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.  

This particular podcast is actually the second episode in a new special series of episodes designed to provide commentary on a new book that I am in the process of writing. The book’s title is “Statistical Machine Learning: A unified framework” and it will be published by CRC Press in their “Texts in Statistical Science” series sometime in early 2021. You can learn more about my new book by visiting the website: www.statisticalmachinelearning.com.

Each podcast in this special series describes a chapter from my new book. Each podcast is divided into four sections. The first section provides an easy-to-understand self-contained overview of the key concepts from the relevant chapter. The second section is specifically designed to provide guidance to the human student who is using the book for self-study. The third section is designed to provide guidance for the human instructor who is using the book for instruction. The fourth section is designed to provide guidance for robot web spiders (also called web spiderbots or web crawlers) so they can find this podcast and properly index it for humans. But even if you are not a robot web spider…you shouldn’t feel left out! Humans familiar with the relevant terminology will also find this fourth section of the podcast useful for a quick technical overview of what is covered in the book chapter.

This particular episode covers the material in Chapter 1 of my new book titled “Statistical Machine Learning: A unified framework.”

[MUSICAL INTERLUDE]

In this episode we discuss Chapter 1 of my new (unpublished) book, which shows how supervised, unsupervised, and reinforcement learning algorithms can be viewed as special cases of a general empirical risk minimization framework. Such a framework has important advantages for machine learning algorithm analysis and design. Suppose we have methods for analyzing and designing empirical risk minimization algorithms. And now suppose we can interpret many different machine learning algorithms as empirical risk minimization algorithms. Then by understanding how to analyze and design empirical risk minimization algorithms, we automatically have learned how to analyze and design many different types of machine learning algorithms!

So what is an empirical risk minimization learning algorithm? We begin by assuming a learning environment where events are independent and identically distributed. A simple way to visualize such a learning environment is to imagine that the set of all possible events the learning machine can possibly experience are in a box. The data generating process then picks an event at random from the box and shows that event to the learning machine. After showing the event to the learning machine, the event is placed back in the box. In other words, the data generating process samples with replacement from the box of possible events to generate a sequence of environmental events.

Within the empirical risk minimization framework, the learning process corresponds to the process of adjusting the parameters of the learning machine. When the learning machine encounters an environmental event, the learning algorithm is assumed to receive a penalty for having inappropriate parameter values. This penalty is generated by the learning algorithm. That is, the penalty loss received by the learning machine for having a particular set of parameter values will vary depending upon the particular environmental event experienced by the learning machine. More precisely the received penalty loss is a function of the parameter values and the event.

The goal of the learning process is to find parameters of the learning machine that minimize the average penalty loss with respect to the data generating process which is generating the observations. This average penalty loss is called the “risk”. In practice, the learning machine never gets the opportunity to obtain absolute knowledge of the data generating process but only observes a sample of the data generating process. Suppose the data generating process generates 1000 events. Then the learning machine’s estimate of the average penalty loss is based not upon an infinite number of events but only upon 1000 events. This estimate of the average penalty loss or “risk”  is based upon 1000 events rather than an infinite number of events is called the “empirical risk”.

The learning algorithm is designed such that the parameter values of the learning machine are modified during the learning process. As the learning machine experiences environmental events, the parameter values of the learning machine are modified. Each the time the parameter values of the learning machine are modified, the empirical risk or equivalently the estimated average penalty loss decreases. Therefore, as the learning process continues, the learning machine is refining its parameter values so that they minimize the empirical risk function.

This relationship between the updating of the parameter values and the decrease in the value of the empirical risk function is an emergent property of the learning mechanism. Typically, the method of gradient descent or variants of this method are used to design both adaptive learning algorithms which update the parameter values of the learning machine as it acquires experience or batch learning algorithms which update the parameter values of the learning machine using all of the data. Adaptive and batch learning algorithms are reviewed in Episode 16 and Episode 68 of Learning Machines 101.

So this is basically how an empirical risk learning machine works. How is the empirical risk learning machine framework related to the major types of machine learning algorithms? It is convenient to view machine learning algorithms as belonging to one of three different learning categories.

Supervised learning algorithm learn events that are typically partitioned into two feature vectors: The input pattern and the desired response to the input pattern. The goal of the supervised learning machine is to minimize an empirical risk function which minimizes the discrepancy between predicted and desired responses. Within the empirical risk minimization framework, the penalty loss received by the learning machine for a particular set of parameter values is determined by an environmental event consisting of an input pattern vector and desired pattern vector.

Unsupervised learning algorithms simply process events generated by the environment with the goal of learning a parameter vector which reduces the empirical risk. The environmental event is represented by a feature vector. The learning machine seeks a parameter vector that minimizes an empirical risk and thus receives a greater penalty for experiencing a novel environmental event. That is, the goal of the unsupervised learning machine is to adjust its parameter values to avoid punishment for not recognizing familiar patterns. Within the empirical risk minimization framework, the penalty loss received by the learning machine for a particular set of parameter values is determined by an environmental event represented by a pattern vector.

Reinforcement learning assumes that a particular environmental event consists of a series of state vectors. The data generating process first generates an initial state vector specifying the context of the initial conditions for the environmental episode. The learning machine then uses its current parameter values to generate an action vector based upon that initial environmental state vector. The next environmental state vector is generated from the learning machine’s action vector and the previous environmental state vector. The learning machine’s action vector modifies its statistical environment which causes the statistical environment to generate the next environmental state vector. This perception-action cycle of the learning machine using the current environmental state to generate an action and then the environment generating an environmental state based upon the learning machine’s action then continually repeats itself until the end of the episode.  When the machine finishes learning is completed, the sequence of environmental state vectors intermixed with the learning machine’s actions constitutes an episode. It is assumed that episodes are independent and identically distributed. Within the empirical risk minimization framework, the penalty loss received by the learning machine for a particular set of parameter values is determined by this episode which is cooperatively constructed through interactions of the learning machine with its environment. Note that the statistical environment for this case of reinforcement learning is special since its characteristics are actually dependent upon the behavior of the learning machine! A discussion of policy gradient reinforcement learning algorithms of this type can be found in Learning Machines 101 Episode 63: Reinforcement Learning.

So this is the empirical risk minimization framework in a nutshell. The majority of popular machine learning algorithms can be naturally viewed as empirical risk minimization algorithms although there are a few notable and important exceptions. The main point of this first chapter is to present a variety of widely used unsupervised learning, supervised learning, and reinforcement learning algorithms which can be viewed within a common empirical risk minimization framework. By viewing these algorithms within a common framework, methods for the analysis and design of an empirical risk minimization algorithm can be applied to a large class of machine learning algorithms.

Or as Miyamoto Musashi, a 16th Century Samurai Warrior, has stated:

 “From one thing, know 10,000 things.”

[MUSICAL INTERLUDE]

This section of the podcast is designed to provide guidance for the student whose goal is to read Chapter 1. The contents of Chapter 1 are relatively straightforward so simply reading the chapter from beginning to end is a reasonable strategy. But Chapter 1 contains a lot of information and many different types of algorithms. You might find it helpful to review the symbol notation at the beginning of the book, the notation reviewed in the preface, and notation provided at the beginning of Chapter 4 as necessary. Don’t worry about how the derivatives were computed or the concept of a vector or matrix derivative. Just skip the vector or matrix derivative calculations. This topic is discussed in Chapter 5. After you get through Chapter 5, you will be prepared to reread these sections involving vector and matrix derivatives.

The algorithms in Chapter 1 are provided as concrete examples designed to illustrate the more general abstract concepts. Don’t try to memorize or understand the specific details of the algorithms. Instead focus on trying to understand how the specific algorithms are implemented and how those algorithms can be viewed as minimizing empirical risk functions.

Make sure you understand the concept of an empirical risk function. Make sure you understand how design an adaptive gradient descent algorithm and a batch gradient descent algorithm to minimize an empirical risk function. Make sure you understand the difference between supervised, unsupervised, and reinforcement learning algorithms. In reality, the distinctions between these classes of algorithms can be quite blurred but, for the purposes of this introduction, it is convenient to view these algorithm classes as distinct.

[MUSICAL INTERLUDE]

This section of the podcast is designed to provide guidance to the instructor whose goal is to teach Chapter 1. Remember these notes are available for download by simply visiting this episode on the website: www.learningmachines101.com. In addition, supplemental teaching materials (as they become available) will be available on the website: www.statisticalmachinelearning.com.

Begin the class by discussing the Statistical Machine Learning Framework in Figure 1.1. After explaining Figure 1.1, explain how viewing the problem within a statistical inductive framework differs from a deductive framework. For example, if you give a statistical learning machine a set of training vectors and then test its performance then it may be working perfectly but nevertheless be inaccurate when tested on the training vectors it has learned. The emphasis in the statistical learning framework is on generalization performance: classifying feature vectors that have never been seen before.

Spend about 5-10 minutes on Section 1.1.

Then next topic should discuss how to represent information as a vector. This is also very important material and is covered in Section 1.2.1 which is titled “Feature Vectors”. Make sure you cover Section 1.2.1 carefully. Section 1.2.2 which covers the concept of stationarity is also important but I would rate it as slightly less important than the previous two sections. Section 1.2.3 provides the first brief mention of the topics of adaptive learning and batch learning. I also introduce the distinction between a “passive learning environment” and a “reactive learning environment”. The terminology “reactive learning environment” is not standard in the field but I have introduced it as a pedagogical device. The basic idea is that we assume the data is generated from an i.i.d. process within the framework presented in this book. When the data is i.i.d. with common density Pe. We call Pe the data generating process density. If this data generating process density changes when the learning machine’s parameters change, then we call the DGP reactive. Otherwise, if the DGP density Pe is not functionally dependent upon the learning machine’s parameters, then we call the DGP density passive.

Section 1.2.4 is titled “Prior Knowledge”. It consists of a few important ideas regarding how Prior Knowledge is incorporated into a Machine Learning Algorithm. Make sure you emphasize that the feature vector representation is an important source of prior knowledge. Also emphasize the architecture and loss function are important sources of prior knowledge. Incorporating prior knowledge into a machine learning algorithm is quite different than adding a few rules to a rule-based system so providing these concrete examples is helpful. Clearly this section is not intended to be a list of all possible ways prior knowledge could be incorporated but rather is designed to briefly mention just a few ideas to help the student reconceptualize the prior knowledge concept within a statistical machine learning framework.

Section 1.3 introduces the Empirical Risk Minimization Framework. The best way to teach this is to introduce key ideas by first presenting Example 1.3.1 very slowly and very carefully. Explain the graphical notation. Explain everything. Once they understand this, note that this is easily generalizable to a deep layer feedforward network with 100 hidden layers by simply replacing the function which makes the prediction with a more complicated function.  This may seem obvious to you but it will be a revelation to many of the students.

Then present the nonlinear regression Example 1.3.2 to illustrate this. Then repeat yourself again by working through Example 1.3.3 for the students who dozed off or didn’t quite get Example 1.3.2.

Section 1.3.3. covers Regularization Terms. Explain carefully what is a regularization term. Explain that regularization terms can make performance on the training data get worse but the idea of the regularization term is that it will make performance on the test data get better. Explain the network is now attempting to improve its predictive performance while simultaneously imposing prior knowledge constraints on the network architecture. Given examples of L1 and L2 regularization terms.

Section 1.3.4. covers batch gradient descent and adaptive gradient descent algorithms and briefly mentions that many variations of these exist. Explain the basic idea first. Then illustrate it with Example 1.3.5 and Example 1.3.6. Don’t spend hardly any time discussing derivatives. Just tell the students “here is the derivative of this function”. We will learn how to compute these vector derivatives in Chapter 5.

Section 1.4 is a particular approach to machine learning algorithm development. You may choose to present a different version of this approach. Regardless, the main point of this section is to help students appreciate that correct implementation of a machine learning algorithm in software is not a sufficient condition for the machine learning algorithm to be effective. For example, if the learning machine is not able to identify relevant statistical regularities in its environment, the learning machine will not work.

Section 1.5 discusses Supervised Learning. A discrepancy function is used to compare the predicted response of a learning machine to the desired response. Examples 1.5.1, 1.5.2, and 1.5.3 introduce different discrepancy functions appropriate for numerical, binary, and categorical targets respectively.  The hidden unit concept is introduced in the context of supervised learning in this section but it is clearly applicable to unsupervised and reinforcement learning applications. Both feedforward and recurrent networks are discussed.

Section 1.6 discusses Unsupervised Learning. Example 1.6.2 Denoising Autoencoder example is an important example which should definitely be discussed. Another important example is Example 1.6.5 which provides a general framework for classical clustering analysis as developed in the 1970s.

Section 1.7 discusses Reinforcement Learning. Definitely explain both Value Function Reinforcement Learning in Section 1.7.2 and Policy Gradient Reinforcement Learning in Section 1.7.3.

[MUSICAL INTERLUDE]

This final section is designed mainly for search engines and web spider robots but humans will find this relevant also.

The objective of Chapter 1 is to introduce the empirical risk minimization framework and show how a large class of commonly used supervised, unsupervised, and reinforcement machine learning algorithms can be naturally represented within this framework. The method of gradient descent is also introduced as a general purpose method for designing machine learning algorithms for minimizing empirical risk functions. Empirical risk functions are constructed for representing a large class of important supervised learning machines including linear regression, logistic regression, multilayer perceptrons, deep learning architectures, and recurrent neural network architectures.  Empirical risk functions are also constructed for a large class of important unsupervised learning machines including the nonlinear denoising autoencoder, K-means clustering, latent semantic indexing, general dissimilarity measure clustering algorithms, stochastic neighborhood embedding algorithms, and Markov logic nets.  And finally, empirical risk functions are constructed for supporting value function reinforcement learning and policy gradient reinforcement learning algorithm design of a large class of linear and nonlinear machine learning algorithms.

[MUSICAL INTERLUDE]

As I said before, my book is not yet published. The current plan is to publish the book early in the year 2021.  If you are a member of the Learning Machines 101 community, make sure to watch your email because when the book is released I want to provide special discounts to members of the Learning Machines 101 community so you won’t have to pay full price for my book. In addition, I will send out more details about the availability of my book as we move closer to the publication date!!!

If you are not a member of the Learning Machines 101 community, you can join the community by visiting our website at: www.learningmachines101.com and you will have the opportunity to update your user profile at that time. This will allow me to keep in touch with you and keep you posted on the release of new podcasts and the availability of the book which I am currently writing. You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear”  link!

In addition, I encourage you to join the Statistical Machine Learning Forum on LINKED-IN. When the book is available, I will be posting information about the book as well as information about book discounts in that forum as well.

And don’t forget to follow us on TWITTER. The twitter handle for Learning Machines 101 is “lm101talk”! Also please visit us on ITUNES and leave a review. You can do this by going to the website: www.learningmachines101.com and then clicking on the ITUNES icon. This will be very helpful to this podcast! Thank you again so much for listening to this podcast and participating as a member of the Learning Machines 101 community! 

Related Learning Machines 101 Podcasts and Further Reading:

Leave a Reply

Your email address will not be published. Required fields are marked *