LM101-086: Ch8: How to Learn the Probability of Infinitely Many Outcomes

By | July 20, 2021

Episode Summary:

This 86th episode of Learning Machines 101 discusses the problem of assigning probabilities to a possibly infinite set of outcomes in a space-time continuum which characterizes our physical world. Such a set is called an “environmental event”. The machine learning algorithm uses information about the frequency of environmental events to support learning. If we want to study statistical machine learning, then we must be able to discuss how to represent and compute the probability of an environmental event. It is essential that we have methods for communicating probability concepts to other researchers, methods for calculating probabilities, and methods for calculating the expectation of specific environmental events. This episode discusses the challenges of assigning probabilities to events when we allow for the case of events comprised of an infinite number of outcomes. Along the way we introduce essential concepts for representing and computing probabilities using measure theory mathematical tools such as sigma fields, and the Radon-Nikodym probability density function. Near the end we also briefly discuss the intriguing Banach-Tarski paradox and how it motivates the development of some of these special mathematical tools.

Show Notes:

Hello everyone! Welcome to the 86th podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.

This particular podcast is actually the ninth episode in a special series of episodes designed to provide commentary of my new book. The book’s title is “Statistical Machine Learning: A unified framework” published by CRC Press in their “Texts in Statistical Science” series. The basic idea of the book is that it is designed to provide mathematical tools from the fields of nonlinear optimization theory and mathematical statistics for the analysis and design of machine learning algorithms. If you have an undergraduate degree in computer science or engineering or math, then this book is a great way to get up-to-speed with some of the most important widely used mathematical tools in machine learning used by the top experts in machine learning. You can learn more about my new book by visiting the website: www.statisticalmachinelearning.com.

Each podcast in this special series describes a chapter from my new book. Each podcast is typically divided into four sections. The first section provides an easy-to-understand self-contained overview of the key concepts from the relevant chapter. The second section is specifically designed to provide guidance to the human student who is using the book for self-study. The third section is designed to provide guidance for the human instructor who is using the book for instruction. The fourth section is designed to provide guidance for robot web spiders (also called web spiderbots or web crawlers) so they can find this podcast and properly index it for humans. But even if you are not a robot web spider…you shouldn’t feel left out! Humans familiar with the relevant terminology will also find this fourth section of the podcast useful for a quick technical overview of what is covered in the book chapter.

Finally, it is important to appreciate that there are many components associated with the development and application of artificial intelligence. The Learning Machines 101 podcast series focuses upon introducing tools and techniques designed to provide a deeper understanding of machine learning algorithms. However, algorithm design should not be done in a vacuum. The best machine learning algorithm designers try to develop a clear understanding of the intended use of the algorithm and how it will be deployed in practical applications. This understanding can greatly influence the actual design of the machine learning algorithm as well as the development of the methods for assessing the effectiveness of the machine learning algorithm.

However, the analysis and design of machine learning algorithms is only one of many components associated with deploying an artificially intelligent system.  At the end of this podcast, I will briefly introduce a great podcast called AI Today which is produced by Cognilytica. The  AI Today podcast provides a nice overview of the issues and challenges of developing artificial intelligence algorithms as well as deploying those algorithms in practical settings.



For now, however, let’s return to the main focus of this particular episode covers the material in Chapter 8 of my new book titled “Statistical Machine Learning: A unified framework.”  Chapter 8 is titled “Random Vectors and Random Functions.

[MUSICAL INTERLUDE]

Overview of Book Chapter 8

The essential idea behind statistical machine learning is that the learning machine finds itself in an environment where a specific event in the environment occurs with a particular probability. The machine learning algorithm uses information about the frequency of events in its environment in order to learn to make appropriate decisions and inferences. If we want to study statistical machine learning, then we must be able to discuss how to represent and compute the probability of an environmental event. It is absolutely essential that we have methods for communicating probability concepts to other researchers, methods for calculating probabilities, and methods for calculating the expectation of specific environmental events.

Some environmental events are naturally represented as discrete space-time chunks of the environment. So, for example, consider learning machines whose statistical environments involve learning to play a game of checkers, solve logic and scheduling problems, translate a sentence from one language such as English into another language such as Japanese, or retrieve a document from a database. Problems of these types naturally represent the inputs to the learning machine as well as the outputs of the learning machine using symbolic representations. That is, there are only a finite number of inputs to the learning machine and a finite number of outputs. Of course these finite numbers are quite large. The number of possible moves in a game of checkers is about 500 quintillion or 500 thousand million billion but it is still finite. So we want to have a method for assigning a probability to each of these 500 quintillion possible moves as well as assigning probabilities to different move combinations. This is called the case where we have a FINITE number of environmental outcomes despite that fact that some might argue that the number 500 quintillion is pretty close to infinity!!.

But this is not sufficiently general for many practical problems. We also need to handle the case of continuous space-time chunks. That is, we need to deal with the problem of modeling the representation of an INFINITE NUMBER OF ENVIRONMENTAL OUTCOMES. In this case, we need to somehow figure out a way to assign a probability to any arbitrary space-time event without first dividing the space-time continuum into a finite set of chunks. An example of this situation would be an intelligent control system where the heart rate of a patient and other vital signs of a patient are monitored in real-time and medication dosages are provided to the patient as necessary in real-time. Another example would be an intelligent control system for a self-driving car, aircraft, helicopter, or spacecraft which monitors in real-time continuous inputs from its environment such as physical position, velocity, and acceleration and makes real-time course corrections so the vehicle can safely reach its desired destination. This is called the case of learning in a statistical environment where an event consists of an UNCOUNTABLY INFINITE number of environmental outcomes.

Furthermore, there are many cases where environmental events have BOTH discrete space-time and continuous space-time properties. Problems of these types involve processing real-time audio signals such as translating the electrical signals generated by a microphone into verbal codes, determining if an image represented by pixel intensities contains a specific object, or translating a verbal command to a self-driving car such as “Take me to the nearest gas station” into a sequence of real-time signals which control the steering wheel, brakes, and accelerator of the car as the car receives images and radar signals from its environment over time. In these cases, a particular environmental outcome consists of some components which are CONTINUOUS and some components which are DISCRETE. So we will need special mathematical tools for representing the probabilities of these more complicated state spaces.

So let’s begin with the simplest case in which we only have a finite number of possible environmental events. This is pretty straightforward, we assume the environment generates a numerical vector with a specific probability. For example, the numerical vector might specify the intensity values of the pixels in an image or the results of a Fourier analysis of an acoustical wave form.This mathematical model of the statistical environment is naturally represented by a probability mass function whose domain is a finite number of vectors and whose range is a number between 0 and 1. The domain of the probability mass function is called the  SAMPLE SPACE. Each vector in the sample space is called an OUTCOME. Another way of thinking of this setup is that every possible vector that the learning machine observed in either training data or test data must be an outcome in the sample space.

So we start with the sample space which for now consists of a finite number of outcomes and we specify every possible subset of outcomes in the sample space. So some subsets will contain several outcomes while other subsets may contain exactly one outcome. Even if the sample space is very small the number of subsets of the sample space can be quite large. For example, consider a machine learning algorithm whose objective is to identify an object in an image. Assume that there are 36 possible objects for the machine learning algorithm to consider. Then the goal is to assign a probability to each possible object. This is achieved using a special function which is called a PROBABILITY MASS FUNCTION. The probability mass function simply takes as input an object and returns as an output a number between zero and one.

In many applications, we will want to compute the probability of different types of events. For example, one event would be that the machine learning algorithm identifies the object in the image as the object DOG, another event might be the machine learning algorithm identifies the object in the image as the object CAT, and a third event might be the machine learning algorithm identifies the object in the image as a TRUCK. The environment generates the object DOG with probability PDOG, the object CAT with the probability PCAT, and the object TRUCK with the probability PTRUCK. Using probability theory, we can calculate the probability that the environment will generate either a CAT or a DOG. This is achieved by simply adding the probability of the DOG object to the probability of the CAT object. Note therefore that the probability mass function implicitly is assigning a probability to every subset of the sample space. The explicit function which takes as input a subset of the sample space and returns a probability for that subset is called a PROBABILITY MEASURE. Note that the domain of a probability measure gets big very quickly. If we have 36 possible object outcomes in the sample space, then there will be about 68,000 million subsets of outcomes. We define an EVENT as a subset of the sample space or equivalently as a set of outcomes.

It is also often convenient to define the concept of a “random vector”. A “random vector” is actually a DETERMINISTIC function. That is, it is not random at all! The easiest way to visualize the concept of a random vector is to think of a random vector as a “feature detector”. Something occurs in the environment and the random vector “detects” the occurrence of that something in the environment which is “random” and generates a vector of numbers as its output. The random vector is not random at all. Rather, the randomness in the environment is the basis of the randomness. This type of random vector is called a DISCRETE RANDOM VECTOR.

When the sample space contains an uncountably infinite number of outcomes, the story becomes more complicated. For example, suppose we have an event such as a vector whose elements specify the intensities of different pixels in an image. Or an event where the vector specifies the output of different physical positions of a robot arm. These events are most naturally represented by an infinite sample space. More technically, we would take the sample space in these cases to be a vector space. So a 100-dimensional vector space would consist of every possible 100-dimensional vector of numbers you could possibly imagine. The vector space would include vectors whose elements could contain numbers such as negative 1005 or positive 7.1345.

We can’t simply assign a small probability to each element in the vector space because no matter how small the probability, when we add up all of the probablities we will get a number which is greater than one. In fact, we will get infinity!! This forces us to realize that the probability of any particular outcome in this sample space must be equal to zero. So for example,  the probability that the robot arm’s position is exactly 1.5327894 inches must be zero.

But if we assign a zero probability to each outcome in the sample space, then we realize that we can’t just add up the probabilities in each subset of the sample space, because then the probability of that subset would be zero. The solution is to define the concept of a probability density
function which more technically is called an ABSOLUTELY CONTINUOUS probability density function. The probability density function assigns either the number zero or a positive number to each element of the sample space but that positive number is not interpretable as a probability.

To solve these problems, the probability of a particular subset of the sample space is then defined as the integral of the probability density function over that region of the sample space. Instead of talking about the probability of a particular physical position of the robot arm, we talk about the probability that the robot arm’s physical position is in a particular range of physical positions.

We can define the concept of a random vector in this case as well but we need to be a little more careful and place some restrictions on the function which specifies the random vector. A sufficient condition for a function to be a random vector in this case of continuous random vectors is that the function is a continuous function. We refer to random vectors of this type as CONTINUOUS RANDOM VECTORS or ABSOLUTELY CONTINUOUS RANDOM VECTORS.

You might think this solves all of our problems but we are not done yet because suppose we have a random vector which is the concatenation of a continuous random vector and a discrete random vector. We refer to a random vector of this type as a MIXED RANDOM VECTOR. Mixed random vectors occur all of the time in machine learning. For example, suppose we have a vector which contains numerical variables specifying the physical position of a robot arm as well as some variable which takes on a finite number of possible values which indicates the label or name of the object which is being held by the robot arm. Or, consider a random vector which contains one subvector which is a list of all of the pixel intensities in an image represented by numerical values and another subvector which can only take on a finite number of values which specifies the object displayed in the image. Note that the probability distribution of such a random vector can not be specified by a probability mass function and can not be specified by an absolutely continuous probability density function. This corresponds to the case we discussed briefly before where the probability of some outcomes in the sample space are zero while the probability of other outcomes in the sample space are positive.

As in the absolutely continuous random vector case, the probability of a vector outcome in the sample space may be assigned a positive probability yet we still require the infinite sum of all outcomes in a subset of outcomes to be a probability between zero or one. Thus, we can’t simply add up all of the outcomes in a subset of outcomes to obtain the probability of that subset. Also we can’t integrate over the subset of outcomes using the ordinary Riemann integral. Instead we use a new integral called the Lebesgue integral which includes the ordinary Riemann integral and the simple adding up of probabilities with a summation as special cases.

Finally, there is one additional little complication which needs to be at least briefly discussed. When we assign probabilities to finite sets of outcomes it is easy to construct a probability mass function which allows us to compute the probability of a set of outcomes by simply adding the probability assigned to each outcome in the set. However, when the set of outcomes is uncountably infinite, it turns out that it is not easy to construct a probability density function which has the property that the integral of that density function over a set of infinite outcomes is equal to the probably of interest. To solve this problem, we only assign probabilities to well-behaved sets. Informally, a ‘’well-behaved’’ set is a member of a collection of sets called a SIGMA FIELD. A SIGMA-FIELD is simply a subset of all possible sets of the sample space and so each element of a sigma-field is an EVENT. And we want to assign probabilities to events. But the SIGMA-FIELD has a few other constraints. For example, every intersection or union of events in the sigma-field generates another event which is also required to be a member of that sigma-field. Also if an event is in the sigma-field, then its complement is a member of the sigma-field. So this basically ensures every possibly imaginable subset of outcomes relevant in a practical engineering application is an event in the sigma-field so we don’t lose any generality with this approach. The set of well-behaved subsets of the sample space is called a SIGMA-FIELD. The probability density which is constructed is called a RADON-NIKODYM PROBABILITY DENSITY function. It can be shown that such a probability density function exists under very general conditions by the famous Radon-Nikodym density theorem.The Radon-Nikodym probability density includes the probability mass function and the absolutely continuous probability density function as special cases but is more general since it can be used to compute the probability that a mixed random vector takes on values in some subset of the sample space.

So, in summary, the nice thing about the Radon-Nikodym probability density function is that it does exactly what we need it to do. If we have a high-dimensional mixed random vector which has both continuous and discrete random variables we can specify the probability distribution of the mixed random vector by a Radon-Nikodym probability density which maps a particular realization of the mixed random vector into a positive number. We can then compute expectations with respect to the Radon-Nikodym probability density by integrating with respect to the density using the Lebesgue integral.

Finally, I would like to end with a comment which is not really central to the discussion but is certainly intriguing and fascinating. As previously mentioned, probabilities are assigned to sets of outcomes and (in general) such sets must be ‘’well-behaved’’ sets. What types of things might go wrong if we simply allowed probabilities to be assigned to any arbitrary subset of the sample space? One famous example of the problems which arise when we try to compute integrals over subsets of a vector space is the Banach-Tarski Theorem. 

Note that when we are computing the probability of a set by integrating the probability density function over that set we are basically computing a type of generalization of the area or VOLUME of the set. Integration is the computation of the area or volume under a curve. So it is fundamentally important that volume is preserved when we divide that set into pieces.

The Banach-Tarski Theorem states that a solid sphere can be divided into a finite number of pieces and then those pieces can be rearranged two form two solid spheres each with the same volume, weight, and shape as the original sphere. At first glance this seems impossible, it suggests that you could take a solid ball of gold. Cut it up into a few pieces, rearrange those pieces and get two solid balls of gold of the same volume as the original ball of gold!! This might seem, at first glance, that mathematicians have figured out a quick and easy way to make a lot of money!!! But of course there is a catch!!

The key to understanding the Banach-Tarski Theorem is that the theorem is about an object which contains an infinite number of points. Thus, the Theorem won’t work with a real-life ball of gold which is made up of a finite number of atoms. Furthermore, note that the Banach-Tarski theorem conclusion only holds true because we need to use sets which are not “well-behaved” or equivalently sets which are not members of a Borel sigma field. That is why the theory of assigning probabilities to sets for environmental events which have both discrete and continuous properties requires that the sets be well-behaved and members of a Borel sigma field.

[MUSICAL INTERLUDE]
How to Read Book Chapter 8

This section of the podcast is designed to provide guidance to the student who is taking a course based upon this book or is using this book for self-study. Chapter 8 is quite different many of the previous chapters because the focus of Chapter 8 is to provide you with tools for communicating ideas about probabilities. Later mathematical theorems in this book which analyze the behavior of adaptive learning algorithms and generalization performance will use the notation and ideas introduced in Chapter 8. In addition, the notation in Chapter 8 is widely used in the machine learning literature at top scientific conferences such as NeurIPS. Unlike many of the other chapters in this book, it is best to read the entire chapter and not focus too much on the exercises on first reading. Spend a lot of time reading and studying the definitions and examples very carefully. Do not try to memorize anything. Read a definition and read the examples following the definition. Then stop and try to explain the concept in your own words. Section 8.3 is optional reading and Recipe Box 8.1 is optional as well. I would skip Section 8.3 and make sure you understand the rest of the chapter 8. After you have mastered all of Chapter 8 except for Section 8.3 then you can read Section 8.3 which essentially attempts to “fill in” missing gaps.

[MUSICAL INTERLUDE]

How to Teach Book Chapter 8

This section of the podcast is designed to provide guidance to the instructor whose goal is to teach Chapter 8. Remember these notes are available for download by simply visiting this episode on the website: www.learningmachines101.com. In addition, supplemental teaching materials and errata for the book (as they become available) will eventually be available on the website: www.statisticalmachinelearning.com. These class notes are designed for advanced undergraduate computer science students or first year computer science graduate students.

I like to begin by focusing on probability mass functions. The probability mass function is a relatively familiar concept to students and is a great starting point. Explain how the probability mass function concept can be used to model a data generating process which generates a sequence of independent and identically distributed random vectors with some common probability mass function. I find it very helpful to students to draw a picture of a hat on the whiteboard and inside the hat is a fixed number of vectors which are labeled with superscripts. The contents of the hat is defined as the sample space. Now we sample with replacement from the hat and the vectors which are generated as a result of this experiment are labeled with subscripts. So I spend some time making clear everybody is on the same page regarding notation. The realization of the ith random vector in a sequence of random vectors can take on the jth outcome in the sample space.

Introduce the concept of a random variable and a random vector as a deterministic unction which maps events in the environment into subsets of outcomes. Spend a little time illustrating how this works. (This discussion corresponds to Example 8.0.1, Example 9.1.2 from Chapter 9.)

Then write down the empirical risk function in terms of the notation introduced above. Show that the expected value of the empirical risk can be explicitly calculated and provide that explicit formula as a summation with the probability mass function defined by the data generating process. Emphasize that the probability mass function which generates the data is never observable but it is essential for communicating in an explicit way what the empirical risk function is attempting to estimate.(see Example 8.4.1 but just the probability mass function part). Show that the variance of this estimator approaches zero as the sample size increases if you wish but this will be discussed further in Chapter 9..

Similarly, write down an adaptive learning algorithm and show how to write down an explicit form for the expected search direction in an adaptive learning algorithm. Show that the variance of the error term is proportional to the learning rate (This is just Example 8.4.6 but just the probability mass function version).

Then discuss the Cheybshev and Hoeffding inequalities.using only probability mass functions
(Examples 8.5.1 and 8.5.2 but just the probability mass function versions)

Then introduce the familiar concept of a continuous probability density function which is technically an absolutely continuous probability density function. Give a concrete example such as the Gaussian probability density function. Show how the concepts previously introduced can be reworked using a continuous probability density function. (Example 8.0.2). Rework the expected risk example (Example 8.4.1) for the continuous random vector case and the expected search direction for the continuous random vector case (Example 8.4.6).

Then introduce a two-dimensional random vector whose first element is a discrete random variable and whose second element is a continuous random variable. Explain why the probability density of this random vector can not be represented as a probability mass function and can not be represented using the concept of an absolutely continuous probability density function. Also draw a picture of a linear artificial neuron with two inputs which computes a weighted sum of its inputs or equivalent a simple linear regression model which has two predictors. Make one input a discrete random variable and one input a continuous random variable, then ask what is the distribution of the response variable which is a weighted sum of the discrete and continuous random variable.

At this point I introduce the Radon-Nikodym density notation. I introduce the term “support specification measure” which is non-standard. Typically, mathematicians simply say define a Radon-Nikodym density with respect to some sigma-finite measure. Here, in this book, I say define a Radon-Nikodym density with respect to a support specification measure which I feel helps communicate more clearly the role of the sigma-finite measure. This notation is informally introduced in Section 8.2.2 which is probably a good section to use for lecture notes. The more formal introduction is provided in the optional Section 8.3.

Once this notation is introduced rework the Empirical Risk and Expected search direction examples 8.4.1 and 8.4.6 using the new Radon-Nikodym density notation.

[MUSICAL INTERLUDE]

Technical Overview of Book Chapter 8

This section is designed mainly for search engines and web spider robots but humans will find this final section relevant also!!

The objective of Chapter 8 is to provide mathematical tools for using probability distributions of mixed random vectors whose components may include both discrete and absolutely continuously random variables. In particular, the Radon-Nikodym density is used to specify the probability distribution of a mixed random vector. In addition, for the purpose of providing a better understanding of the Radon-Nikodym density, Chapter 8 introduces measure theory concepts including the concepts of a sigma-field, probability measure, sigma-finite measure, Borel-measurable function, almost everywhere, and Lebesgue integration. The concept of a random function is introduced, and then conditions for ensuring the existence of the expectations of random functions are provided. The chapter also includes discussions of important concentration inequalities for obtaining error bounds on estimators and predictions. In particular, the Markov inequality, Chebyshev inequality, and Hoeffding inequality are used to show how to estimate generalization error bounds and estimate the amount of training data required for learning.

[MUSICAL INTERLUDE]

Before wrapping things up, I would like to talk a little bit about a great podcast which is produced by Cognilytica and called the AI Today podcast.

Cognilytica is a research and advisory firm that provides real-world, industry and adoption focused market intelligence, advisory, and education on Artificial Intelligence and related areas. Cognilytica analysts Kathleen Walch and Ronald Schmelzer are hosts of the very popular AI Today podcast, going strong with over 200 episodes over the past 4 years. The AI Today podcast, which I encourage you to listen and subscribe to, focuses on how AI is being applied in both private and public sectors, with discussions on relevant topics related to AI and ML, insights into market trends, and interviews with thought leaders in the space.

Cognilytica provides executive level education and training on AI, Machine Learning, and Cognitive Technologies, including certification on the Cognitive Project Management for AI (CPMAI) methodology, and the best practices approach for implementing AI and ML projects. If you’re looking for best practices and methodologies for doing AI projects right, make sure to check out Cognilytica education at courses.cognilytica.com to join the hundreds of certified CPMAI trainees building world-class AI implementations.

Check out the show notes of this episode at www.learningmachines101.com for more information about how to access the AI Today podcast as well as a complete script of todays Learning Machines 101 podcast, references to the topics of measure theory, random vectors, Radon-Nikodym density functions, and the Banach-Tarski Paradox!!

My book “Statistical Machine Learning” can be purchased right now on AMAZON for about $100, the Kindle version is about $50. Currently there is no paperback edition of the book planned in the near future despite what is currently advertised on AMAZON. You can learn more about the book by checking out the amazon link in the show notes or visiting www.statisticalmachinelearning.com

And finally, if you are not a member of the Learning Machines 101 community, you can join the community by visiting our website at: www.learningmachines101.com and you will have the opportunity to update your user profile at that time.  You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear”  link!

In addition, I encourage you to join the Statistical Machine Learning Forum on LINKED-IN. The purpose of the Statistical Machine Learning Forum is to provide an opportunity for more advanced discussions of machine learning topics which are typically not covered in these introductory podcasts.

And don’t forget to follow us on TWITTER. The twitter handle for Learning Machines 101 is “lm101talk”! Also please visit us on Apple Podcast or Stitcher and leave a review. You can do this by going to the website: www.learningmachines101.com and then clicking on the Apple Podcast or Google Podcast or Stitcher.com.

REFERENCES:

Relevant Podcasts:

Leave a Reply

Your email address will not be published. Required fields are marked *