# LM101-067: How to use Expectation Maximization to Learn Constraint Satisfaction Solutions (Rerun)

## Episode Summary:

In this episode we discuss how to **learn** to solve constraint satisfaction inference problems. The goal of the inference process is to infer the most probable values for unobservable variables. These constraints, however, can be **learned** from experience. Specifically, the important machine learning method for handling unobservable components of the data using Expectation Maximization is introduced.

## Show Notes:

Hello everyone! Welcome to the twenty-second podcast in the podcast series *Learning Machines 101*. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.

In this episode we discuss how to **learn** to solve constraint satisfaction inference problems. The goal of the inference process is to infer the most probable values for unobservable variables. These constraints, however, can be **learned** from experience.

In Episode 21, we introduced the constraint satisfaction problem. In the constraint satisfaction inference problem, the inference machine is provided a collection of constraints and tries to find a solution which best satisfies those constraints. If you haven’t yet listened to Episode 21, you might want to check it out by visiting: www.learningmachines101.com since this episode builds upon concepts introduced in Episode 21.

In both this episode and Episode 21, we consider situations where we have a large collection of variables. Each variable can take on a certain number of possible values. Some of these variables we can observe, while some of these variables have values which are unobservable. For simplicity, assume for now each variable corresponds to an assertion which is either “true” or “false”. Thus, each variable takes on the value of “1” indicating its corresponding assertion is “true” or each variable takes on the value of “0” indicating its corresponding assertion is “false”. In a rule-based system, we might have a large collection of IF-THEN logical rules as discussed in Episode 3 of this podcast series. For example, consider a medical diagnosis problem where the first variable corresponds to whether or not the patient has high blood pressure. We will call this variable the “high blood pressure” variable. Also assume that we have a second variable which corresponds to the outcome of a Computed Tomography or CT exam of the patient’s heart intended to determine if there is a build up of calcium in the patient’s heart. This is a relatively expensive and invasive exam. We will call this second variable the CT outcome variable which indicates whether or not there is a build up of calcium in the arteries. We will call this the “CT exam outcome” variable. And finally, let’s have a third variable which indicates whether or not the person has experienced a non-lethal “heart attack” in the past week. We will call this the “heart attack” variable.

To address the problem of exceptions, one can naturally extend the concept of an IF-THEN logical rule to the concept of a probabilistic IF-THEN logical rule. Specifically, the specific conditions in the IF part of the probabilistic rule hold, then with a certain probability the THEN part of the probabilistic IF-THEN rule holds. Within this framework, the problem of inference is now redefined within a probabilistic setting. Given a patient has low blood-pressure and a CT exam outcome indicating a healthy heart, the goal of the probabilistic inference machine is simply to calculate the probability that the patient will have a heart attack. It is also important to note that within this probabilistic framework the constraints and variables are unordered. In other words, we can use the collection of probabilistic IF-THEN rules to infer whether someone will have a heart attack given they have low blood pressure but we can also use those same rules to infer whether someone will have low blood pressure given they have a heart attack and their CT exam outcome indicates a healthy heart. Every variable can be either observable or unobservable. Every variable can be an input or an output of the inference machine.

The concept of stating that a particular IF-THEN rule does not hold absolutely but just holds with some probability thus provides a simple solution to the problem of exceptions since if we have a statement such as: IF your blood-pressure is LOW, THEN it is very improbable that you will have a heart attack provides a natural way of representing knowledge when uncertainty about whether or not a rule is appropriate is present.

Given this collection of probabilistic IF-THEN rules, the next problem is to use these rules to make probabilistic inferences. In Episode 21, we discussed the details of a method called Monte Carlo Markov Chain or MCMC. For details regarding MCMC methods of inference, please take another look at Episode 21. However, the basic idea of both these algorithms is simple yet amazing. Briefly, we start with an initial guess for the values of the unobservable variables and then we randomly change those values according to a simple procedure. Eventually, it can be shown that the patterns of variables values which best satisfy the constraints specified by a particular measure of constraint satisfaction will be visited most frequently by this procedure! Thus, one just simply needs to make these random changes to the variable values and keep track of the good solutions as defined by the constraint satisfaction measure.

However, in Episode 21, the problem of “learning” probabilistic constraints was not discussed. In order to** learn** probabilistic constraints, we will define the probabilistic constraints in a special way. Specifically, the probability that a particular variable such as “heart attack” takes on the value “heart attack imminent” is assumed to depend not only upon other variables such as “blood pressure”, “genetic disposition”, “exercise”, and “diet” but also upon some additional variables which we call parameters. These parameters (also sometimes called “connection weights”) are strengthened and weakened by the learning machine as it learns. Different choices of the parameter values modify the specific probability that a particular variable variable such as “heart attack” takes on the value “heart attack imminent” given the values of other variables such as “blood pressure”, “genetic disposition”, “exercise”, and “diet”. **How does this learning process work?**

One method for learning probabilistic constraints is called the “**pseudolikelihood method**”. The pseudolikelihood method is based upon the idea that we might have some probabilistic rules which try to predict the probability of one variable such as “heart attack imminent” given the values of other variables such as “exercise” and “diet” as well as some parameters which are adjusted by the learning process. During the learning process, the learning machine might be lucky enough to see situations where the values of all three of these variables are simultaneously observable. These situations can then be used as “training stimuli” and the Gradient Descent methods of Episode 16 can be used to adjust the parameters of the learning machine so that it can predict the required probabilities. The basic idea of the Gradient Descent method is that one computes the “error” between the learning machine’s current guess for the probability of an event and the observed percentage of times that event occurs in the environment. Then one uses that “error signal” to “tweak” the parameters of the learning machine so that learning machine’s guess will be slightly improved in the future.

The Pseudolikelihood Method works fairly well but it has problems in at least two situations.

We now discuss the first problem of the Pseudolikelihood Method.

If a particular variable such as “heart attack imminent” depends upon large numbers of variables then situations where all of the relevant predictive variables that predict “heart attack imminent” may be very rare. So this method works best when each probabilistic rule is only formulated using a small number of variables. For example, the variables influencing the likelihood of a heart attack may be very numerous and include factors such as the presence or absence of a pain in one part of your body such as the arms, shoulder, neck, teeth, jaw, belly area, or back. In addition, the pain could be severe or mild. Furthermore, other symptoms of a heart attack include the presence or absence of: anxiety, cough, fainting, dizziness, nausea, heart palpitations, shortness of breath, and sweating. If each probabilistic rule was formulated in terms of all of these variables then learning using the Pseudolikelihood Method could only occur when ALL of these variables are observed. In many real-world situations, the availability of such data may be difficult to obtain.

We now discuss the second problem of the Pseudolikelihood Method.

This is a more subtle problem associated with “non-independent” training-stimuli. In fact, this is where the terminology “Pseudo” in the phrase “Pseudolikelihood Method” arises! Consider a very complicated event in the learning machine’s environment where many variables are observable and many variables are unobservable. During this event, we notice that we observe the variables: “shortness of breath”, “sweating”,”anxiety”, “dizziness”, and “fainting”. In addition, we have the following two probabilistic rules which we want to learn. The first probabilistic rule tries to predict the probability of “shortness of breath” from “sweating” and “anxiety”. The second probabilistic rule tries to predict the probability of “dizziness” from “sweating” and “fainting”. Notice that both rules share a common variable “fainting”. This turns out to be a problem. If we experience TWO DISTINCT events and then use one event to adjust the parameters of the first probabilistic rule and the second event to adjust the parameters of the second probabilistic rule, then there are no technical difficulties. However, if we adjust the parameters of BOTH rules using data from a SINGLE event then there is a potential problem that certain statistical regularities associated with the INTERACTIONS of the two rules will not be properly handled.

There is, however, an alternative to the Pseudolikelihood Method.

A method for learning probabilistic constraints which has some success in handling both of these problems simultaneously but is much more complicated is called Monte Carlo Expectation Maximization. The basic idea of Monte Carlo Expectation Maximization is that when an environmental event is experienced by the learning machine some variables are observable and other variables are unobservable. For example, perhaps the variables “blood pressure” and “diet” are observed but the variables “dizziness” and “neck pain” are not observable. The learning machine has an imperfect probabilistic model of its world since it is in the process of learning about its environment. Nevertheless, the learning machine calculates the probability that the patient has “dizziness” and “neck pain” and then uses that probability to make a random guess regarding whether or not the those factors are present in the patient. Next, the learning machine LEARNS not only the variables it actually observed but also it learns the variables that it guessed. The learning machine then uses this “constructed” training stimulus to make tiny changes to its parameters so that in the future it is more likely to assign a high probability to the “constructed” training stimulus using the Gradient Descent Method described in Episode 16. Although the Monte Carlo Expectation Maximization seems very heuristic, one can use mathematical arguments to identify the conditions where this procedure will allow the learning machine to correctly learn its probabilistic environment.

There are several variations of the Monte Carlo Expectation Maximization method. In the Monte Carlo Expectation Maximization Method, the learning machine uses its imperfect probabilistic model of the world to randomly guess the values of unobservable variables given the observable variable values. Unfortunately, sometimes for very complicated problems, it requires a lot of computation to do this random guess correctly. To solve this problem, one can use Monte Carlo Markov Chain methods such as discussed in Episode 21 to make these random guesses. Once these values of the unobservable variables are randomly guessed, then the “completed” training stimulus is learned by perturbing the parameters using the Gradient Descent Method.

Another potential difficulty in implementing the Monte Carlo Expectation Maximization in practice is that the gradient descent method which is the method for adjusting the parameters of the learning machine based upon error signals may be computationally too complicated to implement in practice. In such situations, a method called “Contrastive-Divergence Learning” is sometimes used. The basic idea of this method is that the learning machine uses a simpler method for adjusting its parameters when it sees a training stimulus but then the mathematics of the Contrastive-Divergence Learning method dictates that the parameter changes to the learning machine have to be modified in a special way. They are modified by having the learning machine generate random guesses for both observable and unobservable variables to generate “training stimuli”. But here is the kicker!!! The parameters of the learning machine are adjusted to make these random guesses LESS PROBABLE. That is, the learning machine adjusts its parameters so that these random guesses about both the observable variable values and unobservable variable values are predicted to be LESS LIKELY.

To illustrate the general ideas we have just discussed, let’s consider a very special case of the Gibbs Sampler using a Monte Carlo Expectation Maximization learning for the case where only correlations between binary-valued units are learned. This special case of constraint satisfaction learning is called the “Wake-Sleep Algorithm” for the Boltzmann machine but has also been referred to as a Helmholtz machine. The learning machine works as follows. First, one makes an initial guess for the parameters of the learning machine. Second, the first training stimulus is presented which specifies values for particular binary variables in the machine. The remaining binary variables in the machine are assigned values of either zero or one at random. Third, a variable whose value is not known is selected at random and the probability that the variable takes on the value of one is computed. The formula for computing this probability involves having the computing unit associated with that variable compute a weighted sum of the variables which influence that variable where the weights

are the parameters of the learning machine. So, for example, if the unknown variable was “blood pressure is high” then the probability that variable was equal to one or “true” would be computed by multiplying the value of each variable influencing the “blood pressure is high” variable by a parameter value and then adding up all of these products. The resulting number is the “evidence” that the “blood pressure is high” but is not yet a probability. To convert the evidence into a probability, the evidence has to be “normalized” so that as the evidence increases the probability comes close to one

and as the evidence decreases the probability comes close to zero. This is done using sigmoidal logistic computing units which were described in Episode 14. Fourth, the learning machine then randomly assigns the value of “1” or “true” to the variable “blood pressure is high” such that the frequency of this assignment is the probability computed by taking the sigmoidal function of the evidence that the “blood pressure is high”. This process is repeated for multiple variables in the system whose values are not known. In Step 5, when this variable update process has completed, the first stage of learning begins. The weighting parameter between a pair of variables is **increased** by a tiny amount directly proportional to the observed correlation between the two variables. Note that some of the variable values are never observed and were simply generated at random by the learning machine based upon its current guesses regarding their probabilities. In Step 6, the second stage of learning begins which corresponds to the Contrastive-Divergence mechanism of learning. The learning machine randomly assigns the value of “1” or “0” to a variable according to its expected probability the variable takes on that value but the learning machine does this not only for variables whose values are unknown…it also does this for variables whose values are known! But here is the interesting point! These correlations which were computed by IGNORING the values of the observable variables are **decreased** by a tiny amount directly proportional to the observed correlation between the two variables! That is, the learning machine adjusts its parameters to INCREASE its memory for correlations with actual observed events in its environment and adjusts its parameters to DECREASE its memory for correlations with events which are pure fantasies! The above process is then repeated over and over again beginning with Step 1.

Some neuroscientists have proposed, in fact, that this Contrastive-Divergence mechanism of learning may be used by humans and animals and corresponds to a theory of how “dreams” help us learn and update connections among neurons in the brain. According to this theory called the “Wake-Sleep Algorithm”, humans and animals strengthen and weaken connections among neurons based upon the experiences they encounter during the day. At night, they dream about situations they have not experienced and “unlearn” these total fantasies. This is consistent with some evidence, for example, that the sleep process tends to eliminate irrelevant memories. If humans and animals do indeed learn according to this procedure, then a mathematician might say that our “dreams” correspond to neural simulations of events in our environment which facilitate learning by the brain’s implementation of a Monte Carlo Expectation Maximization method!!!

If you go to the website: www.learningmachines101.com and check out the references for Episode 22 you can learn more details about this topic.

Thank you again for listening to this episode of Learning Machines 101! I would like to remind you also that if you are a member of the Learning Machines 101 community, please update your user profile.

You can update your user profile when you receive the email newsletter by simply clicking on the: *“Let us know what you want to hear”* link!

Or if you are not a member of the Learning Machines 101 community, when you join the community by visiting our website at: www.learningmachines101.com you will have the opportunity to update your user profile at that time.

From time to time, I will review the profiles of members of the Learning Machines 101 community and do my best to talk about topics of interest to the members of this group!

**Keywords: **Monte Carlo Expectation Maximization, Expectation Maximization, Stochastic Approximation Expectation Maximization, Wake-Sleep Algorithm, Dreams, Gibbs Sampler, Monte Carlo Markov Chain, Markov Random Field, Random Field, Metropolis-Hastings algorithm, Metropolis Algorithm, Hastings Algorithm, Constraint Satisfaction, Boltzmann Machine

## Further Reading:

- Cirelli, Charia; Giulio Tuononi (August 2013). “Perchance to Prune”.
*Scientific American*. - http://www.podcasts.com/science_selections/episode/perchance-to-prune-aug-2013-scientific-american
- Hinton, Dayan, Frey, and Neal (1994). The Wake-Sleep Algorithm for Unsupervised Neural Networks. http://www.gatsby.ucl.ac.uk/~dayan/papers/hdfn95.pdf
- Gu and Kong (1998). A Stochastic Approximation Algorithm with Monte-Carlo Method for Incomplete Data Estimation Problems, 95, 7270-7274.
- Wikipedia: Neuroscience of Sleep.
*http://en.wikipedia.org/wiki/Neuroscience_of_sleep#cite_note-82* - Episode 21: LM101-021: How to Solve Large Complex Constraint Satisfaction Problems (Monte Carlo Markov Chain)
- Fischer, A., and Igel, C. An introduction to restricted Boltzmann Machines.

Click here to get the paper! - Wikipedia: Pseudolikelihood (http://en.wikipedia.org/wiki/Pseudolikelihood )
- Boltzmann Machine Article by Professor Geoffrey Hinton in Scholarpedia (http://www.scholarpedia.org/article/Boltzmann_machine)