LM101-039: How to Solve Large Complex Constraint Satisfaction Problems (Monte Carlo Markov Chain)[Rerun]

By | November 9, 2015
Robot trying to solve a probabilistic constraint satisfaction problem.

LM101-039: How to Solve Large Complex Constraint Satisfaction Problems (Monte Carlo Markov Chain)

Episode Summary:

In this episode we discuss how to solve constraint satisfaction inference problems where knowledge is represented as a large unordered collection of complicated probabilistic constraints among a collection of variables. The goal of the inference process is to infer the most probable values of the unobservable variables given the observable variables.

Show Notes:

Hello everyone! Welcome to the twenty-first podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.

Many of the recent episodes of Learning Machines 101 have focused upon the problem of learning. This episode does not discuss the learning problem at all. In this episode, we will focus upon the problem of inference rather than learning. Thus, in some respects, this episode is most similar to Episodes 3, 7, and 8 which can be found at the Learning Machines 101 website: www.learningmachines101.com .

In this episode, we consider situations where we have a large collection of variables. Each variable can take on a certain number of possible values. Some of these variables we can observe, while some of these variables have values which are unobservable. For simplicity, assume for now each variable corresponds to an assertion which is either “true” or “false”. Thus, each variable takes on the value of “1” indicating its corresponding assertion is “true” or each variable takes on the value of “0” indicating its corresponding assertion is “false”. In a rule-based system, we might have a large collection of IF-THEN logical rules as discussed in Episode 3 of this podcast series. For example, consider a medical diagnosis problem where the first variable corresponds to whether or not the patient has high blood pressure. We will call this variable the “high blood pressure” variable. Also assume that we have a second variable which corresponds to the outcome of a Computed Tomography or CT exam of the patient’s heart intended to determine if there is a build up of calcium in the patient’s heart. This is a relatively expensive and invasive exam. We will call this second variable the CT outcome variable which indicates whether or not there is a build up of calcium in the arteries. We will call this the “CT exam outcome” variable. And finally, let’s have a third variable which indicates whether or not the person has experienced a non-lethal “heart attack” in the past week. We will call this the “heart attack” variable.

Of course in real life, we would want to have more than three variables for the purpose of representing the factors that influence the presence or absence of a heart attack. Additional variables might include the presence or absence of a pain in one part of your body such as the arms, shoulder, neck, teeth, jaw, belly area, or back. In addition, the pain could be severe or mild. Furthermore, other symptoms of a heart attack include the presence or absence of: anxiety, cough, fainting, dizziness, nausea, heart palpitations, shortness of breath, and sweating.

It is not difficult to imagine that a realistic medical diagnosis problem might have as many as 40 possible binary variables. The total number of possible medical situations that can be represented using 40 binary variables is 240 which is approximately equal to the number of stars in the observable universe! Our ultimate goal is to discuss practical algorithms for making inferences involving hundreds or thousands of variables rather than just 40 variables, but for illustrative purposes we will focus on just three binary variables: “high-blood pressure”, “CT exam outcome”, and “heart attack outcome”. For three binary variables, this corresponds to the case of 23 or just 8 possible patterns of outcomes.

Suppose that we are in the medical profession and we being by postulating specific IF-THEN logical rules which specify relationships among these three variables. For example, we might have IF-THEN logical rules such as:

  • IF the patient has high-blood pressure, THEN the patient will NOT have a heart attack.
  • IF the patient has a heart attack, THEN the patient will have high-blood pressure.
  • IF the patient has high-blood pressure AND a poor CT exam outcome, THEN the patient will have a heart attack.

As discussed in Episodes 3, 7, and 8 of this podcast series, there are at least three challenging problems associated with representing knowledge using IF-THEN logical rules. First, although an individual rule by itself might make sense it is possible that rule in conjunction with other rules is illogical. This is called the problem of determining if a set of IF-THEN logical rules is mutually consistent. Second, for a particular situation, it is possible that one will not have sufficient information about the situation with respect to the given collection of IF-THEN logical rules to come to a unique decision regarding a patient’s health. And third, in real life, it is difficult to imagine that one could postulate a legitimate rule about anything. In real life there are always exceptions as discussed in previous podcast Episodes 3, 7, and 8.

To address the problem of exceptions, one can naturally extend the concept of an IF-THEN logical rule to the concept of a probabilistic IF-THEN logical rule. Specifically, the specific conditions in the IF part of the probabilistic rule hold, then with a certain probability the THEN part of the probabilistic IF-THEN rule holds. Within this framework, the problem of inference is now redefined within a probabilistic setting. Given a patient has low blood-pressure and a CT exam outcome indicating a healthy heart, the goal of the probabilistic inference machine is simply to calculate the probability that the patient will have a heart attack. It is also important to note that within this probabilistic framework the constraints and variables are unordered. In other words, we can use the collection of probabilistic IF-THEN rules to infer whether someone will have a heart attack given they have low blood pressure but we can also use those same rules to infer whether someone will have low blood pressure given they have a heart attack and their CT exam outcome indicates a healthy heart. Every variable can be either observable or unobservable. Every variable can be an input or an output of the inference machine.

The concept of stating that a particular IF-THEN rule does not hold absolutely but just holds with some probability thus provides a simple solution to the problem of exceptions since if we have a statement such as: IF your blood-pressure is LOW, THEN it is very improbable that you will have a heart attack provides a natural way of representing knowledge when uncertainty about whether or not a rule is appropriate is present.

On the other hand, the problems of consistency and uniqueness still remain. Fortunately, the probabilistic formulation provides a natural solution to this problem as well through the use of the concept of a Markov Random Field. In Episode 11, we introduced the concept of a Markov chain. A Markov Random Field is a more generalized concept which includes Markov chains as special cases. Briefly, a Markov Random Field may be loosely defined as a collection of probabilistic IF-THEN rules.

The Markov Random Field framework provides us with a methodology for constructing a collection of probabilistic logical rules which not only handle exceptions but also are guaranteed to be mutually consistent and yield unique solutions in the form of specific probabilistic statements. The only real assumption that is required for this to work is the Positivity Assumption which essentially says that every possible situation has at least some tiny tiny tiny chance of occurring. Or, in other words, the Positivity Assumption can be paraphrased as stating that “nothing is impossible”! Given this assumption, arbitrary collections of probabilistic IF-THEN logical rules can be combined and manipulated in a straightforward manner! So, for example, we would have to make the assumption in our simple three variable cardiological medical diagnosis problem that situations where someone has excellent blood pressure and has no indication of a problem of clogged arteries based upon their CT exam could have a heart attack. We might assume that such situations are very highly improbable nevertheless are conceivable.

Intuitively, the Positivity Assumption seems consistent with world knowledge representation in humans. For example, Kane Waselenchuk the current world champion in racquetball has won the world championships in pro racquetball nine times and has only lost 1 match in the past 5 years. Five years ago, if you were to ask a racquetball enthusiast whether they thought it would be possible for Kane to lose only 1 match in a five year period while playing continuously on the pro tour they would probably say it was possible but not very probable. In fact, if you asked Kane’s parents the day after he was born, whether they thought that he might one day become a dominant force on the racquetball pro tour for over a decade, I am sure they would respond with “anything is possible!”. Human cognition appears to embrace all possibilities—never surrendering to the concept of impossibility. Or, in other words, perhaps the “Positivity Assumption” is the secret ingredient for true artificial intelligence!

So, given that we have a Markov Random Field how can we make inferences with respect to the Markov field? The answer to this question can be traced back in part to an early paper published in 1953 in the Journal of Chemical Physics written by Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller, and Edward Teller. The title of the paper was: Equation of State Calculations by Fast Computing Machines. This paper was if not the first, one of the earliest examples, of an important class of computing strategies used in statistical machine learning which are called Monte Carlo Markov Chain methods. This algorithm was ultimately referred to as a member of the class of Metropolis algorithms despite the fact that all members of the research group probably had equal participation in its development.

In 1984, perhaps one of the most influential papers which resurrected, popularized, and extended this early research thread was published by the mathematicians Stuart Geman and Donald Geman (who are brothers).  Their paper entitled: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images which was published in 1984 in the journal: IEEE Transactions on Pattern Analysis and Machine Intelligence. In the Geman and Geman paper, they discussed a special Monte Carlo Markov Chain algorithm called the Gibbs Sampler.

Given the Positivity Assumption, we can always represent the evidence for a particular pattern of binary values in terms of some function or rule that maps a pattern of binary values into a number. Such a function or rule is called the “Energy Function” of the Markov Random Field. Now for the good news and bad news. The good news is that, in principle, we can write down a mathematical formula which lets us calculate the probability of any arbitrary pattern of values of the variables whose values are unknown in terms of the Energy function. Moreover, the Energy function is usually easy to compute. The bad news is that for problems involving more than 12 or 15 variables, it is often computationally impossible using the fastest computers to calculate the desired probability. This is where the concept of a Monte Carlo Markov Chain computing methodology becomes important.

A Monte Carlo Markov Chain or MCMC is an algorithm for randomly transforming one pattern of values for the unobservable variable in the field into another pattern of values. The calculations are computationally simple and easy to perform regardless of the number of variables. For example, this method can be and has been applied to problems involving inference over thousands of variables corresponding to searching through a collection of binary patterns which is many times greater than the number of atoms in the universe!

The first important example of a MCMC algorithm is called the Metropolis algorithm. The essential idea of the Metropolis algorithm is as follows. Remember the “Energy” of a pattern

Of variable values is a number which is typically easy to calculate. Smaller values of energy
correspond to patterns of values which are better solutions to the constraint satisfaction problem.

The terminology “unclamped” variable refers to a variable whose values are allowed to change.

The terminology “clamped” variable refers to a variable whose values are not allowed to change.

Ok, lets now introduce the Metropolis MCMC algorithm!

 

Step 1: Pick a pattern of values for the unclamped variables at random.

Step 2: Randomly select another pattern of values for the unclamped variables based
upon the current values of the unclamped variables and the current clamped variables.

We will call the old pattern of values X and the new pattern of values C.

Step 3:  Evaluate the Energy function at the new pattern of values. Evaluate the energy
function at the old pattern of values. If the Energy decreases, indicating you accidently
picked a good choice of values then go ahead and keep that good choice of values.

On the other hand, if your new choice of values INCREASES the energy then you flip a
weighted coin which is weighted according to the change in the Energy. If the outcome
of the coin flip is heads you keep the new pattern of values even though that pattern of
values is an INFERIOR solution relative to the previous pattern of values.

Step 4: Keep track of patterns of values which have very low Energy on a separate list.

Step 5: Go Step 2 and Repeat this process over and over again.

Notice that sometimes the Metropolis MCMC algorithm changes a pattern of values which has a particular Energy to another pattern of values which has a greater Energy value. This means the Metropolis MCMC is actually selecting a pattern of values which is a WORSE solution when it makes this type of step.

Although it may seem a little non-intuitive that one would be interested in a pattern of values which is a worse solution than the current solution, this is one of the secrets to MCMC algorithms. The algorithm does this because sometimes in order to get to a really great solution you might have to first visit some solutions which are not so good.

A second important special case of the Metropolis-Hastings algorithm is the Gibbs Sampler MCMC algorithm which is more general version of the well-known Boltzmann Machine MCMC algorithm. The essential idea of the Gibbs Sampler MCMC algorithm is as follows.

Step 1: Pick one variable X from the current field at random.
Step 2: Using the probabilistic IF-THEN logical rule set, calculate the probability P that the variable X is equal to 1.
Step 3: Flip a weighted coin which comes up heads with probability P.

Step 4: If the coin comes up heads, then set the variable X equal to 1; otherwise
set the variable X equal to 0.

Step 5: Keep track of patterns of values which have very low Energy on a separate list.

Step 6. Repeat this above process over and over again.

 

Thus, in a Monte Carlo Markov Chain algorithm, the pattern of variable values is progressively perturbed randomly but the chance of selecting a particular pattern of values depends upon the Energy function. It can be shown that for any arbitrary Markov Random Field, it is always possible to construct a Monte Carlo Markov Chain algorithm. Moreover, it can be shown that a Monte Carlo Markov Chain algorithm has a remarkable property, if the perturbation of variable values is repeated multiple times, eventually the percentage of times that a particular pattern of values is visited converges to the probability of that pattern of values as specified by the Markov field’s energy function. This means that low energy states corresponding to good solutions to the constraint satisfaction problem will be visited more frequently provided the algorithm is allowed to continue for a long enough time period. Thus, it is only necessary for the computer to “record” situations where the pattern of variable values have a low “energy state” during this random process. The pattern of variable values with the lowest energy obtained as a result of the MCMC algorithm’s journey through state space are used as the solution to the constraint satisfaction problem.

But that’s not all!! There is even more good news. The rate of convergence can be shown to be “geometric” which, in English, means “super fast”. There are versions of MCMC algorithms which involve an approach called “simulated annealing” which involves decreasing a special parameter called the  temperature parameter towards zero. As the temperature parameter is decreased, the probability of randomly moving from a low energy state to a high energy state gradually decreases. We will not focus on such methods here because the rate of convergence of MCMC simulated annealing algorithms is “logarithmic” which, in English, means “super slow”. Moreover, the mathematics is much more complicated. Simulated annealing methods will be discussed in greater detail in a future episode.

If you go to the website: www.learningmachines101.com and click on Episode 21, then scroll to the bottom of Episode 21 you will find a pointer to the Wikipedia Entry  which describes how to implement a particular type of MCMC algorithm called the Metropolis-Hastings MCMC algorithm. The Metropolis-Hastings MCMC algorithm includes the popular Gibbs Sampler, Metropolis, and Boltzmann machine Monte Carlo Markov Chain algorithms as special cases.

This is a very exciting approach which has numerous applications especially with the availability of high-speed computing resources.  I will now go ahead and list just a few example applications of MCMC algorithms which are not in the area of medical diagnosis.

Our first example is related to an area of interest mentioned by a new member of the Learning Machines 101 community named Stephen. Stephen is interested in the problem of unsupervised classification of spatial data. Markov Random Fields are ideally suited for this type of application. Each variable in the field corresponds to a different spatial location. In addition, we allow each variable in the field to take on many possible values corresponding to the set of possible objects at that spatial location. So, for example, suppose that one is studying how rivers erode land masses. One might have a large collection of variables corresponding to different spatial locations in a geographic region corresponding to a river surrounded by land masses. We call these locations “sites”. At each “site”, we assume that either: (1) land mass is present without evidence of erosion, (2) land mass is present with evidence of erosion, or (3) water is present.  In practice, the Markov Field might have thousands of such site variables where each site variable can take on one of three possible values.  Now we incorporate a simple probabilistic IF-THEN logical rules such as:

 

  • The probability that a particular target site is a “land mass without evidence of erosion” will become larger when there are more sites surrounding the target site which are “land masses without evidence of erosion”.
  • The probability that a particular target site is a “land mass with evidence of erosion” will become larger when there are more sites surrounding the target site which are “land masses with evidence of erosion”.
  • The probability that a particular target site is a “water mass” will become larger when there are more sites surrounding the target site which are of type “water mass”.

 

These three simple probabilistic IF-THEN logical rules can then be used to support spatial classification. In a particular spatial classification problem, it is assumed that some of the “sites” in the spatial classification are known and therefore clamped. For example, it might be known that it is known that some sites are definitely “land masses” and some sites are definitely “river or water” and some sites are definitely “land masses in the process of erosion”. However, there may be many other sites which are difficult to classify. One can use an MCMC type of algorithm to infer “most probable” values of the unclamped site variables given the clamped variables!! This is achieved, as previously mentioned, by using a MCMC algorithm to update the state variables of the Markov Field and keeping track of which patterns of state variables have the smallest Energy.

 

Notice that in the medical diagnosis example. Each variable in the Markov Field corresponds roughly to 1 probabilistic IF-THEN rule. If you have 100 variables, then you have about 100 probabilistic IF-THEN rules. In this problem concerned with the unsupervised classification of spatial data there might be millions of variables but one might have only a few probabilistic IF-THEN rules. The reason why this is possible in this application is that it is assumed that the probabilistic IF-THEN rules which are defined are “shared” by the millions of variables in the Markov Field. This assumption is sometimes called the “spatial homogeneity” assumption.

 

MCMC algorithms have also been used to provide optimal layouts of integrated circuits. One way to set up such a problem would be to assign each component in the integrated circuit design a variable value indicating: (1) how it is connected to other components, (2) the orientation of the component, and (3) the location of the component. The Energy function is chosen to be the total volume of the layout of the components.

 

Another example application is statistical natural language processing. In this case, each clamped variable corresponds to a particular word or phrase. Associated with
each clamped variable is an unclamped variable indicating the syntactic  or semantic interpretation of that word or phrase. Additional unclamped variables can also be included which correspond to higher-order syntactic structures such as “noun phrases” or higher-order semantic structures. That is, in addition to probabilistic IF-THEN rules which specify the probability that a particular word or word phrase should be labeled in a particular manner one can have additional probabilistic IF-THEN rules specifying which patterns of labels are plausible and which are implausible.

 

A third example is the Scheduling Problem. One approach for dealing with scheduling problems. Suppose the goal of a scheduling problem is to determine whether a particular employee should work: day shift, night shift, or the late night shift. For each employee and each day of the week, a particular variable is assigned which takes on the three possible values of “day shift”, “night shift”, and “late night shift”. Then, we have probabilistic constraints such as: (1) a certain number of employees must work at each shift, (2) some employees may be on vacation, (3) some employees prefer certain shifts for particular days,  and (4) one wants to avoid assigning a late night shift followed by a day shift and similar situations.

The Monte Carlo Markov Chain methodology is a generic methodology which is thus applicable to a broad range of complicated constraint satisfaction problems. It has the advantage that it affords considerable flexibility in dealing with a broad range of problems in a straightforward manner. However, it is important to emphasize that for specific types of small scale problems with appropriate constraints, computer scientists have developed sophisticated strategies for performing a search for a solution.  We will not discuss these solutions for these specialized today however. The goal of today’s discussion was to introduce the powerful MCMC methodology as a possible tool for consideration in the solution of very complicated and very high-dimensional probabilistic constraint satisfaction problems.

In addition, this entire podcast did not discuss the problem of learning probabilistic constraints. Rather the focus was on using probabilistic constraints. In a future podcast, we will discuss how one can build a probabilistic constraint satisfaction machine that learns from experience without a teacher! That is, we will describe unsupervised learning procedures for probabilistic constraint satisfaction machines in a future episode of Learning Machines 101!

Finally, as you may have noticed, I discussed an application today which was relevant to one of the members of the Learning Machines 101 community.

If you are a member of the Learning Machines 101 community, please update your user profile.

You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link!

Or if you are not a member of the Learning Machines 101 community, when you join the community by visiting our website at: www.learningmachines101.com you will have the opportunity to update your user profile at that time.

From time to time, I will review the profiles of members of the Learning Machines 101 community and do my best to talk about topics of interest to the members of this group!

Keywords: Gibbs Sampler, Monte Carlo Markov Chain, Markov Random Field, Random Field, Metropolis-Hastings algorithm, Metropolis Algorithm, Hastings Algorithm, Iterated Conditional Modes Algorithm, Constraint Satisfaction, Boltzmann Machine

Further Reading:

Wikipedia Entry: Metropolis-Hastings Algorithm. Click here to visit!

Wikipedia Entry: Gibbs Sampling. Click here to visit!

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Click here to get a copy of the paper!
Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. (1953). “Equations of State Calculations by Fast Computing Machines“. Journal of Chemical Physics 21 (6): 1087–1092.

Sheng, Y., and Takahashi, A. (2012). A simulated annealing based approach to integrated circuit design. Click her to get a copy of the paper!

Golden, R. M. (2000). Kirchoff law Markov fields for analog circuit design. In Neural Information Processing Systems Proceedings (Editors: Solla, Leen, and Muller), MIT Press, 907-913. Click here to get a copy of the paper!

Salomao, M. C., and Remacre, A. Z. (2001). The use of discrete Markov random fields in reservoir characterization. Journal of Petroleum Science and Engineering, 32, 257-264.

Click here to look at the abstract!

Strebelle, S. (2002). Conditional simulation of complex geological structures using multiple-point statistics, Mathematical Geology, 34.Click here to look at the abstract!

Fischer, A., and Igel, C. An introduction to restricted Boltzmann Machines.
Click here to get the paper!

Lavecchia, C., Langlois, D., and Smaili, K. (2008). Phrase-based machine translation based on simulated annealing. Click here to get a copy of the paper!

Gastner, Michael, T. Traffic flow in a spatial network model.

Click here to get a copy of the paper!

He, X., Zemel, R. S., Carreira-Perpinan, M. A. Multiscale conditional random fields for image labeling. Click here to get a copy of the paper!

 

Copyright © 2015 by Richard M. Golden. All rights reserved.

Leave a Reply

Your email address will not be published. Required fields are marked *