LM101-071: How to Model Common Sense Knowledge using First-Order Logic and Markov Logic Nets

Self-driving car doesn't see the difference between hitting a baby carriage or hitting a garbage can because it does not have common sense.

Podcast: Play in new window | Download | Embed

LM101-071:
How to Model Common Sense Knowledge using First-Order Logic and Markov Logic Nets

Episode Summary:

In this podcast, we provide some insights into the complexity of common sense. First, we discuss the importance of building common sense into learning machines. Second, we discuss how first-order logic can be used to represent common sense knowledge. Third, we describe a large database of common sense knowledge where the knowledge is represented using first-order logic which is free for researchers in machine learning. We provide a hyperlink to this free database of common sense knowledge. Fourth, we discuss some problems of first-order logic and explain how these problems can be resolved by transforming logical rules into probabilistic rules using Markov Logic Nets. And finally, we have another book review of the book “Markov Logic: An Interface Layer for Artificial Intelligence” by Pedro Domingos and Daniel Lowd which provides further discussion of the issues in this podcast. In this book review, we cover some additional important applications of Markov Logic Nets not covered in detail in this podcast such as: object labeling, social network link analysis, information extraction, and helping support robot navigation. Finally, at the end of the podcast we provide information about a free software program which you can use to build and evaluate your own Markov Logic Net!

Show Notes:

Hello everyone! Welcome to the 71th podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.

Before listening to this podcast, it would be helpful if you have already listened to Episode 3: How to represent knowledge using logical rules, Episode 8 covering how to represent beliefs using probability theory, Episode 66: How to solve Large Complex Constraint Satisfication Problems, Episode 67 covering how to learn to solve Large constraint satisfaction problems, Episode 55 covering how to learn statistical regularities using MAP and Maximum Likelihood Estimation, and Episode 65 How to design gradient descent learning machines.

You can find these episodes at the website: www.learningmachines101.com

A good example of the complexity of issues associated with artificial intelligence applied to problems in the real world is the challenge of developing self-driving cars which have common sense. The development of artificially intelligent systems with common sense is a challenging task. Suppose that you are driving down the freeway and you suddenly see a baby carriage in the middle of the freeway which is moving slowly across the freeway at about 1 mile per hour. What types of thoughts would run through your mind? You might think it is quite possible the baby carriage is empty so it might be ok to hit the baby carriage on its side to knock it out of the way so that you don’t have to swerve into another lane to avoid the gasoline tanker in your right lane and the family of six in the station wagon in your left lane. Alternatively, if there is in fact a baby inside the baby carriage then hitting the baby carriage with your car would not be a good experience for the baby in the baby carriage. All of these thoughts go through your mind and you will try to make the best possible decision in order to minimize the risk to your life, the life of the baby (if a baby is in fact inside the baby carriage), and the lives of your fellow drivers.

Ok…so now let’s consider what goes through the mind of a self-driving car which has been trained with millions of hours of training videos and is equipped with every possible imaginable radar, infrared, and echolocation sensing device. Chances are that despite millions of hours of training videos, it has rarely encountered a scenario involving a baby carriage on the freeway. It would treat the baby carriage scenario in probably the same manner in which it would treat the scenario of a plastic garbage can rolling across the freeway. It would know an obstruction was present and it would identify its velocity but it wouldn’t be able to think more deeply about this problem. The self-driving car, for example, might directly hit the baby carriage to avoid injuring its human passengers or the passengers traveling in cars in neighboring lanes. The self-driving car would not have common sense knowledge that baby carriages are likely to contain babies but plastic garbage cans are not likely to contain babies.

This is the key problem that we want to focus on in today’s podcast. The problem of building common sense knowledge into our learning machines. Most automobile drivers possess the common sense knowledge that there is a strong possibility that baby carriages contain babies. However, this “common sense knowledge” is not possessed by most self-driving cars. How can we get this type of knowledge into our learning machines? To address this problem, we need to develop a mathematical theory for representing common sense knowledge that can be implemented in computer software.

We start off with a set of primitive objects in the world which I will call S. A “term” is the technical term for a “model of some object”. A term could be an element of S, or a term could be a variable which represents any element in a subset of S, or a term could be a function of other terms. So, for example, suppose that BABY, BOY, GIRL, and CARRIAGE are elements of S. Then BABY, BOY, GIRL, and CARRIAGE correspond to primitive object model concepts denoting a “human baby”, a “human boy”, a “human girl”, and a “baby carriage”. We could define a new variable called HUMAN which can take on the possible values of BABY, BOY, and GIRL. We could also define the function MOTHER-OF so that the term MOTHER-OF(BABY) denotes the baby’s mother. We can even construct the term MOTHER-OF(MOTHER-OF(BABY)) to denote the baby’s grandmother. The concepts BABY, BOY, GIRL, CARRIAGE, HUMAN, MOTHER-OF(BABY) are examples of “object models” which are called “terms”. So this first step provides us with a collection of mathematical tools intended to flexibly represent and manipulate fairly sophisticated models of objects in the world. These concept models which are called “terms” are the building blocks which we will use to express pieces of common knowledge.

An “atomic formula” is a function which assigns the value of “true” or “false” to some function of terms. So basically the “atomic formula” is how we specify a little piece of common-sense knowledge about the world while the “terms” are used to represent aspects of the world. Examples of atomic formulas are: CONTAINS(CARRIAGE, BABY) and CONTAINS(CARRIAGE, MOTHER-OF(BABY)). We can also use logical operators such as AND, OR, NOT, and IF-THEN to construct complex formulas such as: CONTAINS(CAR,BABY) AND CONTAINS(CAR,MOTHER-OF(BABY)) to represent the concept that “the car contains a baby and its mother”.

This is pretty good but what if there is more than one baby in the world or if the world contains more than one baby carriage? This is why it is useful to write our atomic formulas using variables. For example, we could write the atomic formula: CONTAINS(CARRIAGE(X), BABY(Y)) which means “The baby carriage named X contains a baby named Y.” Now this is looking like a more general type of piece of common-sense knowledge. We could also represent common sense knowledge such as “For every baby named Y there exists a mother of Baby Y” more formally by the atomic formula:
“FOR EVERY ENTITY BABY(Y): THERE EXISTS AN ENTITY Z SUCH THAT Z=MOTHER-OF(BABY(Y))”.

Our goal is to identify the set of possible worlds which are consistent with the logical constraints imposed by the evidence from our self-driving car’s sensors and the common-sense knowledge base of our self-driving car. Less formally, the car’s sensors are telling us that there is a baby carriage in the driving lane. Then we have 1 piece of common-sense knowledge that every baby carriage contains exactly 1 baby. We also have another piece of common sense knowledge that states that if a baby carriage contains a baby and the baby carriage is in the driving lane, then that baby is in the driving lane. To keep things simple, we assume that there is only 1 baby carriage in the world and there are two babies named JOHN and SUSAN. We now list all possible ground atoms involved in representing both of the two logical formula rules in this example and the evidence ground atom. These ground atoms are: CONTAINS(CARRIAGE, BABY(JOHN)), CONTAINS(CARRIAGE, BABY(SUSAN)), IN-LANE(CARRIAGE), IN-LANE(BABY(JOHN)), IN-LANE(BABY(SUSAN). Since we have 5 ground atoms which can each be assigned the values of “true” or “false”, it follows that there are 32 possible worlds. The ground atom IN-LANE(CARRIAGE) is assigned the value of “true” because it represents evidence obtained from the car’s sensors so this reduces the number of possible worlds to 16 possible worlds corresponding to all possible assignments of truth values to the 4 ground atoms CONTAINS(CARRIAGE, BABY(JOHN)), CONTAINS(CARRIAGE, BABY(SUSAN)), IN-LANE(BABY(JOHN)), IN-LANE(BABY(SUSAN).

Now we start using the rules to score possible worlds. So, for example, the possible world where the ground atoms CONTAINS(CARRIAGE, BABY(JOHN)), CONTAINS(CARRIAGE, BABY(SUSAN)) are both true or both false is not consistent with the rule that every baby carriage contains exactly one baby so these possible worlds don’t receive any points. However, the possible worlds where exactly one but not both of the ground atoms CONTAINS(CARRIAGE, BABY(JOHN)), CONTAINS(CARRIAGE, BABY(SUSAN)) is true is consistent with the rule that every baby carriage contains exactly one baby so these possible worlds each get 1 point each.

The remaining ground atom which states that “if a baby carriage is in the driving lane and contains some baby, then that baby is also in the driving lane” gives an additional point to the possible world where both of the ground atoms CONTAINS(CARRIAGE,BABY(JOHN)) and IN-LANE(BABY(JOHN)) are true and the ground atom CONTAINS(CARRIAGE,BABY(SUSAN)) is false so this possible world gets a grand total of 3 points. In addition, the possible world where both of the ground atoms CONTAINS(CARRIAGE,BABY(SUSAN)) and IN-LANE(BABY(SUSAN)) are true and the ground atom CONTAINS(CARRIAGE,BABY(JOHN)) is false is also assigned an additional point for a grand total of 3 points. Ultimately only two of the original 32 possible worlds receive 3 points while the remaining 30 possible worlds receive 2 or fewer points.

Another way to visualize this solution is to view it as a nonlinear optimization problem where the objective function assigns a certain number of points to each possible world and the goal is find the set of possible worlds that maximize the value of the objective function.

More formally, for each ground atom we define a logical predicate which returns the value of 1 when the ground atom is true for that possible world and returns the value of 0 when the ground atom is false for that possible world. The objective function is simply the sum of the logical predicates for all ground atoms representing the self-driving car’s knowledge base of common-sense rules subject. The goal of the logical inference problem is to find the set of possible worlds which make the value of the objective function as large as possible. Or equivalently, find the possible worlds which get the most points because those possible worlds correspond to possible worlds where more ground atoms are assigned the value of “true”.

So this provides us with some useful machinery for representing common-sense knowledge using first-order logic. Using these representations, it is possible for different pieces of common-sense knowledge to be represented and manipulated in computer software programs. The next problem is where does one get all of this common-sense knowledge? One possible idea is that you just sit down in front of the computer and just start typing in a bunch of common-sense knowledge. However, the difficulty with this approach is that it would take years and years before you could even get a small fraction of common-sense knowledge into the computer in the form of first-order logic expressions. But don’t worry, there is good news. People have, in fact, been typing common sense knowledge into computers for over 38 years!

In 1979, Professor Douglas Lenat of Stanford University began the development of a representation language for the purpose of representing common-sense knowledge. The key objective was to represent facts about the world such as “Babies travel in baby carriages”, “Mickey Mouse is not the same individual as Bullwinkle the Moose”, “Every American has a mother”, and “A baby carriage has wheels” in this large database. Every year since 1979 for the past 38 years scientists have added little pieces of common knowledge to this database. More technically, the modern version of CycL using first-order logic to represent common sense knowledge in a computationally efficient manner and support computationally efficient retrieval strategies.

How did the CycL system “learn” common sense knowledge facts? A simple answer. It did not learn anything! Human beings simply typed in a bunch of simple knowledge facts about the world into the system with the idea that simple facts about the world such as “The sky is blue”, “The earth is a planet”, “Rain drops are wet” would be important and valuable information for any artificially intelligent system. Ok…so how many facts could the CycL system have in its knowledge base if the only way the facts could be obtained would be if one typed them in by hand? Well, today in 2018 the knowledge base has about 7 million common-sense rules and assertions more than 630,000 concepts and it has taken over 1,000 person-years to construct. All of this common-sense knowledge is represented using the method of first-order logic which we have just described.

The good news for scientists and engineers interested in building learning machines which have common sense is that this database is in the public domain. You can get it all of this common sense knowledge at your fingertips for free!! ResearchCyc is a software development environment which can be accessed by machine learning researchers for free. CYC also contains English parsing and generation tools, Java based interfaces for knowledge editing and querying, and ontology-based data integration technology.

But there are some problems with using logic to represent common-sense knowledge. The first problem is that one can have inconsistencies in the database. So, for example, suppose we have a piece of common-sense knowledge represented as the ground atom “CAN-FLY(BIRD)” assigned the value of TRUE which represents the common-sense rule that “A bird can fly”. Then someone inputs knowledge that an “A penguin is a bird” and then someone else years later inputs knowledge that “Penguins can not fly”. Now the database has a logical inconsistency built into it. In the real world, we try to avoid these situations from happening but they will happen and their resolution can be complicated. So this is an example of having a collection of common-sense knowledge rules which are logically inconsistent.

Now for the second problem, suppose we want to determine if an entity called SAM is in a particular baby carriage. We know that baby carriages contain babies but we don’t know if SAM is a baby. This is this situation where we have a collection of common-sense knowledge but we don’t have enough information to logically determine a unique inference.

A third problem is that it is very difficult to write a common-sense rule which is generally applicable in all situations. It is possible to have baby carriages which contain babies and it is also possible to have baby carriages which do not contain babies. So what types of conditions are required to ensure that a baby carriage definitely contains a baby or definitely does not contain a baby? It’s very difficult to write such rules in practice.

You can certainly rules such as: “A baby carriage which is being sold in a department store does not contain a baby” but “A baby carriage owner who is pushing their baby carriage through a department store will have a baby in that baby carriage”. These rules will help resolve some of these issues but it is virtually impossible to write such rules perfectly. Even simple rules such as “Birds can fly” need to have lots of exception conditions. For example, the bird could be a penguin, the bird could be dead, the bird could have its feet stuck in concrete, the bird could be tied to some train tracks. In the real world, there is always the unexpected possibility which must be mathematically taken into account to support successful inference in the real world.

To address these three problems, we introduce the concept of a Markov Logic Net. This is accomplished by modifying the objective function we previously defined in the following manner. Specifically, we first multiply each logical predicate by a free parameter which is called a “weight”. The “weight parameter” specifies the relative importance of that logical formula. The weight parameter is really useful because it can be used to weight the relative likelihood of truth of different pieces of common sense knowledge. So, for example, an assertion such as “Every Baby Carriage Contains Exactly One Baby” might be weighted with the weight parameter value of 10, an assertion such as “A Baby Carriage is in the Driving Lane on a Freeway” might have a weight parameter value of -10 because it is very rare, while an assertion such as “John is the name of a Particular Baby” might have a weight parameter of +2.

Now we construct the objective function which we used previously for logical inference but rather than defining the objective function as the sum of the logical predicates for all of the ground atoms, the objective function is defined as a weighted sum of the logical predicates of all of the ground atoms where the weighting for a particular logical predicate is specified by the ground atom’s weight parameter. Maximizing the value of this objective function then assigns points to possible worlds as before but since ground atoms are assigned large or small positive weights while other ground atoms are assigned large or small negative weights, the nonlinear optimization problem takes into account the weightings on the logical predicates. That is, the possible worlds with the most points or equivalently which maximize the weighted sum of logical predicates specified by the objective function are generated from the nonlinear optimization problem.

In order to semantically interpret the weighting parameters or to estimate the weighting parameters from observed data, it is helpful to assign a probability to every possible world in the following manner. Let the probability of a possible world be defined as the exponential function evaluated at the weighted sum of logical predicates and then divided by an appropriate normalization constant. The normalization constant is chosen such that the probabilities assigned to all possible worlds will add up to the number one. This essentially assigns a probability to each possible world. Then the Monte Carlo Markov Chain techniques of Episode 66 of Learning Machines 101 can be used to solve the problem of identifying the most probable possible worlds. The key idea, as discussed in Episode 66 of Learning Machines 101, is to randomly hop from one possible world to the next in such a manner that the most probable possible worlds are visited more frequently.

Another advantage to assigning probabilities to possible worlds in this way is that one can adjust the weighting parameters in such a way so that the probability assigned to a possible world by the exponentiated weighted sum of logical predicates “matches” as closely as possible to the observed frequency of that possible world in the environment. Episode 55 describes the method of Maximum Likelihood estimation which can be used to construct an objective function for learning and then Episode 65 describes the method of gradient descent which can be used to minimize that objective function. The method of Maximum Likelihood estimation described in Episode 55, however, only works if the training data indicates the truth values of all logical predicates in a given possible world. If the training data contains some logical predicates whose truth values are unknown, then a maximum likelihood objective function for learning needs to be constructed which can handle this missing data. In this scenario involving training data consisting of only partially observable possible worlds, then the Monte Carlo Expectation Maximization techniques of Episode 67 of Learning Machines 101 can be used to estimate the weighting parameters.

It’s now time for the new Book Review Segment of this Learning Machines 101 Podcast!

Today, we will review a great discussion of Markov Logic Nets written by Dr. Pedro Domingos and Dr. Daniel Lowd. Dr. Domingos and Dr. Lowd are computer science professors at the University of Washington and the University of Oregon respectively and both have made important scientific contributions to the exciting field of Markov Logic Nets. In 2009, they published a very nice book titled “Markov Logic: An Interface Layer for Artificial Intelligence” which will now be briefly reviewed. The book begins with a short introduction to first-order logic and Markov random fields. It then provides a more in-depth discussion of many of the topics covered in this podcast. Chapters 3, 4, and 5 cover more advanced topics in Markov Logic Nets which we did not have an opportunity to discuss in this podcast.

Chapter 6 discusses the applications of Markov Logic Nets to several real-world problems such as: labeling objects, analyzing social networks such as Facebook and Linked-In for the purpose of determining what types of properties connect people together, and information extraction where the goal is to process raw text or semi-structured data sources and extract key pieces of information which can then be stored in a standard relational computer database. In addition, they also discuss the use of Markov Logic Nets for “robot mapping”. In robot mapping, a robot collects information about its environment using laser beams which provide information about the location of points in three-dimensional space. The Markov Logic Net is then used to construct a map of the environment for the robot which indicates the location of doors and walls to support robot navigation. And another really exciting application of a Markov Logic Net is to support semantic network extraction from text. This application is totally cool. The goal of this type of research is to avoid having humans type in common sense knowledge and facts into a big database such as CyC. Rather, the learning machine would simply go surfing on the web and read millions of web pages looking for little facts which were mentioned on those web pages. Once it finds a fact, the learning machine would add it automatically to the common sense knowledge database!

In the Appendix A of the book “Markov Logic: An Interface Layer for Artificial Intelligence” the authors describe an open-source free software package which can be used by researchers and engineers to build their own Markov Logic Nets. They call this system ALCHEMY. If you go to the show notes of the episode, I provide a hyperlink to this software so that you can download this free software system for building and evaluating Markov Logic Nets!!

Although this book is not really designed for the novice machine learning student, it is written in a relatively clear manner and is relatively self-contained so it should be accessible and useful to scientists and engineers interested in an introduction to exploring Markov Logic Net technology. The mathematical prequisites are minimal but you might to do some supplemental background reading on first-order logic, Markov fields, and stochastic optimization methods.

Thank you again for listening to this episode of Learning Machines 101! I would like to remind you also that if you are a member of the Learning Machines 101 community, please update your user profile and let me know what topics you would like me to cover in this podcast.

You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link.

If you are not a member of the Learning Machines 101 community, you can join the community by visiting our website at: www.learningmachines101.com and you will have the opportunity to update your user profile at that time. You can also post requests for specific topics or comments about the show in the Statistical Machine Learning Forum on Linked In.

From time to time, I will review the profiles of members of the Learning Machines 101 community and comments posted in the Statistical Machine Learning Forum on Linked In and do my best to talk about topics of interest to the members of this group!

And don’t forget to follow us on TWITTER. The twitter handle for Learning Machines 101 is “lm101talk”!

Also please visit us on ITUNES and leave a review. You can do this by going to the website: www.learningmachines101.com and then clicking on the ITUNES icon. This will be very helpful to this podcast! Thank you so much. Your feedback and encouragement are greatly valued!

Keywords: Logic, logical rules, first-order logic, Markov random fields, probabilistic logic, common-sense knowledge, CYCL, CYC, Markov logic, Markov Logic Net, MLN, MRF