LM101-058: How to Identify Hallucinating Learning Machines using Specification Analysis
In this 58th episode of Learning Machines 101, I’ll be discussing an important new scientific breakthrough published just last week for the first time in the journal Econometrics in the special issue on model misspecification titled “Generalized Information Matrix Tests for Detecting Model Misspecification”. The article provides a unified theoretical framework for the development of a wide range of specification analysis methods for determining if a learning machine is capable of learning its statistical environment. The article is co-authored by myself, Steven Henley, Halbert White, and Michael Kashner. It is an open-access article so the complete article can be downloaded for free! The download link can be found in the show notes of this episode at: www.learningmachines101.com . In 30 years everyone will be using these methods so you might as well start using them now!
Hello everyone! Welcome to the fifty-eighth podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner. Today, I’ll be discussing an important scientific breakthrough which was published just last week for the first time in the journal Econometrics in the special issue on Recent Developments of Specification Testing which is titled “Generalized Information Matrix Tests for Detecting Model Misspecification”. The article provides a unified theoretical framework for the development of a wide range of methods for determining if a learning machine is capable of learning its statistical environment. The article is co-authored by myself, Steven Henley, Halbert White, and Michael Kashner. This is an open-access article so the complete article can be accessed in the show notes of this episode at: www.learningmachines101.com.
Before discussing the approach, it is important to introduce the concept of model specification analysis. When we develop a machine learning algorithm, the machine learning algorithm is often designed to be successful within specific types of statistical environments. This means that the learning machine is often designed to represent a limited class of statistical environments. If the actual statistical environment within which the learning machine lives can not be represented by the learning machine’s model of reality, this means the learning machine can never obtain absolute knowledge of its statistical environment.
For example, suppose that we are interested in predicting the probability that cancer is present in a patient given specific characteristics of the patient such as the patient’s gender, age, white blood count, and recent weight loss. Assume that the learning machine assumes that the probability that cancer is present will tend to either increase or decrease when an input variable such as white blood cell count increases. This assumption, for example, is a common assumption in logistic regression modeling.
In the real world, however, having either a very low white blood cell count or a very high white blood cell count tends to increase the probability that cancer is present. This suggests that the representation of white blood cell count as a numerical input variable for the cancer presence prediction task can sometimes result in a learning machine whose probabilistic assumptions about its statistical environment are intrinsically wrong. If a learning machine’s probabilistic assumptions about its statistical environment are intrinsically wrong, then regardless of what learning algorithm one uses and regardless of the amount of training data provided to the learning machine….the learning machine will never be able to acquire absolute knowledge of its statistical environment.
One solution to this issue has been often proposed in the machine learning literature which is the use of “non-parametric” learning machines. Such algorithms make minimal assumptions about their statistical environment. Indeed, every learning algorithm MUST have certain biases regarding the nature of its statistical environment. If this were not the case, then the optimal representation of the training data would be the training data itself and the learning machine would be incapable of exhibiting generalization performance. Thus, non-parametric learning machine methods are limited to situations where one has large amounts of training data which are sufficient to support extraction of critical statistical regularities.
In contrast to non-parametric learning machines, parametric learning machines tend to make stronger assumptions about their statistical environment. By making these stronger assumptions, a parametric learning machine has the potential to learn much more rapidly and make better inductive inferences provided that the hints given to the parametric learning machine are “good hints”. If the parametric learning machine is provided “bad hints”, then the parametric learning machine will have more difficulty learning and its inductive inferences will be worse. In today’s discussion, we introduce a new important class of tools for determining if the hints provided to a given parametric learning machine contain “bad hints”.
Parametric learning machines are also widely used in the machine learning literature and include algorithms such “linear regression” and “logistic regression”. In addition, given a sufficient amount of training data and appropriate regularization terms, more complicated multilayer learning machine as well as unsupervised learning machines can also be viewed within a parametric statistical modeling framework.
If one can not find a set of parameter values for the learning machine so that it is capable of exactly representing its statistical environment, then the learning machine’s probability model is said to be “misspecified”. On the other hand, if there exist a set of parameter values for the learning machine’s probabilistic model such that the learning machine’s probability model can represent its statistical environment perfectly then the learning machine’s probability model is said to be “correctly specified”.
Clearly it is very desirable to have a model which is correctly specified and capable of representing its statistical environment. Nevertheless, we need to think carefully about what this really means. The first point is that if a probability model is correctly specified, this simply means that there exists a set of parameter values such that the model can represent its statistical environment. If a probability model is correctly specified, this simply is an assertion that it is “possible” for the learning machine to learn how to correctly generalize in its environment but it is not guarantee that it will learn such a solution.
The second point is that although it is true that correctly specified probability models tend to be more predictive than misspecified probability models, predictive model fit and correct specification are not necessarily correlated. For example, suppose we have a model which predicts tomorrow’s closing price for a stock given today’s closing price using a linear regression type random walk model. Specifically, one assumes a probability model where tomorrow’s closing price is equal to today’s closing price plus a constant drift parameter value plus additive zero-mean Gaussian noise. The variance of the Gaussian noise might be quite large but this random walk model might be an excellent but relatively non-predictive model of a stock’s closing price.
In contrast, suppose one had a nonlinear regression model which uses a complicated nonlinear combination of a stock’s recent history of closing prices. It might turn out that the model’s predictive performance is quite good but the probability distribution of the residual prediction error obeys a probability law quite different from the learning machine’s statistical environment. This is an example of a case where we have a learning machine with high predictive performance but this successful performance is based upon a flawed probability model although the flaw is relatively unimportant. If one could accurately predict the closing price of a security with some very small error perturbation where the distribution of the small error perturbation was not accurately characterized, this would not cause one to trash the stock market prediction probability model! Thus, this latter case illustrates a situation where the end-user would not have a problem using a misspecified model.
The purpose of the discussion is not to minimize the importance of identifying misspecified models which are not capable of representing their probabilistic environments but rather to help understand the meaning of the concept of model misspecification.
We assume that the learning machine’s goal is to estimate the parameter values of a probability model of its statistical environment such that the parameter values make the observed training data as likely as possible. This approach is called Maximum Likelihood Estimation and was described in detail in Episode 55. To apply this approach, three assumptions are required. First, assume that each training stimulus is generated by sampling with replacement from a large finite set of potential training stimuli. Second, assume a probability model where the predicted probabilities are sufficiently smooth functions of the model’s parameter values. A third additional important requirement is that in the special case where one can find a set of “true” parameter values for the probability model such that the model correctly generates the probabilities of actual events in the environment, then any sufficiently small perturbation to those parameter values should decrease the likelihood of the observed data. These are the three most critical “regularity” assumptions which are required for the theory of Maximum Likelihood Estimation to hold.
Note that given a data sample consisting of N training stimuli, the estimated parameter values will never be exactly equal to the true parameter values. This is because the estimated parameter values will be different for each unique batch of N training stimuli. For example, suppose N training stimuli are generated by the environment and used to estimate the parameters of the probability model. In addition, assume another N training stimuli are generated from the statistical environment to estimate the parameters of the same probability model. The intrinsic sampling error in the training data will make the parameter estimates obtained by the two distinct data samples different. However, as the number of training stimuli N becomes large, this sampling error will tend to decrease and the parameter estimates associated with the two different data sets of N training stimuli similar to one another. If the probability model is correctly specified, then the parameter estimates associated with the two different data sets will converge to the “true parameter values” as the number of training stimuli in a data set denoted by N becomes large.
An additional bonus result of the theory of Maximum Likelihood Estimation is that when the probability model is correctly specified, the sampling error of the parameter estimates can be estimated in two different ways. The first method estimates the sampling error based upon using the first derivatives of the likelihood of the observed data given the model’s parameter values. The second method estimates the sampling error based upon using the second derivatives of the likelihood of the observed data given the model’s parameter values. These formulas for estimating the sampling error have been known in the statistics literature for over 100 years and are sometimes referred to as Fisher’s Information Matrix Equality. In fact, many statistical software packages give the user the option of calculating standard errors for the parameter estimates using these two different methods.
Note that when you have a probability model with more than one free parameter the sampling error associated with the parameter estimates is not characterized simply by obtaining the standard errors of all of the parameter estimates. It is also necessary to calculate the covariance matrix of the parameter estimates whose on-diagonal elements are the squares of the standard errors and whose off-diagonal elements specify how the sampling error in one estimated parameter value co-varies with the sampling error in another estimated parameter value. In other words, when we have more than one parameter in a probability model, we characterize the sampling error by an array of numbers which is called the sampling error covariance matrix.
So the new idea which was introduced into the literature by Professor Halbert White in 1982 is to explore the contrapositive of the Information Matrix Equality. That is, we know from Maximum Likelihood Estimation Theory that: If the probability model is correctly specified, then the two different formulas of estimating covariance matrices will give similar results. The contrapositive of this statement which follows from the theory of deductive logic is then given as follows. If the covariance matrices obtained using these two different formulas are different, then this implies that the probability model is not correctly specified. Professor White showed how to design a statistical test to test the null hypothesis that the covariance matrices of the parameter estimates using these two different methods are identical. Thus, rejecting that null hypothesis becomes a method for detecting the presence of model misspecification. Our recently published paper extends Professor White’s work in a variety of important ways and provides for the first time a general theoretical framework for developing a wide range of novel tests for misspecification for a wide range of smooth parametric probability models.
In particular, our key idea is to compare a transformation or function of one covariance matrix with the same transformation or function of the other covariance matrix. These functions or transformations can be chosen to emphasize or de-emphasize different properties of the two covariance matrices. Different choices of functions yield different ways of measuring the similarity between the two covariance matrices. Thus, this provides us with a method for generating many different types of tests for model misspecification.
So how could we use this approach in practice? Many statistical software packages provide the user the option of estimating the covariance matrices of the parameter estimates using either the first or second derivatives of the Likelihood function. One can see if the covariance matrices computed using both of these methods are similar using functions of the two covariance matrices. If the two transformed covariance matrices are dissimilar, then this indicates the presence of model misspecification. Our recent published paper reviews the relevant literature in this area and shows how to construct statistical tests for determining if the difference in covariance matrices computed using these different methods are significantly different from one another. We discuss both analytic formulas which don’t require simulation methods as well as simulation method approaches. An important focus of the paper is to clearly identify the statistical assumptions which allow one to conclude that model misspecification is present. The paper also provides explicit general mathematical formulas which can be used to derive customized misspecification tests for different types of smooth probability models.
In the show notes of this episode, we provide a hyperlink to our article which you can download for free since this is an open-access article. The article provides all technical details of our approach as well as a review of the relevant statistical literature on this topic. In 30 years, everybody will be using these new powerful methods for the detection of model misspecification so you might as well start using them now!
Before leaving today, I would like to note that John Sonmez who is the founder of the “Simple Programmer” blog and author of “Soft Skills: The Software Developer’s Life Manual” has constructed the “The Ultimate List of Developer Podcasts”. I was quite pleased to be included on the list of “Data and Machine Learning” podcasts which includes not only my podcast “Learning Machines 101” but also other great Data Science and Machine Learning podcasts such as: Partially Derivative, the Data Skeptic, Linear Digressions, the O’Reilly Data Show, Data Stories, and Talking Machines.
I encourage you to visit John’s website to check out the “Data and Machine Learning” podcast list as well as the other podcast lists he has constructed for his “Ultimate List of Developer Podcasts”. Again, the link to his “Ultimate List of Developer Podcasts” can be found by vising the website: www.learningmachines101.com.
Thank you again for listening to this episode of Learning Machines 101! I would like to remind you also that if you are a member of the Learning Machines 101 community, please update your user profile and let me know what topics you would like me to cover in this podcast.
You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link!
If you are not a member of the Learning Machines 101 community, you can join the community by visiting our website at: www.learningmachines101.com and you will have the opportunity to update your user profile at that time. You can also post requests for specific topics or comments about the show in the Statistical Machine Learning Forum on Linked In.
From time to time, I will review the profiles of members of the Learning Machines 101 community and comments posted in the Statistical Machine Learning Forum on Linked In and do my best to talk about topics of interest to the members of this group!
And don’t forget to follow us on TWITTER. The twitter handle for Learning Machines 101 is “lm101talk”!
And finally, I noticed I have been getting some nice reviews on ITUNES. Thank you so much. Your feedback and encouragement is greatly valued!
Keywords: misspecification analysis, correct specification, model fit, information matrix test, goodness-of-fit
Special Issue of Econometrics “Recent Developments of Specification Testing”
Golden, R. M., Henley, S. S., White, H., and Kashner, T. M. (2016). Generalized Information Matrix Tests for Detecting Model Misspecification. Econometrics, 46.
http://www.mdpi.com/2225-1146/4/4/46 (this is a free open-access article!)
Golden, R. M., Henley, S. S., White, H., and Kashner, T. M. (2013). New Directions in Information Matrix Testing: Eigenspectrum Tests. In Recent Advances and Future Directions in Causality, Prediction, and Specification Analysis Edited By X. Chen and N. R. Swanson
LM101-055: How to Learn Statistical Regularities using MAP and Maximum Likelihood Estimation.
Fisher Information https://en.wikipedia.org/wiki/Fisher_information
The Ultimate List of Developer Podcasts https://simpleprogrammer.com/2016/10/29/ultimate-list-developer-podcasts/
Copyright © 2016 by Richard M. Golden. All rights reserved.