LM101-076: How To Choose the Best Model using AIC or GAIC

By | January 22, 2019
The Model Selection Problem.

Episode Summary:

In this episode, we explain the proper semantic interpretation of the Akaike Information Criterion (AIC) and the Generalized Akaike Information Criterion (GAIC) for the purpose of picking the best model for a given set of training data.  The precise semantic interpretation of these model selection criteria is provided, explicit assumptions are provided for the AIC and GAIC to be valid, and explicit formulas are provided for the AIC and GAIC so they can be used in practice. Briefly, AIC and GAIC provide a way of estimating the average prediction error of your learning machine on test data without using test data or cross-validation methods. The GAIC is also called the Takeuchi Information Criterion (TIC).

Show Notes:

Hello everyone! Welcome to the 76th podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.

In this podcast, we will discuss the Akaike Information Criterion which is a commonly used method for solving the “model selection problem”. In the model selection problem you have two or more different models. For example, suppose you are trying to predict whether or not it will rain tomorrow based upon the temperature, humidity, and rainfall for the past seven days. This could be one model. Another model might try to predict whether or not it will rain tomorrow based upon only the temperature and rainfall measurements for the past month. You would like to figure out which of these two models will be more useful for predicting whether or not it will rain tomorrow. To solve this problem using the Akaike Information Criterion we do the following.

First, estimate the performance of your learning machine using one of the models using some training data. Second, Then take the number of free parameters of the model divided by the sample size and add this to the estimated average prediction error on the training data. Third, Multiply the result by twice the number of training stimuli and that gives you the Akaike Information Criterion or AIC for the model. Fourth, choose the model which has the smallest AIC as the best model. Note that since all of the models are fit to the same data set, the step in which you multiply the average prediction error plus the penalty for the number of parameters by twice the number of training stimuli can be omitted without changing the results of the model selection process. I usually prefer to omit this step since I find the average prediction error plus the number of free model parameters divided by the sample size to be easier to understand.

The key qualitative intuition behind the Akaike Information Criterion is that if you have two models which have good performance using the training data then you should pick the model which has fewer parameters. However, if you have a lot of training data then the difference in the number of free parameters between the two models should make less of a difference.

So the Akaike Information Criterion is relatively easy to compute because you are basically just adding the number of free parameters divided by the number of training stimuli to the estimated prediction error. Compute the AIC for each model and choose the model with the smallest AIC. Simple right! The problem is that because this is so simple to implement and use there is a lot of potential for not using the Akaike Information Criterion correctly resulting in wrong solutions to the model selection problem. In fact, I would not be surprised if the majority of scientists and engineers are incorrectly using the Akaike Information Criterion because they do not have a clear understanding of what the AIC is designed to estimate and the specific conditions under which the AIC estimates are valid.

In this podcast, we will explain the concepts of the Akaike Information Criterion carefully to better understand its strengths and limitations. We will explicitly provide assumptions for the AIC to hold. In the show notes of the podcast, I will provide references to the literature which describe explicit assumptions required for the AIC to be valid and key theorems which reveal the specific semantic interpretation of the AIC. Then we will introduce some advanced improvements of the Akaike Information Criterion which relax some of its more problematic assumptions. By explaining these concepts, my goal is to reduce incorrect applications of the AIC and more effectively use this powerful tool for model selection.

So the first step of this process is to understand in a precise way what is the semantic interpretation of the AIC. Some researchers state that the semantic interpretation is to find a parsimonious model that fits the data. That’s why you add 2 times the sample size multiplied by the model’s prediction error on the training data to twice the number of free parameters in the model to get the AIC. A model with a lower prediction error and fewer free parameters will be a more parsimonious model that fits the data. But what is the logic behind the magic of this ritual? Why don’t we multiply the model’s prediction error by 7 which is a much more magical number and then add the number of free parameters? Why don’t we square the number of free parameters of the model and then add that to the square root of the model’s prediction error? All of these formulas are equally consistent with the intuitive qualitative concept that we are trying to find a model which fits the data which has as few parameters as possible. Also how do we count free parameters? Suppose I have a Gaussian probability model with two parameters: a mean and a variance. I then decide I want to rewrite the mean as the product of 8 parameters which yields a Gaussian probability model with 9 parameters. Does that change the AIC? Why is this not valid? In order to understand the answers to these questions which will be revealed shortly, it is necessary to re-examine the fundamental semantic interpretation of the AIC.

When a model’s parameters are fitted to a data set and the model’s average prediction error is computed using the same data set then the average prediction error is called the “in-sample prediction error” or “training error”. The problem with using the “training error” as a measure of performance is that in many important machine learning applications we are interested in evaluating the “generalization performance” of a learning machine. The “training error” does not evaluate generalization performance. It only evaluates the ability of the learning machine to “memorize” the training data set.

A method for assessing generalization performance rather than memorization performance is the use of an approach called “cross-validation”. Cross-validation methods are widely used in machine learning. The basic concept of cross-validation was discussed in Episode LM101-028 of Learning Machines 101 at the website: www.learningmachines101.com . Using the method of cross-validation, the goal is to attempt to estimate the expected value of the prediction error on a test data set when the parameters have been estimated on a training data set. The simplest way to do this is to use a train-test method where one divides the original data set into a training data set and a testing data set. Then one estimates the parameters of the model on the training data resulting in the “fitted model”. The average prediction error of the fitted model is then evaluated using the test data set. This is called the out-of-sample prediction error or the “test error”.

The potential difficulty with this approach is that if there are important statistical regularities in the test data set but not the training data set, then the model will not be trained on those important statistical regularities but will be tested on those important statistical regularities. Another potential problem is that the training data set or test data set may have some fluke statistical regularities which are peculiar to the training data or test data which also make it difficult to estimate the average prediction error on the test data when the model is trained on the training data.

An improved method for doing this is to implement 2-fold cross-validation. This works by first dividing the data set into two data sets. Second, the model’s parameters are estimated using the first data set and then the resulting fitted model’s average prediction error is computed using the second data set. Third, the model’s parameters are re-estimated using the SECOND data set and then the resulting fitted model’s average prediction error is computed using the FIRST data set. This gives us two different estimates of out-of-sample prediction error when the model parameters are fitted to the training data and then the resulting fitted model’s prediction error is estimated using test data. These two different estimates of the out-of-sample prediction error are then averaged to obtain an improved out-of-sample prediction error estimator.

However, this method estimates the out-of-sample prediction error using only two different training data sets and these data sets have fewer training stimuli than the original full data set. Suppose we have a training data set with N training stimuli. We could fit the model to N-1 training stimuli and test the model on the remaining training stimulus to get an out-of-sample prediction error for the remaining training stimulus. Then we could repeat this process again using a different subset of N-1 training stimuli to get another out-of-sample prediction error estimator. Proceeding in this manner, one can obtain N out-of-sample prediction error which can then be averaged to obtain the final out-of-sample prediction error estimator. A nice feature of this approach which is called “leave one out” cross-validation is that not only does one average N-1 estimators rather than just 2 estimators but each training data sample has N-1 training stimuli so hopefully most of the important statistical regularities will be present in most of the N training data sets.

The difficulty with this approach, however, is that it is computationally intensive. Suppose one has a large data set with a million training stimuli. Also assume that it takes 5 seconds to compute the prediction error for a training data set with about a million stimuli. Then it would take 5 seconds multiplied by a million or about 1400 hours of computing time to compute the leave-one-out cross-validation error.

So here is the main point of this podcast! There is a simple formula we can use to estimate the leave-one-out cross-validation error using 5 seconds of computing time rather than requiring 1400 hours of computing time! It works for essentially any learning machine which computes parameter estimates by minimizing average prediction error which is computed by averaging the prediction error for each training stimulus across a set of training data.

The simple formula is called the Akaike Information Criterion or AIC which was proposed in 1973 by Akaike. The AIC is widely used in the field of machine learning but not as widely used as cross-validation. If you type cross-validation into Google Scholar you will get about 1.5 million search results.  If you type AIC into Google Scholar you will get about 900,000 search results. The AIC is very easy to implement in software.

[pause]

Here is the basic formula for estimating the out-of-sample prediction error using the AIC method. Step 1: Take the prediction error on the training data which is your in-sample prediction error. Step 2: Then add the number of free parameters in your model divided by the number of training stimuli to the in-sample prediction error. The resulting sum is an estimator of the out-of-sample prediction error!

[pause]

The Akaike Information Criterion thus allows us to estimate the out-of-sample generalization error without refitting the model multiple times with different subsets of the training data and then averaging!

Note that the official definition of the AIC requires an additional multiplication by 2 times the number of training stimuli. I personally do not like the official definition of the AIC because it is not normalized and is more difficult to semantically interpret.

In order to correctly apply the Akaike Information Criterion, certain assumptions must hold. Fortunately, these assumptions hold in most situations but unfortunately these assumptions are not carefully checked before applying the AIC in practice by most engineers and scientists.

Let’s briefly go over a set of sufficient conditions for AIC to be applicable. First, we will assume that the N training stimuli are the outcomes of sampling N times with replacement from a large but finite set of feature vectors. Second, the measure of prediction error computed for each feature vector is assumed to be a smooth function of the parameter values so that the second derivatives of the prediction error per training stimuli with respect to the parameter values are continuous. Also the prediction error should be a continuous or piecewise continuous function of each feature vector. Third, parameter estimates are computed by minimizing the average prediction error for a data set. Fourth, it will be assumed that the parameter estimates will converge to a unique set of q parameter values as the number of training stimuli N gets larger and larger. Fifth, the derivative of the prediction error per training stimulus evaluated at the parameter values that minimize the training prediction error is called the gradient stimulus prediction error vector which has q elements. Compute the outer-product of each gradient stimulus prediction error vector and then average all N outer-products. The resulting matrix is called the B matrix. The eigenvalues of B should be checked to make sure that they are finite positive numbers. Intuitively, this condition is satisfied provided that at least q of the N gradient stimulus prediction error vectors span a q-dimensional space and are non-redundant information sources. Sixth, the second derivative of the prediction error per training stimulus which is a q-dimensional matrix is computed and evaluated at the parameter values that minimize the training prediction error. These N second derivatives are averaged to obtain a matrix called the A matrix. We also need to check that the eigenvalues of A are finite positive numbers. Seventh, the prediction error for the Akaike Information Criterion is a normalized log-likelihood function whose optimal parameters that minimize the prediction error are maximum likelihood estimates. Maximum Likelihood estimation is discussed in Episode LM101-055 of Learning Machines 101. Eighth, it is assumed that there exists a set of parameter values for the probability model which specify the exact probability distribution which generated the observed data. This latter assumption is called the assumption that the probability model is correctly specified.

The above assumptions are applicable to a large class of important probability models including linear regression models and logistic regression models. They are also applicable to highly nonlinear regression models and deep learning models which satisfy these assumptions.

In practical applications of the Akaike Information Criterion, some of these assumptions are probably violated. Thus, even though the computer program used calculates the Akaike Information Criterion the resulting estimates of the out-of-sample prediction error are not valid. These are crucial assumptions which are often satisfied in practice when the parameter estimates are converging to a locally unique stable solution. However, even though these assumptions might often be satisfied in practice, it is important to understand when these assumptions fail and thus invalidate the methodology.

The first major problem in common usage of the Akaike Information Criterion is that I have rarely seen engineers or scientists report the eigenvalues of the A and B matrices are converging to positive numbers suggesting that these important conditions are not properly checked. These assumptions are closely related to the issue of having “stable” parameter estimates with respect to the parameters in your model. For example, suppose you had a linear regression model with 1000 free parameters and you had two data points. You will find in this case that the assumptions that the eigenvalues of A and B are all positive will be violated. As another example, suppose we consider the previously stated example where we have a Gaussian probability distribution with some mean and some variance and we rewrite the mean of the Gaussian distribution as the product of eight free parameters. The resulting Gaussian probability model has 9 free parameters but this particular way in which we have defined the parameters of the Gaussian probability model will violate the assumption that the eigenvalues of the A matrix are positive. Therefore, this is an illegal parameterization.

Another challenge in applying the Akaike Information Criterion occurs with highly nonlinear smooth regression models. For example, consider a deep learning network such as discussed in Episode LM101-023 where the prediction error is constructed so that it is a smooth function of the parameter values. Highly nonlinear regression models and deep learning networks often have multiple saddlepoints, flats, and strict local minimizers. In order for the AIC to work, the parameter estimates must converge to a particular local minimizer of the average prediction error as the number of training stimuli becomes large. The AIC analysis fails if we converge to a saddlepoint or a flat plateau on the prediction error surface. If we are converging to a particular strict local minimizer, then the analysis is valid. The semantic interpretation in this case is simply that the AIC is estimating the out-of-sample prediction error associated with a particular strict local minimizer.  Another way to think of the condition that we are at a strict local minimizer is that if we perturb the parameter estimates slightly in any way then when the sample size is really large the average prediction error needs to increase. Also it is worth commenting that when we have a very complicated model with multiple strict local minimizers we are really using the AIC to compare strict local minimizers rather than models! And these strict local minimizers do not have to belong to different models…they can belong to the same model!

The second major problem in common usage of the Akaike Information Criterion is that when comparing competing fitted probability models, the formula for the Akaike Information Criterion actually assumes that both of the fitted models are capable of representing the probability distribution which generated the data perfectly. In many practical applications, most of the models which one wishes to compare are not likely to be correctly specified. The Akaike Information Criterion can not be used to estimate the out-of-sample prediction error for a probability model which is not correctly specified.

In 1976, Takeuchi proposed the Takeuchi Information Criterion which is also known as the Generalized Akaike Information Criterion. The Generalized Akaike Information Criterion is a generalization of the Akaike Information Criterion which does not require the assumption that the probability model is correctly specified. The formula for the Generalized Akaike Information Criterion is given as follows.

[pause]

Here is the formula for estimating the out-of-sample prediction error using GAIC which works regardless of whether or not the model is correctly specified.

Step 1: Compute the inverse of the A matrix and multiply it by the B matrix. Step 2: Then add up the diagonal elements of the resulting matrix product and divide by the number of training stimuli. Step 3: Then add the average prediction error to that quantity. This gives an out-of-sample prediction error which is valid even in the presence of model misspecification.

[pause]

The official version of the GAIC, like AIC, requires that you multiply the final result by twice the number of training stimuli but I personally find the official version of the GAIC to be harder to interpret because it is not normalized.

Both the AIC developed in 1973 and the GAIC developed in 1976 assume that the minimizer of the estimated prediction error corresponds to the maximum likelihood estimator. That is, both AIC and GAIC assume the prediction error is the normalized negative log-likelihood function. Episode LM101-055 of Learning Machines 101 discusses maximum likelihood estimation. Fortunately, however, the Linhart and Volkers article in 1984 and the Model Selection book by Linhart and Zucchini published in 1986 showed that the exact same formula used for GAIC is not limited to the case where the prediction error is a negative normalized log-likelihood function. That is, we can use the GAIC formula even for estimating out-of-sample prediction error when the prediction error is not interpretable as a negative normalized log-likelihood function! I refer to the model selection criterion proposed by Linhart and Volkers as the Cross Validation Risk Criterion.

If you visit the Show Notes at the end of this episode on www.learningmachines101.com  you will find a copy of these show notes as well as useful references to the literature regarding AIC, GAIC, and the Cross Validation Risk Criterion!

Thank you again for listening to this episode of Learning Machines 101! I would like to remind you also that if you are a member of the Learning Machines 101 community, please update your user profile and let me know what topics you would like me to cover in this podcast. You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear”  link! As a member of the Learning Machines 101 community you will also receive an email approximately once a month designed to let you know when this month’s podcast has been released.

If you are not a member of the Learning Machines 101 community, you can join the community by visiting our website at: www.learningmachines101.com and you will have the opportunity to update your user profile at that time.  You can also post requests for specific topics or comments about the show in the Statistical Machine Learning Forum on Linked In.

From time to time, I will review the profiles of members of the Learning Machines 101 community and comments posted in the Statistical Machine Learning Forum on Linked In and do my best to talk about topics of interest to the members of this group!

And don’t forget to follow us on TWITTER. The twitter handle for Learning Machines 101 is “lm101talk”!

Also please visit us on ITUNES and leave a review. You can do this by going to the website: www.learningmachines101.com and then clicking on the ITUNES icon. This will be very helpful to this podcast! Thank you so much.  Your feedback and encouragement are greatly valued!

Keywords: AIC, GAIC, TIC, Akaike Information Criterion, Generalized Akaike Information Criterion, Takeuchi Information Criterion, Cross-Validation, in-sample-estimator, out-of-sample estimator, generalization performance evaluation

Related Podcasts:

  1. LM101-028: How to Evaluate the Ability to Generalize from Experience (Cross-Validation Methods)
  2. LM101-055: How to Learn Statistical Regularities using MAP and Maximum Likelihood Estimation
  3. LM101-023: How to Build a Deep Learning Machine (Function Approximation)
  4. LM101-041: What Happened at the 2015 Neural information Processing Systems Deep Learning Tutorial?

Further Reading:

  1. Model Selection by Linhart and Zucchini (1986). This is probably but the best reference for the material covered in this podcast. The Appendix of this book provides the relevant detailed assumptions and technical theorems which are the basis for the discussion in this podcast. The main text explains how to use these results with practical examples.
  2. “An introduction to Model Selection” (2000) by Linhart and Zucchini. This is a good introduction to the key ideas in their book. This review was published in the Journal of Mathematical Psychology, 44, pp. 41-61.
  3. Linhart and Volkers (1984).  Asymptotic Criteria for Model Selection. OR Spektrum, 6: 161-165. This is a great paper covering core material in appendix of Linhart and Zucchini (1986).
  4. Akaike, H. (1973). Information Theory and an Extension of the Maximum Likelihood Principle. In B. N. Petrov, & F. Csaki (Eds.), Proceedings of the 2nd International Symposium on Information Theory (pp. 267-281). Budapest: Akademiai Kiado. [The classic paper on the Akaike Information Criteria!] Included in this reading list for historical reasons.
  5. Akaike, H. (1974), “A new look at the statistical model identification”, IEEE Transactions on Automatic Control, 19 (6): 716–723, doi:10.1109/TAC.1974.1100705MR 0423716. [More technical details on the AIC!] Included in this reading list for historical reasons.
  6. Claeskens, G.Hjort, N. L. (2008), Model Selection and Model Averaging, Cambridge University Press. [Note: the AIC defined by Claeskens & Hjort is the negative of the standard definition—as originally given by Akaike and followed by other authors.] This book has a discussion of the relationship between cross-validation and GAIC/TIC.
  7. Takeuchi, K. (1976), ” ” [Distribution of informational statistics and a criterion of model fitting], Suri-Kagaku [Mathematical Sciences] (in Japanese), 153: 12–18. Included in this reading list for historical reasons.
  8. Stone, M. (1977), “An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion”, Journal of the Royal Statistical Society, Series B39 (1): 44–47, JSTOR 2984877. Included in this reading list for historical reasons.
  9. Konishi and Kitagawa (2008). Information Criteria and Statistical Modeling.
    https://www-springer-com.libproxy.utdallas.edu/us/book/9780387718866

Leave a Reply

Your email address will not be published. Required fields are marked *