LM101-060: How to Monitor Machine Learning Algorithms using Anomaly Detection Machine Learning Algorithms
This 60th episode of Learning Machines 101 discusses how one can use novelty detection or anomaly detection machine learning algorithms to monitor the performance of other machine learning algorithms deployed in real world environments. The episode is based upon a review of a talk by Chief Data Scientist Ira Cohen of Anodot presented at the 2016 Berlin Buzzwords Data Science Conference.
Hello everyone! Welcome to the sixtieth podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.
Today we will review a talk presented at last year’s Berlin Buzzwords Conference which was held last year in June. Berlin Buzzwords is a data science conference which focusses on Storing, Processing, Streaming, and Searching large amounts of digital data. Many of the software projects presented at the conference are open source. If you go to the website: www.learningmachines101.com you can learn more about this conference and the link for submitting a talk to the conference is provided in the show notes of this episode. The deadline date for submitting a talk to the conference is: February 14, 2017.
The talk was presented by Ira Cohen who is the Chief Data Scientist for a startup company called Anodot which is based in Israel which develops machine learning algorithms for anomaly detection. Ira’s talk focused on the challenges of deploying machine learning algorithms in real world environments and how some of these challenges might be addressed using additional machine learning algorithms. In particular, Ira’s talk was titled “Learning the learner: Using machine learning to track performance of machine learning algorithms”. The discussion today will roughly follow the main points of Ira’s talk but I will take the liberty of explaining some of the details of the talk in greater depth while omitting other details. A video of Ira’s talk which he gave at last year’s Berlin Buzzwords Conference may be found in the show notes at: www.learningmachines101.com and I encourage you to watch that video so you may see how today’s discussion is based upon Ira’s talk.
So what are the standard steps for developing a machine learning algorithm from a researcher’s perspective? The expert in machine learning never begins by selecting a machine learning algorithm but always begins by first attempting to understand the machine learning problem which, in turn, begins with understanding the learning machine’s statistical environment.
Here is a sequence of steps which may be helpful for machine learning algorithm development. The first step is titled “Understand the Learning Machine’s Statistical Environment” . This step involves understanding what types of events exist in the learning machine’s environment which are observable by the learning machine? For example, does the learning machine observe images or process audio signals over time? This step also involves characterizing what types of actions can be generated by the learning machine. For example, is the learning machine’s action simply to identify if an incoming email message is spam or must the learning machine make a numerical prediction such as predicting tomorrow morning’s temperature? Can the actions of the learning machine alter its statistical environment? These are fundamental issues which need to be considered carefully for the purposes of obtaining a good understanding of a learning machine’s statistical environment.
Understanding a learning machine’s statistical environment also involves understanding the conditions under which the learning machine’s environment is “stationary”. The concept of a stationary environment is usually a standard assumption in machine learning which means that the statistical environment is not time-varying. Or, in other words, a statistical environment E is stationary if one observes some statistical regularities over some time interval of fixed time length which begins today and those same statistical regularities are present in the environment E over a time interval of the same time length which begins tomorrow or at some other point in the future. For example, in a non-stationary environment, the statistical regularities present in the training data can be quite different from the statistical regularities present in the test data since the statistical environment statistical regularities are changing in time. Stationary statistical environments should not be assumed for convenience. For example, a spam email detection system is likely to make classification errors if a period of no spam is immediately followed by a large onslaught of spam email.
Another important issue associated with understanding the learning machine’s statistical environment. When the machine makes a decision by taking an action, what type of feedback is provided by the statistical environment? If the statistical environment provides no feedback to the learning machine, then this suggests that an “unsupervised machine learning algorithm” is required. If the statistical environment provides regular feedback to the learning machine indicating the correct desired response for a given response of the learning machine to an input pattern, then one should consider using a “supervised learning algorithm”. If the statistical environment provides feedback sometimes but not always and the feedback is relatively non-specific and vague, then this suggests that one requires a “reinforcement learning algorithm”.
Another important issue associated with modeling the learning machine’s statistical environment is representing the statistical environment in terms of “feature vectors”. The representation of events as feature vectors is a critical step since the wrong representation can make a very easy machine learning problem virtually impossible to solve while a good representation can make an impossibly hard machine learning problem very easy to solve.
So this first step is fundamental to machine learning algorithm development and obtaining a good understanding of this first step paves the way for choosing an appropriate machine learning algorithm architecture which is the second step. This is, of course, a complicated step which depend upon the answers acquired in the first step as well as the experience of the researcher in implementing and applying various types of machine learning algorithms. This step involves developing methods for estimating the parameters of the learning machine based upon the data sampled from the learning machine’s environment. This step also involves the major task of collecting and preparing the data sets for processing by the machine learning algorithm.
Once the first two steps have been accomplished, then the third step is to evaluate the machine learning algorithm’s generalization performance. This is typically done using a training data set and a test data set although it should also be done using a cross-validation methodology as decribed in Episode 28 of Learning Machines 101.
These three major steps: (i) understanding the statistical environment, (ii) choosing an appropriate machine learning algorithm, and (iii) and evaluating the machine learning algorithm’s generalization performance are the basic principles required to develop a machine learning algorithm but as noted by Data Scientist Ira Cohen at his talk at Berlin Buzzwords, two additional steps are required in order to deploy a machine learning algorithm in a real world setting.
The fourth major step is what is called deploy to Production. There is a substantial difference between software which is used by a machine learning researcher to investigate and develop a machine learning algorithm and software which can be used to deploy a machine learning algorithm in a real world environment. All phases of the project must be streamlined and debugged extensively since the software will often not be used by a machine learning expert but rather by someone without extensive detailed knowledge of the implemented machine learning algorithms.
A fifth major step is what is called “track and monitor”. One can not simply place a machine learning algorithm into production, distribute it to customers (or simply have customers use the algorithm) without tracking and monitoring the success and failures of the algorithm. Statistical regularities in the learning machine’s environment might change due to non-stationary factors or the system might be in a software environment which changes and introduces a bug. Unlike typical algorithms in computer science, machine learning algorithms tend to be relatively robust and might superficially appear to be effectively functioning even with fatal software or data problems because they are so robust and are capable of learning to compensate for problems. Thus, careful monitoring of a machine learning algorithm deployed in a production environment is essential.
With respect to the problem of monitoring shifts in the nature of the learning machine’s statistical environment, this can be accomplished by using specification tests for determining if the probability model used by the learning machine is an adequate model for representing its statistical environment. This is a complicated problem and it is not possible to develop a statistical test indicating that your model provides an adequate fit to the data generating process. Instead, one can only develop statistical tests which flag the presence of unusual situations which might require an updating or revision of the probability model. With my colleagues, we have developed a scientific breakthrough in this area which provides an entirely new approach for developing statistical tests for checking if your probability model can represent the data generating process for a wide range of probability models. The details of this approach are provide in Episode 58 of Learning Machines 101 which also references our recently published scientific paper.
In his talk at last year’s Berlin Buzzwords Conference, Ira Cohen noted that when a machine learning algorithm is deployed in a production setting that it is crucial that the performance of the machine learning algorithm is continually monitored because it is making decisions in real-time. However, in many practical situations, each machine learning algorithms must be monitored with respect to a large number of algorithm performance metrics over time. Moreover, in many cases, multiple machine learning algorithms must be monitored. Ira Cohen mentions an extreme example for his company Anodot which specializes in Anomaly Detection for other companies. They have found themselves in situations where they have had to modify millions of different models which are being updated using many different types of learning algorithms with respect to hundreds of millions of different data samples which are changing and evolving in real-time.
In such a situation, the tracking and monitoring of the performance metrics associated with evaluating whether a particular machine learning algorithm is properly making decisions with respect to a specific model of its statistical environment can not be accomplished without automatizing the tracking and monitoring process.
To address this problem, let’s consider the following scenario. Note that this scenario is not explicitly discussed in Ira Cohen’s talk. I have simply devised this for expository reasons.In this scenario, we have a machine learning algorithm is designed to make real-time predictions and decisions. For example, the machine learning algorithm might be responsible for driving a car or making stock market predictions to support real-time trading. Clearly, it is important in such situations to monitor to track and monitor the performance of the machine learning algorithm. We will call this machine learning algorithm the “primary machine learning algorithm”. Now to evaluate the “primary machine learning algorithm’s performance” we might have multiple performance metrics. Some of these metrics will be used to evaluate the degree to which the “primary machine learning algorithm’s” model of its statistical environment adequately represents the observed data samples. Some of these metrics might evaluate how efficiently the primary machine learning algorithm can estimate and update its parameters during the learning process. One might imagine possibly dozens or even hundreds of performance metrics which are based upon measuring the performance of the “primary machine learning algorithm” as it drives a car or makes stock market trading decisions. For simplicity, assume that we have only 10 performance metrics which are designed to characterize the performance of the “primary machine learning algorithm” over a particular time interval.
So now we develop a secondary machine learning algorithm which is intended to solve the “anomaly detection” problem. We will call this secondary machine learning algorithm the “monitoring machine learning algorithm”. The inputs to the monitoring machine learning algorithm is a time-series of feature vectors where a particular feature vector represents a snap shot of the performance of the performance of the “primary machine learning algorithm” over a particular time interval. The “monitoring machine learning algorithm” is designed to detect anomalies in the data and generates a signal when it sees patterns of performance metrics which are highly unusual.
Note that this approach has several advantages. First, automatization of the detection of anomalous patterns without human intervention clearly is advantageous when dealing with large amounts of data in response-critical real-world environments. Second, however, a human might tend to look at each performance metric individually and if each performance metric seems to be within acceptable ranges might conclude that there is no evidence of an anomaly. However, it is possible that subtle patterns of values of performance metrics might signal the presence of anomaly. So, for example, suppose we had three performance metrics each of which had a normal operating range between 0 and 10. If a performance measure takes on a value less than 0 or greater than 10, then one might conclude that an anomaly is present and generate a message to a human user that the “primary machine learning algorithm” may not be functioning in an optimal manner. One could imagine a situation where all three performance measures take on values between 0 and 10 but the PATTERN of performance measure values is anomalous. A machine learning algorithm could be capable of detecting such anomalous patterns.
So the next question is how can we develop machine learning algorithms to detect patterns which they have never seen before? This is essentially an example of an unsupervised learning problem. In a standard unsupervised learning problem, one shows the machine learning algorithm examples of patterns. The machine learning algorithm is presented various scribbles some of which are handwritten English letter characters, handwritten Russian letter characters, handwritten Japanese letter characters, and handwritten Hebrew letter characters. Each feature vector that the unsupervised learning machine observes is just a handwritten scribble. The goal of the unsupervised learning machine is simply to decide whether or not a handwritten scribble is a letter from one of these five alphabets.
As the learning process progresses, the unsupervised learning machine learns to generate a positive response if the next feature vector it observes looks similar to the feature vectors which it had previously learned. This same unsupervised learning machine, however, can be easily used as an anomaly detector. To see this, note that after learning, if one presents the machine with a feature vector representing a handwritten scribble and the system generates a negative response indicating that it is not similar to the examples of English, Russian, Japanese, and Hebrew letters then the unsupervised learning algorithm is acting as an anomaly detector. It’s negative responses indicate the presence of an anomaly.
To get this type of machine learning architecture to work correctly, one may have to rethink concepts of regularization and penalties. In a standard unsupervised learning machine, one sets up regularization constraints and penalties in such a manner so that the system does not generate a positive response in the initial stages of learning and as the learning process progresses it generates a stronger positive response to familiar feature vectors.
For building novelty or anomaly detection learning machines, it is sometimes more appropriate to set up regularization constraints and penalties such as that the system generates a positive response to all feature vectors in the initial stages of learning and as the learning process progresses it generates a stronger negative response to unfamiliar feature vectors.
Commonly used machine learning algorithms for novelty detection include: singular value decomposition and autoencoders. The singular value decomposition approach which I include approaches such as orthogonalizing linear filters was worked out in a very nice way in a 1976 Biological Cybernetics article by Teuvo Kohonen and E. Oja titled “Fast adaptive formation of orthogonalizing filters and associative memory in recurrent networks of neuron-like elements”. Nonlinear autoencoders based upon deep learning methods were initially explored in the 1990s. I have provided a link to the original Kohonen and Oja article as well as review of Novelty detection using nonlinear autoencoders in the show notes of this episode of Learning Machines 101.
The basic idea behind an autoencoder can be understood in terms of a multi-layer feedforward network architecture as described in Episode 23 of Learning Machines 101. In a multi-layer feedforward network architecture, an input pattern is used to generate a pattern of activity over a group of hidden units which in term generate a pattern of activity over a group of output units. The goal of the learning process is to adjust the parameters of the network architecture so that the generated pattern of activity over the output units is similar to the desired output activity pattern. In an autoencoder, the desired activity pattern over the output units is defined as the activity pattern over the input units. Typically, the number of hidden units is relatively small compared to the number of input units and this forces the learning machine to extract out critical statistical regularities. Thus, an autoencoder tends to “filter out” familiar input patterns. In a similar manner, a linear autoencoder consisting of linear hidden units can be set up to implement a singular value decomposition and use linear filters to “filter out” familiar input patterns and only respond to patterns it has never seen before.
In summary, we have emphasized that in real world applications of machine learning algorithms, it is necessary to continue to monitor and track the performance of these algorithms. One idea is that such monitoring and tracking can be accomplished by using a novelty detection “monitoring machine learning algorithm” which examines performance measures characterizing the performance of the “primary machine learning algorithm” as it processes incoming information. Novelty detection or anomaly detection algorithms can be implemented using nonlinear autoencoders or linear methods such as singular value decomposition. And finally, just a reminder, you can learn more about how to submit a talk to the 2017 Berlin Buzzwords Conference, Ira Cohen’s 2016 Presentation at last year’s Berlin Buzzwords Conference, and the Data Science Company Anodot by visiting the website: www.learningmachines101.com !
Thank you again for listening to this episode of Learning Machines 101! I would like to remind you also that if you are a member of the Learning Machines 101 community, please update your user profile and let me know what topics you would like me to cover in this podcast.
You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link!
If you are not a member of the Learning Machines 101 community, you can join the community by visiting our website at: www.learningmachines101.com and you will have the opportunity to update your user profile at that time. You can also post requests for specific topics or comments about the show in the Statistical Machine Learning Forum on Linked In.
From time to time, I will review the profiles of members of the Learning Machines 101 community and comments posted in the Statistical Machine Learning Forum on Linked In and do my best to talk about topics of interest to the members of this group!
And don’t forget to follow us on TWITTER. The twitter handle for Learning Machines 101 is “lm101talk”!
And finally, I noticed I have been getting some nice reviews on ITUNES. Thank you so much. Your feedback and encouragement is greatly valued!
Keywords: Novelty Detection, Anomaly Detection, Unsupervised Learning, Berlin Buzzwords, Anodot, Monitoring Machine Learning, Tracking Machine Learning, Deploying Machine Learning
Further Reading and Videos:
Recent review of novelty detection in Machine Learning and Signal Processing. By M. Pimental, D. A. Clifton, L. Clifton, and L. Tarassenko (2014) in the journal Signal Processing.
Golden, R. M., Henley, S. S., White, H., and Kashner, T. M. (2016). Generalized Information Matrix Tests for Detecting Model Misspecification. Econometrics, 46.
http://www.mdpi.com/2225-1146/4/4/46 (this is a free open-access article!)
Kohonen and Oja (1976). Fast adaptive formation of orthogonalizing filters and associative memory in recurrent networks of neuron-like elements. Biological Cybernetics. [Perhaps the first systematic discussion of novelty detection in the pattern recognition literature.]
Copyright © 2017 by Richard M. Golden.