Podcast: Play in new window | Download | Embed

# LM101-070: How to Identify Facial Emotion Expressions in Images Using Stochastic Neighborhood Embedding

## Episode Summary:

This 70^{th} episode of Learning Machines 101 we discuss how to identify facial emotion expressions in images using an advanced clustering technique called Stochastic Neighborhood Embedding. We discuss the concept of recognizing facial emotions in images including applications to problems such as: improving online communication quality, identifying suspicious individuals such as terrorists using video cameras, improving lie detector tests, improving athletic performance by providing emotion feedback, and designing smart advertising which can look at the customer’s face to determine if they are bored or interested and dynamically adapt the advertising accordingly. To address this problem we review clustering algorithm methods including K-means clustering, Linear Discriminant Analysis, Spectral Clustering, and the relatively new technique of Stochastic Neighborhood Embedding (SNE) clustering. At the end of this podcast we provide a brief review of the classic machine learning text by Christopher Bishop titled “Pattern Recognition and Machine Learning”.

## Show Notes:

Hello everyone! Welcome to the 70th podcast in the podcast series *Learning Machines 101*. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.

In this podcast, we discuss how to identify facial emotion expressions in images using an advanced clustering technique called Stochastic Neighborhood Embedding as well as more well-known clustering techniques such as K-means clustering. We discuss the concept of recognizing facial emotions in images including applications to problems such as: improving online communication quality, identifying suspicious individuals such as terrorists using video cameras, improving lie detector tests, improving athletic performance by providing emotion feedback, and designing smart advertising which can look at the customer’s face to determine if they are bored or interested and dynamically adapt the advertising accordingly. At the end of this podcast we provide a brief review of the classic machine learning text by Christopher Bishop titled “Pattern Recognition and Machine Learning”.

A number of research studies by behavioral scientists have demonstrated that there is strong evidence for agreement across cultures from all over the world that essentially the same facial expressions are used to express the seven emotions: Anger, contempt, disgust, fear, joy, sadness, and surprise. This is a rather amazing finding. One might imagine that in some cultures that a big grin might signify that someone is sad rather than happy. It has also been shown that individuals who have been born without eyesight produce the same facial expressions as individuals with normal eyesight. Emotional states naturally occur for ½ second to about 4 seconds and are called macroexpressions.

Another important discovery by behavioral scientists are microexpressions. Microexpressions are facial expressions which pop on or off the face in fraction of a second. Typically a microexpression occurs for a duration of about 1/30 of a second and may not even be observed by someone watching for facial expressions. Some scientists believe such microexpressions can not be directly controlled by an individual. It is also likely that microexpressions may reveal unconscious emotional states or emotional states an individual wishes to conceal.

The experimental data thus shows that specific statistical regularities in the environment exist in facial images which may be used to identify specific emotional states. If face recognition machine learning algorithms could be developed to reliably detect both emotional facial macroexpressions and emotional facial microexpressions this would be an important technological breakthrough.

For example, the detection of emotional facial macroexpressions could be useful for enhancing the quality of digital communications which are naturally biased towards the assumption of a neutral facial expression. Emojis in tweets and emails could automatically be generated in your digital message as your computer or mobile phone looks at your facial expression. You would, of course, have the option of turning off this feature or deleting the generated emoji before sending. I personally know of several friends who have had experiences of sending misinterpreted emails or tweets because the emotions associated with those emails or tweets were misinterpreted on the other end of the digital communication channel.

A second great application for the detection of emotional facial microexpressions would be for security cameras. Imagine an airport where the scanner takes a short 3 second video clip of someone’s face. This would correspond to about 180 facial microexpressions which could then be analyzed to determine if the person seemed nervous about having their picture taken. This could be used to possibly identify persons who may be potential security risks.

A third application of the detection of emotional facial microexpressions would be to enhance lie detection systems which typically focus on changes in skin resistance, changes in heart rate, blood pressure, and breathing rate. The ability to identify emotional states directly from facial expressions would clearly enhance lie detection technology.

A fourth interesting application for the detection of both facial macroexpressions and facial microexpressions is to enhance athletic performance. Suppose one could monitor the facial expressions of an athletes by analysis of videos of the athlete’s face. It might be possible to determine whether they are attentive, relaxed, or anxious at different stages during their competition. This information could provide the athlete with valuable feedback regarding their mental state during competition for the purposes of improving mental focus and ultimately enhancing athletic performance.

And finally a fifth big potential money-maker is the use of automatic facial image emotion recognition technology to recognize someone’s reaction to a display of some product or advertisement. As the potential customer or consumer looks at the product or display, a camera could look at their face and figure out their emotion. Based upon the analysis of the person’s emotional disposition, the machine learning algorithm could record whether a person was interested in the product or display. In fact, going one step further, if the system determined a person was not interested in the product, then the computer-driven advertising might shift to a different product. But if the system determined the potential consumer was interested, the electronic display might adapt and show additional details of the product. So, in principle, personalized advertisements could be automatically generated in real-time and customized to an individual by essentially reading the emotions off someone’s facial expressions in real-time!

In the show notes of this episode of Learning Machines 101, I provide references to some of this research. However, some of the above ideas I just invented while I was doing this podcast and I don’t know if those ideas have been developed yet! I also provide hyperlinks to various databases of human facial expressions which you can use to train up your own machine learning algorithms to recognize human facial expressions. These references and hyperlinks are located at the end of this episode which is posted on the website: www.learningmachines101.com

How might one build a learning machine for classifying human facial expressions? One approach which has been explored is to first map an image of a human facial expression into a vector. So, for example, a 32 by 32 grey-scale pixel image of a human facial expression could be represented by a matrix of numbers with 32 rows and 32 columns where the number in the 4^{th} row and 10^{th} column specifies how dark one should color the pixel in the 4^{th} row and the 10^{th} column of the grey-scale pixel image which has 32 rows and 32 columns. Since 32 multiplied by 32 is 1024, this 32 by 32 matrix of numbers can be represented as 1024-dimensional vector and can be plotted as a point in a 1024-dimensional hyperspace. Suppose we had 100 facial expressions corresponding to 100 32 by 32 grey-scale pixel images. We can map each grey-scale pixel image into a 1024-dimensional vector and plot it as a point in the 1024-dimensional hyperspace.

Suppose that we pick 25 of the 100 facial expression images at random and have a human classify each of these 25 facial expression images into one of the seven categories of emotions identified by behavioral scientists. Then we plot all 100 facial expressions in a 1024-dimensional hyperspace but we place a red colored point in the hyperspace location

Corresponding to a particular face which is angry and green colored point in the hyperspace location corresponding to a particular face which is joyful and so on. We will color points as black if we do not know their category. So when we are done, we will have a 1024-dimensional hyperspace with 25 colored points and 75 black points.

Now we can examine this hyperspace and notice that there is a black point next to a point which is colored red. So we might conclude that the nearby black point corresponds to a facial expression which is “angry” because it is close to the red colored point. However, the situation could be complicated if the black point is near several red points and is also near several green points. One approach to dealing with this problem is to compute the hyperspace location which is the “center” of the location of all of the red points and then compute the hyperspace location which is the “center” of the location of all of the green points. Mathematically, the center of all of the red points might be defined as the average value of all of the red points and the center of all of the green points might be defined as the average value of all of the green points. With this strategy, we can take a black point and decide whether it is closer to the center of the black points or the center for the green points. Based upon this decision, we classify the black point as either red (angry) or green (joyful). We continue in this manner for all of the black points. However, after we have done this the “center” of all of the red points will have shifted and the “center” for all of the green points will have shifted so we now have to repeat this process for all of the black points to make sure that the classification of each black point into one of the seven face emotion categories has not changed. If the classification of all black points has not changed, then the algorithm stops. If some of the black points which we previously classified as red now should be classified as green, we recompute the centers of the red and green categories and then repeat the process of checking our classification.

It can be shown that this procedure is searching for a set of face emotion category assignments such that the Euclidean distance of each black point and each colored point to their assigned category center is as small as possible. Furthermore, the algorithm we have just described is called the K-means algorithm which was also discussed in Episode 57 of Learning Machines 101. The K-means algorithm attempts to construct clusters of points which are as compact as possible.

A more sophisticated approach is spectral clustering which is a lot more effective yet only slightly more complicated. The core idea of spectral clustering goes back to an idea proposed in 1936 by Fisher called Linear Discriminant Analysis. Basically, in K-means clustering we try to find clusters such that each cluster is as compact as possible. In spectral clustering and Linear Discriminant Analysis we try to find clusters which are not only as compact as possible but also try to simultaneously maximize the distance between clusters. Thus, K-means cluster analysis tries to assign the unknown facial expressions (that is the black dots) to clusters of red points such that the distance between all points in a cluster of red points is as small as possible and furthermore the distance between all points in a cluster of green points is as small as possible. More concisely, K-means cluster analysis minimizes the within-cluster distance. Linear Discriminant Analysis and Spectral Clustering, on the other hand, not only attempt to minimize the within-cluster distance but also try to maximize the distance between clusters. That is, the goal is to create clusters of facial expressions so that all of the facial expressions in a particular cluster (such as “anger”) are as similar as possible within the cluster but as different as possible to facial expressions assigned to other clusters (such as “joy” or “disgust”). The details of Spectral Clustering are discussed in Episode 57 of Learning Machines 101.

Suppose we have a facial expression which we wish to classify but the facial expression is ambiguous. It is not clearly a facial expression of “anger” and it is not clearly a facial expression of “disgust”. It is a facial expression of “anger-disgust”. We can’t complain about this data point! It is data and we are stuck with it!! If we use the standard Linear Discriminant Analysis methods this means that we need to place this point in either the anger category or the disgust category and it also means that we need to move the anger and disgust categories closer to one another. This might be ok but now we find another data point which is an ambiguous facial expression corresponding to “joy-surprise” and then another ambiguous facial expression corresponding to “anger-surprise”. If there is a lot of ambiguity and the points in the hyperspace do not naturally fall into compact non-overlapping clusters of points after the clustering process our clustering algorithm’s performance will not be very effective. So when the clusters are compact and non-overlapping, approaches such as K-means or Linear Discriminant Analysis are very effective but when the clusters are not compact and possibly overlapping we need a more sophisticated approach to constructing clusters.

Today we discuss a relatively new approach to clustering which was introduced by Hinton and Roweis in 2002 at the Neural information Processing Systems conference which is called Stochastic Neighborhood Embedding or SNE. The essential idea is as follows please see the show notes located at: www.learningmachines101.com for a more detailed description of SNE. Here, however, is the basic idea. We have a set of 100 points in a 1024-dimensional hyperspace corresponding to 100 facial expression images. We then pick a 50-dimensional hyperspace and randomly place 100 points in this 50-dimensional hyperspace. Each 50-dimensional point in the low-dimensional hyperspace can be interpreted as a “projection” of a particular 1024-dimensional point in the original high-dimensional hyperspace. Next, we have some method for computing the distance of a point in the original hyperspace to every other point in the original hyperspace. Then we construct a conditional probability which is a decreasing function of this distance measure.

Specifically, the conditional probability that given some hyperspace point X in the original hyperspace what is the conditional probability that another hyperspace point Y in the original hyperspace belongs to the same category as X. This is basically the probability of Y given X and it is a known number which is computed from the points in the original hyperspace for each pair of points in the original hyperspace. In many applications, the distance measure is a Euclidean distance so the conditional probability is a conditional Gaussian density function.

Now the next step is we use either the same or different distance measure in the low-dimensional hyperspace to compute the probability of a point Y-prime in the low-dimensional hyperspace given a point X-prime in the low-dimensional hyperspace. Here is where the fun begins. We are going to treat the black points (unknown emotional state faces) in the low-dimensional hyperspace as free parameters which we can move around the points which are marked as red (angry) or green (joyful). We will try to move them around in the low-dimensional hyperspace so that the conditional probabilities computed in the low-dimensional hyperspace are as similar as possible to the conditional probabilities computed in the high-dimensional hyperspace. This is achieved by minimizing the Kullback-Leibler divergence or cross-entropy of the conditional probability in the low-dimensional space relative to the conditional probability distribution in the high-dimensional space. Details are provided in the original Hinton and Roweis 2002 paper. The method of gradient descent which is described in Episode 65 and Episode 68 of Learning Machines 101 is used to solve this nonlinear optimization problem. The initial guess of the black points in the low-dimensional hyperspace has a large impact on the success of this algorithm. So either: (1) follow the recommended heuristics in the paper, (2) use K-means clustering in the high-dimensional space to obtain initial locations of the points and then use a linear projection algorithm to map that initial K-means clustering guess into an initial guess for the black points in the low-dimensional space, or (3) choose the initial points in the low-dimensional space at random but choose their magnitudes to be very small and clustered around the origin.

Huan, Wang, and Ying in a paper published in the Proceedings of the 2011 International Conference on System Science and Engineering titled “Facial Expression Recognition using Stochastic Neighborhood Embedding and SVMs” empirically compared a Stochastic Neighborhood Embedding clustering algorithm with a Linear Discriminant Analysis clustering algorithm for clustering Japanese Facial Expressions using the Japanese Female Facial Expression (JAFFE) database. In their study, they used a database of 213 face images which posed 3 or 4 samples of each of six basic facial expressions: happiness, sadness, surprise, anger, disgust, fear, and a neutral face. Ten-fold cross-validation method as described in Episode 28 of Learning Machines 101 was used to estimate classification performance. They found that the Stochastic Neighborhood Embedding algorithm they used showed a facial expression recognition rate of 66% and the Linear Discriminant Analysis clustering algorithm showed a facial expression recognition rate of only 56%.

Details on the topics of using facial expressions to detect emotions, recognizing emotions from facial images, K-means clustering, spectral clustering, Stochastic Neighborhood Embedding can be found in the show notes of this episode at: www.learningmachines101.com .

It’s now time for the new Book Review Segment of this Learning Machines 101 Podcast!

Today, we will review a classic machine learning text titled *Pattern Recognition and Machine Learning* by Christopher Bishop.

Dr. Christopher Bishop is Laboratory Director at Microsoft Research Cambridge and Professor of Computer Science at the University of Edinburgh. Dr. Bishop is an expert in the field of Artificial Intelligence and Artificial Neural Networks and he has a PhD in Theoretical Physics. In 2017, he was elected as a Fellow of the Royal Society.

This book is a collection of topics which are loosely organized but the discussion of the topics is extremely clear. The flexible organization of topics has the advantage that one can flip around the book and read different sections without having to read earlier sections. A machine learning beginner might start by reading Chapters 1, 2, 3, and 4 very carefully and then read the initial sections of the remaining chapters to get an idea about what types of topics they cover.

The choice of topics hit most of the major areas of machine learning and the pedagogical style and writing style is quite clear. There are lots of great exercises, great color illustrations, intuitive explanations, relevant but not excessive mathematical notation, and numerous comments which are extremely relevant for applying these ideas in practice. This is a handy reference which I like to keep by my side at all times! Indeed, this is one of the most popular graduate level textbooks on Machine Learning. But it is very accessible to advanced undergraduate students in computer science, math, and engineering as well.

Chapters 1 and 2 provide a brief overview of relevant topics in probability theory. Chapters 3 and 4 discuss methods for parameter estimation for linear regression modeling. Chapter 5 discusses parameter estimation for feedforward neural network models. Chapters 6 and 7 discuss kernel methods and Support Vector Machines (SVMs). Chapter 8 discusses Bayesian networks and Markov random fields. Chapter 9 discusses mixture models and Expectation Maximization (EM) methods. Variational Inference methods are discussed in Chapter 10. Chapter 11 discusses sampling algorithms which are useful for seeking global minima as well as numerically evaluating high-dimensional integrals. Chapter 12 discusses various types of Principal Component Analysis (PCA) including: PCA, Probabilistic PCA and Kernel PCA. Chapter 13 discusses Hidden Markov models. And Chapter 14 discusses Bayesian Model Averaging and other methods for modeling mixtures of experts.

In order to read this textbook, a student should have taken the standard lower-division course in linear algebra, a lower-division course in calculus (although multivariate calculus is recommended), and a calculus-based probability theory course (typically an upper-division course). With this background, the book may be a little challenging to read but it is certainly accessible to students with this relatively minimal math background. For example, undergraduate students in computer science or engineering or math should not have difficulty reading this book. Also if you have a Bachelor’s or Masters degree in these areas but you forgot everything you learned because you received your degree decades ago, you should still have no problem understanding this book. If you have a PhD in Statistics, Computer Science, Engineering, or Physics you will still find this book extremely useful because it will help you understand the machine learning literature in terms of topics with which you are already familiar. Genius high school students will also find this book very useful and of great interest…there will be a lot of stuff you will still understand and a lot of stuff in the book that you won’t understand but the struggle will help pre-adapt the synaptic junctions which interconnect neurons in your brain to facilitate future learning!!!

Thank you again for listening to this episode of Learning Machines 101! I would like to remind you also that if you are a member of the Learning Machines 101 community, please update your user profile and let me know what topics you would like me to cover in this podcast.

You can update your user profile when you receive the email newsletter by simply clicking on the: *“Let us know what you want to hear”* link!

If you are not a member of the Learning Machines 101 community, you can join the community by visiting our website at: www.learningmachines101.com and you will have the opportunity to update your user profile at that time. You can also post requests for specific topics or comments about the show in the Statistical Machine Learning Forum on Linked In.

From time to time, I will review the profiles of members of the Learning Machines 101 community and comments posted in the Statistical Machine Learning Forum on Linked In and do my best to talk about topics of interest to the members of this group!

And don’t forget to follow us on TWITTER. The twitter handle for Learning Machines 101 is “*lm101talk*”!

Also please visit us on ITUNES and leave a review. You can do this by going to the website: www.learningmachines101.com and then clicking on the ITUNES icon. This will be very helpful to this podcast! Thank you so much. Your feedback and encouragement are greatly valued!

**Keywords: ** Clustering, Face Recognition, Emotions, K-Means, SNE, Stochastic Neighborhood Embedding, LDA, Linear Discriminant Analysis, Spectral Clustering

## Further Reading:

Reading Facial Expressions of Emotion by David Matsumoto and Hyi Sung Hwang

Psychological Science Agenda, May 2011, No. 5. American Psychological Association.

Performance Comparisons of Facial Expression Recognition in Jaffe Database.

International Journal of Pattern Recognition and Artificial Intelligence · May 2008

Face Recognition Databases (http://web.mit.edu/emeyers/www/face_databases.html)

Hinton and Roweis NIPS 2002 Stochastic Neighborhood Embedding Clustering Paper

Huang, Wang, and Ying 2011 Paper on Facial Expression Recognition using Stochastic Neighborhood Embedding (Proceedings of the 2011 International Conference on System

Science and Engineering, Macau, China, June 2011).

Fisher Linear Discriminant Analysis https://en.wikipedia.org/wiki/Linear_discriminant_analysis

Original Paper on Fisher Linear Discriminant Analysis

*Fisher, R. A.** *(1936). “The Use of Multiple Measurements in Taxonomic Problems”.* **Annals of Eugenics*.* ***7*** *(2): 179–188.* **doi*:*10.1111/j.1469-1809.1936.tb02137.x*.* **hdl*:*2440/15227*.

Microsoft’s Training Facial Recognition Apps to Recognize our emotions

https://nakedsecurity.sophos.com/2015/12/04/microsofts-training-facial-recognition-apps-to-recognize-our-emotions/

Emotion Recognition Technologies – Next Step in Security and Marketing

January 2017 (IHLS).

Emotion API Microsoft Azure

https://azure.microsoft.com/en-us/services/cognitive-services/emotion/

Good Discussion of K Means Clustering https://en.wikipedia.org/wiki/K-means_clustering

Luxberg (2007). A Tutorial on Spectral Clustering. Statistics and Computing, vol. 17.

https://arxiv.org/abs/0711.0189

Bishop, C. Pattern Recognition and Machine Learning. Springer-Verlag. New York. 738 pp.