LM101-057: How to Catch Spammers using Spectral Clustering
In this 57th episode, we explain how to use spectral cluster analysis unsupervised machine learning algorithms to catch internet criminals who try to steal your money electronically!
Hello everyone! Welcome to the fifty-seventh podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner. In this episode we will explain how to use spectral cluster analysis unsupervised machine learning algorithms to catch internet criminals who try to steal your money electronically!
I’m sure you may have received an email such as the following. Here is the first example:
Dear Banking Customer, We need to confirm that your or someone authorized to use your Debit Card made the following transaction(s) on your Debit Card: View All Transactions Thank you for being a valued customer. Customer Service Center
And here is the second example:
Your mailbox has exceeded the storage limit is 1 GB, which is defined by the administrator, are running at 99.8 gigabytes, you can not send or receive new messages until you re-validate your mailbox. To renew the mailbox Click Here Thank you! Web mail system administrator! WARNING! Protect your privacy. Logout when you are done and completely exit your browser.
And here is a third example:
Dear Customer, Due to recent activity on your bank account, we have temporarily suspended activity due to security precautions. Please verify your identity by clicking here to confirm you are the owner of your bank account. Best Regards, John Smith Chief Financial Security Bank Officer Badge #45123
Although sometimes such emails are legitimate, they usually are scams designed to steal your login information. The scammer will generate a “fake portal” with two windows where you type in your username and password respectively to login to your bank account or email account. You think you are just logging into your bank account but you are actually providing the criminals with your login information. Once you have typed in this information, they have immediate access to your bank account or email account, can transfer money to themselves, and can even change the password so you can’t get back into your account!
The best way to deal with these cases when you do not know if they are legitimate, is first do not click on any of the links in the email. Instead, if it is supposed to be an email from your bank, either call your bank directly or logon to your bank account using the official legitimate website for your bank account with a different browser. If the email comes from your email provider, login to your email system and double check if you really have exceeded the storage limit.
The above examples of spam illustrate a new approach to crime called “spamming” which allows criminals to rob you while they are safely located far away from you and possibly in another country and you may be at home, or shopping, or studying in the library.
So how do spammers get email addresses? Email addresses are sometimes obtained through “email harvesting spiders” also known as “harvesters” or “harvesting bots” which crawl around websites looking for email addresses. This process of crawling around the web looking for email addresses is called “scraping addresses from websites.” Note that it is useful to make a distinction between spamming and harvesting. The email harvesting spiders to collect email addresses for evil purposes but are not necessarily going to use the email addresses for spamming. Instead, the harvester might sell the email addresses to a spammer and then the spammer sends out emails which are designed to trick computer users into revealing their usernames and passwords for the purpose of accessing on-line accounts such as on-line banking accounts or online email accounts. Of course, sometimes the harvester and spammer could be the same individual or group of individuals.
Note that in the United States the CAN-SPAM act of 2003 is a law specifically designed to prohibit predatory and abusive commercial email and includes specific discussions of both spamming and harvesting. Check out the show notes for more information about the CAN-SPAM act!
“Project Honey Pot” which is described at: www.projecthoneypot.org is designed to catch both spammers and harvesters. I suppose it could have been called “Project Cookie Jar” with the idea that you catch the person’s hand in the cookie jar but instead it’s called “Project Honey Pot” probably because the honey “sticks” to the person’s hand when they try stealing the honey. Before explaining how Project Honey Pot works, however, recall that an IP address is essentially the name of a particular computer or particular computer network expressed as a long number. The terminology IP address refers to the phrase Internet Protocol address. So, for example, if you have a wireless home computer network, then there is an IP address associated with your home computer network and the different devices such as computers and printers attached to the home computer network have IP addresses as well which are often constructed by extending the IP address of your computer network.
Project Honey Pot works as follows, anyone with a website can participate by simply going to the website: www.projecthoneypot.org and then installing the Project Honey Pot software somewhere on their website. This software is used to install “fake email addresses” on your website that are custom-tagged with the time and IP address of the visitor to your site. This sets up a trap to catch harvesters and spammers in the following way. The email harvesting spiders wander around the worldwide web and stumble upon your website. They then search around and find your email list which they steal. These email lists can then be sold to Spammers. However, inside the email list they have stolen are these fake email addresses and when they try to use these fake email addresses this is where the trap is activated because the email is send to Project Honey Pot to track down and prosecute harvesters and spammers. The email contains information regarding the IP address associated with the harvesting spider and the time it was harvested as well as the IP address of the computer which sent the spam. Note that the data collected by Project Honey Pot is also shared with law enforcement agencies as well as the website owner.
One possible use of this data is to identify harvesters which might be related to one another. Suppose we have identified 30,000 different harvesters, it would be of great value to law enforcement agencies if one could identify the presence of groups of harvesters and how those groups interact because the distribution of harvester computers has been found to be useful for providing clues regarding the structure of spammer social networks.
Xu, Kliger, Woolf, and Hero (2009) have published the article “Revealing Social Networks of Spammers Through Spectral Clustering” in the Proceedings of the IEEE International Conference on Communications. A recent 2013 ARXIV paper on this topic is available if you visit the website: www.learningmachines101.com and take a look at the show notes for Episode 57. In today’s podcast, we will summarize and discuss some highlights of this article which provides a nice introduction to this area. I would like to emphasize that the following discussion concerned with analyzing data from Project Honey Pot is based upon the Xu et al. ARXIV paper although I have taken the liberty to be selective regarding how I describe this procedure leaving out some details while explaining others in greater depth.
Xu et al. note in their ARXIV article that as of February 2009, over 35 million trap email addresses, 39 million spam computers, and 59,000 harvesting spiders have been identified by Project Honey Pot. So this yields an extremely large data set and as we have previously mentioned this data set includes the IP address of the harvesting spider, the time it was harvested, and the IP address of the spamming computer. In addition, to this data we have other important data about spammers. We have knowledge of how frequently a particular harvester computer used a particular spamming computer. This is a measure of similarity which can be used to determine if a group of harvester computers are working together to collect email addresses if the members of the group tend to associated with a particular group of spamming computers.
In particular, the harvester–spam server usage similarity measure is defined using the following formula. Let hjk indicate the number of emails sent by harvester j using spam server k divided by the total number of all emails sent by all harvesters and additionally divided by the total number of emails harvester j has acquired. In other words, hjk is the percentage of times that harvester computer j uses a spam computer k to send an email address.
Given the harvester-spam server usage similarity measure, we would like to calculate a measure of similarity sjq between harvester j and harvester q. Suppose that all of the harvester computers were using the same SPAM server to send out SPAM messages. Under the assumption of statistical independence, the probability that harvester computers j and q use spam computer k to send an email address is simply estimated by multiplying the percentage of times that harvester computer j uses spam computer k to send an email by the percentage of times that harvester computer q uses spam computer k to send an email.
The probability that harvester computer j and harvester computer q send emails using spam computers 1 and 2 is given simply by adding up the probabilities that the two harvest computers both use both spam computers and the probabilities that a harvest computer uses either of the spam computers. This probability which we will call the harvester pair similarity measure and defines a special unnormalized similarity matrix whose element in the jth row and qth column specifies the probability that harvestor computer j and harvestor computer q would both send a spam email.
The next step of the analysis is to use the harvester pair similarity measure to compute a measure of the quality of a particular strategy for clustering the data. Specifically, we construct an objective function from the harvester pair similarity measure which will be called the harvester clustering measure. The harvester clustering measure assumes that there are K distinct clusters and assigns a number for each possible way of grouping N harvesters into K distinct clusters. If the number assigned to a particular way of clustering N harvesters into K distinct clusters is small, this indicates that we have a good way of clustering the harvesters. If the number assigned by the harvester clustering measure to a particular way of clustering N harvesters into K distinct clusters is large, this indicates our method of clustering the harvesters is poor. More specifically, the harvester clustering measure is defined as the sum of the pair-wise harvester similarities between two harvesters in different clusters minus the sum of the pair-wise harvester similarities of each harvester within another harvester in the same cluster. Finding a clustering scheme which minimizes the harvester clustering measure thus is equivalent to finding a clustering scheme which simultaneously increases within-cluster similarity while decreasing between-cluster similarity using the harvester pair similarity measure as a formal definition of similarity. This basic principle of clustering is well-known in the statistics literature and perhaps first appeared in the literature about 80 years ago and is the basis of the well-known Fisher Linear Discriminant Analysis method.
By formulating the problem in terms of a harvester clustering measure, this provides a great conceptual simplification to the clustering problem. The goal of the clustering problem is to simply minimize the harvester clustering measure over all possible ways of clustering N harvesters into K distinct clusters. Finally, note that the problem of clustering a group of N objects into K categories is equivalent to the unsupervised learning problem where one is provided N objects and asked to learn without supervision how to assign an object to a category.
We now digress briefly to explain how to solve this clustering problem can be viewed as an unsupervised learning problem.
Note that minimizing a harvester clustering measure is non-trivial since if one has 60,000 harvesters then the total number of different ways of assigning 60,000 harvesters into K different groups is astronomical. The problem is further complicated because the number of groups denoted by K is not known. There are several general approaches to figuring out the number of clusters which is denoted by K. First, one simply plots the data and estimates the number of clusters from a plot of the data. Second, one just picks an arbitrary choice of K because you might be interested in what happens for a specific number of clusters. Third, one might try several different values of K and see which choice of K gives rise to the best-fitting clustering assignment based upon the harvester clustering measure. Fourth, one can might try choosing K to be approximately equal to the logarithm of the number of harvesters which is a heuristic suggested by Xu et al. And fifth, the “eigengap heuristic” is sometimes used to choose the value of K which is computed using the spectral cluster analysis method which will be described briefly and is closely related to the method where one looks for the K which minimizes the harvester clustering method.
In any case, assume for now that the number of clusters K has been chosen using one of the above five methods or some other method. So now the problem is to minimize the harvester clustering measure for a fixed K…where as we noted early K is the number of clusters.
This is where the spectral clustering approximation method is used. It is possible to rewrite the harvester clustering measure as a Rayleigh Quotient which allows one to solve for an initial approximate guess for assigning the harvesters to K groups by computing the eigenvector with the second smallest eigenvalue of a high-dimensional matrix called the normalized Laplacian. In practice, iterative methods such as Rayleigh Quotient iteration methods may also be required to compute the initial approximate guess since direct computation of eigenvectors of a 60,000 dimensional matrix may be prohibitive. This initial guess can then be used to assign harvesters to K groups and then K-means clustering is used to generate the final assignment. K-means clustering can be interpreted as a type of deterministic descent algorithm which works by picking a harvester and assigning it to the cluster whose within-cluster mean is most similar to that harvester and then repeating this process until convergence. It is typically difficult to apply the K-means clustering algorithm directly without some sort of good initial guess so that is why the spectral clustering approximation method is very helpful for initializing the K-means algorithm.
Xu et al have used the techniques described here to successfully identify distinct communities of harvesters involved in spamming. In addition, they obtained empirical evidence that these communities tend to be smaller communities which extensively share spammer computer resources in contrast to larger communities which exhibit less sharing of spammer computer resources.For more details and some cool color plots of the social networks of harvesters discovered by this algorithm, please visit the show notes of this episode at: www.learningmachines101.com where you will also find references to the various algorithms and techniques mentioned in this episode.
Thank you again for listening to this episode of Learning Machines 101! I would like to remind you also that if you are a member of the Learning Machines 101 community, please update your user profile and let me know what topics you would like me to cover in this podcast.
You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link! If you are not a member of the Learning Machines 101 community, you can join the community by visiting our website at: www.learningmachines101.com and you will have the opportunity to update your user profile at that time.
Also, I haven’t received many new reviews on ITUNES in a while but I know that I have a lot of loyal listeners. It’s really important that you visit the ITUNES store and rate the show because this boosts the ranking of the show in the ITUNES search engine and makes it easier for other podcast listeners to discover the show. This really helps me a lot because I’m trying to reach a broad audience with this podcast and by boosting my ITUNES ranking this greatly supports the show and ensures that you will receive many more outstanding future episodes of Learning Machines 101.
And don’t forget to follow us on TWITTER. The twitter handle for Learning Machines 101 is “lm101talk”!
You can also post requests for specific topics or comments about the show in the Statistical Machine Learning Forum on Linked In.
From time to time, I will review the profiles of members of the Learning Machines 101 community and comments posted in the Statistical Machine Learning Forum on Linked In and do my best to talk about topics of interest to the members of this group!
Keywords: Spam, Spamming, Spammers, Spectral Clustering, Harvester, Harvester Bot, Harvester Spider, Unsupervised Learning, Clustering, K-Means Clustering
Xu KS, Kliger MM, Yilun C, Woolf PJ, Hero A (2009) Revealing social networks of spammers through spectral clustering. In: Proceedings of the IEEE International Conference on Communications. Dresden, Germany. doi:10.1109/ICC.2009.5199418
PDF File from https://arxiv.org/abs/1305.0051
Luxberg (2007). A Tutorial on Spectral Clustering. Statistics and Computing, vol. 17.
Good Discussion of what is an IP Address
Good Discussion of Email Address Harvesting
Good Discussion of K Means Clustering
Fisher Linear Discriminant Analysis
Original Paper on Fisher Linear Discriminant Analysis
Fisher, R. A. (1936). “The Use of Multiple Measurements in Taxonomic Problems”. Annals of Eugenics. 7 (2): 179–188. doi:10.1111/j.1469-1809.1936.tb02137.x. hdl:2440/15227.
Kuleshov (2013). Method of Computing Eigenvectors for Extremely Large Sparse Matrices
Rayleigh Quotient Iteration Method for Eigenvector Calculations
International Conference on Machine Learning (2013).
Copyright © 2016 by Richard M. Golden. All rights reserved.