Mother and android looking at a park scene and learning about "dogs".

Podcast: Play in new window | Download | Embed

LM101-054: How to Build Search Engine and Recommender Systems using Latent Semantic Analysis (RERUN)

Episode Summary:

In this episode we explain how to build a search engine, automatically grade essays, and identify synonyms using Latent Semantic Analysis.

Preamble:

Welcome to the 54th Episode of Learning Machines 101 titled “How to Build a Search Engine, Automatically Grade Essays, and Identify Synonyms using Latent Semantic Analysis”

This is a rerun of Episode 40 which originally appeared in November 2015. In this episode documents are modeled as unordered sets of words. Then, we can consider a subset of key words as a type of document and retrieve the documents which are most similar to these key words.

I would like to comment that the principles in this episode are also applicable to the problem of “Market Basket Analysis” where a document corresponds to a “market transaction” (e.g., buying some food items at the grocery store) and and the words in a document correspond to the items which were purchased. With these relationships, one can see how the mathematics of identifying structural relations for document retrieval in search engines is equivalent to the mathematics of the problem of identifying structural relations for analyzing market transactions. This information can then be used, for example, in the design of “recommender systems” which try to anticipate what types of products tend to be purchased together for the purpose of making recommendations and suggestions to consumers. So…perhaps this episode should have been titled: “How to Build a Search Engine, Automatically Grade Essays,Identify Synonyms, and develop Recommender Systems using Latent Semantic Analysis”!

Show Notes:

Hello everyone! Welcome to the fortieth podcast in the podcast series Learning Machines 101. In this series of podcasts my goal is to discuss important concepts of artificial intelligence and machine learning in hopefully an entertaining and educational manner.

In this episode we introduce a very powerful approach for computing semantic similarity between documents. Here, the terminology “document” could refer to a web-page, a word document, a paragraph of text, an essay, a sentence, or even just a single word. Two semantically similar documents, therefore, will discuss many of the same topics while two semantically different documents will not have many topics in common. Machine learning methods are described which can take as input large collections of documents and use those documents to automatically learn semantic similarity relations. This approach is called Latent Semantic Indexing (LSI).

Assume we had a machine that could compute the semantic similarity between documents.

If this could be accomplished, then it could be used in a large variety of applications. For example, suppose we wanted to build a search engine. The search engine works by the user typing in a few words into the search window and then a list of web pages are generated which are rank-ordered according to their semantic similarity with the words which were originally typed into the search window. If we knew how to compute semantic similarity with some algorithm, then we could use that algorithm to rank-order the web-pages so that the URL or address of the web-page which is most semantically similar to the key words into the search engine would be presented first, then the URL of the web-page which was second-most semantically similar to the key words would be presented second and so on.

Another useful application of an approach for computing semantic similarity between documents would be useful in the area of educational technology for the purpose of automatically grading essays. Suppose one desires to design a system which is capable of automatically grading essays. One approach might be to have a human manually grade a very small number of really excellent essays. We will refer to this small number of essays which correspond to “good answers” as the “gold-standard essays” for the essay grading problem. Then the computer would process a new essay by computing the semantic similarity of a new essay with each of the different “gold-standard” essays. If the new essay is semantically similar to ANY of the gold-standard essays, then the computer might give the student who generated the new essay a good grade. While if a student writes an essay which is not semantically similar to any of the “gold-standard” essays, then the student might receive a bad grade.

A third useful application of computing semantic similarity might be in the design of an artificially intelligent thesaurus which when, upon request, will generate a semantically similar word or word phrase given a particular word or word phrase is provided to the thesaurus. Thus, semantic similarity could be used for synonym generation.

So how could one compute semantic similarity between documents? One idea is that we could have humans identify how one word is semantically related to another word. This network of semantic relationships between words could then be used to compute the similarity between two documents consisting of words. For example, the WORDNET project (wordnet.princeton.edu) which was initiated in 1985 was a project which involved identifying semantic relationships among pairs of words using human semantic annotators. The resulting semantic network has been used in many projects in artificial intelligence requiring the computation of similarity between documents or similarity between words. Related projects include: Concept Net (conceptnet5.media.mit.edu) , OpenCyc (sw.opencyc.org ), and the Unified Medical Language System (UMLS) sponsored by the NIH National Library of Medicine (www.nlm.nih.gov/research/umls). Specific links to these related projects can be found in the show notes at: www.learningmachines101.com .

Still, although such approaches have been relatively successful and deployed in various types of applications, they have the following difficulties. First, it takes many years (sometimes decades) to construct a useful semantic network using humans who provide semantic similarity ratings. And second, it is equally expensive to maintain such a semantic network and keep it updated and correctly functioning.

Ideally, an automated method for learning semantic similarity would be very desirable and this is the approach which will be discussed in today’s podcast. To approach this problem, consider how a child (or even an adult) acquires knowledge of words and language. The child may be observing a complex situation at a park. Birds are flying. The wind is blowing. There are bicyclists in the park. There are noises coming from the streets and shops. People are playing Frisbee in the park. Other people are having picnics in the park. In the midst of this complex situation, the mother points in the general direction of a boy walking his dog and the mother says “dog”. Now, it is a few weeks later, the child and his mother are visiting a neighbor. They are playing in the living room when the family dog enters the room. Again, the mother says “dog”. In both cases, the word or tag “dog” is attached to a different complex situation which is feature-rich. Through experience, however, invariant properties of these complex situations are identified and associated with the verbal code: “dog”. A large class of machine learning algorithms learn the “semantic meaning” of a verbal code such as “dog” in a similar manner. However, the key idea for this specific class of machine learning algorithms is that a “situation” is defined in terms of the frequency in which the “verbal code” is used across a collection of documents. This collection of documents is called a “corpus” and can be interpreted in this context as an abstract model of the collection of situations that a language learner might experience.

In particular, suppose we have a large number of documents. For example, in one TREC [Text Retrieval Conference] (trec.nist.gov) application we might have 70,000 documents and 90,000 terms in those documents. The concept of a “term” refers to either a particular “word” in a document such as “dog” or a particular word phrase which is best treated in a unitary manner such as “hot dog”. Words which are expected to have poor discriminative performance in terms of distinguishing documents such as: “the”, “a”, “of”, “and”, and “is” are often not included in the list of terms. We then construct a large array of numbers which has 90,000 rows where each row corresponds to a different term and 70,000 columns where each column refers to a particular document. This large array of numbers is called a “term by document” matrix. A row of numbers in the “term by document” matrix is called a “row vector”. A column of numbers in the “term by document” matrix is called a “column vector”.

Suppose that document #1000 contains the term “dog” and assume that the term “dog” is the 29^th term in the dictionary of 90,000 terms. Let’s place the number “1” at the location corresponding to row #29 and column #1000 in the term by document matrix. This means that the 29^th term occurred one or more times in document #1000. This is essentially a very simple “learning algorithm”. We present the learning machine with example “situations” which correspond to documents and the learning machine then updates its term by document matrix with each document experience. Eventually the term by document matrix will contain 90,000 rows of 70,000 numbers or equivalently 6,300 million numbers where each number in the term by document matrix is either a zero or a one.

We can use this term by document matrix to build a search engine. This is done as follows. Each document in the matrix corresponds to a particular web page so that the similarity between two web pages or equivalently two columns of the term by document matrix can be computed by simply counting the number of times the number “1” or the number “0” occurs in one column and not the other indicating a “mismatch”. This will be called the Hamming Distance Similarity Measure. For example, if one column consists of the numbers: 1,0,0,0,0 and the other column consists of the numbers 0,1,0,0,0 then the Hamming distance between these two columns is equal to 2. If the columns are exactly the same, then the Hamming distance is equal to zero.

Now we are ready to build a search engine using the concept of Hamming Distance. Construct a new document representing a search engine query by the user. Specifically, if the user mentions terms: 10, 11, and 12 in the search engine search window then we construct a new document which has the number “1” in rows 10, 11, and 12 and the number “0” in the remaining 87,997 rows of the 90,000 row matrix. We then compute the Hamming distance between this new document and all of the 70,000 documents in the database. The documents with the smallest Hamming distance to the query by the user are presented first and the documents with the largest Hamming distance to the query by the user are presented last.

This same technique could be used to automatically grade essays. For automatic essay grading, a document or equivalently a column of the term by document matrix corresponds to an essay. If the term by document matrix has 90,000 terms and 70,000 documents this means that we have 70,000 essays to grade. We also have a small number of documents corresponding to essays which are classified as “gold standard essays”. These essays might have been written by the professor or the teaching assistants for the course. Perhaps we have 100 gold standard essays. Each gold standard essay is then translated into a document or column vector to obtain a set of 100 column vectors. Next, the Hamming Distance of each of the 70,000 essays is compared to each of the 100 gold-standard column vectors. If an ungraded essay has a low Hamming Distance to ANY of the 100 gold-standard column vectors, then we say that the ungraded essay is SIMILAR to the SET of gold-standard essays using the Hamming Distance.

And finally, we can use this technique to generate synonyms. For the essay grading and search engine problems, similarity was computed in a “document space”. Each document could be interpreted as a point in a 70,000 dimensional document space. For the synonym generation problem, similarity is computed in a term space. Instead of comparing columns of the term by document matrix, we compare rows of the term by document matrix. In other words, a term is represented as a list of ones and zeros indicating which documents that term was used and which documents that term was not used. We can then again compute the Hamming Distance similarity measure to compare the similarity between rows of the term by document matrix.

Although this approach might seem as if it solves an important class of machine learning problems, it is fundamentally flawed. Indeed, the approach which has just been described is really equivalent to simple “keyword matching” and it is not very sophisticated at all. The Hamming Distance is simply measuring the number of times that there is a “mismatch” of keywords.

So the first step to improving the quality of inference in this problem is to allow for entries in the term by document matrix which are different from one. In other words, we would like to take into account how “frequently” a term appears in a document rather than simply stating that the term appears at least once in the document.

For example, we might place the number 100 in the 10^th row and 12^th column of the term by document matrix in order to indicate that term #10 occurred 100 times in document #12. An analysis of an actual corpus will show that some terms will occur many times more frequently than other terms so, in practice, it is convenient to apply an appropriate logarithmic transformation such as placing the number log(1 + 100) rather than the number 100 in the term by document matrix for a term which occurs 100 times in a particular document. This has the effect of constraining the range of numbers in the term by document matrix to be relatively small.

Once this is done, the similarity between a pair of columns of the term by document matrix is computed using the dot product of normalized versions of the two column vectors. This is done by the following simple operation. To normalize a vector which has at least some non-zero elements, simply take the square root of the sum of the squares of all elements in that vector to obtain the vector’s length. Then divide each element of the original vector by the length of the original vector to obtain the normalized vector. If all elements in the vector are zero, then we will refer to that vector of zeros as normalized. Now compute the dot product of two normalized vectors, first normalize each vector as we have just described. Now we compute the dot product of the two normalized vectors. This is accomplished by multiplying the first element of normalized vector 1 by the first element of normalized vector 2 and then adding that to the multiplication of the second element of normalized vector 1 and the second element of normalized vector 2 and then adding that to the multiplication of the third element of normalized vector 1 and the third element of normalized vector 3. The resulting number is called the dot product of the two normalized vectors and we will use this instead of the Hamming Distance Measure. This new dot product of the two normalized vectors distance measure is called the Cosine Angular Separation Measure since it can be shown that the dot product of two normalized vectors is exactly equal to the cosine of the angular separation between the two vectors. Two document column vectors which tend to use the same terms with similar frequencies will tend to have a Cosine Angular Separation Measure which is close to one while two document vectors which do not use the same terms with similar frequencies will tend to have a Cosine Angular Separation measure which is close to zero.

This new approach generalizes the Hamming Distance approach by taking into account the frequency of a term in a document rather than simply looking at whether or not the term is present in a document. This is a big improvement over the Hamming Distance approach but it still doesn’t address a fundamental problem which is shared by the Hamming Distance and the Cosine Angular Separation Measure approach.

In the real world, many words have multiple meanings and different words can refer to the same meaning. For example, the word “dog” can refer to: a domestic mammal closely related to the gray wolf, a worthless person (e.g., you “dog”), a good buddy (e.g., “yo dog”) , a constellation of stars such as Canis Major, something inferior (e.g., “the investment was a dog”), or a “hot dog” such as a frankfurter. In some sense, this problem is closely related to the topics discussed in Episode 27 which was concerned with the problem of learning rare and unseen events. Indeed, if we look at our term by document matrix which has 70,000 row and 90,000 columns typically we would find only a tiny fraction (e.g., 0.002%) of those entries in the matrix will be non-zero. This is consistent with the idea that lots of documents do not have words in common and lots of words don’t have documents in common. This fundamental problem is shared by both the Hamming Distance and Cosine Angular Separation Similarity Measures.

Our approach here to deal with this problem involves a technique which is called “Latent Semantic Analysis” or “Latent Semantic Indexing” sometimes called LSA or LSI. The basic idea is that we can use a technique called Singular Value Decomposition which rewrites the original “term by document” matrix as a weighted sum of a bunch of other “term by document” matrices. This is an example of what is called a “matrix decomposition”. It is a way of “dissecting” a “term by document” matrix and it works for rectangular matrices. The computer program MATLAB, for example, has Singular Value Decomposition implemented in a single function called SVD so you can do a Singular Value Decomposition in MATLAB using only one line of MATLAB code! Now suppose that we do a Singular Value Decomposition of our Term by Document Matrix and we find that we can rewrite the original Term by Document matrix as the number 10 multiplied by the another “term by document” matrix plus the number 5 multiplied by another “term by document matrix” plus the number 0.0001 multiplied by another “term by document matrix”. The weighting factors: 10, 5, and 0.0001 are called the “singular values”. Now the key assumption of Latent Semantic Indexing is that contribution of the terms with the smaller singular values are basically “noise” and do not reflect the important structural features of the data, while the terms associated with the larger singular values do correspond to the important structure features of the data. Thus, according to this assumption, one should be able to improve the inference performance of the learning machine by deleting the components of the original term by document matrix associated with small singular values. In other words, going back to our example, we simply use a “modified” term by document matrix constructed as a weighted sum of the two term by document matrix components which have singular values of 10 and 5. Interestingly enough, when this is done, the resulting “modified” term by document matrix now has many non-zero entries thus revealing “latent” associations between words and documents which were not present in the original “term by document” matrix before the matrix decomposition was implemented.

These latent associations provide non-zero entries where zeros were present before in the original Term by Document matrix and consequently two documents which might not have any key words in common might, in fact, be identified as very similar using this Latent Semantic Analysis methodology.

Finally, from a computational perspective, one might think it is computationally problematic to have a non-sparse matrix representation. However, it can be shown that an equally effective sparse representation of the modified term by document matrix can be constructed but it’s a little difficult to get into those technical details here. Briefly, however, the trick is to never actually construct the “modified” Term by Document matrix explicitly and just work with the “left” and “right” eigenvectors of the original matrix which have the largest singular values.

This approach has been shown to be computationally very effective at solving a lot of important problems in document retrieval, automatic essay grading, and synonym identification. However, it still has a few undesirable features. First, how does one decide which singular values are the important ones? How small should a singular value be before it is ignored? And second, what types of statistical modeling assumptions are being assumed? That is, for which types of statistical environments will this approach be successful and for which types of statistical environments will this approach fail miserably? The solution to these questions requires that a probabilistic approach to latent semantic analysis be developed. We will discuss probabilistic approaches to latent semantic analysis in a future episode of Learning Machines 101!

If you are a member of the Learning Machines 101 community, please update your user profile.

You can update your user profile when you receive the email newsletter by simply clicking on the: “Let us know what you want to hear” link!

Or if you are not a member of the Learning Machines 101 community, when you join the community by visiting our website at: www.learningmachines101.com you will have the opportunity to update your user profile at that time. Also feel free to tweet about this episode using the handle: @lm101talk!

From time to time, I will review the profiles of members of the Learning Machines 101 community and do my best to talk about topics of interest to the members of this group!

Keywords: Latent Semantic Analysis, Latent Semantic Indexing, LSA, LSI, Singular Value Decomposition, SVD, Automatic Essay Grading, Search Engine Optimization, Synonyms, Market Basket Analysis

LM101-054: How to Build Search Engine and Recommender Systems using Latent Semantic Analysis (RERUN)

LM101-054: How to Build Search Engine and Recommender Systems using Latent Semantic Analysis (RERUN)

Episode Summary:

Preamble:

Show Notes:

Further Reading:

Leave a Reply Cancel reply