Singular Value Decomposition and LSI
According to MIT Professor Gilbert Strang, “ The Beautiful thing happened in Linear Algebra in his lifetime is ‘Singular Value Decomposition’. The process in which a rectangular piece of data (matrix) is decomposed into multiple parts. The Most Important part first, then the next important part and so on… If a data scientist can find those most important pieces, stop there, approximate the whole data with those important parts, that’s it”. You can do the magic.
The basic idea behind SVD is, every matrix can be decomposed as a product of 3 matrices. A “rotation matrix” followed by a “stretch” and then another “rotation”, Everyone can visualize these 3 operations pretty neat.
Lets understand how we can use this SVD in our models for Similarity. We can take a paper for this demonstration. Here I am trying to explain a paper published by Scott Deerwester on Indexing by Latent Semantic Analysis.
Problem Statement:
Two major problems exists for Vector Space Model Representation of any document.
Synonymy: Many ways to refer to the same object, eg., car and automobile, which leads to poor recall.
Polysemy: Most words have more than one distinct meaning, eg., model, python, chip, which leads to poor precision.
The basic observation is that, terms are an unreliable means to assess the relevance of a document, w.r.t to a query, because of synonymy and polisemy. Thus, one would like to represent document using a more semantically accurate way, i.e., in terms of “concepts”.
Solution:
Represent document using a more semantically accurate way, i.e., in terms of “concepts”. LSI (Latent Semantic Indexing) achieves this by analyzing the whole term-document matrix W, and by projecting it in a lower-dimensional “latent” space spanned by relevant “concepts”. More precisely, LSI uses a linear algebra technique — Singular Value Decomposition (SVD) — before performing dimensionality reduction.
Every document can be represented as a vector. Every word can be represented as a vector ( through CBOW, or Skip Gram), which are semantically related.
A document contains thousands of words, that means thousand dimensional vector, or high amount of dimensions. Handling such a big dimension needs high computing power. This is nearly impossible sometimes. Here comes the importance of SVD. We need to identify those parts which are very important and reduce the higher dimensional vector space into a lower dimensional vector space.
Through SVD, we can represent the terms(words) and documents into a set of latent(hidden) concepts.
Mathematically, it is represented as a product of 3 matrices. This is just to say that weight of a word (ti) in a document (dj) is represented as a linear combination of term-concept and document-concept weights.
Consider and MxN term document which is very huge in dimensional, which is represented as a matrix W. We can factorize it into 3 different matrices.
Here T and D are new Basis, which are orthogonal ( orthonormal), which represents term-concept similarity matrix and document-concept similarity matrix respectively.
Δ is a diagonal matrix, also represent as the square roots of “Eigen values” of a matrix. We can also call it as “Concept matrix”. We can later reduce the entire documents into a set of reduced concepts based on the importance to our context.
Referring to examples given in Deerwester’s paper,
A set of documents , c1 to c5 and m1 to m4,
These 9 documents are having 12 important words in it which represents the contents of each document.
These documents can be decomposed into its SVD representation as below,
Representation of the Documents only by Terms-Concepts or Document-Concepts.
By expanding SVD, we can represent a document only with its term-concept space or document-concept space. This transformation will help us to do the prediction. Mathematics behind this transformation is explained below as the product of a matrix to its transpose. It can be written as below,
WW’ is real symmetrical ( term-term similarity matrix), Columns of T are Eigen Vectors.
W’W is eral and symmetrical ( document-document similarity matrix), columns of D are Eigen vectors of such matrix.
(Δ)^(2) is is the Eigen Values of W’W and WW’.
Rotation + Stretch + Rotation
•Every Document can be represented in its “Terms-Concept” Plane
•Every Document can be represented by its “Document-Concept” Plane.
- Term Plane and Document Plane are orthogonal .
- The concept Matrix is the one which helps in this transformation.
What it means, any new document comes for a similarity check, can be represented only with its term-concept vector, This vector can be transformed into Document Concept Vector and find the similarity within them.
So any query Document Wq can be transformed with its Term-Concept Vector and Concept matrices and can further represented as a vector in its reduced Document-Concept Dq space as,
Dimensionality Reduction:
Our next aim is to reduce the dimensionality by choosing the most significant (bigger) Eigen values which has higher variability and which can approximately represent the input vector.
From “Deerwester’s Paper”,
For this example, we have chosen the value of k as 2.
Now the entire document is represented into 2 concepts,
A new query vector Q can be represented in the 2 dimensional, document-concept space.
Through the cosine similarity, we can identify the similar documents of the queried vector(/document).
Other Reads from Dinu Thomas: