I always used to wonder how Google used to differentiate between Java – The programming language, Java – The Island in South East Asia, and Java – the coffee bean. It was after a lot of contemplation that I decided to personify Google and all seemed to be clear to me. Google was somebody who understood which ‘Java’ I was talking about, based on the context I was referring to.
Imagine how many such words exist with different meanings, what we call homonyms. Mr. Google knew all homonyms in the English language and not only this, he also understood what ever I said, and even if I misspell something he would come up to me and correct me about what I was asking him. Well how did Mr. Google know ALLLL this? After pondering and reading some stuff that Mr. Google gave me when I asked him, I got my answer.
Mr. Google was a very well read person, he would read, read, read the whole dam day – 24 X 7 and the best part is he never forgets something that he read (I wish I had such a brain). Apart from reading he would go around telling people what they wanted. And I guess he did both simultaneously. I bet Mr. Google had 10 heads like Ravana or might be even more. Who!! Ok! Let us now try and understand how Mr. Google read and how he stored all that data in one of his many heads. For any piece of data that Mr. Google read he would first go about removing all the articles (a, an, the) prepositions (in, on, above), connectors (since, because, while) and other common words. He would only keep the remaining words (those words which were specific to the data). He would then continue to read and whenever he came across similar stuff, he would do some kind of calculations which is a secret and I guess Mr. Google’s son would be the only guy who would get to know of it. These calculations that he performed would give some kind of relevance factor to words. He would do this process for eternity and when ever some body asked him about any thing he would dig into his head and give those pieces of information which have a greater degree of relevance to what the person asked. This degree of relevance is called Latent Semantic Index (LSI).
I got a peep into how Mr. Google went about doing his calculations. However, if you don’t want to get into the math you can jump to the next paragraph. For those who love Math, Let's go on... After removing all the unnecessary words, he developed something called aTerm Document Matrix (TDM). Mr. Google generates his TDM by arranging list of all content words along the vertical axis, and a similar list of all documents along the horizontal axis. These need not be in any particular order, as long as he keeps track of which column and row corresponded to which word and document. He would then go about filling the number of times each word occurred in a document. Since any random document would contain only a tiny subset of content word vocabulary, the matrix is very sparse (that is, it consists almost entirely of zero's). This is how he goes about doing it at the lower level, imagine Mr. Google went about doing this for every Web Page!!! Each of his heads would be assigned a specific task and they would do it meticulously. These matrices would run into many more dimensions. He then went about giving weights to words some were local weights and some were global weights. A word that appears more number of times in a page is given a higher local weight. Global weights are given depending on the school of math Mr. Google’s teachers came from. The Global weight could either be directly or inversely proportional to the local weight. There is one more last step to the weighting process – Normalization, this is a scaling step designed to keep large documents with many keywords from overwhelming smaller documents in the result set. It is similar to handicapping in golf - smaller documents are given more importance, and larger documents are penalized, so that every document has equal significance. These three values multiplied together - local weight, global weight, and normalization factor - determine the actual numerical value that appears in each non-zero position of our term/document matrix. This is followed by running aSingular Value Decomposition (SVD) algorithm, this procedure is like shining a torch over a ball, what would you see? You would see a circle, in effect, we have reduced a 3D object into a 2D object. This is what SVD is about. The SVD algorithm would reduce the matrix into a set of smaller components. The algorithm alters one of these components (this is where the number of dimensions gets reduced), and then recombines them into a matrix of the same shape as the original, so Mr. Google can again use it as a lookup grid. The matrix is an approximation of the term document matrix, and looks much different from the original. This finished matrix is what Mr. Google would use to actually search. Given one or more terms in a search query, Mr. Google would look up the values for each search term/document combination, calculate a cumulative score for every document, and rank the documents by that score, which is a measure of their similarity to the search query. In practice, Mr. Google will probably assign an empirically-determined threshold value to serve as a cutoff between relevant and irrelevant documents, so that the query does not return every document in Mr. Google’s collection. I guess you would have got a peep into a fraction of second of Google’s life.
Well, in what way is LSI useful to us - PPC professionals? When our ads appear on the content network it is because of the LSI between the cluster of keywords in the ad group and that of the content on the web page that makes your ad appear. LSI also affects your quality score, the greater the degree of relevance between your ad text, your keyword inventory and the content on your landing page would mean a better Quality Score for your keywords. Put on your thinking hats and see what else LSI could affect. So guys! We have one more thing to keep in mind while we create our accounts, make sure you have High LSI's.