How do we compare the similarity of documents?
We compare the tf-idf vector of one document to another document's vector. What is tf-idf? It stands for term frequency-inverse document frequency. Still confused? Let’s break it down into simple terms.
The tf part, in relation to text-based documents, represents how often certain words are used. In the search engine world, this is used to grab common key words and phrases from websites to help the user find the correct information they’re searching for. tf can also be used to weed out or identify “stop words,” which are connecting words, such as “a, the, an, is…” In most cases, these are not necessarily the keywords you want to pull out, but rather the words you want the search engine to forget or ignore when performing its search. There are some exceptions to this rule, which we can get into later.
In many cases, finding documents with commonly searched words is helpful, but not always.
That brings us to the idf part of the acronym. If tf is about finding similarities, idf is about finding the differences. While it’s important to know what’s similar in two documents, it’s sometimes equally important to learn what sets them apart.
To return once again to “stop words,” the idf part is important for offsetting these common, “throwaway” terms that, while important to the English language, don’t add weight to the substance of a document, yet are featured most frequently in written documents and text. The Inverse Document Frequency highlights words that are used infrequently, so even if the word “the” appears 100 times, it can be offset by the use of infrequent terms used.
Stop words explained
As promised, we will elaborate on the concept of “Stop words.” While in many cases, you want tf-idf to exclude stop words from the search, there are cases where you want them included. It’s all about context. It should also be noted that there is no universal list of definitively accepted “stop words.” For context, some searches look for key phrases rather than keywords, so “stop words” are specifically not ignored so the sentence or phrase can be recognized as a whole. See the figure below.