Chapter 71 TF - IDF Theory

Unigrams are words of length 1 , therefore they are single words.BiGrams are words of length 2.

We wish to find out the important UniGrams and BiGrams which are present in the Violations Text.

We would explore this using a fascinating concept known as Term Frequency - Inverse Document Frequency. Quite a mouthful, but we will unpack it and clarify each and every term.

A document in this case is the set of Violation Text present in Results Type.

E.g. The Violation Text present in Results Type Pass is a single document.The Violation Text present in Result Type Fail is a single document

Therefore we have different documents for each Results Type.

From the book 5 Algorithms Every Web Developer Can Use and Understand

TF-IDF computes a weight which represents the importance of a term inside a document.

It does this by comparing the frequency of usage inside an individual document as opposed to the entire data set (a collection of documents). The importance increases proportionally to the number of times a word appears in the individual document itself–this is called Term Frequency. However, if multiple documents contain the same word many times then you run into a problem. That’s why TF-IDF also offsets this value by the frequency of the term in the entire document set, a value called Inverse Document Frequency.

71.1 The Math

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
Value = TF * IDF