Chapter 57 TF-IDF
We wish to find out the important words which are spoken by the characters. Example for your young child , the most important word is mom. Example for a bar tender , important words would be related to drinks.
We would explore this using a fascinating concept known as Term Frequency - Inverse Document Frequency. Quite a mouthful, but we will unpack it and clarify each and every term.
A document in this case is the set of lines associated with a Sentiment.Therefore we have different documents for each Sentiment.
From the book 5 Algorithms Every Web Developer Can Use and Understand
TF-IDF computes a weight which represents the importance of a term inside a document.
It does this by comparing the frequency of usage inside an individual document as opposed to the entire data set (a collection of documents). The importance increases proportionally to the number of times a word appears in the individual document itself–this is called Term Frequency. However, if multiple documents contain the same word many times then you run into a problem. That’s why TF-IDF also offsets this value by the frequency of the term in the entire document set, a value called Inverse Document Frequency.
57.1 The Math
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
Value = TF * IDF
57.2 Twenty Most Important Words
trainWords <- train %>%
unnest_tokens(word, text) %>%
dplyr::count(Sentiment, word, sort = TRUE) %>%
ungroup()
total_words <- trainWords %>%
group_by(Sentiment) %>%
dplyr::summarize(total = sum(n))
trainWords <- dplyr::left_join(trainWords, total_words, by="Sentiment")
#Now we are ready to use the bind_tf_idf which computes the tf-idf for each term.
trainWords <- trainWords %>%
filter(!is.na(Sentiment)) %>%
bind_tf_idf(word, Sentiment, n)
plot_trainWords <- trainWords %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))
plot_trainWords %>%
top_n(20) %>%
ggplot(aes(word, tf_idf)) +
geom_col(fill = fillColor) +
labs(x = NULL, y = "tf-idf") +
coord_flip() +
theme_bw()