Chapter 45 TF-IDF
We wish to find out the important words which are spoken by the characters. Example for your young child , the most important word is mom. Example for a bar tender , important words would be related to drinks.
We would explore this using a fascinating concept known as Term Frequency - Inverse Document Frequency. Quite a mouthful, but we will unpack it and clarify each and every term.
A document in this case is the set of lines spoken by a character.
E.g. The words spoken by Bart is a single document.The words spoken by Marge is a another document.
Therefore we have different documents for each Character.
From the book 5 Algorithms Every Web Developer Can Use and Understand
TF-IDF computes a weight which represents the importance of a term inside a document.
e.g. The term mom might be an important term for both Lisa and Bart since they would say this a lot of times to Marge but not any other character would say this term.
It does this by comparing the frequency of usage inside an individual document as opposed to the entire data set (a collection of documents). The importance increases proportionally to the number of times a word appears in the individual document itself–this is called Term Frequency. However, if multiple documents contain the same word many times then you run into a problem. That’s why TF-IDF also offsets this value by the frequency of the term in the entire document set, a value called Inverse Document Frequency.
45.1 The Math
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it).
Value = TF * IDF
45.2 Twenty Most Important words for the Twenty Most Active Characters
Here using TF-IDF , we investigate the Twenty Most Important words for the Twenty Most Active Characters.
#Get the Top 20 Characters
Top20Characters = head(TopCharacters)$name
##################################################################################
# Prepare for the bind_tf_idf function
##################################################################################
SCWords <- SC %>%
unnest_tokens(word, normalized_text) %>%
dplyr::count(name, word, sort = TRUE) %>%
ungroup()
total_words <- SCWords %>%
group_by(name) %>%
summarize(total = sum(n))
SCWords <- left_join(SCWords, total_words)
SCWordsFull <- SCWords %>%
filter(!is.na(name)) %>%
bind_tf_idf(word, name, n)
#Now we are ready to use the bind_tf_idf which computes the tf-idf for each term.
SCWords <- SCWords %>% filter( name %in% Top20Characters) %>%
bind_tf_idf(word, name, n)
plot_SCWords <- SCWords %>%
arrange(desc(tf_idf)) %>%
mutate(word = factor(word, levels = rev(unique(word))))
plot_SCWords %>%
top_n(20) %>%
ggplot(aes(word, tf_idf, fill = name)) +
geom_col() +
labs(x = NULL, y = "tf-idf") +
coord_flip() +
theme_bw()
#Choose words with low IDF
SCWords2 <- SCWords %>%
bind_tf_idf(word, name, n)
LowIDF = SCWords2 %>%
arrange((idf)) %>%
select(word,idf)
#Get the Unique Words with LowIDF
UniqueLowIDF = unique(LowIDF$word)
We observe that the most important word for Bart and Lisa is mom. This is obvious since both Bart and Lisa are children of Marge and Homer.
45.3 Word Cloud for the Twenty most important characters
We show the Hundred most important words for the Twenty most important characters. This Word Cloud is based on the TF- IDF scores. Higher the score, bigger is the size of the text.
plot_SCWords %>%
with(wordcloud(word, tf_idf, max.words = 100,colors=brewer.pal(8, "Dark2")))
45.4 Marge Important Words
We investigate here the most important words spoken by Marge.The conversations of Marge with the word homie is provided below. We observe that all of it is addressed to her husband Homer .
keywordHomie = 'homie'
ScriptsCharactersMarge = ScriptsCharacters %>%
filter(name == 'Marge Simpson') %>%
filter(str_detect(normalized_text,keywordHomie) )
MargeAdressesTo <- data.frame(Name = character(), Text = character())
for(i in 1: 5)
{
MargeNextSenceAfterHomie = ScriptsCharacters %>%
filter(id > ScriptsCharactersMarge[i,]$id - 1) %>%
filter(id < (ScriptsCharactersMarge[i,]$id +2) ) %>%
select(name,raw_text)
MargeAdressesTo = rbind(MargeAdressesTo,MargeNextSenceAfterHomie)
}
name | raw_text |
---|---|
Marge Simpson | Marge Simpson: Homie, did you straighten everything out…? |
Homer Simpson | Homer Simpson: (HAPPILY) Up… up… up… up… up… up. Don’t say anything, Marge. Let’s just go to bed. I’m on the biggest roll of my life. |
Marge Simpson | Marge Simpson: Oh, Homie, you have lots of hair… Why did you want to know your blood type? |
Homer Simpson | Homer Simpson: Aw, old man Burns is gonna kick off if he doesn’t get some Double-O-Negative blood, but nobody at the plant has it. |
Marge Simpson | Marge Simpson: Please, Homie? For me? |
Homer Simpson | Homer Simpson: (SOFTENING) Oh, all right. (GRUMBLING) You always do that hand thing! And it usually works. |
Marge Simpson | Marge Simpson: Oh, Homie. |
Psychiatrist | Psychiatrist: Mr. Simpson, after talking to your wife, we believe you’re no threat to yourself or others. |
Marge Simpson | Marge Simpson: (AMOROUSLY) Homie… Put down your magazine for a minute… |
Homer Simpson | Homer Simpson: Hmph? |
45.5 Moe Important Words
We investigate here the most important words spoken by Moe.
45.5.1 Word focus - “Midge”
keywordMidge = 'midge'
ScriptsCharactersMoe = ScriptsCharacters %>%
filter(name == 'Moe Szyslak') %>%
filter(str_detect(normalized_text,keywordMidge) )
MoeAdressesTo <- data.frame(Name = character(), Text = character())
for(i in 1: 5)
{
MoeNextSenceAfterMidge = ScriptsCharacters %>%
filter(id > ScriptsCharactersMoe[i,]$id - 1) %>%
filter(id < (ScriptsCharactersMoe[i,]$id +2) ) %>%
select(name,raw_text)
MoeAdressesTo = rbind(MoeAdressesTo,MoeNextSenceAfterMidge)
}
name | raw_text |
---|---|
Moe Szyslak | Moe Szyslak: What? What? A bartender can’t come by and say “hi” to his best customer? (TO MARGE) Hey, hey there, Midge! Oh gee I like what you done to your hair. |
Marge Simpson | Marge Simpson: You caught me at a real bad time, Moe. I hope you understand I’m too tense to pretend I like you. |
Moe Szyslak | Moe Szyslak: Can somebody tell me what the hell is goin’ on? Midge, help me out here. |
Homer Simpson | Homer Simpson: Quiet! You’re missing the jokes! |
Moe Szyslak | Moe Szyslak: Outta the way, Midge! |
Marge Simpson | Marge Simpson: (PHONY) Oh, am I in the way? |
Moe Szyslak | Moe Szyslak: Ah, remember, Midge. You feel the need to rage, you call me, right? I won’t even get sexual or nothin’. Unless that’s what you want. (SHORT AWKWARD BEAT) That, that’s not what you want, right? |
Marge Simpson | Marge Simpson: (FIRMLY) No thanks. (UPBEAT) But thanks. |
Moe Szyslak | Moe Szyslak: Okay, Midge. You made us feel bad about what we done to your boy. But what can we do about it now? It’s not like we can play the game over again. |
Lisa Simpson | Lisa Simpson: (SLY) Can’t we? |