Chapter 18 He or She Analysis
We examine the words which start with he
or she
. This section draws inspiration from the blog post by David Robinson
in his writeup
train %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(word1 %in% c("he", "she"))
## # A tibble: 5,680 x 5
## id author len word1 word2
## <chr> <chr> <int> <chr> <chr>
## 1 id00004 EAP 134 he might
## 2 id00004 EAP 134 he necessarily
## 3 id00017 EAP 469 he makes
## 4 id00029 MWS 115 he entered
## 5 id00035 HPL 75 he was
## 6 id00036 HPL 201 he had
## 7 id00037 MWS 274 he owned
## 8 id00037 MWS 274 he the
## 9 id00043 HPL 167 he absorbed
## 10 id00045 MWS 237 he found
## # ... with 5,670 more rows
18.1 Gender associated verbs
Which words were most shifted towards occurring after “he” or “she”? We’ll filter for words that appeared at least 20 times.
train %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(word1 %in% c("he", "she")) %>%
count(word1,word2) %>%
spread(word1, n, fill = 0) %>%
mutate(total = he + she,
he = (he + 1) / sum(he + 1),
she = (she + 1) / sum(she + 1),
log_ratio = log2(she / he),
abs_ratio = abs(log_ratio)) %>%
arrange(desc(log_ratio)) %>%
filter(!word2 %in% c("himself", "herself"),
!word2 %in% stop_words$word,
total>= 20) %>%
group_by(direction = ifelse(log_ratio > 0, 'More "she"', "More 'he'")) %>%
top_n(15, abs_ratio) %>%
ungroup() %>%
mutate(word2 = reorder(word2, log_ratio)) %>%
ggplot(aes(word2, log_ratio, fill = direction)) +
geom_col() +
coord_flip() +
labs(x = "",
y = 'Relative appearance after "she" compared to "he"',
fill = "",
title = "Gender associated with Verbs ") +
scale_y_continuous(labels = c("4X", "2X", "Same", "2X"),
breaks = seq(-2, 1)) +
guides(fill = guide_legend(reverse = TRUE)) +
theme_bw()
She cried , She loved , She died ,She heard is common while He told, He spoke, He sat, He wished , He found are common