Chapter 55 Tokenisation
We break the text into individual tokens which are simply individual words. This process is called tokenisation. This is accomplished through the unnest_tokens
function.
train <- train %>%
dplyr::rename(text = Phrase)
test <- test %>%
dplyr::rename(text = Phrase)
train %>%
unnest_tokens(word, text) %>%
head(10)
## # A tibble: 10 x 4
## PhraseId SentenceId Sentiment word
## <int> <int> <int> <chr>
## 1 1 1 1 a
## 2 1 1 1 series
## 3 1 1 1 of
## 4 1 1 1 escapades
## 5 1 1 1 demonstrating
## 6 1 1 1 the
## 7 1 1 1 adage
## 8 1 1 1 that
## 9 1 1 1 what
## 10 1 1 1 is
55.1 Removing the Stop words
movie_stopwords = c("movie","film")
train %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
filter(!word %in% movie_stopwords) %>%
head(10)
## # A tibble: 10 x 4
## PhraseId SentenceId Sentiment word
## <int> <int> <int> <chr>
## 1 1 1 1 series
## 2 1 1 1 escapades
## 3 1 1 1 demonstrating
## 4 1 1 1 adage
## 5 1 1 1 goose
## 6 1 1 1 gander
## 7 1 1 1 occasionally
## 8 1 1 1 amuses
## 9 1 1 1 amounts
## 10 1 1 1 story