Chapter 7 Tokensiation
We break the text into individual tokens which are simply individual words. This process is called tokenisation. This is accomplished through the unnest_tokens function.
train %>%
unnest_tokens(word, text) %>%
head(10)
## # A tibble: 10 x 4
## id author len word
## <chr> <chr> <int> <chr>
## 1 id26305 EAP 231 this
## 2 id26305 EAP 231 process
## 3 id26305 EAP 231 however
## 4 id26305 EAP 231 afforded
## 5 id26305 EAP 231 me
## 6 id26305 EAP 231 no
## 7 id26305 EAP 231 means
## 8 id26305 EAP 231 of
## 9 id26305 EAP 231 ascertaining
## 10 id26305 EAP 231 the
7.1 Removing the Stop words
We seperate the words in the train dataset and remove the most commonly occuring words
train %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>% head(10)
## # A tibble: 10 x 4
## id author len word
## <chr> <chr> <int> <chr>
## 1 id26305 EAP 231 process
## 2 id26305 EAP 231 afforded
## 3 id26305 EAP 231 means
## 4 id26305 EAP 231 ascertaining
## 5 id26305 EAP 231 dimensions
## 6 id26305 EAP 231 dungeon
## 7 id26305 EAP 231 circuit
## 8 id26305 EAP 231 return
## 9 id26305 EAP 231 set
## 10 id26305 EAP 231 aware