Chapter 7 Tokensiation

We break the text into individual tokens which are simply individual words. This process is called tokenisation. This is accomplished through the unnest_tokens function.

train %>%
  unnest_tokens(word, text) %>%
  head(10)
## # A tibble: 10 x 4
##         id author   len         word
##      <chr>  <chr> <int>        <chr>
##  1 id26305    EAP   231         this
##  2 id26305    EAP   231      process
##  3 id26305    EAP   231      however
##  4 id26305    EAP   231     afforded
##  5 id26305    EAP   231           me
##  6 id26305    EAP   231           no
##  7 id26305    EAP   231        means
##  8 id26305    EAP   231           of
##  9 id26305    EAP   231 ascertaining
## 10 id26305    EAP   231          the

7.1 Removing the Stop words

We seperate the words in the train dataset and remove the most commonly occuring words

train %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>% head(10)
## # A tibble: 10 x 4
##         id author   len         word
##      <chr>  <chr> <int>        <chr>
##  1 id26305    EAP   231      process
##  2 id26305    EAP   231     afforded
##  3 id26305    EAP   231        means
##  4 id26305    EAP   231 ascertaining
##  5 id26305    EAP   231   dimensions
##  6 id26305    EAP   231      dungeon
##  7 id26305    EAP   231      circuit
##  8 id26305    EAP   231       return
##  9 id26305    EAP   231          set
## 10 id26305    EAP   231        aware