Chapter 55 Tokenisation

We break the text into individual tokens which are simply individual words. This process is called tokenisation. This is accomplished through the unnest_tokens function.

train <- train %>%
  dplyr::rename(text = Phrase)

test <- test %>%
  dplyr::rename(text = Phrase)

train %>%
  unnest_tokens(word, text) %>%
  head(10)

## # A tibble: 10 x 4
##    PhraseId SentenceId Sentiment          word
##       <int>      <int>     <int>         <chr>
##  1        1          1         1             a
##  2        1          1         1        series
##  3        1          1         1            of
##  4        1          1         1     escapades
##  5        1          1         1 demonstrating
##  6        1          1         1           the
##  7        1          1         1         adage
##  8        1          1         1          that
##  9        1          1         1          what
## 10        1          1         1            is

55.1 Removing the Stop words

movie_stopwords = c("movie","film")

train %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  filter(!word %in% movie_stopwords) %>% 
  head(10)

## # A tibble: 10 x 4
##    PhraseId SentenceId Sentiment          word
##       <int>      <int>     <int>         <chr>
##  1        1          1         1        series
##  2        1          1         1     escapades
##  3        1          1         1 demonstrating
##  4        1          1         1         adage
##  5        1          1         1         goose
##  6        1          1         1        gander
##  7        1          1         1  occasionally
##  8        1          1         1        amuses
##  9        1          1         1       amounts
## 10        1          1         1         story