Chapter 39 Topic Modelling

Topic modeling is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for.

Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.

39.1 LDA Function

Borrowing an awesome function from Rachael’s Notebook

# function to get & plot the most informative terms by a specificed number
# of topics, using LDA
top_terms_by_topic_LDA <- function(input_text, # should be a columm from a dataframe
                                   plot = T, # return a plot? TRUE by defult
                                   number_of_topics = 4) # number of topics (4 by default)
    # create a corpus (type of object expected by tm) and document term matrix
    Corpus <- Corpus(VectorSource(input_text)) # make a corpus object
    DTM <- DocumentTermMatrix(Corpus) # get the count of words/document

    # remove any empty rows in our document term matrix (if there are any 
    # we'll get an error when we try to run our LDA)
    unique_indexes <- unique(DTM$i) # get the index of each unique value
    DTM <- DTM[unique_indexes,] # get a subset of only those indexes
    # preform LDA & get the words/topic in a tidy text format
    lda <- LDA(DTM, k = number_of_topics, control = list(seed = 1234))
    topics <- tidy(lda, matrix = "beta")

    # get the top ten terms for each topic
    top_terms <- topics  %>% # take the topics data frame and..
      group_by(topic) %>% # treat each topic as a different group
      top_n(10, beta) %>% # get the top 10 most informative words
      ungroup() %>% # ungroup
      arrange(topic, -beta) # arrange words in descending informativeness

    # if the user asks for a plot (TRUE by default)
    if(plot == T){
        # plot the top ten terms for each topic in order
        top_terms %>% # take the top terms
          mutate(term = reorder(term, beta)) %>% # sort terms by beta value 
          ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
          geom_col(show.legend = FALSE) + # as a bar plot
          facet_wrap(~ topic, scales = "free") + # which each topic in a seperate plot
          labs(x = NULL, y = "Beta") + # no x label, change y label 
          coord_flip() # turn bars sideways
        # if the user does not request a plot
        # return a list of sorted terms instead

39.2 Topic Modelling for Mon Ami Gabi

4 topics for the Mon Ami Gabi

create_LDA_topics <- function(business_text,custom_stop_words)
  # create a document term matrix to clean
reviewsCorpus <- Corpus(VectorSource(business_text$text)) 
reviewsDTM <- DocumentTermMatrix(reviewsCorpus)

# convert the document term matrix to a tidytext corpus
reviewsDTM_tidy <- tidy(reviewsDTM)

# remove stopwords
reviewsDTM_tidy_cleaned <- reviewsDTM_tidy %>% # take our tidy dtm and...
    anti_join(stop_words, by = c("term" = "word")) %>% # remove English stopwords and...
    anti_join(custom_stop_words, by = c("term" = "word")) # remove my custom stopwords

top_terms_by_topic_LDA(reviewsDTM_tidy_cleaned$term, number_of_topics = 4)


monamigabi = reviews %>%
  filter(business_id == "4JNXUYY8wbaaDmk3BPzlWw")

custom_stop_words <- tibble(word = c("mon","ami","gabi","restaurant","food","vegas"))


39.3 Topic Modelling for Bacchanal Buffet

4 topics for the Bacchanal Buffet

custom_stop_words <- tibble(word = c("restaurant","food"))


39.4 Topic Modelling for Pai Northern Thai Kitchen

4 topics for the Pai Northern Thai Kitchen

custom_stop_words <- tibble(word = c("thai","restaurant","food"))


We observe a common theme which appears across topics across the Three restaurants is service.The theme of service complaints was also very evident when we did the sentiment analysis