Chapter 39 Topic Modelling
Topic modeling is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds natural groups of items even when we’re not sure what we’re looking for.
Latent Dirichlet allocation (LDA) is a particularly popular method for fitting a topic model. It treats each document as a mixture of topics, and each topic as a mixture of words. This allows documents to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.
39.1 LDA Function
Borrowing an awesome function from Rachael’s Notebook
# function to get & plot the most informative terms by a specificed number
# of topics, using LDA
top_terms_by_topic_LDA <- function(input_text, # should be a columm from a dataframe
plot = T, # return a plot? TRUE by defult
number_of_topics = 4) # number of topics (4 by default)
{
# create a corpus (type of object expected by tm) and document term matrix
Corpus <- Corpus(VectorSource(input_text)) # make a corpus object
DTM <- DocumentTermMatrix(Corpus) # get the count of words/document
# remove any empty rows in our document term matrix (if there are any
# we'll get an error when we try to run our LDA)
unique_indexes <- unique(DTM$i) # get the index of each unique value
DTM <- DTM[unique_indexes,] # get a subset of only those indexes
# preform LDA & get the words/topic in a tidy text format
lda <- LDA(DTM, k = number_of_topics, control = list(seed = 1234))
topics <- tidy(lda, matrix = "beta")
# get the top ten terms for each topic
top_terms <- topics %>% # take the topics data frame and..
group_by(topic) %>% # treat each topic as a different group
top_n(10, beta) %>% # get the top 10 most informative words
ungroup() %>% # ungroup
arrange(topic, -beta) # arrange words in descending informativeness
# if the user asks for a plot (TRUE by default)
if(plot == T){
# plot the top ten terms for each topic in order
top_terms %>% # take the top terms
mutate(term = reorder(term, beta)) %>% # sort terms by beta value
ggplot(aes(term, beta, fill = factor(topic))) + # plot beta by theme
geom_col(show.legend = FALSE) + # as a bar plot
facet_wrap(~ topic, scales = "free") + # which each topic in a seperate plot
labs(x = NULL, y = "Beta") + # no x label, change y label
coord_flip() # turn bars sideways
}else{
# if the user does not request a plot
# return a list of sorted terms instead
return(top_terms)
}
}
39.2 Topic Modelling for Mon Ami Gabi
4 topics for the Mon Ami Gabi
create_LDA_topics <- function(business_text,custom_stop_words)
{
# create a document term matrix to clean
reviewsCorpus <- Corpus(VectorSource(business_text$text))
reviewsDTM <- DocumentTermMatrix(reviewsCorpus)
# convert the document term matrix to a tidytext corpus
reviewsDTM_tidy <- tidy(reviewsDTM)
# remove stopwords
reviewsDTM_tidy_cleaned <- reviewsDTM_tidy %>% # take our tidy dtm and...
anti_join(stop_words, by = c("term" = "word")) %>% # remove English stopwords and...
anti_join(custom_stop_words, by = c("term" = "word")) # remove my custom stopwords
top_terms_by_topic_LDA(reviewsDTM_tidy_cleaned$term, number_of_topics = 4)
}
monamigabi = reviews %>%
filter(business_id == "4JNXUYY8wbaaDmk3BPzlWw")
custom_stop_words <- tibble(word = c("mon","ami","gabi","restaurant","food","vegas"))
create_LDA_topics(monamigabi,custom_stop_words)
39.3 Topic Modelling for Bacchanal Buffet
4 topics for the Bacchanal Buffet
custom_stop_words <- tibble(word = c("restaurant","food"))
create_LDA_topics(bacchanal,custom_stop_words)
39.4 Topic Modelling for Pai Northern Thai Kitchen
4 topics for the Pai Northern Thai Kitchen
custom_stop_words <- tibble(word = c("thai","restaurant","food"))
create_LDA_topics(pai_thai,custom_stop_words)
We observe a common theme which appears across topics across the Three restaurants is service.The theme of service complaints was also very evident when we did the sentiment analysis