• 1 Introduction
  • I Spooky Authors Text Mining
    • 1.1 Spooky Author Identification Dataset
  • 2 Read the Data
  • 3 Add Feature Number of Words
  • 4 Peek into the Data
  • 5 Length Comparison
  • 6 Words Length Distribution
    • 6.1 Words Length Distribution Plot 2
  • 7 Tokensiation
    • 7.1 Removing the Stop words
  • 8 Top Twenty most Common Words
    • 8.1 WordCloud of the Common Words
    • 8.2 WordCloud of HPL
    • 8.3 BarPlot for words of HPL
    • 8.4 WordCloud of MWS
    • 8.5 BarPlot for words of MWS
    • 8.6 WordCloud of EAP
    • 8.7 BarPlot for words of EAP
  • 9 TF-IDF
    • 9.1 The Math
    • 9.2 Twenty Most Important words
      • 9.2.1 Twenty most important words HPL
      • 9.2.2 Twenty most important words EAP
      • 9.2.3 Twenty most important words MWS
    • 9.3 Word Cloud for the Most Important Words
  • 10 Most Common Bigrams
  • 11 Most Common Trigrams
  • 12 Relationship among words
  • 13 Sentiment Analysis
    • 13.1 Postive Authors and Not so Positive Authors
    • 13.2 Postive and Not So Postive Words of Authors
    • 13.3 Postive and Not So Postive Words of Author HPL
    • 13.4 Postive and Not So Postive Words of Author EAP
    • 13.5 Postive and Not So Postive Words of Author MWS
  • 14 Sentiment Analysis using NRC Sentiment lexicon
    • 14.1 Sentiment Analysis Words - Fear
    • 14.2 Fear Word Cloud - MWP
    • 14.3 Sentiment Analysis Words - Surprise
    • 14.4 Surprise Word Cloud - MWP
    • 14.5 Sentiment Analysis Words - Joy
    • 14.6 Joy Word Cloud - MWP
  • 15 Positive and Not So Postive Lines
  • 16 Feature Sentiment Score
  • 17 Feature NRC Sentiments
  • 18 He or She Analysis
    • 18.1 Gender associated verbs
  • 19 Document Term Matrix
  • 20 Modelling with XGBoost
    • 20.1 Add features
    • 20.2 Creating the XGBoost Model
  • 21 Predictions using the XGB Model
  • 22 Predictions using glmnet Model
  • 23 Modelling using the text2vec package
    • 23.1 Inspect the vocabulary
    • 23.2 Inspect the Document Term Matrix
    • 23.3 Build the Multinomial Logistic Regression Model
    • 23.4 Predict using the Multinomial Logistic Regression Model
  • II Yelp Data Reviews Text Mining
  • 24 Introduction
  • 25 Preparation
    • 25.1 Load Libraries
    • 25.2 Read the data
  • 26 Business data
  • 27 Reviews data
  • 28 Detecting the language of the reviews
  • 29 Most Popular Categories
  • 30 Top Ten Cities with the most Business parties mentioned in Yelp
  • 31 Map of the business parties in Las vegas
  • 32 Business with most Five Star Reviews from Users
  • 33 “Mon Ami Gabi”
    • 33.1 Useful,funny,cool reviews
    • 33.2 Word Cloud of Mon Ami Gabi
    • 33.3 Top Ten most common Words of the business “Mon Ami Gabi”
    • 33.4 Sentiment Analysis - Postive and Not So Postive Words of “Mon Ami Gabi”
    • 33.5 Calculate Sentiment for the reviews
    • 33.6 Negative Reviews
    • 33.7 Positive Reviews
    • 33.8 Most Common Bigrams of “Mon Ami Gabi”
    • 33.9 Relationship among words
      • 33.9.1 Relationship of words with steak
      • 33.9.2 Relationship of words with french
  • 34 Bacchanal Buffet
    • 34.1 Word Cloud of Bacchanal Buffet
    • 34.2 Top Ten most common Words of the business “Bacchanal Buffet”
    • 34.3 Sentiment Analysis - Postive and Not So Postive Words of Bacchanal Buffet
    • 34.4 Calculate Sentiment for the reviews
    • 34.5 Negative Reviews
    • 34.6 Positive Reviews
    • 34.7 Relationship among words in Bacchanal Buffet
      • 34.7.1 Relationship of words with crab
      • 34.7.2 Relationship of words with food
  • 35 Top Ten Business in Toronto
  • 36 Pai Northern Thai Kitchen
    • 36.1 Word Cloud of business Pai Northern Thai Kitchen
    • 36.2 Ten most common words used in reviews of business Pai Northern Thai Kitchen
    • 36.3 Sentiment Analysis - Postive and Not So Postive Words of Pai Northern Thai Kitchen
    • 36.4 Calculate Sentiment for the reviews
    • 36.5 Negative Reviews
    • 36.6 Positive Reviews
    • 36.7 Relationship among words in Pai Northern Thai Kitchen
      • 36.7.1 Relationship of words with thai
  • 37 Chipotle business
  • 38 Chipotle Business in Yonge Street Toronto
    • 38.1 Word Cloud of business Chipotle Business in Yonge Street Toronto
    • 38.2 Top Ten most common Words of the business “Chipotle Business in Yonge Street Toronto”
    • 38.3 Sentiment Analysis - Postive and Not So Postive Words of Chipotle Business in Yonge Street Toronto
    • 38.4 Calculate Sentiment for the reviews
    • 38.5 Negative Reviews
    • 38.6 Positive Reviews
    • 38.7 Relationship among words in Chipotle Business in Yonge Street Toronto
  • 39 Topic Modelling
    • 39.1 LDA Function
    • 39.2 Topic Modelling for Mon Ami Gabi
    • 39.3 Topic Modelling for Bacchanal Buffet
    • 39.4 Topic Modelling for Pai Northern Thai Kitchen
  • 40 Phoenix City Analysis
    • 40.1 Top Ten Business in Phoenix
    • 40.2 Topic Modelling for Phoenix City
    • 40.3 Word Cloud of Phoenix City
    • 40.4 Top Ten most common Words of the business Phoenix City
    • 40.5 Sentiment Analysis - Postive and Not So Postive Words of Phoenix City
    • 40.6 Calculate Sentiment for the reviews
    • 40.7 Negative Reviews
    • 40.8 Positive Reviews
  • III Simpsons Text Mining
  • 41 Introduction
  • 42 Ten Most Active Characters
  • 43 Next Ten Most Active Characters
  • 44 Top Twenty most Common Words
    • 44.1 WordCloud of the Common Words
  • 45 TF-IDF
    • 45.1 The Math
    • 45.2 Twenty Most Important words for the Twenty Most Active Characters
    • 45.3 Word Cloud for the Twenty most important characters
    • 45.4 Marge Important Words
    • 45.5 Moe Important Words
      • 45.5.1 Word focus - “Midge”
  • 46 Relationship among words
    • 46.1 Dont word network graph
  • 47 Sentiment Analysis
    • 47.1 Postive Characters and Not so Positive Characters
    • 47.2 Postive and Not So Postive Words
    • 47.3 Postive and Not So Postive Script Lines
  • 48 Topic Modelling
  • 49 Location of Characters
    • 49.1 Homers Location
    • 49.2 Marge’s Location
    • 49.3 Bart’s Location
    • 49.4 Lisa’s Location
  • 50 Homer Simpson
    • 50.1 Word Cloud
    • 50.2 Postive and Not So Postive Words of Homer Simpson
    • 50.3 Sentiment Analysis based on location
  • 51 Best and the Worst Episodes
    • 51.1 Best Episode
    • 51.2 Worst Episode
    • 51.3 Positive and Not So Positive Characters of the Best Episode
    • 51.4 Positive and Not So Positive Characters of the Worst Episode
    • 51.5 Postive and Not So Postive Words of Best Episode
    • 51.6 Postive and Not So Postive Words of Worst Episode
  • 52 Modelling with XGBoost
  • IV Movies Sentiment Analysis
  • 53 Introduction
  • 54 Read the data
  • 55 Tokenisation
    • 55.1 Removing the Stop words
  • 56 Top Ten most Common Words
    • 56.1 WordCloud of the Common Words
    • 56.2 Word Cloud of Negative Sentiments
    • 56.3 Word Cloud of somewhat negative Sentiments
    • 56.4 Word Cloud of neutral Sentiments
    • 56.5 Word Cloud of somewhat positive Sentiments
    • 56.6 Word Cloud of positive Sentiments
  • 57 TF-IDF
    • 57.1 The Math
    • 57.2 Twenty Most Important Words
  • 58 Most Common Bigrams
  • 59 Most Common Trigrams
  • 60 Relationship among words
  • 61 Modelling using the text2vec package
    • 61.1 Inspect the vocabulary
    • 61.2 Inspect the Document Term Matrix
    • 61.3 TF-IDF
    • 61.4 Build the Multinomial Logistic Regression Model
    • 61.5 Predict using the Multinomial Logistic Regression Model
  • V Chicago Food Inspections
  • 62 Introduction
  • 63 Analysis of Data ( Having Values Percentage)
  • 64 Top Twenty Facility Types
  • 65 Inspection Results for Top Twenty Facility Types
  • 66 Trend of Inspections YearWise
    • 66.1 All Results
    • 66.2 Each Result Category
  • 67 Trend of Inspections Monthwise
    • 67.1 All Results
    • 67.2 Each Result category
    • 67.3 Out of Business Result
  • 68 Trend of Inspections Daywise
  • 69 Maps of Food Places
    • 69.1 Pass or Fail Spots
    • 69.2 Out of Business Spots
  • 70 Top Twenty most Common Words
    • 70.1 WordCloud of the Common Words
  • 71 TF - IDF Theory
    • 71.1 The Math
  • 72 Term Frequency of Words (TF)
  • 73 TF- IDF of Unigrams (One Word )
  • 74 Word Cloud for Unigrams
  • 75 TF - IDF Bigrams
  • 76 Verification of important words for “Out of Business”
  • 77 Word Cloud for Bigrams
  • 78 Relationship among words
  • 79 Sentiment Analysis for Results
  • 80 Sentiment analysis by word
  • 81 Sentiment analysis by word for Each Result Type
  • 82 Sentiment analysis by Inspection Text
  • 83 Plot of Results and Risks
  • 84 Modelling with XGBoost
  • 85 References

Little Book on Text Mining

Chapter 29 Most Popular Categories

The most popular categories of business are plotted in the bar plot

categories = str_split(business$categories,";")
categories = as.data.frame(unlist(categories))
colnames(categories) = c("Name")

categories %>%
  group_by(Name) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count)) %>%
  ungroup() %>%
  mutate(Name = reorder(Name,Count)) %>%
  head(10) %>%
  
  
  ggplot(aes(x = Name,y = Count)) +
  geom_bar(stat='identity',colour="white", fill =fillColor2) +
  geom_text(aes(x = Name, y = 1, label = paste0("(",Count,")",sep="")),
            hjust=0, vjust=.5, size = 4, colour = 'black',
            fontface = 'bold') +
  labs(x = 'Name of Category', y = 'Count', 
       title = 'Top 10 Categories of Business') +
  coord_flip() + 
  theme_bw()