1
Introduction
I Spooky Authors Text Mining
1.1
Spooky Author Identification Dataset
2
Read the Data
3
Add Feature Number of Words
4
Peek into the Data
5
Length Comparison
6
Words Length Distribution
6.1
Words Length Distribution Plot 2
7
Tokensiation
7.1
Removing the Stop words
8
Top Twenty most Common Words
8.1
WordCloud of the Common Words
8.2
WordCloud of HPL
8.3
BarPlot for words of HPL
8.4
WordCloud of MWS
8.5
BarPlot for words of MWS
8.6
WordCloud of EAP
8.7
BarPlot for words of EAP
9
TF-IDF
9.1
The Math
9.2
Twenty Most Important words
9.2.1
Twenty most important words HPL
9.2.2
Twenty most important words EAP
9.2.3
Twenty most important words MWS
9.3
Word Cloud for the Most Important Words
10
Most Common Bigrams
11
Most Common Trigrams
12
Relationship among words
13
Sentiment Analysis
13.1
Postive Authors and Not so Positive Authors
13.2
Postive and Not So Postive Words of Authors
13.3
Postive and Not So Postive Words of Author HPL
13.4
Postive and Not So Postive Words of Author EAP
13.5
Postive and Not So Postive Words of Author MWS
14
Sentiment Analysis using NRC Sentiment lexicon
14.1
Sentiment Analysis Words - Fear
14.2
Fear Word Cloud - MWP
14.3
Sentiment Analysis Words - Surprise
14.4
Surprise Word Cloud - MWP
14.5
Sentiment Analysis Words - Joy
14.6
Joy Word Cloud - MWP
15
Positive and Not So Postive Lines
16
Feature Sentiment Score
17
Feature NRC Sentiments
18
He or She Analysis
18.1
Gender associated verbs
19
Document Term Matrix
20
Modelling with XGBoost
20.1
Add features
20.2
Creating the XGBoost Model
21
Predictions using the XGB Model
22
Predictions using glmnet Model
23
Modelling using the text2vec package
23.1
Inspect the vocabulary
23.2
Inspect the Document Term Matrix
23.3
Build the Multinomial Logistic Regression Model
23.4
Predict using the Multinomial Logistic Regression Model
II Yelp Data Reviews Text Mining
24
Introduction
25
Preparation
25.1
Load Libraries
25.2
Read the data
26
Business data
27
Reviews data
28
Detecting the language of the reviews
29
Most Popular Categories
30
Top Ten Cities with the most Business parties mentioned in Yelp
31
Map of the business parties in Las vegas
32
Business with most Five Star Reviews from Users
33
“Mon Ami Gabi”
33.1
Useful,funny,cool reviews
33.2
Word Cloud of Mon Ami Gabi
33.3
Top Ten most common Words of the business
“Mon Ami Gabi”
33.4
Sentiment Analysis - Postive and Not So Postive Words of
“Mon Ami Gabi”
33.5
Calculate Sentiment for the reviews
33.6
Negative Reviews
33.7
Positive Reviews
33.8
Most Common Bigrams of
“Mon Ami Gabi”
33.9
Relationship among words
33.9.1
Relationship of words with
steak
33.9.2
Relationship of words with
french
34
Bacchanal Buffet
34.1
Word Cloud of
Bacchanal Buffet
34.2
Top Ten most common Words of the business
“Bacchanal Buffet”
34.3
Sentiment Analysis - Postive and Not So Postive Words of
Bacchanal Buffet
34.4
Calculate Sentiment for the reviews
34.5
Negative Reviews
34.6
Positive Reviews
34.7
Relationship among words in Bacchanal Buffet
34.7.1
Relationship of words with
crab
34.7.2
Relationship of words with
food
35
Top Ten Business in Toronto
36
Pai Northern Thai Kitchen
36.1
Word Cloud of business
Pai Northern Thai Kitchen
36.2
Ten most common words used in reviews of business
Pai Northern Thai Kitchen
36.3
Sentiment Analysis - Postive and Not So Postive Words of
Pai Northern Thai Kitchen
36.4
Calculate Sentiment for the reviews
36.5
Negative Reviews
36.6
Positive Reviews
36.7
Relationship among words in
Pai Northern Thai Kitchen
36.7.1
Relationship of words with
thai
37
Chipotle business
38
Chipotle Business in Yonge Street Toronto
38.1
Word Cloud of business
Chipotle Business in Yonge Street Toronto
38.2
Top Ten most common Words of the business
“Chipotle Business in Yonge Street Toronto”
38.3
Sentiment Analysis - Postive and Not So Postive Words of
Chipotle Business in Yonge Street Toronto
38.4
Calculate Sentiment for the reviews
38.5
Negative Reviews
38.6
Positive Reviews
38.7
Relationship among words in Chipotle Business in Yonge Street Toronto
39
Topic Modelling
39.1
LDA Function
39.2
Topic Modelling for
Mon Ami Gabi
39.3
Topic Modelling for
Bacchanal Buffet
39.4
Topic Modelling for
Pai Northern Thai Kitchen
40
Phoenix City Analysis
40.1
Top Ten Business in Phoenix
40.2
Topic Modelling for
Phoenix City
40.3
Word Cloud of
Phoenix City
40.4
Top Ten most common Words of the business
Phoenix City
40.5
Sentiment Analysis - Postive and Not So Postive Words of
Phoenix City
40.6
Calculate Sentiment for the reviews
40.7
Negative Reviews
40.8
Positive Reviews
III Simpsons Text Mining
41
Introduction
42
Ten Most Active Characters
43
Next Ten Most Active Characters
44
Top Twenty most Common Words
44.1
WordCloud of the Common Words
45
TF-IDF
45.1
The Math
45.2
Twenty Most Important words for the Twenty Most Active Characters
45.3
Word Cloud for the Twenty most important characters
45.4
Marge Important Words
45.5
Moe Important Words
45.5.1
Word focus - “Midge”
46
Relationship among words
46.1
Dont word network graph
47
Sentiment Analysis
47.1
Postive Characters and Not so Positive Characters
47.2
Postive and Not So Postive Words
47.3
Postive and Not So Postive Script Lines
48
Topic Modelling
49
Location of Characters
49.1
Homers Location
49.2
Marge’s Location
49.3
Bart’s Location
49.4
Lisa’s Location
50
Homer Simpson
50.1
Word Cloud
50.2
Postive and Not So Postive Words of Homer Simpson
50.3
Sentiment Analysis based on location
51
Best and the Worst Episodes
51.1
Best Episode
51.2
Worst Episode
51.3
Positive and Not So Positive Characters of the Best Episode
51.4
Positive and Not So Positive Characters of the Worst Episode
51.5
Postive and Not So Postive Words of Best Episode
51.6
Postive and Not So Postive Words of Worst Episode
52
Modelling with XGBoost
IV Movies Sentiment Analysis
53
Introduction
54
Read the data
55
Tokenisation
55.1
Removing the Stop words
56
Top Ten most Common Words
56.1
WordCloud of the Common Words
56.2
Word Cloud of
Negative
Sentiments
56.3
Word Cloud of
somewhat negative
Sentiments
56.4
Word Cloud of
neutral
Sentiments
56.5
Word Cloud of
somewhat positive
Sentiments
56.6
Word Cloud of
positive
Sentiments
57
TF-IDF
57.1
The Math
57.2
Twenty Most Important Words
58
Most Common Bigrams
59
Most Common Trigrams
60
Relationship among words
61
Modelling using the text2vec package
61.1
Inspect the vocabulary
61.2
Inspect the Document Term Matrix
61.3
TF-IDF
61.4
Build the Multinomial Logistic Regression Model
61.5
Predict using the Multinomial Logistic Regression Model
V Chicago Food Inspections
62
Introduction
63
Analysis of Data ( Having Values Percentage)
64
Top Twenty Facility Types
65
Inspection Results for Top Twenty Facility Types
66
Trend of Inspections YearWise
66.1
All Results
66.2
Each Result Category
67
Trend of Inspections Monthwise
67.1
All Results
67.2
Each Result category
67.3
Out of Business Result
68
Trend of Inspections Daywise
69
Maps of Food Places
69.1
Pass or Fail Spots
69.2
Out of Business Spots
70
Top Twenty most Common Words
70.1
WordCloud of the Common Words
71
TF - IDF Theory
71.1
The Math
72
Term Frequency of Words (TF)
73
TF- IDF of Unigrams (One Word )
74
Word Cloud for Unigrams
75
TF - IDF Bigrams
76
Verification of important words for “Out of Business”
77
Word Cloud for Bigrams
78
Relationship among words
79
Sentiment Analysis for Results
80
Sentiment analysis by word
81
Sentiment analysis by word for Each Result Type
82
Sentiment analysis by Inspection Text
83
Plot of Results and Risks
84
Modelling with XGBoost
85
References
Little Book on Text Mining
Chapter 53
Introduction