Chapter 1 Introduction

This Little Book on Text Mining provides a gentle and hands on introduction to Text Mining. If you are tired of reading through pages of text and would like to get your hands dirty and experience on how to do a quick and detailed text mining, then you are in the right place. This book does a detailed Text Mining and Modelling on the following datasets

Spooky Author Identification dataset from Kaggle
Yelp Data Reviews dataset from Kaggle
Simpsons dataset from Kaggle
Chicago Inspections dataset from Kaggle

The book focuses on Three main areas

Exploratory Data Analysis , TF IDF concept and the application of it ,Bigrams ,Trigrams ,Relationship among various words ( Word Clouds and Bar Plots )
Detailed Sentiment Analysis and insights from it using different Sentiment Analysis lexicons such as AFINN , NRC
Modelling using feature engineering and supervised learning techniques such as XGBoost and Multinomial Logistic Regression.Modelling using unsupervised learning techniques such as Topic Modelling.

1.1 Spooky Author Identification Dataset

The Spooky Author Identification dataset from Kaggle has excerpts from horror stories by Edgar Allan Poe, Mary Shelley, and HP Lovecraft. The competition challenges to predict the author of excerpts.

The dataset can be found in Kaggle.

This chapter focuses on the following topics

Word Length comparison among the various authors
Common words used by the authors
TF IDF concept and the application of it
Bigrams
Trigrams
Relationship among various words
Sentiment Analysis
NRC Sentiment Analysis
Building features using the Sentiment Score
Building features using the NRC Sentiment Score
Words Commonly used by Males and Females in Authors’ text
Evaluation of the Document Term Matrix
Topic Modelling
Predictions using the XGBoost Model
Predictions using the GLMnet Model

Little Book on Text Mining

Ambarish Ganguly

Chapter 1 Introduction

1.1 Spooky Author Identification Dataset