Chapter 1 Introduction

This Little Book on Text Mining provides a gentle and hands on introduction to Text Mining. If you are tired of reading through pages of text and would like to get your hands dirty and experience on how to do a quick and detailed text mining, then you are in the right place. This book does a detailed Text Mining and Modelling on the following datasets

  • Spooky Author Identification dataset from Kaggle

  • Yelp Data Reviews dataset from Kaggle

  • Simpsons dataset from Kaggle

  • Chicago Inspections dataset from Kaggle

The book focuses on Three main areas

  • Exploratory Data Analysis , TF IDF concept and the application of it ,Bigrams ,Trigrams ,Relationship among various words ( Word Clouds and Bar Plots )

  • Detailed Sentiment Analysis and insights from it using different Sentiment Analysis lexicons such as AFINN , NRC

  • Modelling using feature engineering and supervised learning techniques such as XGBoost and Multinomial Logistic Regression.Modelling using unsupervised learning techniques such as Topic Modelling.

1.1 Spooky Author Identification Dataset

The Spooky Author Identification dataset from Kaggle has excerpts from horror stories by Edgar Allan Poe, Mary Shelley, and HP Lovecraft. The competition challenges to predict the author of excerpts.

The dataset can be found in Kaggle.

This chapter focuses on the following topics

  • Word Length comparison among the various authors

  • Common words used by the authors

  • TF IDF concept and the application of it

  • Bigrams

  • Trigrams

  • Relationship among various words

  • Sentiment Analysis

  • NRC Sentiment Analysis

  • Building features using the Sentiment Score

  • Building features using the NRC Sentiment Score

  • Words Commonly used by Males and Females in Authors’ text

  • Evaluation of the Document Term Matrix

  • Topic Modelling

  • Predictions using the XGBoost Model

  • Predictions using the GLMnet Model