Chapter 2 Hidden Gems

Heads or Tails [Martin Henze] has compiled a list of 300 kernels for a period of 100 weeks which he believes are Hidden Gems , Kernels which are Gems but they did not get their due recognition. Thanks to the wonderful effort, Heads or Tails for the Kaggle Community.

We wanted to find out what makes a Kernel a Hidden Gem. For this we looked at 2 lens

8 perspectives
Text Mining the Hidden Gem Titles and Hidden Gem Reviews

8 perspectives

We looked at the 8 perspectives : Most Popular Gem Authors , Performance Tier of the Gem Authors , Total Votes , Total Comments , Total Views , Medals for the Gems , Maximum Version of the Kernel, Whether the Notebook is a Competition Notebook or not

We then did a Dimension Reduction of the Hidden Gems to see which of the factors contribute the most in making a Kernel in the Hidden Gem. We found that the Total Votes , Total Comments , Total Views and Medal contributes the most . The other factors which influenced a lot are Whether the Notebook is a Competition Notebook or Not and Maximum Version Number of the Notebook

More than 84% of the Hidden Gems are Non Competition Notebooks

Using these factors, we compiled a very simple rules based recommender for finding Hidden Gems for Notebooks created between 2021 June to 2021 December [ This is chosen to reduce the dataset analysis purposes only ]

We choose the following criteria

Medals - Silver
We chose a Kernel which is NOT a Competition Notebook
Performance Tier of the author is Expert or Master
We chose Kernels whose Total Votes greater than 40, Total Comments greater than 10 and the Number of views is more than 3100
We removed Kernels which had common data sources such as Titanic, Breast Cancer , Heart and Diabetes

Text Mining the Hidden Gem Titles and Hidden Gem Reviews

In our quest to find out what makes a Hidden Gem , we dived deeper into a Hidden Gem Title and a Hidden Gem Review. Following are the observations

We found that Hidden Gem Titles with 2 most commonly occurring words include data analysis , data science and time series .

We also found that Hidden Gem Reviews with 2 most commonly occurring words include data visuals , visuals analysis , detailed exploration , kaggle survey , exploratory analysis

We also wanted to find out topics which are interesting for Hidden Gems. Most prominent topics include detailed exploration , competition , data analysis , and data visuals

Mining the Hidden Gem Titles and Reviews it seems that notebooks with detailed exploratory data analysis catches the eye of the Hidden Gem Series Creator

We also dived deeper into the leading Hidden Gem Authors’ [ Jonathan Bouchet and Ramshankar Yadunath ] reviews and found the notebooks are detailed and have beautiful visuals and exploratory data analysis

We also saw another genre of reviews of Vopani, kxx and Bojan Tunguz and saw these reviews appreciated the various techniques using structured datasets in competitions. It also focussed on model building using Keras as well as XGBoost [Bojan’s reviews]

This genre coincides with the observations of Grandmaster reviews which concentrates on competition solutions. The reviews mention of clean code, clean techniques , clean analysis.

Highest Votes after Hidden Gem declaration

Tutorial on reading large datasets, Dive into dplyr (tutorial #1), Writing Hamilton Lyrics with Tensorflow/R, Petfinder Pawpularity EDA & fastai starter , Recommendation engine with networkx got the highest votes after the Hidden Gem declaration

Police Policy and the Use of Deadly Force, CTDS - Subtitles exploration, MOA Recipe,Advanced EDA: New Inferences from an old dataset, Leaf doctoR: EDA ,Evaluating defender ability to limit YAC , Do Left Handed Pitchers Make More Money? got No Votes after Hidden Gem declaration

Median Upvotes after Hidden Gem declaration is 9
Highest Upvotes after Hidden Gem declaration is 1152

Similar Authors

We find similar authors using the Hidden Gems reviews. We use the TF-IDF technique as well as the cosine similarity to find similar authors. Below is the list of similar authors along with a Cosine Similarity Score. Higher the score , better is the similarity

options(warn = -1)
options(scipen = 10000)
options(repr.plot.width = 20.0, repr.plot.height = 13.3)


library(gt)
library(stringr)
library(knitr)
library(tidyverse)
library(broom)
library(vroom)
library(ggplot2)
library(widyr)
library(igraph)
library(ggraph)
library(tidytext)
library('wordcloud')
library(ggthemes)

rm(list=ls())

fillColor = "#FFA07A"
fillColor2 = "#F1C40F"

gems = vroom("../input/notebooks-of-the-week-hidden-gems/kaggle_hidden_gems.csv")
kernels = vroom("../input/meta-kaggle/Kernels.csv")
users = vroom("../input/meta-kaggle/Users.csv")
kernel_version_competition = vroom("../input/meta-kaggle/KernelVersionCompetitionSources.csv")
kernel_versions = vroom("../input/meta-kaggle/KernelVersions.csv")
kernel_tags = vroom("../input/meta-kaggle/KernelTags.csv")
tags = vroom("../input/meta-kaggle/Tags.csv")
kernel_votes = vroom("../input/meta-kaggle/KernelVotes.csv")

gems =  gems %>% 
  mutate(CurrentUrlSlug = str_remove(notebook, str_c("https://www.kaggle.com/", author_kaggle, "/")))

gems_users = gems %>% 
  left_join(users %>% select(AuthorUserId = Id, 
                             author_kaggle = UserName,
                             DisplayName,
                             RegisterDate,
                             PerformanceTier), by = "author_kaggle")

kernels_gems <- gems_users %>% 
  left_join(kernels ,  by = c("CurrentUrlSlug","AuthorUserId"))

kernels_gems <- kernels_gems %>%
  rename(KernelId = Id)

kvcs <- kernels_gems %>%
  left_join(kernel_version_competition ,  
            by = c("CurrentKernelVersionId" = "KernelVersionId"))

kernels_gems_tags <- inner_join(kernels_gems,kernel_tags)
kernels_gems_tags <- inner_join(kernels_gems_tags,tags,by = c("TagId" = "Id"))


trainWords <- gems_users %>%
  unnest_tokens(word, review) %>%
  filter(!word %in% stop_words$word) %>%
  count(DisplayName, word, sort = TRUE) %>%
  ungroup()

total_words <- trainWords %>% 
  group_by(DisplayName) %>% 
  summarize(total = sum(n))

trainWords <- left_join(trainWords, total_words)

#Now we are ready to use the bind_tf_idf which computes the tf-idf for each term. 
trainWords <- trainWords %>%
  filter(!is.na(DisplayName)) %>%
  bind_tf_idf(word, DisplayName, n)

   
tfidf_trainWords <- trainWords %>%
  arrange(desc(tf_idf)) %>%
  mutate(word = factor(word, levels = rev(unique(word))))


tfidf_trainWords_dtm <- tfidf_trainWords %>%
  cast_dtm(DisplayName, word, n)