• Hidden Gems Book
  • 1 What is Kaggle
  • 2 Hidden Gems
  • 3 Most Popular Hidden Gems Authors
    • 3.1 Jonathan Bouchet Notebooks - Top Hidden Gem Author
    • 3.2 Jonathan Bouchet - Leading Hidden Gem Author reviews
    • 3.3 Ramshankar Yadhunath - Leading Hidden Gem Author reviews
    • 3.4 Parul Pandey - Leading Hidden Gem Author reviews
    • 3.5 Laura Fink - Leading Hidden Gem Author reviews
    • 3.6 Vopani - Leading Hidden Gem Author reviews
    • 3.7 kxx - Leading Hidden Gem Author reviews
    • 3.8 Bojan Tunguz - Leading Hidden Gem Author reviews
  • 4 Tags and Percentage
    • 4.1 Jonathan Bouchet Tags
    • 4.2 Ramshankar Yadhunath Tags
    • 4.3 Vopani Tags
    • 4.4 Bojan Tunguz Tags
  • 5 Performance Tier and Gems
  • 6 Hidden Gem and Competiton Notebook
    • 6.1 95% Confidence Interval for a Hidden Gem being a NOT A Competition notebook
  • 7 Total Votes for a Hidden Gem
    • 7.1 Box Plot ( without Outliers )
    • 7.2 Box Plot
    • 7.3 Density Plot
    • 7.4 Summary Statistics for Votes
    • 7.5 95% Confidence Interval for Hidden Gems Votes
  • 8 Total Comments
    • 8.1 Box Plot
    • 8.2 Density Plot
    • 8.3 Summary Statistics for Total Comments
    • 8.4 95% Confidence Interval for Hidden Gems Total Comments
  • 9 Total Views
    • 9.1 Box Plot
    • 9.2 Density Plot
    • 9.3 Histogram Plot
    • 9.4 Summary Statistics for Total Views
    • 9.5 95% Confidence Interval for Hidden Gems Total Views
  • 10 Medal distribution
  • 11 Versions of the Hidden Gems
    • 11.1 Box Plot [ Removing Outliers ]
    • 11.2 Density Plot
    • 11.3 Histogram Plot
    • 11.4 Summary Statistics for Maximum Version Number
    • 11.5 95% Confidence Interval for Hidden Gems Maximum Version Number
  • 12 Principal Components
    • 12.1 Principal Component 1
    • 12.2 Principal Component 2
    • 12.3 Principal Component 3
    • 12.4 Principal Component 4
    • 12.5 Principal Component 5
    • 12.6 Principal Component 6
    • 12.7 Principal Component 7
  • 13 Recommended Notebooks for 2021 June to 2021 December
  • 14 Who got Highest Votes after Hidden Gem Declaration
  • 15 Who got No Votes after the Hidden Gem Declaration
  • 16 Lowest Number of Votes after Hidden Gem Declaration
  • 17 More Analysis UpVotes after Hidden Gem declaration
    • 17.1 Box Plot [ removing Outliers]
    • 17.2 Density Plot
    • 17.3 Summary Statistics for Hidden Gems UpVotes
    • 17.4 95% Confidence Interval for Hidden Gems UpVotes
  • 18 Mining Hidden Gems Titles and Reviews
    • 18.1 Word Cloud of the Hidden Gem Reviews
    • 18.2 Word Cloud of the Hidden Gem Titles
    • 18.3 Network graph of Hidden Gem Title
    • 18.4 Network graph of Hidden Gem Reviews
    • 18.5 Competition Network Graph
      • 18.5.1 Competition reviews
    • 18.6 Image Network Graph
      • 18.6.1 Image reviews
    • 18.7 GrandMaster and Reviews
      • 18.7.1 Grand Master reviews
    • 18.8 Kaggle Network Graph
      • 18.8.1 Kaggle reviews
    • 18.9 Master Network Graph
    • 18.10 Topic Modelling of Hidden Gem Reviews
  • 19 Similar Authors
    • 19.1 Jonathan Bouchet - Similar Author
    • 19.2 Vopani - Similar Author
    • 19.3 Parul Pandey - Similar Author
    • 19.4 Bojan Tunguz - Similar Author
    • 19.5 Laura Fink - Similar Author
    • 19.6 Bukun - Similar Author
  • References
  • Published with bookdown

Hidden Gems Book

Chapter 3 Most Popular Hidden Gems Authors

Jonathan Bouchet has the highest number of gems ( 9)

gems %>%
  group_by(author_name) %>%
  summarise(Count = n()) %>%
  filter(Count >=3) %>%
  arrange(desc(Count)) %>%
  ungroup() %>%
  mutate(author_name = reorder(author_name,Count)) %>%
 
  
  ggplot(aes(x = author_name,y = Count)) +
  geom_bar(stat='identity',colour="white", fill = fillColor2) +
  geom_text(aes(x = author_name, y = 1, label = paste0("(",Count,")",sep="")),
            hjust=0, vjust=.5, size = 6, colour = 'black',
            fontface = 'bold') +
  labs(x = 'author', 
       y = 'Count', 
       title = 'author and Count') +
  coord_flip() + 
  theme_fivethirtyeight(base_size = 15)

3.1 Jonathan Bouchet Notebooks - Top Hidden Gem Author

jb_gems = gems %>%
 filter(author_name == "Jonathan Bouchet") %>%
  select(title,review)



jb_gems %>%
  gt() %>%
  tab_header(
    title = "Jonathan Bouchet Notebooks")
Jonathan Bouchet Notebooks
title review
U.S. Commercial Flights Tracker Map Stunning maps are accompanied by lots of other fantastic visuals in this outstanding Notebook by one of my favourite Kagglers. Tons of dataviz inspiration in this criminally underrated work.
Airlines Route Tracker Beautiful maps of airline frequencies and routes from one of Kaggle's most prolific Notebook authors. Great attention to detail and exemplary engagement with comments from the community.
Cities Transportation system visualization Another dataviz master class that explores the public transport networks of major cities. Also see the associated interactive [R shiny app](https://jonathanbouchet.shinyapps.io/transport_visualization/).
F1 Data analysis An insightful exploration of Formula 1 history that showcases the author's trademark style of combining creative and thoughtful visuals with detailed interpretation and context.
Pokemon Battles Another fantastic end-to-end analysis by one of Kaggle's most prolific Notebook authors. Note in particular the radar charts, consistent narrative flow, and how every question in the comments is answered in detail.
Beware of trolls An expertly crafted analysis of troll tweets; featuring a range of diverse visuals, a concise setup, and careful notes and interpretations. Note in particular the heatmap and annotated time series.
A closer look at the FIFA Ranking Another classic Notebook by Jonathan that showcases the tailored exploration of a dataset in great detail with detailed explanations, beautiful graphs, and insightful interpretation.
2017 German Elections : some results A somewhat topical entry, this notebook produces expert visuals to analyse the previous German voting patterns in 2017. Lots of inspiration to study and compare the recent 2021 election.
NBA player of the week ... he's on fire !!! A detailed exploration of a basketball award category and the characteristics of winning players and teams. Check out the strong narration, the clean data wrangling, and the fantastic visuals.

For each of the authors we show a word cloud as well as a network graph

A word cloud is a graphical representation of frequently used words in the text. The height of each word in this picture is an indication of frequency of occurrence of the word in the entire text.

We wish to find out the important words in the Hidden Gem reviews. Example for your young child , the most important word is mom. Example for a bar tender , important words would be related to drinks. For the following Hidden Gem Authors, we plot the most important or distinguishing words


3.2 Jonathan Bouchet - Leading Hidden Gem Author reviews

The reviews reveal that these notebooks are detailed and have very detail exploration. You would also be treated with beautiful and insightful visualizations through this

my_stop_words <- bind_rows(stop_words, 
                           tibble(word = c("kaggle", "survey", "https", "2021",
                                           "2020","www.kaggle.com")))

drawNetworkGraph <- function(author, occur=2) {
  gem_review <- gems %>% 
    filter(author_name == author) %>%
    select(notebook,review) %>%
    unnest_tokens(word, review) %>% 
    anti_join(my_stop_words)
  
  
  gem_review %>%
  count(word,sort = TRUE) %>%
  ungroup()  %>%
  head(30) %>%
  
  with(wordcloud(word, n, max.words = 30,colors=brewer.pal(8, "Dark2")))
  
  
  review_word_pairs <- gem_review %>% 
    pairwise_count(word, notebook, sort = TRUE, upper = FALSE) %>%
    filter( item1 != "www.kaggle.com") %>%
    filter( item2 != "www.kaggle.com") %>%
    filter( item1 != "https")
  
  review_word_pairs
  
  
  set.seed(1234)
  review_word_pairs %>%
    filter(n >= occur) %>%
    graph_from_data_frame() %>%
    ggraph(layout = "fr") +
    geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "darkred") +
    geom_node_point(size = 5) +
    geom_node_text(aes(label = name), repel = TRUE,
                   point.padding = unit(0.2, "lines")) +
    theme_void(base_size = 15)
}






drawNetworkGraph("Jonathan Bouchet")

getImportantWords <- function(author) {
  tfidf_trainWords %>%
    filter(DisplayName == author) %>%
    arrange(desc(tf_idf)) %>%
    mutate(word = reorder(word,tf_idf)) %>%
    head() %>%
  
  ggplot(aes(x = word,y = tf_idf)) +
    geom_bar(stat='identity',colour="white", fill = "orange")+
    labs(x = 'word', 
         y = 'tf-idf', 
         title = 'Most Important words') +
    coord_flip() + 
    theme_fivethirtyeight(base_size = 15) +
    theme(legend.position = "none") 
}

getImportantWords("Jonathan Bouchet")

3.3 Ramshankar Yadhunath - Leading Hidden Gem Author reviews

The reviews reveal that these notebooks have visual analysis and visual interpretations

drawNetworkGraph("Ramshankar Yadhunath")

getImportantWords("Ramshankar Yadhunath")

3.4 Parul Pandey - Leading Hidden Gem Author reviews

The reviews indicate strong association with covid 19 topics. This is expected since the Hidden Gem Notebooks are about Pollution and the Indian Stock Index Nifty of how they were affected by COVID 19

 a = gems %>% 
  filter(author_name == "Parul Pandey") %>%
  select(title,review)

 a %>%
  gt() %>%
  tab_header(
    title = "Parul Pandey Hidden Gems")
Parul Pandey Hidden Gems
title review
Breathe India: COVID-19 effect on Pollution A detailed work studying the interaction between the big topics of COVID-19 and air pollution in past and recent data from India.
Nifty data EDA A well-structured exploration of Indian stockmarket data with annotated time series visuals; including the impact of the first Covid-19 lockdowns on the economy.
Recreating Gapminder visualisation with Bokeh A step by step guide on how to reproduce the dataviz techniques pioneered by the late, great Hans Rosling using the Python Bokeh library. Clean code and much attention to detail.
drawNetworkGraph("Parul Pandey",occur = 1)

getImportantWords("Parul Pandey")

3.5 Laura Fink - Leading Hidden Gem Author reviews

Laura Fink reviews have words mentioning detailed visuals and detailed interpretations

 a = gems %>% 
  filter(author_name == "Laura Fink") %>%
  select(title,review)

 a %>%
  gt() %>%
  tab_header(
    title = "Laura Fink Hidden Gems")
Laura Fink Hidden Gems
title review
How good does your chocolate taste? A flavourful composition of layers of data wrangling and exploratory visuals; sprinkled with inspiration for modelling.
Computer vision with seedlings An exceptionally well narrated Notebook on image classification, which features a custom segmentation strategy for preprocessing as well as detailed interpretations and documentation.
Patterns of colorectal cancer - image clustering A clustering analysis with great visuals and detailed interpretations that explore a medical dataset. Note how the clean narrative structure makes the approach and results accessible.
drawNetworkGraph("Laura Fink",occur=1)

getImportantWords("Laura Fink")

3.6 Vopani - Leading Hidden Gem Author reviews

3 different clusters of reviews. The cluster focused on read table, cudf , pandas . The other cluster focused on hugging face , tensorflow . The third cluster on the 2019 , 2020 survey

 a = gems %>% 
  filter(author_name == "Vopani") %>%
  select(title,review)

 a %>%
  gt() %>%
  tab_header(
    title = "Vopani Hidden Gems")
Vopani Hidden Gems
title review
A deep learning of Deep Learning A deep meta-look into the deep learning preferences of our deeply fascinating community. Narrated and illustrated based on data from the [2019 Kaggle Survey](https://www.kaggle.com/c/kaggle-survey-2019). It will be interesting to see how the numbers change in 2020 and beyond.
TPU Sherlocked: One-stop for HuggingFace with TF A clean NLP starter framework for the [Contradiction Competition](https://www.kaggle.com/c/contradictory-my-dear-watson/) with Tensorflow and Huggingface models. Well documented and extendable for fast experimentation (including TPU configuration).
Tutorial on reading large datasets An impressively clean and accessible primer on Python tools to read, and formats to store, large datasets. Brief and to the point; featuring Pandas, Dask, Datable, and Rapids cudf.
drawNetworkGraph("Vopani",occur=1)

getImportantWords("Vopani")

3.7 kxx - Leading Hidden Gem Author reviews

3 clusters are visible. The 1st cluster is about keras and tidymodels . The 2nd cluster is about the tabular sparse data in the santander competition. The 3rd cluster is about image classification of leaf disease. Please note the connection between the 1st cluster and the 2nd cluster is through model

 a = gems %>% 
  filter(author_name == "kxx") %>%
  select(title,review)

 a %>%
  gt() %>%
  tab_header(
    title = "kxx Hidden Gems")
kxx Hidden Gems
title review
Santander: EDA + features A great example for an end-to-end framework to model anonymised & sparse tabular data. The next Santander competition is never far away ;-)
MOA Recipe An elegant R tidymodels + Keras approach to building a neural network starter model. Concise and well explained
Leaf doctoR: EDA A well-structured template showcasing the step-by-step usage of the new [torch for R framework](https://torch.mlverse.org/) on the ongoing [Leaf Disease Image classification competition](https://www.kaggle.com/c/cassava-leaf-disease-classification).
drawNetworkGraph("kxx",occur=1)

getImportantWords("kxx")

3.8 Bojan Tunguz - Leading Hidden Gem Author reviews

Reviews strongly indicate the words gpu , rapids, xgboost

 a = gems %>% 
  filter(author_name == "Bojan Tunguz") %>%
  select(title,review)

 a %>%
  gt() %>%
  tab_header(
    title = "Bojan Tunguz Hidden Gems")
Bojan Tunguz Hidden Gems
title review
MNIST 2D t-SNE with Rapids One of the first Notebooks on Kaggle demonstrating the game-changing speed up provided by Nvidia's GPU-magic tools. An exhibit as concise and powerful as the code itself.
Adversarial Rainforest A compact work providing adversarial validation of the rainforest competition data together with interpretable Shapely values via GPU-powered XGBoost in the Rapids framework.
TPS 01-21 Feature Importance with XGBoost and SHAP A notebook showcasing a fast method for computing explanatory SHAP values through GPU-powered Rapids XGBoost. This provides feature importances and thereby model interpretability.
drawNetworkGraph("Bojan Tunguz",occur=1)

getImportantWords("Bojan Tunguz")