Demystifying Data Science Free Online Conference is July 30-31! Register Now

DrivenData Contest: Building the Best Naive Bees Classifier

By Emily Wilson • April 21, 2016

This piece was written and originally published by DrivenData. We sponsored and hosted its recent Naive Bees Classifier contest, and these are the exciting results.


Wild bees are important pollinators and the spread of colony collapse disorder has only made their role more critical. Right now it takes a lot of time and effort for researchers to gather data on wild bees. Using data submitted by citizen scientists, Bee Spotter is making this process easier. However, they still require that experts examine and identify the bee in each image. When we challenged our community to build an algorithm to pick out the genus of a bee based on the image, we were shocked by the results: the winners achieved a 0.99 AUC (out of 1.00) on the held out data!

We caught up with the top three finishers to learn about their backgrounds and how they tackled this problem. In true open data fashion, all three stood on the shoulders of giants by leveraging the pre-trained GoogLeNet model, which has performed well in the ImageNet competition, and tuning it to this task. Here's a little bit about the winners and their unique approaches.


___________

Meet the winners!

1st Place - E.A.

Name: Eben Olson and Abhishek Thakur

Home base: New Haven, CT and Berlin, Germany

Eben's Background: I work as a research scientist at Yale University School of Medicine. My research involves building hardware and software for volumetric multiphoton microscopy. I also develop image analysis/machine learning approaches for segmentation of tissue images.

Abhishek's Background: I am a Senior Data Scientist at Searchmetrics. My interests lie in machine learning, data mining, computer vision, image analysis and retrieval and pattern recognition.

Method overview: We applied a standard technique of finetuning a convolutional neural network pretrained on the ImageNet dataset. This is often effective in situations like this one where the dataset is a small collection of natural images, as the ImageNet networks have already learned general features which can be applied to the data. This pretraining regularizes the network which has a large capacity and would overfit quickly without learning useful features if trained directly on the small amount of images available. This allows a much larger (more powerful) network to be used than would otherwise be possible.

For more details, make sure to check out Abhishek's fantastic write-up of the competition, which includes some truly terrifying deepdream images of bees!

2nd Place - L.V.S.

Name: Vitaly Lavrukhin

Home base: Moscow, Russia

Background: I am a researcher with 9 years of experience both in industry and academia. Currently, I am working for Samsung and dealing with machine learning developing intelligent data processing algorithms. My previous experience was in the field of digital signal processing and fuzzy logic systems.

Method overview: I employed convolutional neural networks, since nowadays they are the best tool for computer vision tasks [1]. The provided dataset contains only two classes and it is relatively small. So to get higher accuracy, I decided to fine-tune a model pre-trained on ImageNet data. Fine-tuning almost always produces better results [2].

There are many publicly available pre-trained models. But some of them have license restricted to non-commercial academic research only (e.g., models by Oxford VGG group). It is incompatible with the challenge rules. That is why I decided to take open GoogLeNet model pre-trained by Sergio Guadarrama from BVLC [3].

One can fine-tune a whole model as is but I tried to modify pre-trained model in such a way, that could improve its performance. Specifically, I considered parametric rectified linear units (PReLUs) proposed by Kaiming He et al. [4]. That is, I replaced all regular ReLUs in the pre-trained model with PReLUs. After fine-tuning the model showed higher accuracy and AUC in comparison with the original ReLUs-based model.

In order to evaluate my solution and tune hyperparameters I employed 10-fold cross-validation. Then I checked on the leaderboard which model is better: the one trained on the whole train data with hyperparameters set from cross-validation models or the averaged ensemble of cross- validation models. It turned out the ensemble yields higher AUC. To improve the solution further, I evaluated different sets of hyperparameters and various pre- processing techniques (including multiple image scales and resizing methods). I ended up with three groups of 10-fold cross-validation models.

3rd Place - loweew

Name: Edward W. Lowe

Home base: Boston, MA

Background: As a Chemistry graduate student in 2007, I was drawn to GPU computing by the release of CUDA and its utility in popular molecular dynamics packages. After finishing my Ph.D. in 2008, I did a 2 year postdoctoral fellowship at Vanderbilt University where I implemented the first GPU-accelerated machine learning framework specifically optimized for computer-aided drug design (bcl::ChemInfo) which included deep learning. I was awarded an NSF CyberInfrastructure Fellowship for Transformative Computational Science (CI-TraCS) in 2011 and continued at Vanderbilt as a Research Assistant Professor. I left Vanderbilt in 2014 to join FitNow, Inc in Boston, MA (makers of LoseIt! mobile app) where I direct Data Science and Predictive Modeling efforts. Prior to this competition, I had no experience in anything image related. This was a very fruitful experience for me.

Method overview: Because of the variable positioning of the bees and quality of the photos, I oversampled the training sets using random perturbations of the images. I used ~90/10 split training/ validation sets and only oversampled the training sets. The splits were randomly generated. This was performed 16 times (originally intended to do 20-30, but ran out of time).

I used the pre-trained googlenet model provided by caffe as a starting point and fine-tuned on the data sets. Using the last recorded accuracy for each training run, I took the top 75% of models (12 of 16) by accuracy on the validation set. These models were used to predict on the test set and predictions were averaged with equal weighting.


Similar Posts

news
Sr. Data Scientist Roundup: Geek of the Week, ODSC Talks, & Feature Scaling Blog Post

By Emily Wilson • November 07, 2017

When our Sr. Data Scientists aren't in the classroom teaching bootcamps, you can find them engaged in corporate training efforts, working on curriculum development, giving talks at conferences and Meetups, writing blog posts on topics of interest, working on data-focused passion projects...and the list goes on. This new monthly blog series will track and discuss some of their recent activities and accomplishments.

news
New Collaboration with Kaplan Learning Institute (KLI) Brings Metis Bootcamp to Singapore

By Metis • February 15, 2019

On Friday afternoon, we officially announced an exciting collaboration with Kaplan Learning Institute (KLI), one of Singapore’s leading corporate training providers, through which we'll launch our Metis Data Science Bootcamp @ Kaplan in Singapore. Get all the details here!

news
Metis NYC Now Proudly Accepting GI Bill Benefits

By Emily Wilson • May 09, 2018

We’re proud to announce that Metis is now approved to offer GI Bill benefits to student veterans who are accepted to our data science bootcamp in New York City. Learn more in this post, including how the process works.