Free FIU Data Science One Hour at Bootcamp: Intro Naive Bayes workshop -  Register Here

Impact of Sample Size on Transfer Learning

By Roberto Reif • July 05, 2018

This post was written by Metis Sr. Data Scientist Roberto Reif and it was originally published on his blog here.

Deep Learning (DL) models have had great success in the past, especially in the field of image classification.  But one of the challenges of working with these models is that they require large amounts of data to train. Many problems, such as in the case of medical images, contain small amounts of data, making the use of DL models challenging. Transfer learning is a method of using a deep learning model that has already been trained to solve one problem containing large amounts of data, and applying it (with some minor modifications) to solve a different problem that contains small amounts of data. In this post, I analyze the limit for how small a data set needs to be in order to successfully apply this technique.


Optical Coherence Tomography (OCT) is a non-invasive imaging technique that obtains cross-sectional images of biological tissues, using light waves, with micrometer resolution.  OCT is commonly used to obtain images of the retina, and allows ophthalmologists to diagnose several diseases such as glaucoma, age-related macular degeneration and diabetic retinopathy.  In this post I classify OCT images into four categories: choroidal neovascularization, diabetic macular edema, drusen and normal, with the help of a Deep Learning architecture.  Given that my sample size is too small to train a whole Deep Learning architecture, I decided to apply a transfer learning technique and understand what are the limits of the sample size to obtain classification results with high accuracy.  Specifically, a VGG16 architecture pre-trained with an Image Net dataset is used to extract features from OCT images, and the last layer is replaced with a new Softmax layer with four outputs.  I tested different amounts of training data and determine that fairly small datasets (400 images - 100 per category) produce accuracies of over 85%.


Optical Coherence Tomography (OCT) is a non-invasive and non-contact imaging technique.  OCT detects the interference formed by the signal from a broadband laser beam reflected from a reference mirror and a biological sample.  OCT is capable of generating in vivo cross-sectional volumetric images of the anatomical structures of biological tissues with microscopic resolution (1-10μm) in real-time.  OCT has been used to understand different disease pathogenesis and is commonly used in the field of ophthalmology.  

Convolutional Neural Network (CNN) is a Deep Learning technique that has gained popularity in the last few years.  It has been used successfully in image classification tasks.  There are several types of architectures that have been popularized, and one of the simple ones is the VGG16 model.  In this model, large amounts of data are required to train the CNN architecture. 

Transfer learning is a method that consists on using a Deep Learning model that was originally trained with large amounts of data to solve a specific problem, and applying it to solve a challenge on a different data set that contains small amounts of data.

In this study, I use the VGG16 Convolutional Neural Network architecture that was originally trained with the Image Net dataset, and apply transfer learning to classify OCT images of the retina into four groups.  The purpose of the study is to determine the minimum amount of images required to obtain high accuracy.


For this project, I decided to use OCT images obtained from the retina of human subjects.  The data can be found in Kaggle and was originally used for the following publication.  The data set contains images from four types of patients: normal, diabetic macular edema (DME), choroidal neovascularization (CNV), and drusen.  An example of each type of OCT image can be observed in Figure 1.

Fig. 1: From left to right: Choroidal Neovascularization (CNV) with neovascular membrane (white arrowheads) and associated subretinal fluid (arrows). Diabetic Macular Edema (DME) with retinal-thickening-associated intraretinal fluid (arrows). Multiple drusen (arrowheads) present in early AMD.  Normal retina with preserved foveal contour and absence of any retinal fluid/edema. Image obtained from the following publication.

To train the model I used a maximum of 20,000 images (5,000 for each class) so that the data would be balanced across all classes.  Additionally, I had 1,000 images (250 for each class) that were separated and used as a testing set to determine the accuracy of the model. 


For this project, I used a VGG16 architecture, as shown below in Figure 2.  This architecture presents several convolutional layers, whose dimensions get reduced by applying max pooling.  After the convolutional layers, two fully connected neural network layers are applied, which terminate in a Softmax layer which classifies the images into one of 1000 categories.  In this project, I use the weights in the architecture that have been pre-trained using the Image Net dataset.  The model used was built on Keras using a TensorFlow backend in Python. 

Fig. 2: VGG16 Convolutional Neural Network architecture displaying the convolutional, fully connected and softmax layers.  After each convolutional block there was a max pooling layer.  

Given that the objective is to classify the images into 4 groups, instead of 1000, the top layers of the architecture were removed and replaced with a Softmax layer with 4 classes using a categorical crossentropy loss function, an Adam optimizer and a dropout of 0.5 to avoid overfitting.  The models were trained using 20 epochs. 

Each image was grayscale, where the values for the Red, Green, and Blue channels are identical.  Images were resized to 224 x 224 x 3 pixels to fit in the VGG16 model.  

A) Determining the Optimal Feature Layer

The first part of the study consisted in determining the layer within the architecture that produced the best features to be used for the classification problem.  There are 7 locations that were tested and are indicated in Figure 2 as Block 1, Block 2, Block 3, Block 4, Block 5, FC1 and FC2.  I tested the algorithm at each layer location by modifying the architecture at each point.  All the parameters in the layers before the location tested were frozen (we used the parameters originally trained with the ImageNet dataset).  Then I added a Softmax layer with 4 classes and only trained the parameters of the last layer.  An example of the modified architecture at the Block 5 location is presented in Figure 3.  This location has 100,356 trainable parameters.  Similar architecture modifications were created for the other 6 layer locations (images not shown).

Fig. 3: VGG16 Convolutional Neural Network architecture displaying a replacement of the top layer at the location of Block 5, where a Softmax layer with 4 classes was added, and the 100,356 parameters were trained.  

At each of the seven modified architectures, I trained the parameter of the Softmax layer using all the 20,000 training samples.  Then I tested the model on 1,000 testing samples that the model had not seen before.  The accuracy of the test data at each location is presented in Figure 4.  The best result was obtained at the Block 5 location with an accuracy of 94.21%.  

Fig. 4: Accuracy of the model as a function of the different layers where the Softmax layer was placed within the VGG16 architecture.

Table 1 presents the probabilities obtained of the confusion matrix.  Ideally, we would obtain 25% on all four values of the main diagonal.

Table 1: Confusion matrix indicating the percentage of samples

B) Determining the Minimum Number of Samples

Using the modified architecture at the Block 5 location, which had previously provided the best results with the full dataset of 20,000 images, I tested training the model with different sample sizes from 4 to 20,000 (with an equal distribution of samples per class).  The results are observed in Figure 5.  If the model was randomly guessing, it would have an accuracy of 25%. However, with as few as 40 training samples, the accuracy was above 50%, and by 400 samples it had reached more than 85%.


In this study, I explored the use of transfer learning for a classification problem using medical images of the retina obtained with OCT.  I determined that using transfer learning on a VGG16 architecture pre-trained with the ImageNet dataset on Block 5 produced the highest accuracy.  Finally, I demonstrated that with a small sample size (400 images) I was able to obtain an accuracy higher than 85%. This approach is a viable method to classify images where the sample size is small, such as in medical applications.

Similar Posts

data science
Two Metis Team Members Featured in New Book, 'Mothers of Data Science'

By Emily Wilson • July 23, 2020

In a book published last week, read interviews with two Metis team members – Chief Data Scientist Debbie Berebichez and Data Scientist Alice Zhao – about their experiences as mothers and data scientists.

data science
Made at Metis: Deep Learning to Detect Pneumonia & Predicting Spotify Track Skips

By Carlos Russo • August 31, 2020

This post features two projects from recent graduates of our data science bootcamp. Take a look at what's possible to create in just 12 weeks, including projects focused on detecting pneumonia using deep learning and predicting track skips in Spotify.

data science
Made at Metis: Predicting Stock Performance & AI-Generated Guided Meditations

By Emily Wilson • July 27, 2020

This post features two projects from recent graduates of our data science bootcamp. Take a look at what's possible to create in just 12 weeks.