Using Deep Learning to Classify Cassava Leaf Diseases (Part I)
by Annie Phan, Drew Solomon, and Roma Coffin
Cassava is a key crop for food security across Sub-Saharan Africa. Yet, viral diseases threaten cassava yields, and are costly to detect manually. Thus, image recognition using deep learning may help identify diseased cassava plants efficiently and prevent crop loss.
Leveraging deep learning, our goal is to group cassava leaf image photos to five different categories using a dataset consisting of 21,367 images with labels of each category. The first four categories are a specific type of disease the plant may have and the fifth category is recognizing a healthy plant. By correctly identifying diseased plants farmers in Uganda can perhaps more efficiently control diseases that arise and ruin substantial food sources.
This first blog post will aim to shed light on the data collection process, exploratory data analysis, baseline model methodology, and next steps for this project. Please do take note that although changes were made, information was used from the Kaggle Competition site to assist with our analysis.
Exploratory Data Analysis:
For us to better understand the dataset, exploratory data analysis(EDA) was performed. EDA helped to familiarize us with the dataset by capturing common data patterns and to help us with data visualization.
The sample images below are from the data set provided. These pictures represent the five categories, the four kinds of diseases and healthy leaves.
The Cassava Disease Class Balance bar plot demonstrates the distribution of the dataset. This graph reveals the dataset is imbalanced and Cassava Mosaic Disease (CMD) was the most dominant. This is important to know so in the future we can stratify (e.g. using stratified K fold) to ensure all of the classes are adequately represented across training, validation, and testing.
The Cassava Disease Class Balance Normalized pie chart represents the proportion of images that are attributed to each category. This chart shows that Cassava Mosaic Disease (CMD) images are 61.5% of the dataset, thus is the majority of the dataset, while healthy plant images are only 12% of the dataset. Therefore, a simple majority classifier would achieve 61.5% accuracy by just predicting all images as CMD. Similar to the bar plot above, the pie chart demonstrates that the data is imbalanced so it would be in our best interest to look into stratified K fold as a method of splitting the data.
Baseline Model Methodology & Results:
The original test set only contained one image, we split the training images to train and validate the model. We used 80% of the data as training and 20% as validation. We then used Keras’ ImageDataGenerator to augment the images and fit the data in the model.
To improve the efficiency and accuracy for our baseline model, we used transfer learning to build upon VGG16, a neural network pre-trained on ImageNet data. For this model, all weights remained trainable, over 14,717,253 parameters. This resulted in longer training times but greater flexibility relative to freezing the base layers.
Above the VGG16 model, our model has a global average pooling layer and a dense layer with 5 neurons (one for each class) with a softmax activation to convert output weights to classification probabilities.
After the 16 epochs, the validation accuracy was 0.663, i.e. 66.3%, and training accuracy was 0.654 i.e. 65.4%. The validation accuracy performed slightly above the majority classifier baseline accuracy of 0.615 (i.e. the proportion of the most populous class, Cassava Mosaic Disease). Moreover, we did not observe a large gap between training and validation accuracy, indicating that this model is not overfitting. Since the performance remains relatively low across both validation and training, the model may be underfitting the data or may require more training epochs.
Our next step is to improve upon the results from the baseline model.This will be accomplished by further looking into developing more complicated deep learning models to attempt to more accurately classify cassava plant leaves. By adjusting the hyperparameters and building and refining our model structure we can improve upon our performance and achieve a higher accuracy score. We will also explore different base models for transfer learning, and experiment with hyperparameters via cross-validation.
Team Name: RAD
Baseline Kaggle Submission:
Cassava Leaf Disease: Keras CNN baseline
Explore and run machine learning code with Kaggle Notebooks | Using data from Cassava Leaf Disease Classification
Makerere University AI Lab. (2020). Cassava Leaf Disease Classification. Retrieved from https://www.kaggle.com/c/cassava-leaf-disease-classification
Aleksandradeis. (2020, November 20). Cassava Leaf Disease Classification EDA. Retrieved from https://www.kaggle.com/aleksandradeis/cassava-leaf-disease-classification-eda
Ramjib. (2020, December 23). Cassava Leaf Disease: EDA and Outliers. Retrieved from https://www.kaggle.com/ramjib/cassava-leaf-disease-eda-and-outliers
Maksymshkliarevskyi. (2020, December 20). Cassava Leaf Disease: Keras CNN baseline. Retrieved from https://www.kaggle.com/maksymshkliarevskyi/cassava-leaf-disease-keras-cnn-baseline/notebook
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Retrieved from https://arxiv.org/abs/1409.1556