It is said that in order for a machine learning model to be successful, you need to have good data. While this is true (and pretty much obvious), it is extremely difficult to define, build, and sustain good data. Let me share with you the unique processes that I have learned over several years building an ever-growing image classification system and how you can apply these techniques to your own application.
With persistence and diligence, you can avoid the classic “garbage in, garbage out”, maximize your model accuracy, and demonstrate real business value.
In this series of articles, I will dive into the care and feeding of a multi-class, single-label image classification app and what it takes to reach the highest level of performance. I won’t get into any coding or specific user interfaces, just the main concepts that you can incorporate to suit your needs with the tools at your disposal.
Here is a brief description of the articles. You will notice that the model is last on the list since we need to focus on curating the data first and foremost:
Background
Over the past six years, I have been primarily focused on building and maintaining an image classification application for a manufacturing company. Back when I started, most of the software did not exist or was too expensive, so I created these from scratch. In this time, I have deployed two identifier applications, the largest handles 1,500 classes and achieves 97–98% accuracy.
It was about eight years ago that I started online studies for Data Science and machine learning. So, when the exciting opportunity to create an AI application presented itself, I was prepared to build the tools I needed to leverage the latest advancements. I jumped in with both feet!
I quickly found that building and deploying a model is probably the easiest part of the job. Feeding high quality data into the model is the best way to improve performance, and that requires focus and patience. Attention to detail is what I do best, so this was a perfect fit.
It all starts with the data
I feel that so much attention is given to the model selection (deciding which neural network is best) and that the data is just an afterthought. I have found the hard way that even one or two pieces of bad data can significantly impact model performance, so that is where we need to focus.
For example, let’s say you train the classic cat versus dog image classifier. You have 50 pictures of cats and 50 pictures of dogs, however one of the “cats” is clearly (objectively) a picture of a dog. The computer doesn’t have the luxury of ignoring the mislabelled image, and instead adjusts the model weights to make it fit. Square peg meets round hole.
Another example would be a picture of a cat that climbed up into a tree. But when you take a wholistic view of it, you would describe it as a picture of a tree (first) with a cat (second). Again, the computer doesn’t know to ignore the big tree and focus on the cat — it will start to identify trees as cats, even if there is a dog. You can think of these pictures as outliers and should be removed.
It doesn’t matter if you have the best neural network in the world, you can count on the model making poor predictions when it is trained on “bad” data. I’ve learned that any time I see the model make mistakes, it’s time to review the data.
Example Application — Zoo animals
For the rest of this write-up, I will use an example of identifying zoo animals. Let’s assume your goal is to create a mobile app where guests at the zoo can take pictures of the animals they see and have the app identify them. Specifically, this is a multi-class, single-label application.
Here is your challenge:
- Variety — There are a lot of different animals at the zoo and many of them look very similar.
- Quality — Guests using the app don’t always take good pictures (zoomed out, blurry, too dark), so we don’t want to provide an answer if the image is poor.
- Growth — The zoo keeps expanding and adding new species all the time.
- Out-of-scope — Occasionally you might find that people take pictures of the sparrows near the food court grabbing some dropped popcorn.
- Pranksters — Just for fun, guests may take a picture of the bag of popcorn just to see what it comes back with.
These are all real challenges — being able to tell the subtle differences between animals, handling out-of-scope cases, and just plain poor images.
Before we get there, let’s start from the beginning.
Collecting and Labelling
There are a lot of tools these days to help you with this part of the process, but the challenge remains the same — collecting, labelling, and curating the data.
Having data to collect is challenge #1. Without images, you have nothing to train. You may need to get creative on sourcing the data, or even creating synthetic data. More on that later.
A quick note about image pre-processing. I convert all my images to the input size of my neural network and save them as PNG. Inside this square PNG, I preserve the aspect ratio of the original picture and fill the background black. I don’t stretch the image nor crop any features out. This also helps center the subject.
Challenge #2 is to establish standards for data quality…and ensure that these standards are followed! These standards will guide you toward that “good” data. And this assumes, of course, correct labels. Having both is much easier said than done!
I hope to show how “good” and “correct” actually go hand-in-hand, and how important it is to apply these standards to every image.
Good Data
First, I want to point out that the image data discussed here is for the training set. What qualifies as a good image for training is a bit different than what qualifies as a good image for evaluation. More on that in Part 3.
So, what is “good” data when talking about images? “A picture is worth a thousand words”, and if the first words you use to describe the picture do not include the subject you are trying to label, then it is not good and you need remove it from your training set.
For example, let’s say you are shown a picture of a zebra and (removing bias toward your application) you describe it as an “open field with a zebra in the distance”. In other words, if “open field” is the first thing you notice, then you likely do not want to use that image. The opposite is also true — if the picture is way too close, you would described it as “zebra pattern”.


What you want is a description like, “a zebra, front and center”. This would have your subject taking up about 80–90% of the total frame. Sometimes I will take the time to crop the original image so the subject is framed properly.
Keep in mind the use of image augmentation at the time of training. Having that buffer around the edges will allow “zoom in” augmentation. And “zoom out” augmentation will simulate smaller subjects, so don’t start out less than 50% of the total frame for your subject since you lose detail.
Another aspect of a “good” image relates to the label. If you can only see the back side of your zoo animal, can you really tell, for example, that it is a cheetah versus a leopard? The key identifying features need to be visible. If a human struggles to identify it, you can’t expect the computer to learn anything.

What does a “bad” image look like? Here is what I frequently watch out for:
- Wide angle lens stretching
- Back-lit or silohuette
- High contrast or dark shadows
- Blurry or hazy
- Obscured features
- Multiple subjects
- “Doctored” images, drawn lines and arrows
- “Unusual” angles or situations
- Picture of a mobile device that has a picture of your subject
Correct Labels
If you have a team of subject matter experts (SMEs) on hand to label the images, you are in a good starting position. Animal trainers at the zoo know the various species, and can spot the differences between, for example, a chimpanzee and a bonobo.


To a Machine Learning Engineer, it is easy for you to assume all labels from your SMEs are correct and move right on to training the model. However, even experts make mistakes, so if you can get a second opinion on the labels, your error rate should go down.
In reality, it can be prohibitively expensive to get one, let alone two, subject matter experts to review image labels. The SME usually has years of experience that make them more valuable to the business in other areas of work. My experience is that the machine learning engineer (that’s you and me) becomes the second opinion, and often the first opinion as well.
Over time, you can become pretty adept at labelling, but certainly not an SME. If you do have the luxury of access to an expert, explain to them the labelling standards and how these are required for the application to be successful. Emphasize “quality over quantity”.
It goes without saying that having a correct label is so important. However, all it takes is one or two mislabelled images to degrade performance. These can easily slip into your data set with careless or hasty labelling. So, take the time to get it right.
Ultimately, we as the ML engineer are responsible for model performance. So, if we take the approach of only working on model training and deployment, we will find ourselves wondering why performance is falling short.
Unknown Labels
A lot of times, you will come across a really good picture of a very interesting subject, but have no idea what it is! It would be a shame to simply dispose of it. What you can do is assign it a generic label, like “Unknown Bird” or “Random Plant” that are not included in your training set. Later in Part 4, you’ll see how to come back to these images at a later date when you have a better idea what they are, and you’ll be glad you saved them.
Model Assistance
If you have done any image labelling, then you know how time consuming and difficult it can be. But this is where having a model, even a less-than-perfect model, can help you.
Typically, you have a large collection of unlabelled image and you need to go through them one at a time to assign labels. Simply having the model offer a best guess and display the top 3 results lets you step through each image in a matter of seconds!
Even if the top 3 results are wrong, this can help you narrow down your search. Over time, newer models will get better, and the labelling process can even be somewhat fun!
In Part 4, I will show how you can bulk identify images and take this to the next level for faster labelling.
Classes and Sub-Classes
I mentioned the example above of two species that look very similar, the chimpanzee and the bonobo. When you start out building your data set, you may have very sparse coverage of one or both of these species. In machine learning terms, we these “classes”. One option is to roll with what you have and hope that the model picks up on the differences with only a handful of example images.
The option that I have used is to merge two or more classes into one, at least temporarily. So, in this case I would create a class called “chimp-bonobo”, which is composed of the limited example pictures of chimpanzee and bonobo species classes. Combined, these may give me enough to train the model on “chimp-bonobo”, with the trade-off that it’s a more generic identification.
Sub-classes can even be normal variations. For example, juvenile pink flamingos are grey instead of pink. Or, male and female orangutans have distinct facial features. You wan to have a fairly balanced number of images for these normal variations, and keeping sub-classes will allow you to accomplish this.


Don’t be concerned that you are merging completely different looking classes — the neural network does a nice job of applying the “OR” operator. This works both ways — it can help you identify male or female variations as one species, but it can hurt you when “bad” outlier images sneak in like the example “open field with a zebra in the distance.”
Over time, you will (hopefully) be able to collect more images of the sub-classes and then be able to successfully split them apart (if necessary) and train the model to identify them separately. This process has worked very well for me. Just be sure to double-check all the images when you split them to ensure the labels didn’t get accidentally mixed up — it will be time well spent.
All of this certainly depends on your user requirements, and you can handle this in different ways either by creating a unique class label like “chimp-bonobo”, or at the front-end presentation layer where you notify the user that you have intentionally merged these classes and provide guidance on further refining the results. Even after you decide to split the two classes, you may want to caution the user that the model could be wrong since the two classes are so similar.
Up next…
I realize this was a long write-up for something that on the surface seems intuitive, but these are all areas that I have tripped me up in the past because I didn’t give them enough attention. Once you have a solid understanding of these principles, you can go on to build a successful application.
In Part 2, we will take the curated data we collected here to create the classic data sets, with a custom benchmark set that will further enhance your data. Then we will see how best to evaluate our trained model using a specific “training mindset”, and switch to a “production mindset” when evaluating a deployed model.