Footies: Using Deep Learning to Classify Footwear | by Raheem Ridwan Opeyemi | Apr, 2025

Big data may be analyzed to uncover hidden patterns using a machine learning technique called deep learning. Based on a machine learning technique, it is called the artificial neural network (ANN) algorithm. A single hidden layer makes up an ANN, whereas multiple hidden layers make up deep neural networks, which train convolutional neural networks (CNN).

In this article, I have chosen to use deep learning techniques on a “shoe vs. sandal vs. boot” image dataset. Hasib Al Muzdad is putting together the unstructured image dataset, and it is readily available on Kaggle. The dataset has three classes, and the data has not been structured. The process of structuring and splitting each class into train, test, and validation is done manually. Restructuring data in deep learning simply means rearranging the image of each class in a ratio of 60/20/20 and also creating folders for each and every class for training, testing, and validation datasets locally. Finally, I pushed the whole dataset to my GitHub repository and cloned it on the cloud.

For this model, I made use of the Saturn cloud to train this deep learning project. Why cloud? Deep learning is generally used for unstructured datasets such as images, videos, and some NLP text. It takes longer time to complete training. Deep learning analysis takes days or weeks. Also, it is used to analyze large amounts of data, whereas machine learning is used to analyze small and medium-sized datasets, which necessitates the use of powerful and efficient computers for this kind of project.

The analysis of deep learning necessitates complex mathematical processes. Manually carrying out these tasks is exceedingly challenging. Thankfully, excellent libraries have been created recently to support deep learning initiatives. You can train amazing deep-learning models and track their performance with these frameworks.

In this project, I made use of two frameworks, namely Tensorflow and Keras, because they work hand in hand. With TensorFlow, you can create end-to-end deep-learning applications. This implies that you can use this framework for modeling, data preparation, and deploying the model into production. A variety of programming languages, including Python, C++, and Java, can be utilized with TensorFlow. The most used library is TensorFlow, developed by Google.

Also, look at the Keras Library. Before, Keras was a library that operated on various frameworks. However, in 2019, it became TensorFlow. Deep learning tasks can be readily completed by non-AI experts because of TensorFlow’s high-level API.

Loading images

The load_img function is a unique function provided by Keras that loads images. I imported it.

I’d resize and normalize this image to utilize it in a neural network because the models require images to always be consistent and of a specific size. For instance, the network that I utilized in this chapter needs an image that is either 150 × 150 or 299 × 299 in size. The bigger image size turns out to have more accuracy than the smaller image size. The bigger image makes the pre-trained CNN learn the image pattern faster and preserves its features.

Image processing has been transformed by CNN architecture. This algorithm gave machines a vision comparable to that of humans. This architecture extracts patterns from images using layers and filters. This architecture’s method of operation makes it popular for applications like face recognition, object recognition, and image categorization.

One type of machine learning model for handling regression and classification issues is the neural network model. As a classification problem, my task in this project is to identify an image’s category, which makes the problem unique because I’m working with the images. Because of this, convolutional neural networks, a unique kind of neural network, are required. These networks can recognize patterns in images and utilize them to forecast future events. Convolutional neural network training from scratch takes a lot of time, large amounts of data, and sophisticated hardware. A lot of pre-trained neural networks are available online for various purposes. Luckily, I used an ImageNet pre-trained model called Xception.

I fitted the ImageNet with images of size 299 x 299, creating an array. Before I can apply the model to our image, I need to prepare it with the preprocess_input function:

This function is used to preprocess input images before they are fed into the Xception model for prediction or feature extraction to predict images belonging to three classes.

Convolutional neural networks demand large amounts of data and take a long time to train. However, there is a quick fix: transfer learning, which is a method that modifies a trained model to fit the situation.

Transfer Learning

Pre-trained on a sizable dataset, the Xception model is a deep convolutional neural network architecture. Instead of loading each image one by one. I used a data generator, ImageDataGenerator. Keras will be used for loading the images and pre-processing them.

Training becomes challenging mostly because of convolutional layers. The filters must acquire good patterns to be able to extract a high-quality vector representation from an image. The network needs to see a wide variety of images for that to happen; the more, the better.
However, it is not too difficult to train dense pre-trained layers, given that I have a suitable vector representation. This implies that I can utilize a neural network that has been pre-trained on ImageNet to solve the problem. Good filters have already been learned by this model. Thus, we take this model, retain the convolutional layers, and train additional new layers in place of the dense layers.

Loading the Data

In order to obtain X, the feature-rich matrix, I loaded the complete dataset into memory. It’s more challenging with images because I might not have enough memory to save them all. ImageDataGenerator is a solution included with Keras. The images are loaded from the dataset in small batches, rather than the complete dataset being loaded into memory.

As I already know, the preprocess_input function must be used to preprocess the image. For this reason, we must instruct ImageDataGenerator on how the data should be prepared. Now that I have the generator, all I need to do is point it in the direction of the data directory. Use the flow_from_directory method to accomplish that:

The model can be trained more quickly with this method. It is also feasible to use a laptop for training if we are dealing with small images. In this model, I worked with images (299 × 299 pixels). The dataset comprises three types of classes, with images belonging to each class kept in a different directory. For instance, the sandal folder houses all of the sandals. The folder structure can be used by the generator to deduce each image’s label.
The cell tells us how many images and classes are in the training dataset when it is executed below.

Also, I repeated the process for the validation dataset. I use the training dataset for training the model and the validation dataset for selecting the best parameters.

Adjusting the learning rate is a crucial aspect of training neural networks, including when using models like Xception in TensorFlow. The learning rate determines the size of the steps taken during the optimization process and can significantly impact the training process and the final model’s performance. Trying different values and seeing what parameters are best for the model

Creating the model

Firstly, I load the base model This is the pre-trained model that I’ll be using for extracting the vector representation from images. As previously stated, I use Xception, but this time I include only the part with pre-trained convolutional layers. After that, I add dense layers. So, the base model is:

Now let me build the classification model. The input images should be 299 × 299 with three classes, and I will use the base_model to extract the high-level features. Extracts the vector representation: converts the output of base_model to a vector. Adds a dense layer of size three, one element for each class, as well as combining the inputs and the outputs into a Keras model. Training the model with different parameters and a learning rate of 0.01 is best for this model, and other parameters were deleted. This way of building the model is called the “functional style.”

In this section, I will show you the process I undertook to shortlist the most accurate model that will significantly impact the overall accuracy of the model I create. Once the model is trained, I can save it using the save_weights method:

You may have observed fluctuations in the model’s performance on the validation set throughout training. In this sense, after ten iterations, I might not have the optimal model; instead, the best results might have been obtained on iterations five or six. Each repetition allows us to store the model, but the result is an excessive amount of data. Furthermore, renting a server on the cloud may easily fill up all of the available space. Rather, I can only save the model if its validation score surpasses the prior best score. I preserved the models, for instance, if the prior best accuracy was 0.976 and I have an increase to be 0.977. If not, I’ll carry on training the model without saving it.

Include_top set to False, basically saying I can add my own other classifier layers removing final Dense layers of Xception. I don’t have to restrict the model to just convolutional layers, so I’ll add another layer between the base model and the last convolutional layer with predictions. There’s no particular reason for selecting the size of 100 for the inner dense layer. I’m treating it as a parameter. Having added another dense layer of size 100 in the leu for connecting outputs to vectors, I connected it to the inner. By putting together multiple logistic regressions, I get a neural network. In logistic regression, sigmoid is used for converting the raw score to probability.

However, probabilities are not necessary for inner layers, and alternative functions can be used in place of the sigmoid. We refer to these operations as activation functions. Among them is ReLU (Rectified Linear Unit), which is a preferable option to sigmoid for inner layers and is not easily saturated. Deep neural network training is not feasible due to the vanishing gradient problem affecting the sigmoid function. This difficulty is solved with ReLU.

A unique method for combating overfitting in neural networks is dropout. The primary concept behind dropout is that during training, a portion of a dense layer remains frozen. In every iteration, a randomly selected portion is frozen. The frozen portion is not handled at all; only the unfrozen portion is trained. The likelihood of the model overfitting overall is reduced if some network components are disregarded.
The frozen portion of a layer is switched off and does not view this data when the network passes over a batch of images, 32. In this manner, the network finds it harder to commit the images to memory. The portion to freeze is chosen at random for each batch, allowing the network to learn how to identify patterns in missing data, which makes it more robust and less likely to overfit. To do this in Keras, we add a dropout layer after the first dense layer and set up the dropout rate:

Obtaining additional data is usually the best course of action when it comes to enhancing the quality of the model. Unfortunately, obtaining further information isn’t always possible. But with images, I’d extracted additional information from already-existing images. For instance, rotate an image and zoom in or out somewhat. flip an image both vertically and horizontally, and alter an image in numerous ways. I can combine multiple data-augmentation strategies. For example, I can take an image, flip it horizontally, zoom out, and then rotate it.

Image by Soft Footies. showing horizontal and vertical flipping and zooming of the shoe.

Keras offers an integrated method for dataset augmentation. My previous usage of ImageDataGenerator to read the images served as its foundation.
Multiple arguments are fed into the generator. Before now, I just utilized the preprocessing function, which is required to prepare the images. It requires much more epochs than before to train this model. Another technique for regularization is data augmentation. Every epoch, the network sees a distinct variant of the same image rather than repeatedly training on the same one. This reduces the likelihood of overfitting and makes it harder for the model to memorize the data. This improvement is significantly important for overfi. But I have experimented a lot, and I could do this relatively quickly because we used images of size 299 × 299. Time to applyd everything the model has learned so far to the larger model.

It is noteworthy that even for some people, it may be challenging to understand what kind of item is in the image dataset if a small image size is used. It’s also difficult for a computer; it’s not easy to see the important details, so the model may confuse sandals and shorts. which is the reason I put the size of the images at 299 × 299 right from the beginning of the training; it’ll be much easier for the network to see more features and, therefore, achieve greater accuracy. It takes about four times longer to train a model on larger photos than on smaller ones. You are not required to run the code in this section if you do not have access to a computer or machine with a GPU. The input size is the sole variable that differs from the conceptual standpoint of the procedure.

The drop rate is 0.2, and the epoch was increased to 50. Every other hyperparameter is selected based on their accuracy during training and Validation.

Finally, I load the models, evaluate them, and get the prediction. Previously, I trained multiple models and also deleted them because of the space that they would be occupying on the Saturn cloud and to obliterate some constraints when pushing to the GitHub repository from the cloud. This method will save me more space, time, and data charges. The best one is the model we trained on the larger model; it has 97.7% accuracy. The second-best model has an accuracy of 97.6%. I’ll now use these models to make predictions. To use a model, I first need to load it.

Loading the Model

The models that were preserved for use can be found in this repository directory, as shown below in the image.

To use it, load the model using the load_model function from the model package.

Evaluating the model

Both the training and validation datasets have already been used above. It’s time to evaluate this model using the test dataset now that the training procedure is complete. I’ll use ImageDataGenerator but point to the test directory. Evaluating a model in Keras is as simple as invoking the evaluate method. It applies the model to all the data in the test folder and shows the evaluation metrics of loss and accuracy:

However, let’s see how I can apply the model to individual images to get predictions.

Getting the Prediction

If I want to apply the model to a single image that has already been loaded,

Preprocessing the image to generate an array. The result is an array of three elements, where each element contains the score. The higher the score, the more likely the image is to belong to the respective class

I have a trained convolutional neural network for classifying images of footwear. I can save it, load it, and use it inside a Jupyter Notebook just like we did for the second model, xception_v1_08_0.976.h5, in this repository directory. This is the link to the project code repository:

Check out my repository for the model code here. Thank you for reading.

Footies: Using Deep Learning to Classify Footwear | by Raheem Ridwan Opeyemi | Apr, 2025

Loading images

Transfer Learning

Loading the Data

Creating the model

Loading the Model

Evaluating the model

Getting the Prediction

Recent Articles

STAR Doesn’t Work: How to Answer Behavioral Questions as a Data Scientist

Orbital Mechanics (or How I Optimized a CSS Keyframes Animation)

A Practical Guide to BERTopic for Transformer-Based Topic Modeling

Did Ryan Coogler tease a ‘Sinners’ trilogy?

Russian Hackers Using ClickFix Fake CAPTCHA to Deploy New LOSTKEYS Malware

Related Stories

Leave A Reply Cancel reply