Learnings from a Machine Learning Engineer — Part 5: The Training

In this fifth part of my series, I will outline the steps for creating a Docker container for training your image classification model, evaluating performance, and preparing for deployment.

AI/ML engineers would prefer to focus on model training and data engineering, but the reality is that we also need to understand the infrastructure and mechanics behind the scenes.

I hope to share some tips, not only to get your training run running, but how to streamline the process in a cost efficient manner on cloud resources such as Kubernetes.

I will reference elements from my previous articles for getting the best model performance, so be sure to check out Part 1 and Part 2 on the data sets, as well as Part 3 and Part 4 on model evaluation.

Here are the learnings that I will share with you, once we lay the groundwork on the infrastructure:

Building your Docker container
Executing your training run
Deploying your model

Infrastructure overview

First, let me provide a brief description of the setup that I created, specifically around Kubernetes. Your setup may be entirely different, and that is just fine. I simply want to set the stage on the infrastructure so that the rest of the discussion makes sense.

Image management system

This is a server you deploy that provides a user interface to for your subject matter experts to label and evaluate images for the image classification application. The server can run as a pod on your Kubernetes cluster, but you may find that running a dedicated server with faster disk may be better.

Image files are stored in a directory structure like the following, which is self-documenting and easily modified.

Image_Library/
  - cats/
    - image1001.png
  - dogs/
    - image2001.png

Ideally, these files would reside on local server storage (instead of cloud or cluster storage) for better performance. The reason for this will become clear as we see what happens as the image library grows.

Cloud storage

Cloud Storage allows for a virtually limitless and convenient way to share files between systems. In this case, the image library on your management system could access the same files as your Kubernetes cluster or Docker engine.

However, the downside of cloud storage is the latency to open a file. Your image library will have thousands and thousands of images, and the latency to read each file will have a significant impact on your training run time. Longer training runs means more cost for using the expensive GPU processors!

The way that I found to speed things up is to create a tar file of your image library on your management system and copy them to cloud storage. Even better would be to create multiple tar files in parallel, each containing 10,000 to 20,000 images.

This way you only have network latency on a handful of files (which contain thousands, once extracted) and you start your training run much sooner.

Kubernetes or Docker engine

A Kubernetes cluster, with proper configuration, will allow you to dynamically scale up/down nodes, so you can perform your model training on GPU hardware as needed. Kubernetes is a rather heavy setup, and there are other container engines that will work.

The technology options change constantly!

The main idea is that you want to spin up the resources you need — for only as long as you need them — then scale down to reduce your time (and therefore cost) of running expensive GPU resources.

Once your GPU node is started and your Docker container is running, you can extract the tar files above to local storage, such as an emptyDir, on your node. The node typically has high-speed SSD disk, ideal for this type of workload. There is one caveat — the storage capacity on your node must be able to handle your image library.

Assuming we are good, let’s talk about building your Docker container so that you can train your model on your image library.

Building your Docker container

Being able to execute a training run in a consistent manner lends itself perfectly to building a Docker container. You can “pin” the version of libraries so you know exactly how your scripts will run every time. You can version control your containers as well, and revert to a known good image in a pinch. What is really nice about Docker is you can run the container pretty much anywhere.

The tradeoff when running in a container, especially with an Image Classification model, is the speed of file storage. You can attach any number of volumes to your container, but they are usually network attached, so there is latency on each file read. This may not be a problem if you have a small number of files. But when dealing with hundreds of thousands of files like image data, that latency adds up!

This is why using the tar file method outlined above can be beneficial.

Also, keep in mind that Docker containers could be terminated unexpectedly, so you should make sure to store important information outside the container, on cloud storage or a database. I’ll show you how below.

Dockerfile

Knowing that you will need to run on GPU hardware (here I will assume Nvidia), be sure to select the right base image for your Dockerfile, such as nvidia/cuda with the “devel” flavor that will contain the right drivers.

Next, you will add the script files to your container, along with a “batch” script to coordinate the execution. Here is an example Dockerfile, and then I’ll describe what each of the scripts will be doing.

#####   Dockerfile   #####
FROM nvidia/cuda:12.8.0-devel-ubuntu24.04

# Install system software
RUN apt-get -y update && apg-get -y upgrade
RUN apt-get install -y python3-pip python3-dev

# Setup python
WORKDIR /app
COPY requirements.txt
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install -r requirements.txt

# Pythong and batch scripts
COPY ExtractImageLibrary.py .
COPY Training.py .
COPY Evaluation.py .
COPY ScorePerformance.py .
COPY ExportModel.py .
COPY BulkIdentification.py .
COPY BatchControl.sh .

# Allow for interactive shell
CMD tail -f /dev/null

Dockerfiles are declarative, almost like a cookbook for building a small server — you know what you’ll get every time. Python libraries benefit, too, from this declarative approach. Here is a sample requirements.txt file that loads the TensorFlow libraries with CUDA support for GPU acceleration.

#####   requirements.txt   #####
numpy==1.26.3
pandas==2.1.4
scipy==1.11.4
keras==2.15.0
tensorflow[and-cuda]

Extract Image Library script

In Kubernetes, the Docker container can access local, high speed storage on the physical node. This can be achieved via the emptyDir volume type. As mentioned before, this will only work if the local storage on your node can handle the size of your library.

#####   sample 25GB emptyDir volume in Kubernetes   #####
containers:
  - name: training-container
    volumeMounts:
      - name: image-library
        mountPath: /mnt/image-library
volumes:
  - name: image-library
    emptyDir:
      sizeLimit: 25Gi

You would want to have another volumeMount to your cloud storage where you have the tar files. What this looks like will depend on your provider, or if you are using a persistent volume claim, so I won’t go into detail here.

Now you can extract the tar files — ideally in parallel for an added performance boost — to the local mount point.

Training script

As AI/ML engineers, the model training is where we want to spend most of our time.

This is where the magic happens!

With your image library now extracted, we can create our train-validation-test sets, load a pre-trained model or build a new one, fit the model, and save the results.

One key technique that has served me well is to load the most recently trained model as my base. I discuss this in more detail in Part 4 under “Fine tuning”, this results in faster training time and significantly improved model performance.

Be sure to take advantage of the local storage to checkpoint your model during training since the models are quite large and you are paying for the GPU even while it sits idle writing to disk.

This of course raises a concern about what happens if the Docker container dies part-way though the training. The risk is (hopefully) low from a cloud provider, and you may not want an incomplete training anyway. But if that does happen, you will at least want to understand why, and this is where saving the main log file to cloud storage (described below) or to a package like MLflow comes in handy.

Evaluation script

After your training run has completed and you have taken proper precaution on saving your work, it is time to see how well it performed.

Normally this evaluation script will pick up on the model that just finished. But you may decide to point it at a previous model version through an interactive session. This is why have the script as stand-alone.

With it being a separate script, that means it will need to read the completed model from disk — ideally local disk for speed. I like having two separate scripts (training and evaluation), but you might find it better to combine these to avoid reloading the model.

Now that the model is loaded, the evaluation script should generate predictions on every image in the training, validation, test, and benchmark sets. I save the results as a huge matrix with the softmax confidence score for each class label. So, if there are 1,000 classes and 100,000 images, that’s a table with 100 million scores!

I save these results in pickle files that are then used in the score generation next.

Score generation script

Taking the matrix of scores produced by the evaluation script above, we can now create various metrics of model performance. Again, this process could be combined with the evaluation script above, but my preference is for independent scripts. For example, I might want to regenerate scores on previous training runs. See what works for you.

Here are some of the sklearn functions that produce useful insights like F1, log loss, AUC-ROC, Matthews correlation coefficient.

from sklearn.metrics import average_precision_score, classification_report
from sklearn.metrics import log_loss, matthews_corrcoef, roc_auc_score

Aside from these basic statistical analyses for each dataset (train, validation, test, and benchmark), it is also useful to identify:

Which ground truth labels get the most number of errors?
Which predicted labels get the most number of incorrect guesses?
How many ground-truth-to-predicted label pairs are there? In other words, which classes are easily confused?
What is the accuracy when applying a minimum softmax confidence score threshold?
What is the error rate above that softmax threshold?
For the “difficult” benchmark sets, do you get a sufficiently high score?
For the “out-of-scope” benchmark sets, do you get a sufficiently low score?

As you can see, there are multiple calculations and it’s not easy to come up with a single evaluation to decide if the trained model is good enough to be moved to production.

In fact, for an image classification model, it is helpful to manually review the images that the model got wrong, as well as the ones that got a low softmax confidence score. Use the scores from this script to create a list of images to manually review, and then get a gut-feel for how well the model performs.

Check out Part 3 for more in-depth discussion on evaluation and scoring.

Export script

All of the heavy lifting is done by this point. Since your Docker container will be shutdown soon, now is the time to copy the model artifacts to cloud storage and prepare them for being put to use.

The example Python code snippet below is more geared to Keras and TensorFlow. This will take the trained model and export it as a saved_model. Later, I will show how this is used by TensorFlow Serving in the Deploy section below.

# Increment current version of model and create new directory
next_version_dir, version_number = create_new_version_folder()

# Copy model artifacts to the new directory
copy_model_artifacts(next_version_dir)

# Create the directory to save the model export
saved_model_dir = os.path.join(next_version_dir, str(version_number))

# Save the model export for use with TensorFlow Serving
tf.keras.backend.set_learning_phase(0)
model = tf.keras.models.load_model(keras_model_file)
tf.saved_model.save(model, export_dir=saved_model_dir)

This script also copies the other training run artifacts such as the model evaluation results, score summaries, and log files generated from model training. Don’t forget about your label map so you can give human readable names to your classes!

Bulk identification script

Your training run is complete, your model has been scored, and a new version is exported and ready to be served. Now is the time to use this latest model to assist you on trying to identify unlabeled images.

As I described in Part 4, you may have a collection of “unknowns” — really good pictures, but no idea what they are. Let your new model provide a best guess on these and record the results to a file or a database. Now you can create filters based on closest match and by high/low scores. This allows your subject matter experts to leverage these filters to find new image classes, add to existing classes, or to remove images that have very low scores and are no good.

By the way, I put this step inside the GPU container since you may have thousands of “unknown” images to process and the accelerated hardware will make light work of it. However, if you are not in a hurry, you could perform this step on a separate CPU node, and shutdown your GPU node sooner to save cost. This would especially make sense if your “unknowns” folder is on slower cloud storage.

Batch script

All of the scripts described above perform a specific task — from extracting your image library, executing model training, performing evaluation and scoring, exporting the model artifacts for deployment, and perhaps even bulk identification.

One script to rule them all

To coordinate the entire show, this batch script gives you the entry point for your container and an easy way to trigger everything. Be sure to produce a log file in case you need to analyze any failures along the way. Also, be sure to write the log to your cloud storage in case the container dies unexpectedly.

#!/bin/bash
# Main batch control script

# Redirect standard output and standard error to a log file
exec > /cloud_storage/batch-logfile.txt 2>&1

/app/ExtractImageLibrary.py
/app/Training.py
/app/Evaluation.py
/app/ScorePerformance.py
/app/ExportModel.py
/app/BulkIdentification.py

Executing your training run

So, now it’s time to put everything in motion…

Start your engines!

Let’s go through the steps to prepare your image library, fire up your Docker container to train your model, and then examine the results.

Image library ‘tar’ files

Your image management system should now create a tar file backup of your data. Since tar is a single-threaded function, you will get significant speed improvement by creating multiple tar files in parallel, each with a portion of you data.

Now these files can be copied to your shared cloud storage for the next step.

Start Docker container

All the hard work you put into creating your container (described above) will be put to the test. If you are running Kubernetes, you can create a Job that will execute the BatchControl.sh script.

Inside the Kubernetes Job definition, you can pass environment variables to adjust the execution of your script. For example, the batch size and number of epochs are set here and then pulled into your Python scripts, so you can alter the behavior without changing your code.

#####   sample Job in Kubernetes   #####
containers:
  - name: training-job
    env:
      - name: BATCH_SIZE
        value: 50
      - name: NUM_EPOCHS
        value: 30
    command: ["/app/BatchControl.sh"]

Once the Job is completed, be sure to verify that the GPU node properly scales back down to zero according to your scaling configuration in Kubernetes — you don’t want to be saddled with a huge bill over a simple configuration error.

Manually review results

With the training run complete, you should now have model artifacts saved and can examine the performance. Look through the metrics, such as F1 and log loss, and benchmark accuracy for high softmax confidence scores.

As mentioned earlier, the reports only tell part of the story. It is worth the time and effort to manually review the images that the model got wrong or where it produced a low confidence score.

Don’t forget about the bulk identification. Be sure to leverage these to locate new images to fill out your data set, or to find new classes.

Deploying your model

Once you have reviewed your model performance and are satisfied with the results, it is time to modify your TensorFlow Serving container to put the new model into production.

TensorFlow Serving is available as a Docker container and provides a very quick and convenient way to serve your model. This container can listen and respond to API calls for your model.

Let’s say your new model is version 7, and your Export script (see above) has saved the model in your cloud share as /image_application/models/007. You can start the TensorFlow Serving container with that volume mount. In this example, the shareName points to folder for version 007.

#####   sample TensorFlow pod in Kubernetes   #####
containers:
  - name: tensorflow-serving
    image: bitnami/tensorflow-serving:2.18.0
    ports:
      - containerPort: 8501
    env:
      - name: TENSORFLOW_SERVING_MODEL_NAME
        value: "image_application"
    volumeMounts:
      - name: models-subfolder
        mountPath: "/bitnami/model-data"

volumes:
  - name: models-subfolder
    azureFile:
      shareName: "image_application/models/007"

A subtle note here — the export script should create a sub-folder, named 007 (same as the base folder), with the saved model export. This may seem a little confusing, but TensorFlow Serving will mount this share folder as /bitnami/model-data and detect the numbered sub-folder inside it for the version to serve. This will allow you to query the API for the model version as well as the identification.

Conclusion

As I mentioned at the start of this article, this setup has worked for my situation. This is certainly not the only way to approach this challenge, and I invite you to customize your own solution.

I wanted to share my hard-fought learnings as I embraced cloud services in Kubernetes, with the desire to keep costs under control. Of course, doing all this while maintaining a high level of model performance is an added challenge, but one that you can achieve.

I hope I have provided enough information here to help you with your own endeavors. Happy learnings!