Introduction to AutoML: Automating Machine Learning Workflows


Image by Author

AutoML is a tool designed for both technical and non-technical experts. It simplifies the process of training machine learning models. All you have to do is provide it with the dataset, and in return, it will provide you with the best-performing model for your use case. You don’t have to code for long hours or experiment with various techniques; it will do everything on its own for you.

In this tutorial, we will learn about AutoML and TPOT, a Python AutoML tool for building machine learning pipelines. We will also learn to build a machine learning classifier, save the model, and use it for model inference.

What is AutoML?

AutoML, or Automated Machine Learning, is a tool where you provide a dataset, and it will do all the tasks on the back end to provide you with a high-performing machine learning model. AutoML performs various tasks such as data preprocessing, feature selection, model selection, hyperparameter tuning, model ensembling, and model evaluation. Even a non-technical user can build a highly complex machine learning model using the AutoML tools. 

By using advanced machine learning algorithms and techniques, AutoML systems can automatically discover the best models and configurations for a given dataset, thus reducing the time and effort required to develop machine learning models.

1. Getting Started with TPOT

TPOT (Tree-based Pipeline Optimization Tool) is the most simple and highly popular AutoML tool that uses genetic programming to optimize machine learning pipelines. It automatically explores hundreds of potential pipelines to identify the most effective model for a given dataset.

You can install TPOT using the following command on your system. 

Load the necessary Python libraries to load and process the data and train the classification model. 

2. Loading the Data

For this tutorial, we are using the Mushroom Dataset from Kaggle which contains 9 features to determine if the mushroom is poisonous or not. 

We will load the dataset using Pandas and randomly select 1000 samples from the dataset. 

Introduction to AutoML: Automating Machine Learning Workflows

3. Data Processing

The “class” column is our target variable, which contains two values—0 or 1—where 0 refers to non-poisonous and 1 refers to poisonous. We will use it to create independent and dependent variables. After that, we will split it into a train and test datasets. 

4. Building and Fitting TPOT Classifier

We will initiate the TPOT classifier and train it using a training set. The model will experiment with various models and techniques and return the best-performing model and pipeline. 

We got various scores for different generations and the best pipeline. 

Introduction to AutoML: Automating Machine Learning Workflows

Let’s evaluate our best pipeline on the test dataset by using the .score function.

I think we have a pretty stable and accurate model. 

5. Saving the TPOT Pipeline and Model

To save the TPOT pipeline, we will use the .export function and provide it with the file name and .py extension. 

The file will be saved as a Python file with the code containing the best pipeline. In order to run the pipeline, you have to make a few changes to the dataset’s directory, separator, and target column names. 

tpot_mashroom_pipeline.py:

You can even save the model using the joblib library as a pickle file. This file contains the model weights and the code to run the model inference. 

6. Loading the TPOT Pipeline and Model Inference

We will load the saved model using the joblib.load function and predict the top 10 samples from the testing dataset. 

Our model is accurate as the actual labels are similar to predicted labels. 

Summary

In this tutorial, we have learned about AutoML and how it can be used by anyone, even non-technical users. We have also learned to use TPOT, an AutoML Python tool that automatically performs data processing, feature selection, model selection, hyperparameter tuning, model ensembling, and model evaluation. At the end of model training, we get the best-performing model and the pipeline by running two lines of code. We can even save the model and use it to build an AI application.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here