AutoML is a tool designed for both technical and non-technical experts. It simplifies the process of training machine learning models. All you have to do is provide it with the dataset, and in return, it will provide you with the best-performing model for your use case. You don’t have to code for long hours or experiment with various techniques; it will do everything on its own for you.
In this tutorial, we will learn about AutoML and TPOT, a Python AutoML tool for building machine learning pipelines. We will also learn to build a machine learning classifier, save the model, and use it for model inference.
What is AutoML?
AutoML, or Automated Machine Learning, is a tool where you provide a dataset, and it will do all the tasks on the back end to provide you with a high-performing machine learning model. AutoML performs various tasks such as data preprocessing, feature selection, model selection, hyperparameter tuning, model ensembling, and model evaluation. Even a non-technical user can build a highly complex machine learning model using the AutoML tools.
By using advanced machine learning algorithms and techniques, AutoML systems can automatically discover the best models and configurations for a given dataset, thus reducing the time and effort required to develop machine learning models.
1. Getting Started with TPOT
TPOT (Tree-based Pipeline Optimization Tool) is the most simple and highly popular AutoML tool that uses genetic programming to optimize machine learning pipelines. It automatically explores hundreds of potential pipelines to identify the most effective model for a given dataset.
You can install TPOT using the following command on your system.
!pip install tpot==0.12.2 |
Load the necessary Python libraries to load and process the data and train the classification model.
import numpy as np import pandas as pd from tpot import TPOTClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import load_breast_cancer |
2. Loading the Data
For this tutorial, we are using the Mushroom Dataset from Kaggle which contains 9 features to determine if the mushroom is poisonous or not.
We will load the dataset using Pandas and randomly select 1000 samples from the dataset.
data = pd.read_csv(‘mushroom_cleaned.csv’) data = data.sample(n=1000, random_state=55) data.head() |
3. Data Processing
The “class” column is our target variable, which contains two values—0 or 1—where 0 refers to non-poisonous and 1 refers to poisonous. We will use it to create independent and dependent variables. After that, we will split it into a train and test datasets.
X = data.drop(‘class’, axis=1) y = data[‘class’].values
# Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=55) |
4. Building and Fitting TPOT Classifier
We will initiate the TPOT classifier and train it using a training set. The model will experiment with various models and techniques and return the best-performing model and pipeline.
# Initialize TPOTClassifier tpot = TPOTClassifier(verbosity=2, generations=5, population_size=20, random_state=55)
# Fit the classifier to the training data tpot.fit(X_train, y_train) |
We got various scores for different generations and the best pipeline.
Let’s evaluate our best pipeline on the test dataset by using the .score
function.
# Evaluate the model on the test set print(tpot.score(X_test, y_test)) |
I think we have a pretty stable and accurate model.
5. Saving the TPOT Pipeline and Model
To save the TPOT pipeline, we will use the .export
function and provide it with the file name and .py
extension.
tpot.export(‘tpot_mashroom_pipeline.py’) |
The file will be saved as a Python file with the code containing the best pipeline. In order to run the pipeline, you have to make a few changes to the dataset’s directory, separator, and target column names.
tpot_mashroom_pipeline.py:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import numpy as np import pandas as pd from sklearn.ensemble import ExtraTreesClassifier from sklearn.feature_selection import SelectFromModel from sklearn.model_selection import train_test_split from sklearn.pipeline import make_pipeline from tpot.export_utils import set_param_recursive
# NOTE: Make sure that the outcome column is labeled ‘target’ in the data file tpot_data = pd.read_csv(‘PATH/TO/DATA/FILE’, sep=‘COLUMN_SEPARATOR’, dtype=np.float64) features = tpot_data.drop(‘target’, axis=1) training_features, testing_features, training_target, testing_target = \ train_test_split(features, tpot_data[‘target’], random_state=55)
# Average CV score on the training set was: 0.8800000000000001 exported_pipeline = make_pipeline( SelectFromModel(estimator=ExtraTreesClassifier(criterion=“entropy”, max_features=0.9000000000000001, n_estimators=100), threshold=0.1), ExtraTreesClassifier(bootstrap=False, criterion=“gini”, max_features=0.9500000000000001, min_samples_leaf=4, min_samples_split=2, n_estimators=100) )
# Fix random state for all the steps in exported pipeline set_param_recursive(exported_pipeline.steps, ‘random_state’, 55)
exported_pipeline.fit(training_features, training_target) results = exported_pipeline.predict(testing_features) |
You can even save the model using the joblib
library as a pickle file. This file contains the model weights and the code to run the model inference.
import joblib
joblib.dump(tpot.fitted_pipeline_, ‘tpot_mashroom_pipeline.pkl’) |
6. Loading the TPOT Pipeline and Model Inference
We will load the saved model using the joblib.load
function and predict the top 10 samples from the testing dataset.
model = joblib.load(‘tpot_mashroom_pipeline.pkl’)
print(y_test[0:10]) print(model.predict(X_test[0:10])) |
Our model is accurate as the actual labels are similar to predicted labels.
[1 1 1 1 1 1 0 1 0 1] [1 1 1 1 1 1 0 1 0 1] |
Summary
In this tutorial, we have learned about AutoML and how it can be used by anyone, even non-technical users. We have also learned to use TPOT, an AutoML Python tool that automatically performs data processing, feature selection, model selection, hyperparameter tuning, model ensembling, and model evaluation. At the end of model training, we get the best-performing model and the pipeline by running two lines of code. We can even save the model and use it to build an AI application.