Introduction
Decision trees are considered a fundamental tool in machine learning. They provide logical insights into complex datasets. A decision tree is a non-parametric supervised learning algorithm used for both classification and regression problems. It has a hierarchical tree structure with a root node, branches, internal nodes, and leaf nodes.
Decision trees terminologies
Before going deeper into the topic of decision trees, let’s familiarize ourselves with some terminologies, as mentioned in the illustration below:
Root node: The root node is the beginning point of a decision tree where the whole dataset starts to divide based on different features present in the dataset.
Decision nodes: Nodes with children nodes represent a decision to be taken. The root node (if having children) is also a decision node.
Leaf nodes: Nodes that indicate the final categorization or result when additional splitting is not possible. Terminal nodes are another name for leaf nodes.
Branches or subtrees: A branch or subtree is a component of the decision tree that is part of the larger structure. Within the tree, it symbolizes a certain decision-making and result-oriented path.
Pruning: It is the practice of eliminating or chopping down particular decision tree nodes to simplify the model and avoid overfitting.
Parent and child nodes: In a decision tree, a node that can be divided is called a parent node, and nodes that emerge from it are called its child nodes. The parent node represents a decision or circumstance, and the child nodes represent possible outcomes or additional decisions based on that situation.
Decision trees in scikit-learn
We have understood the basic concept of decision trees. Now, with scikit-learn’s help, we explore how decision trees work.
The dataset
We use the wine dataset, a classic for multi-class classification. Let’s explore the dataset:
import pandas as pd
from sklearn.datasets import load_wine #loading the dataset from sklearn
data = load_wine() # Loading dataset
wine = pd.DataFrame(data['data'], columns = data['feature_names']) # Converting data to a Data Frame to view properly
wine['target'] = pd.Series(data['target'], name = 'target_values') # Configuring pandas to show all features
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(wine.head()) #printing the top 5 records from the dataset
print("Total number of observations: {}".format(len(wine)))
Output:
The target
Let’s explore the target values to find how many classes we have in this dataset:
print(wine['target'].head()) #first five observations of target variable.
shuffled = wine.sample(frac=1, random_state=1).reset_index() #shuffling the dataset to add randomization in observation placements.
print(shuffled['target'].head())
Output:
Now, we observe three classes: 0, 1, 2
Properties of the Wine Dataset
The properties (according to the official website) of the wine dataset:
Step-by-step guide to decision trees
Let’s break down the decision tree algorithm into simple steps for the wine dataset.
We will predict the wine class based on its given features. The root node represents all the instances of the dataset. At the root, we have the color_intensity feature. The decision tree algorithm follows a branch and advances to the next node based on the decision taken at the root. At this level, we have two different features — proline and flavonoids. The algorithm proceeds to the next node by comparing its attribute value with the other sub-nodes. It keeps doing this till it gets to the tree’s leaf node.
The following algorithm can help you better understand the entire process:
Begin with the root node:
The root node symbolizes the whole dataset of wines — this is where the algorithm starts.
Find the best attribute:
We have several wine characteristics — such as acidity, alcohol percentage, and so forth. These characteristics help to determine which is most useful for dividing wines into their appropriate groups — such as wine varieties. We determine the best attribute to split the dataset using attribute selection measures (ASM) like information gain or Gini index. This attribute should maximize the information gain or minimize impurity.
Attribute selection measure:
The primary problem while implementing a decision tree is figuring out which attribute is ideal for the root node and its child nodes. An attribute selection measure, or ASM, has been developed to address these issues. We can quickly choose the ideal attribute for the tree’s nodes using this measurement. For ASM, there are two widely used methods, which are:
Information gain (IG): This measures the effectiveness of a particular attribute in classifying data. It quantifies the reduction in entropy or uncertainty about the classification of data points after splitting them based on the attribute.
Gini index (GI): This measures the impurity or homogeneity of a dataset. It measures the likelihood that a randomly selected element in the dataset would be erroneously classified if its label were assigned at random based on the distribution of labels in the dataset.
Divide the dataset:
The algorithm separates the dataset into smaller subsets, each comprising wines with comparable qualities based on the selected attribute’s possible values.
Generate decision tree nodes:
The algorithm adds a new node to the tree at each stage to represent the selected attribute. These nodes direct the algorithm to the following stage as decision points.
Recursive tree building:
The algorithm recursively repeats this process until it cannot further divide the dataset, adding new branches and nodes. These latter nodes — leaf nodes — stand for the anticipated wine categories.
Implementation
Let’s apply this algorithm to the wine dataset, which contains attributes of wine samples categorized into three classes [class_0, class_1, class_2]. We’ll use Python’s scikit-learn library for implementing the decision tree classifier. The decision rule for classifying wines into particular classes using decision trees is determined based on the attribute values of the wine characteristics. For example, a decision rule could be that wines with certain levels of acidity, alcohol percentage, and color intensity belong to class_0, while wines with different attribute values belong to class_1 or class_2. These decision rules are learned during the training process of the decision tree algorithm based on the patterns and relationships found in the dataset.
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as pltwine = load_wine() #loading the wine dataset in a variable called wine
X = wine.data #independent variable
y = wine.target #dependent variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Splitting the dataset into training and testing sets
clf = DecisionTreeClassifier() # Initialize the decision tree classifier
clf.fit(X_train, y_train) # Fitting the classifier on the training data
# Plot the decision tree
plt.figure(figsize=(15, 10))
plot_tree(clf, filled=True, feature_names=wine.feature_names, class_names=wine.target_names)
plt.savefig("output/plot.png", bbox_inches='tight')
plt.show()
Output:
Interpretation
The decision tree model classifies instances into different classes based on the selected attributes and decision rules learned during training. At each node of the tree, the model evaluates the value of a specific attribute and decides to split the data into two or more subsets. This splitting continues recursively until the algorithm determines that further division is not beneficial or until certain stopping criteria are met. Each leaf node represents a final classification or outcome, indicating the predicted class for instances that reach that node.
Information gain (IG) or Gini index
Information gain (IG) and the Gini index play crucial roles in the decision-making process of the decision tree algorithm. IG measures the effectiveness of a particular attribute in classification data by quantifying the reduction in entropy (uncertainty) about the classification of data points after splitting them based on the attribute. While Gini index measures the impurity or homogeneity of a dataset, indicating the likelihood that a randomly selected element in the dataset would be erroneously classified if its label were randomly assigned based on the distribution of labels in the dataset. These metrics help the algorithm determine which attribute to select for splitting at each node, aiming to maximize the information gain or minimize impurity in the resulting subsets.
Decision rule
The decision tree algorithm selects the attribute with the highest IG or lowest Gini index at each node to make splitting decisions. This process involves evaluating all available attributes and calculating their IG or Gini index. The highest IG or lowest Gini index attribute is the best attribute for splitting the dataset at that node. By selecting attributes that maximize IG or minimize impurity, the algorithm aims to create subsets that are as pure and informative as possible, facilitating accurate classification. This iterative process helps the decision tree algorithm learn decision rules that effectively partition the data and classify instances into the correct classes based on the attributes’ values.
Simplicity: Decision trees are easy to comprehend as they closely resemble how humans make decisions. Even nonexperts can use them because of their ease.
Flexible problem-solving: Decision trees are adaptable instruments that may be used to solve various decision-related issues in various industries, including healthcare and finance.
Easy outcome analysis: Decision trees allow methodically investigating every scenario and its ramifications by examining every conceivable outcome for a given situation.
Less data cleaning: Decision trees usually require less preprocessing and data cleaning than other machine learning algorithms, saving time and effort in the data preparation process.
Layering complexity: Decision trees have the potential to branch out into several levels as they get larger. Because of its complexity, the model’s judgments may be difficult to understand and interpret.
Risk of overfitting: Decision trees can overfit, which causes them to identify noise or unimportant patterns in the training set, which impairs their ability to generalize to new data. This problem can be lessened using strategies like random forests, which combine several decision trees.
Computational complexity: Working with datasets with many class labels can lead to computationally expensive decision trees. This complexity can impact training and prediction times, requiring additional computational resources.
In this blog, we delved into the scikit-learn library to create and understand decision trees, highlighting their logical structure and effectiveness in solving classification problems. Decision trees offer clear insights and are adaptable and straightforward, making them valuable across various industries. Despite their benefits, challenges like overfitting and computational complexity need careful management. By working through a practical example using the wine dataset, we demonstrated how to implement and interpret decision trees with scikit-learn, showcasing their practical application in multi-class classification.