AWS Machine Learning Project-03. Building a Decision Tree to Predict… | by Pratik Khose | Jul, 2024

We will use a synthetic dataset for this exercise. The dataset contains the following columns:

CustomerID: A unique identifier for each customer.
Age: The age of the customer.
MonthlyCharge: The monthly bill amount for the customer.
CustomerServiceCalls: The number of times the customer contacted customer service.
Churn: The target variable, indicating whether the customer churned (Yes) or not (No).

UNSupervised Learning Code

Below is the Python code to set up and execute the unsupervised learning task:

import pandas as pd
import matplotlib.pyplot as plt
import warnings
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn import treewarnings.filterwarnings('ignore')

we create a synthetic dataset:

data = {
'CustomerID': range(1, 101),
'Age': [20, 25, 30, 35, 40, 45, 50, 55, 60, 65]*10,
'MonthlyCharge': [50, 60, 70, 80, 90, 100, 110, 120, 130, 140]*10,
'CustomerServiceCalls': [1, 2, 3, 4, 0, 1, 2, 3, 4, 0]*10,
'Churn': ['No', 'No', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'No', 'Yes']*10
}
df = pd.DataFrame(data)

We split the data into features (X) and the target variable (y):

X = df[['Age', 'MonthlyCharge', 'CustomerServiceCalls']]
y = df['Churn']

we further split the dataset into training and testing sets:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

We use Scikit-learn to create and train a DecisionTreeClassifier:

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

We make predictions on the test set and calculate the accuracy of the model:

y_pred = clf.predict(X_test)accuracy = accuracy_score(y_test, y_pred)
print(f'Model Accuracy: {accuracy}')

Using Matplotlib, we visualize how the decision tree makes decisions:

plt.figure(figsize=(12,8))
tree.plot_tree(clf, filled=True, feature_names=['Age', 'MonthlyCharge', 'CustomerServiceCalls'], class_names=['No Churn', 'Churn'])
plt.title('Decision Tree for Predicting Customer Churn')
plt.show()

Model Accuracy

The accuracy score gives us an idea of how well our model performs. In our case, the synthetic dataset may not reflect real-world complexity, so the accuracy might vary.

Decision Tree Interpretation

The decision tree visualization helps us understand the rules used by the model to make predictions. For example, it might show that customers with a high number of service calls and high monthly charges are more likely to churn.

Gini: A measure of impurity. Lower values indicate higher purity.
Samples: The number of samples reaching the node.
Value: The distribution of samples in different classes at the node.
Class: The predicted class at the node.

Results-