Demystifying Decision Trees for the Real World


Image by Author

Decision trees break down difficult decisions into straightforward, easily followed phases, thereby functioning like human brains.

In data science, these strong instruments are extensively applied to assist in data analysis and the direction of decision-making.

In this article, I will go over how decision trees operate, give real-world examples, and give some tips for enhancing them.

 

Structure of Decision Trees

 

Fundamentally, decision trees are simple and clear tools. They break down difficult options into simpler, sequential choices, therefore reflecting human decision-making. Let us now explore the main elements forming a decision tree.

 

Nodes, Branches, and Leaves

Three basic components define a decision tree: leaves, branches, and nodes. Every one of these is absolutely essential for the process of making decisions.

  • Nodes: They are decision points whereby the tree decides depending on the input data. When representing all the data, the root node is the starting point.
  • Branches: They relate the result of a decision and link nodes. Every branch matches a potential result or value of a decision node.
  • Leaves: The decision tree’s ends are leaves, sometimes known as leaf nodes. Each leaf node offers a certain consequence or label; they reflect the last choice or classification.

 

Conceptual Example

Suppose you are choosing whether to venture outside depending on the temperature. “Is it raining?” the root node would ask. If so, you might find a branch headed toward “Take an umbrella.” This should not be the case; another branch could say, “Wear sunglasses.”

These structures make decision trees easy to interpret and visualize, so they are popular in various fields.

 

Real-World Example: The Loan Approval Adventure

Picture this: You’re a wizard at Gringotts Bank, deciding who gets a loan for their new broomstick.

  • Root Node: “Is their credit score magical?”
  • If yes → Branch to “Approve faster than you can say Quidditch!”
  • If no → Branch to “Check their goblin gold reserves.”
    • If high →, “Approve, but keep an eye on them.”
    • If low → “Deny faster than a Nimbus 2000.”
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

data = {
    'Credit_Score': [700, 650, 600, 580, 720],
    'Income': [50000, 45000, 40000, 38000, 52000],
    'Approved': ['Yes', 'No', 'No', 'No', 'Yes']
}

df = pd.DataFrame(data)

X = df[['Credit_Score', 'Income']]
y = df['Approved']

clf = DecisionTreeClassifier()
clf = clf.fit(X, y)

plt.figure(figsize=(10, 8))
tree.plot_tree(clf, feature_names=['Credit_Score', 'Income'], class_names=['No', 'Yes'], filled=True)
plt.show()

 

Here is the output.

Structure of Decision Trees in Machine LearningStructure of Decision Trees in Machine Learning When you run this spell, you’ll see a tree appear! It’s like the Marauder’s Map of loan approvals:

  • The root node splits on Credit_Score
  • If it’s ≤ 675, we venture left
  • If it’s > 675, we journey right
  • The leaves show our final decisions: “Yes” for approved, “No” for denied

Voila! You’ve just created a decision-making crystal ball!

Mind Bender: If your life were a decision tree, what would be the root node question? “Did I have coffee this morning?” might lead to some interesting branches!

 

Decision Trees: Behind the Branches

 

Decision trees function similarly to a flowchart or tree structure, with a succession of decision points. They begin by dividing a dataset into smaller pieces, and then they build a decision tree to go along with it. The way these trees deal with data splitting and different variables is something we should look at.

 

Splitting Criteria: Gini Impurity and Information Gain

Choosing the best quality to divide the data is the primary goal of building a decision tree. It is possible to determine this procedure using criteria provided by Information Gain and Gini Impurity.

  • Gini Impurity: Picture yourself in the midst of a game of guessing. How often would you be mistaken if you randomly selected a label? That’s what Gini Impurity measures. We can make better guesses and have a happier tree with a lower Gini coefficient.
  • Information gain: The “aha!” moment in a mystery story is what you may compare this to. How much a hint (attribute) aids in solving the case is measured by it. A bigger “aha!” means more gain, which means an ecstatic tree!

To predict whether a customer would buy a product from your dataset, you can start with basic demographic information like age, income, and purchasing history. The approach takes all of these into account and finds the one that separates the buyers from the others.

 

Handling Continuous and Categorical Data

There are no types of info that our tree detectives can’t look into.

For features that are easy to change, like age or income, the tree sets up a speed trap. “Anyone over 30, this way!”

When it comes to categorical data, like gender or product type, it’s more of a lineup. “Smartphones stand on the left; laptops on the right!”

 

Real-World Cold Case: The Customer Purchase Predictor

To better understand how decision trees work, let’s look at a real-life example: using a customer’s age and income to guess whether they will buy a product.

To guess what people will buy, we’ll make a simple collection and a decision tree.

A description of the code

  • We import libraries like pandas to work with the data, DecisionTreeClassifier from scikit-learn to build the tree, and matplotlib to show the results.
  • Create Dataset: Age, income, and buying status are used to make a sample dataset.
  • Get Features and Goals Ready: The goal variable (Purchased) and features (Age, Income) are set up.
  • Train the Model: The information is used to set up and train the decision tree classifier.
  • See the Tree: Finally, we draw the decision tree so that we can see how choices are made.

Here is the code.

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
import matplotlib.pyplot as plt

data = {
    'Age': [25, 45, 35, 50, 23],
    'Income': [50000, 100000, 75000, 120000, 60000],
    'Purchased': ['No', 'Yes', 'No', 'Yes', 'No']
}

df = pd.DataFrame(data)

X = df[['Age', 'Income']]
y = df['Purchased']

clf = DecisionTreeClassifier()
clf = clf.fit(X, y)

plt.figure(figsize=(10, 8))
tree.plot_tree(clf, feature_names=['Age', 'Income'], class_names=['No', 'Yes'], filled=True)
plt.show()

 

Here is the output.

Behind the Branches of Decision Trees in Machine LearningBehind the Branches of Decision Trees in Machine Learning

The final decision tree will show how the tree splits up based on age and income to figure out if a customer is likely to buy a product. Each node is a decision point, and the branches show different outcomes. The final decision is shown by the leaf nodes.

Now, let’s look at how interviews can be used in the real world!

 

Real-World Applications

 

Real World Applications for Decision TreesReal World Applications for Decision Trees

This project is designed as a take-home assignment for Meta (Facebook) data science positions. The objective is to build a classification algorithm that predicts whether a movie on Rotten Tomatoes is labeled ‘Rotten’, ‘Fresh’, or ‘Certified Fresh.’

Here is the link to this project: https://platform.stratascratch.com/data-projects/rotten-tomatoes-movies-rating-prediction

Now, let’s break down the solution into codeable steps.

 

Step-by-Step Solution

  1. Data Preparation: We will merge the two datasets on the rotten_tomatoes_link column. This will give us a comprehensive dataset with movie information and critic reviews.
  2. Feature Selection and Engineering: We will select relevant features and perform necessary transformations. This includes converting categorical variables to numerical ones, handling missing values, and normalizing the feature values.
  3. Model Training: We will train a decision tree classifier on the processed dataset and use cross-validation to evaluate the model’s robust performance.
  4. Evaluation: Finally, we will evaluate the model’s performance using metrics like accuracy, precision, recall, and F1-score.

Here is the code.

import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

movies_df = pd.read_csv('rotten_tomatoes_movies.csv')
reviews_df = pd.read_csv('rotten_tomatoes_critic_reviews_50k.csv')

merged_df = pd.merge(movies_df, reviews_df, on='rotten_tomatoes_link')

features = ['content_rating', 'genres', 'directors', 'runtime', 'tomatometer_rating', 'audience_rating']
target="tomatometer_status"

merged_df['content_rating'] = merged_df['content_rating'].astype('category').cat.codes
merged_df['genres'] = merged_df['genres'].astype('category').cat.codes
merged_df['directors'] = merged_df['directors'].astype('category').cat.codes

merged_df = merged_df.dropna(subset=features + [target])

X = merged_df[features]
y = merged_df[target].astype('category').cat.codes

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(max_depth=10, min_samples_split=10, min_samples_leaf=5)
scores = cross_val_score(clf, X_train, y_train, cv=5)
print("Cross-validation scores:", scores)
print("Average cross-validation score:", scores.mean())

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

classification_report_output = classification_report(y_test, y_pred, target_names=['Rotten', 'Fresh', 'Certified-Fresh'])
print(classification_report_output)

 

Here is the output.

Real World Applications for Decision TreesReal World Applications for Decision Trees

The model shows high accuracy and F1 scores across the classes, indicating good performance. Let’s see the key takeaways.

Key Takeaways

  1. Feature selection is crucial for model performance. Content rating genres directors’ runtime and ratings proved valuable predictors.
  2. A decision tree classifier effectively captures complex relationships in movie data.
  3. Cross-validation ensures model reliability across different data subsets.
  4. High performance in the “Certified-Fresh” class warrants further investigation into potential class imbalance.
  5. The model shows promise for real-world application in predicting movie ratings and enhancing user experience on platforms like Rotten Tomatoes.

 

Enhancing Decision Trees: Turning Your Sapling into a Mighty Oak

 

So, you’ve grown your first decision tree. Impressive! But why stop there? Let’s turn that sapling into a forest giant that would make even Groot jealous. Ready to beef up your tree? Let’s dive in!

 

Pruning Techniques

Pruning is a method used to cut a decision tree’s size by eliminating parts that have minimal ability in target variable prediction. This helps to reduce overfitting in particular.

  • Pre-pruning: Often referred to as early stopping, this entails stopping the tree’s growth right away. Before training, the model is specified parameters, including maximum depth (max_depth), minimum samples required to split a node (min_samples_split), and minimum samples required at a leaf node (min_samples_leaf). This keeps the tree from growing overly complicated.
  • Post-pruning: This method grows the tree to its maximum depth and removes nodes that don’t offer much power. Though more computationally taxing than pre-pruning, post-pruning can be more successful.

 

Ensemble Methods

Ensemble techniques combine several models to generate performance above that of any one model. Two primary forms of ensemble techniques applied with decision trees are bagging and boosting.

  • Bagging (Bootstrap Aggregating): This method trains several decision trees on several subsets of the data (generated by sampling with replacement) and then averages their predictions. One often used bagging technique is Random Forest. It lessens variance and aids in overfit prevention. Check out “Decision Tree and Random Forest Algorithm” to deeply address everything related to the Decision Tree algorithm and its extension “Random Forest algorithm”.
  • Boosting: Boosting creates trees one after the other as each one seeks to fix the mistakes of the next one. Boosting techniques abound in algorithms including AdaBoost and Gradient Boosting. By emphasizing challenging-to-predict examples, these algorithms sometimes provide more exact models.

 

Hyperparameter Tuning

Hyperparameter tuning is the process of determining the optimal hyperparameter set for a decision tree model to raise its performance. Using methods like Grid Search or Random Search, whereby several combinations of hyperparameters are assessed to identify the best configuration, this can be accomplished.

 

Conclusion

 

In this article, we’ve discussed the structure, working mechanism, real-world applications, and methods for enhancing decision tree performance.

Practicing decision trees is crucial to mastering their use and understanding their nuances. Working on real-world data projects can also provide valuable experience and improve problem-solving skills.

 
 

Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.



Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here