You have got a massive data. And now you are wondering how do you make sense out of it.
You are also thinking of the ways you can break that data into some logical structure.
While you can always create a logical tree (can also call them a KPI Tree) manually, but what if we say that there is a Machine learning algorithm that can actually take care of it, on its own.
Alright! — So first — let’s show how you might want to create a KPI Tree out of a given dataset, manually.?.?.?
We are looking at a sample data of some customers who visit an e-commerce website on a frequent basis, and looking at some of their prominent web-navigation attributes concerning the type of device category they use, the browser they use, the channel they visited the website from and the geo-network region they accessed the website from; we want to figure out whether they will convert or not.
I have created a sample data of 100 records (customers who visited the website and mentioning their web navigation / behavioural attributes) :-
import pandas as pd
import random
# Sample data for each column
device_categories = ['Desktop', 'Android Mobile', 'Apple Mobile', 'Tablet', 'Laptop']
browsers = ['Chrome', 'Opera', 'Safari', 'Firefox', 'Edge']
visit_sources = ['Paid Social', 'Referral', 'Organic Search', 'Direct', 'Email Campaign']
geonetwork_regions = ['Mumbai', 'Delhi', 'Bangalore', 'Hyderabad', 'Chennai', 'Kolkata', 'Pune', 'Ahmedabad', 'Surat', 'Jaipur']
conversion_status = [0, 1]# Generate random data
data = {
'device_category': [random.choice(device_categories) for _ in range(100)],
'browser': [random.choice(browsers) for _ in range(100)],
'visit_source': [random.choice(visit_sources) for _ in range(100)],
'geonetwork_region': [random.choice(geonetwork_regions) for _ in range(100)],
'will_convert': [random.choice(conversion_status) for _ in range(100)]
}
# Create DataFrame
df = pd.DataFrame(data)
# Display the records
df
Here’s how the data might typically look like :-
Now, basically the moment we say — “whether they will convert or not”, it organically becomes a classification / prediction problem.
Though, the idea of KPI Tree that I have been used to, in my previous organization requires some sort of a metric to be maintained across the different nodes of the KPI Tree. Now this metric shall remain consistent across the different nodes and what may vary would be obviously the value of the metric.
Now, I understand that we are diverging from the idea of a quintessential Decision Tree Classifier, which renders the decided class for a particular record in your dataset, at the leaf nodes, and that is how it also settles for the best logical structure that it can break down the data into. Like, looking at the entropy / information gain, etc.
Yea – we will come to that.
But for now, coming back to the idea of a KPI Tree which would be more of a logical structure that simply dissects your data into the format of a typical decision tree but the order / hierarchy of the nodes can be random and purely dependent on what dimension you want to analyse at what level, based on your business requirements.
For instance, looking at the e-commerce example I mentioned above, the Data Scientist / Analyst / or any Dear John, might want to may be have geo-network region as the first level of the KPI Tree, followed by device category, followed by browser, followed by visit source, etc.
Which may spin up a KPI Tree that may look something like below :-
Ok, we haven’t fixed a KPI. Let’s may be call it Conversion Rate(%).
Looks good, aye?
So the KPI Tree shown below depicts the drill-down of Conversion Rate(%) observed for all the customers across the different geo-network regions, followed by the device categories, followed by the browsers, followed by the various visit sources.
So the way you interpret is — Let’s say Mumbai has a Conversion Rate(%) of X% and within Mumbai, the Desktop as a Device Category has a Conversion Rate(%) of Y% and within Desktop, Chrome has a Conversion Rate(%) of Z%, and within Chrome, Paid Social has a Conversion Rate(%) of P%.
Similarly, you could see it for some other geo-network region and under it some other device category than Desktop or even same too. It is like what path you want to traverse (as there are umpteen combinations / paths to follow).