Because we are using an unsupervised learning algorithm, there is not a widely available measure of accuracy. However, we can use domain knowledge to validate our groups.

Visually inspecting the groups, we can see some benchmarking groups have a mix of Economy and Luxury hotels, which doesn’t make business sense as the demand for hotels is fundamentally different.

**We can scroll to the data and note some of those differences, but can we come up with our own accuracy measure?**

We want to create a function to measure the consistency of the recommended Benchmarking sets across each feature. One way of doing this is by calculating the variance in each feature for each set. For each cluster, we can compute an average of each feature variance, and we can then average each hotel cluster variance to get a total model score.

From our domain knowledge, we know that in order to set up a comparable benchmark set, we need to prioritize hotels in the same Brand, possibly the same market, and the same country, and if we use different markets or countries, then the market tier should be the same.

With that in mind, we want our measure to have a higher penalty for variance in those features. To do so, we will use a weighted average to calculate each benchmark set variance. We will also print the variance of the key features and secondary features separately.

To sum up, to create our accuracy measure, we need to:

**Calculate variance for categorical variables**: One common approach is to use an “entropy-based” measure, where higher diversity in categories indicates higher entropy (variance).**Calculate variance for numerical variables**: we can compute the standard deviation or the range (difference between maximum and minimum values). This measures the spread of numerical data within each cluster.**Normalize the data**: normalize the variance scores for each category before applying weights to ensure that no single feature dominates the weighted average due to scale differences alone.**Apply weights for different metrics**: Weight each type of variance based on its importance to the clustering logic.**Calculating weighted averages**: Compute the weighted average of these variance scores for each cluster.**Aggregating scores across clusters**: The total score is the average of these weighted variance scores across all clusters or rows. A lower average score would indicate that our model effectively groups similar hotels together, minimizing intra-cluster variance.

`from scipy.stats import entropy`

from sklearn.preprocessing import MinMaxScaler

from collections import Counterdef categorical_variance(data):

"""

Calculate entropy for a categorical variable from a list.

A higher entropy value indicates datas with diverse classes.

A lower entropy value indicates a more homogeneous subset of data.

"""

# Count frequency of each unique value

value_counts = Counter(data)

total_count = sum(value_counts.values())

probabilities = [count / total_count for count in value_counts.values()]

return entropy(probabilities)

#set scoring weights giving higher weights to the most important features

scoring_weights = {"BRAND": 0.3,

"Room_count": 0.025,

"Market": 0.25,

"Country": 0.15,

"Market Tier": 0.15,

"HCLASS": 0.05,

"Demand": 0.025,

"Price range": 0.025,

"distance_to_airport": 0.025}

def calculate_weighted_variance(df, weights):

"""

Calculate the weighted variance score for clusters in the dataset

"""

# Initialize a DataFrame to store the variances

variance_df = pd.DataFrame()

# 1. Calculate variances for numerical features

numerical_features = ['Room_count', 'Demand', 'Price range', 'distance_to_airport']

for feature in numerical_features:

variance_df[f'{feature}'] = df[feature].apply(np.var)

# 2. Calculate entropy for categorical features

categorical_features = ['BRAND', 'Market','Country','Market Tier','HCLASS']

for feature in categorical_features:

variance_df[f'{feature}'] = df[feature].apply(categorical_variance)

# 3. Normalize the variance and entropy values

scaler = MinMaxScaler()

normalized_variances = pd.DataFrame(scaler.fit_transform(variance_df),

columns=variance_df.columns,

index=variance_df.index)

# 4. Compute weighted average

cat_weights = {f'{feature}': weights[f'{feature}'] for feature in categorical_features}

num_weights = {f'{feature}': weights[f'{feature}'] for feature in numerical_features}

cat_weighted_scores = normalized_variances[categorical_features].mul(cat_weights)

df['cat_weighted_variance_score'] = cat_weighted_scores.sum(axis=1)

num_weighted_scores = normalized_variances[numerical_features].mul(num_weights)

df['num_weighted_variance_score'] = num_weighted_scores.sum(axis=1)

return df['cat_weighted_variance_score'].mean(), df['num_weighted_variance_score'].mean()

To keep our code clean and track our experiments , let’s also define a function to store the results of our experiments.

`# define a function to store the results of our experiments`

def model_score(data: pd.DataFrame,

weights: dict = scoring_weights,

model_name: str ="model_0"):

cat_score,num_score = calculate_weighted_variance(data,weights)

results ={"Model": model_name,

"Primary features score": cat_score,

"Secondary features score": num_score}

return resultsmodel_0_score= model_score(results_model_0,scoring_weights)

model_0_score

Now that we have a baseline, let’s see if we can improve our model.

## Improving our Model Through Experimentation

Up until now, we did not have to know what was going on under the hood when we ran this code:

`nns = NearestNeighbors()`

nns.fit(data_scaled)

nns_results_model_0 = nns.kneighbors(data_scaled)[1]

To improve our model, we will need to understand the model parameters and how we can interact with them to get better benchmark sets.

Let’s start by looking at the Scikit Learn documentation and source code:

`# the below is taken directly from scikit learn source`from sklearn.neighbors._base import KNeighborsMixin, NeighborsBase, RadiusNeighborsMixin

class NearestNeighbors_(KNeighborsMixin, RadiusNeighborsMixin, NeighborsBase):

"""Unsupervised learner for implementing neighbor searches.

Parameters

----------

n_neighbors : int, default=5

Number of neighbors to use by default for :meth:`kneighbors` queries.

radius : float, default=1.0

Range of parameter space to use by default for :meth:`radius_neighbors`

queries.

algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'

Algorithm used to compute the nearest neighbors:

- 'ball_tree' will use :class:`BallTree`

- 'kd_tree' will use :class:`KDTree`

- 'brute' will use a brute-force search.

- 'auto' will attempt to decide the most appropriate algorithm

based on the values passed to :meth:`fit` method.

Note: fitting on sparse input will override the setting of

this parameter, using brute force.

leaf_size : int, default=30

Leaf size passed to BallTree or KDTree. This can affect the

speed of the construction and query, as well as the memory

required to store the tree. The optimal value depends on the

nature of the problem.

metric : str or callable, default='minkowski'

Metric to use for distance computation. Default is "minkowski", which

results in the standard Euclidean distance when p = 2. See the

documentation of `scipy.spatial.distance

<https://docs.scipy.org/doc/scipy/reference/spatial.distance.html>`_ and

the metrics listed in

:class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric

values.

p : float (positive), default=2

Parameter for the Minkowski metric from

sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is

equivalent to using manhattan_distance (l1), and euclidean_distance

(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

metric_params : dict, default=None

Additional keyword arguments for the metric function.

"""

def __init__(

self,

*,

n_neighbors=5,

radius=1.0,

algorithm="auto",

leaf_size=30,

metric="minkowski",

p=2,

metric_params=None,

n_jobs=None,

):

super().__init__(

n_neighbors=n_neighbors,

radius=radius,

algorithm=algorithm,

leaf_size=leaf_size,

metric=metric,

p=p,

metric_params=metric_params,

n_jobs=n_jobs,

)

There are quite a few things going on here.

The `Nearestneighbor`

class inherits from`NeighborsBase`

, which is the case class for nearest neighbor estimators. This class handles the common functionalities required for nearest-neighbor searches, such as

- n_neighbors (the number of neighbors to use)
- radius (the radius for radius-based neighbor searches)
- algorithm (the algorithm used to compute the nearest neighbors, such as ‘ball_tree’, ‘kd_tree’, or ‘brute’)
- metric (the distance metric to use)
- metric_params (additional keyword arguments for the metric function)

The `Nearestneighbor`

class also inherits from`KNeighborsMixin`

and `RadiusNeighborsMixin`

classes. These Mixin classes add specific neighbor-search functionalities to the `Nearestneighbor`

`KNeighborsMixin`

provides functionality to find the nearest fixed number k of neighbors to a point. It does that by finding the distance to the neighbors and their indices and constructing a graph of connections between points based on the k-nearest neighbors of each point.`RadiusNeighborsMixin`

is based on the radius neighbors algorithm, which finds all neighbors within a given radius of a point. This method is useful in scenarios where the focus is on capturing all points within a meaningful distance threshold rather than a fixed number of points.

Based on our scenario, KNeighborsMixin provides the functionality we need.

We need to understand one key parameter before we can improve our model; this is the distance metric.

The documentation mentions that the NearestNeighbor algorithm uses the “Minkowski” distance by default and gives us a reference to the SciPy API.

In `scipy.spatial.distance`

, we can see two mathematical representations of “Minkowski” distance:

∥u−v∥ p=( i ∑∣u i−v i∣ p ) 1/p

This formula calculates the p-th root of the sum of powered differences across all elements.

The second mathematical representation of “Minkowski” distance is:

∥u−v∥ p=( i ∑w i(∣u i−v i∣ p )) 1/p

This is very similar to the first one, but it introduces weights `wi`

to the differences, emphasizing or de-emphasizing specific dimensions. This is useful where certain features are more relevant than others. By default, the setting is None, which gives all features the same weight of 1.0.

**This is a great option for improving our model as it allows us to pass domain knowledge to our model and emphasize similarities that are most relevant to users.**

If we look at the formulas, we see the parameter. `p`

. This parameter affects the “path” the algorithm takes to calculate the distance. **By default, p=2, which represents the Euclidian distance.**

You can think of the Euclidian distance as calculating the distance by drawing a straight line between 2 points. This is usally the shortest distance, however, this is not always the most desirable way of calculating the distance, specially in higher dimention spaces. For more information on why this is the case, there is this great paper online: https://bib.dbvis.de/uploadedFiles/155.pdf

**Another common value for p is 1. This represents the Manhattan distance.** You think of it as the distance between two points measured along a grid-like path.

**On the other hand, if we increase p towards infinity, we end up with the Chebyshev distance, defined as the maximum absolute difference between any corresponding elements of the vectors**. It essentially measures the worst-case difference, making it useful in scenarios where you want to ensure that no single feature varies too much.

By reading and getting familiar with the documentation, we have uncovered a few possible options to improve our model.

By default n_neighbors is 5, however, for our benchmark set, we want to compare each hotel to the 3 most similar hotels. To do so, we need to set n_neighbors = 4 (Subject hotel + 3 peers)

`nns_1= NearestNeighbors(n_neighbors=4)`

nns_1.fit(data_scaled)

nns_1_results_model_1 = nns_1.kneighbors(data_scaled)[1]

results_model_1 = clean_results(nns_results=nns_1_results_model_1,

encoders=encoders,

data=data_clean)

model_1_score= model_score(results_model_1,scoring_weights,model_name="baseline_k_4")

model_1_score

Based on the documentation, we can pass weights to the distance calculation to emphasize the relationship across some features. Based on our domain knowledge, we have identified the features we want to emphasize, in this case, Brand, Market, Country, and Market Tier.

`# set up weights for distance calculation`

weights_dict = {"BRAND": 5,

"Room_count": 2,

"Market": 4,

"Country": 3,

"Market Tier": 3,

"HCLASS": 1.5,

"Demand": 1,

"Price range": 1,

"distance_to_airport": 1}

# Transform the wieghts dictionnary into a list by keeping the scaled data column order

weights = [ weights_dict[idx] for idx in list(scaler.get_feature_names_out())]nns_2= NearestNeighbors(n_neighbors=4,metric_params={ 'w': weights})

nns_2.fit(data_scaled)

nns_2_results_model_2 = nns_2.kneighbors(data_scaled)[1]

results_model_2 = clean_results(nns_results=nns_2_results_model_2,

encoders=encoders,

data=data_clean)

model_2_score= model_score(results_model_2,scoring_weights,model_name="baseline_with_weights")

model_2_score

Passing domain knowledge to the model via weights increased the score significantly. Next, let’s test the impact of the distance measure.

So far, we have been using the Euclidian distance. Let’s see what happens if we use the Manhattan distance instead.

`nns_3= NearestNeighbors(n_neighbors=4,p=1,metric_params={ 'w': weights})`

nns_3.fit(data_scaled)

nns_3_results_model_3 = nns_3.kneighbors(data_scaled)[1]

results_model_3 = clean_results(nns_results=nns_3_results_model_3,

encoders=encoders,

data=data_clean)

model_3_score= model_score(results_model_3,scoring_weights,model_name="Manhattan_with_weights")

model_3_score

Decreasing p to 1 resulted in some good improvements. Let’s see what happens as p approximates infinity.

To use the Chebyshev distance, we will change the metric parameter to `Chebyshev.`

The default sklearn Chebyshev metric doesn’t have a weight parameter. To get around this, we will define a custom `weighted_chebyshev`

metric.

`# Define the custom weighted Chebyshev distance function`

def weighted_chebyshev(u, v, w):

"""Calculate the weighted Chebyshev distance between two points."""

return np.max(w * np.abs(u - v))nns_4 = NearestNeighbors(n_neighbors=4,metric=weighted_chebyshev,metric_params={ 'w': weights})

nns_4.fit(data_scaled)

nns_4_results_model_4 = nns_4.kneighbors(data_scaled)[1]

results_model_4 = clean_results(nns_results=nns_4_results_model_4,

encoders=encoders,

data=data_clean)

model_4_score= model_score(results_model_4,scoring_weights,model_name="Chebyshev_with_weights")

model_4_score

We managed to decrease the primary feature variance scores through experimentation.

Let’s visualize the results.

`results_df = pd.DataFrame([model_0_score,model_1_score,model_2_score,model_3_score,model_4_score]).set_index("Model")`

results_df.plot(kind='barh')

Using Manhattan distance with weights seems to give the most accurate benchmark sets according to our needs.

The last step before implementing the benchmark sets would be to examine the sets with the highest Primary features scores and identify what steps to take with them.

`# Histogram of Primary features score`

results_model_3["cat_weighted_variance_score"].plot(kind="hist")

`exceptions = results_model_3[results_model_3["cat_weighted_variance_score"]>=0.4]`print(f" There are {exceptions.shape[0]} benchmark sets with significant variance across the primary features")

These 18 cases will need to be reviewed to ensure the benchmark sets are relevant.

As you can see, with a few lines of code and some understanding of Nearest neighbor search, we managed to set internal benchmark sets. We can now distribute the sets and start measuring hotels’ KPIs against their benchmark sets.

You don’t always have to focus on the most cutting-edge machine learning methods to deliver value. Very often, simple machine learning can deliver great value.

What are some low-hanging fruits in your business that you could easily tackle with Machine learning?

World Bank. “World Development Indicators.” Retrieved June 11, 2024, from https://datacatalog.worldbank.org/search/dataset/0038117

Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (n.d.). On the Surprising Behavior of Distance Metrics in High Dimensional Space. IBM T. J. Watson Research Center and Institute of Computer Science, University of Halle. Retrieved from https://bib.dbvis.de/uploadedFiles/155.pdf

SciPy v1.10.1 Manual. `scipy.spatial.distance.minkowski`

. Retrieved June 11, 2024, from https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.minkowski.html

GeeksforGeeks. Haversine formula to find distance between two points on a sphere. Retrieved June 11, 2024, from https://www.geeksforgeeks.org/haversine-formula-to-find-distance-between-two-points-on-a-sphere/

scikit-learn. Neighbors Module. Retrieved June 11, 2024, from https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors