Feast or Famine: Predicting Restaurants’ Aggregate Rating with Machine Learning and Python. | by Israel Oloche Adama | Mar, 2025


Have you ever wondered what makes a restaurant truly great? If you’re a foodie or someone that owns a restaurant or someone aspiring to have a restaurant, I believe you possibly would have asked yourself or someone some few questions about what makes a restaurant stands out from the others. Is it the cuisines and their quality in that restaurant, the restaurant’s ambience, or the service features of the restaurant? Just like you, I set out to answer these questions. As a data enthusiast, I explored these questions using a restaurant rating dataset, and the findings were fascinating.

The overarching goals of this project is to perform an Exploratory Data Analysis on the dataset, build predictive models, and derive actionable insights that will improve the rating of any restaurant. In this article I will walk you through some integral steps taken in the project to explore the data itself, the visualisation methods including but not limited to geospatial analysis, customers’ preferences, and building predictive modelling.

Whether you’re a restaurant owner or an intended owner, a data scientist, or someone that is just a lover of food, the article is for you. So, take a seat, and let’s dive in.

This work was primarily done with Python and it uses many libraries such as Pandas, NumPy, Seaborn, Matplotlib, Folium, Plotly, and the Scikit learn module.

First the dataset was downloaded in .csv format from Oyeniran GitHup page and loaded using the following Python code.

## Importing the important libraries for data understanding
import pandas as pd
import numpy as np

## data loading and exploration
restaurant_df = pd.read_csv('https://raw.githubusercontent.com/Oyeniran20/axia_class_cohort_7/refs/heads/main/Dataset%20.csv')
restaurant_df.head()

Output
Output

Next was to perform some data exploration to better understand the dataset. One of which was to check for missing values, data type, and duplicated observations.

## checking for missing values
restaurant_df.isnull().sum()

## dataset information
restaurant_df.info()

## checking for duplicate values
restaurant_df.duplicated().sum()

I performed summary statistics of the Aggregate rating, and it was discovered zero (0) is the lowest rating and 4.9 is the highest rating with the mean rating of 2.67

## Analysing aggregate rating distribution
Aggregate_distribution= restaurant_df['Aggregate rating'].describe().T
Aggregate_distribution

The distribution pattern shows the rating is having a near normal distribution with skewness of -0.95. slightly left skewed. Zero (0) rating got the single highest individual Aggregate rating with the percentage of 22.5 of the total ratings as shown below.

## Visualising the distribution of aggregate rating and understand the distribution better
## To do that, we need to import the necessary visualisation libraries
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## looking at the distribution proper using histogram
sns.histplot(restaurant_df['Aggregate rating'], kde=True)
plt.title('Distribution of Aggregate Ratings')
plt.xlabel('Aggregate Rating')
skewness = restaurant_df['Aggregate rating'].skew()
plt.savefig('Distribution of Aggregate Ratings.png')

print(f'Skewness: {skewness}')

Further investigation reveals that zero (0) as an aggregate rating could just be a place holder for ‘not rated’ as described by ‘rating text’, and not necessary that the cuisines are rated very poorly as one would think.

## visualise the outcome
restaurant_df['Rating text'].groupby(restaurant_df['Aggregate rating']).value_counts().plot(kind='bar')
plt.title('Distribution of Aggregate Ratings')
plt.xlabel('Aggregate Rating')

I used the treemap of the plotly library to visualise the top cuisines using the following code lines

## identify and visualizing the top 5 cuisines using treemap
top_cuisines = restaurant_df.Cuisines.value_counts().head(5).index.tolist()

filtered_df = restaurant_df[restaurant_df['Cuisines'].isin(top_cuisines)]

px.treemap(filtered_df, path=['Cuisines'], values='Aggregate rating', title='Top 5 Cuisines')

To compare aggregate rating across different cities and cuisines, I deploy the following code lines.

Cuisines_clean = restaurant_df[['Cuisines', 'Aggregate rating', 'City']].dropna()

px.sunburst(Cuisines_clean, path=['City','Cuisines'], values='Aggregate rating', title='Aggregate Rating across different Cities and Cuisines')

Restaurants distribution across cities

## To map restaurants location using their coordinates, we need to import folium library
import folium

## Initialising the map with the map focal point of '-91.633600, 14.017900' coordinates

restaurant_map = folium.Map(location=[-91.633600, 14.017900], zoom_start=1)

# Looping and iterating through rows of the DataFrame using a 'for loop.'
for index, row in restaurant_df.iterrows():
Latitude = row['Latitude']
Longitude = row['Longitude']
coords= [Latitude, Longitude]
# Add marker to the map using the folium.Marker function and append it to the map using the add_to function
folium.Marker(coords, popup=restaurant_df['Restaurant Name'], tooltip='view more restaurants').add_to(restaurant_map)

# Display the map
restaurant_map

This second code chunks make the map more interactive

## to make the map more interactive, we need to import and use fast marker cluster from folium plugins.
from folium.plugins import FastMarkerCluster
# Create a list to store all coordinates
locations = []

# Iterate through rows to collect locations
for index, row in restaurant_df.iterrows():
Latitude = row['Latitude']
Longitude = row['Longitude']
locations.append([Latitude, Longitude])

# Create FastMarkerCluster using the collected locations
FastMarkerCluster(data=locations).add_to(restaurant_map)

# Display the map
restaurant_map

Another way to do this is to use the plotly scatter geometry

## putting the restaurant on the map using plotly

px.scatter_geo(restaurant_df, lat= restaurant_df.Latitude, lon=restaurant_df.Longitude, projection='natural earth', hover_name=restaurant_df['Restaurant Name'])

Now, let’s look at the correlation between the continuous variables.

## correlation of the numerical columns and visualising the relationship using heatmap
num_corr = restaurant_df.select_dtypes(exclude='object').corr()
num_corr

## visulising the correlation results

sns.heatmap(num_corr, annot=True, fmt='.2f')

Feature selection

The following columns were dropped as they were not considered important features for the prediction.

1. Switch to Order Menu: This is not important as all the restaurants in dataset have “No”.
2. rating_text and rating_color: They are dependent on aggregate rating itself and not are independent features.
3. Restaurant Name: It’s believed that it’s not have any impact on the rating
4. Restauarant ID : They are believed to be sets of randomly selected digits that have no impacts on the cuisines.
5. City, Address, Locality, and Local Verbose: Although, the country may have effects, but restaurants’ location information may have none.

Since this regression problem, different regression models were and their r2 scores, RMSE, MAE respectively were checked to understand the best performing model. Also, to avoid data leakage when deployed and ensuring clean work, pipelines were used in this process.

The first thing here in the modelling was to handle the missing values in the dataset.

## handling the missing values
## Since the percentage of the missing values is way less than 1 and the considering the good size of the dataset, <Br> we are going to drop the rows.
restaurant_df.dropna(inplace=True)
restaurant_df.shape

Next was to import libraries and perform data splitting and features encoding.

## separating the dataframe into numerical and categorical variables
numeric_cols = restaurant_df.select_dtypes(include=np.number).columns.drop('Aggregate rating') ## dropped the calories columns because it's the target column
categoric_cols = restaurant_df.select_dtypes(exclude=np.number).columns
categoric_cols
from sklearn.model_selection import train_test_split
x = restaurant_df.drop('Aggregate rating', axis=1)
y = restaurant_df['Aggregate rating']

## splitting both coordinates into train and test samples (80/20) using the imported function
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)

Data preprocessing and modelling

## importing the neceassary libraries

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor

# 1. Create a ColumnTransformer to apply different preprocessing to different columns
preprocessor = ColumnTransformer(
transformers=[
('numerical_columns', StandardScaler(), numeric_cols),
('categorical_columns', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), categoric_cols),
])

# 2. Define a list of regression models to evaluate
models = [
('Linear Regression', LinearRegression()),
('Ridge Regression', Ridge()),
('Lasso Regression', Lasso()),
('Decision Tree Regression', DecisionTreeRegressor()),
('Random Forest Regression', RandomForestRegressor()),
('K-Nearest Neighbors Regression', KNeighborsRegressor()),
('Support Vector Regression', SVR())
]

# 3. Iterate through the models and evaluate performance
## Creating an empty list to store all the pipelines for each model.

pipelines = []
for model_name, model in models:
pipeline = Pipeline(
steps=[
('preprocessor', preprocessor),
('regressor', model),
])
pipelines.append(pipeline)
pipeline.fit(x_train, y_train)

train_score = pipeline.score(x_train, y_train)
test_score = pipeline.score(x_test, y_test)

Models Evaluation

The following codes were used to evaluate the models.

## from the necessary metrics
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error

# Predictions for training and testing data
## initialising with an empty list to store the metric results
results =[]

for model_name, model in models:
pipeline = pipelines[models.index((model_name, model))]
train_pred = pipeline.predict(x_train)
test_pred = pipeline.predict(x_test)

# Calculate metrics for training data
train_r2 = r2_score(y_train, train_pred)
train_mae = mean_absolute_error(y_train, train_pred)
train_rmse = root_mean_squared_error(y_train, train_pred)

# Calculate metrics for testing data
test_r2 = r2_score(y_test, test_pred)
test_mae = mean_absolute_error(y_test, test_pred)
test_rmse = root_mean_squared_error(y_test, test_pred)

results.append({
'Model': model_name,
'Train R2': train_r2,
'Test R2': test_r2,
'Train MAE': train_mae,
'Test MAE': test_mae,
'Train RMSE': train_rmse,
'Test RMSE': test_rmse
})

pd.DataFrame(results)

From the metrics analysis, RandomForest Regression have performance with the dataset than any of the tested models. Hence, it’s the best model for this training followed by Decision tree. The least performing model used here is the Lasso regression.

Target column transformation

Since the target column is not normally distributed, I performed some transformation on it to see if improves our models.

## Regularisation of the target column (Aggregate rating) using logarithm of 1 plus the aggregate column

np.log(restaurant_df['Aggregate rating'] + 1).skew()
print(f"Skewness of the target column after transformation is {np.log(restaurant_df['Aggregate rating'] + 1).skew()}")

## comparing the skewness before the log transformation
print(f"Skewness of Aggregate rating before transformation is {restaurant_df['Aggregate rating'].skew()}")

Column transformation using the natural logarithm of aggregate column plus 1 does not have any positive effect on the skewness of the data distribution. Rather, it shifted the distribution further to the negative side.

Another approach was to use the PowerTransformer from Scikit learn module.

## Importing and using the powertransformer from scikit learn dot preprocessing library
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
Y = pt.fit_transform(restaurant_df[['Aggregate rating']])

The result of the target column transformed followed the preprocessing and modelling described earlier.

The target column transformation using the PowerTranformation visibly improved the R2 scores of the Linear models (Linear and Ridge regression) and the Support Vector regression, while having no or little effect on the same score of the Lasso regression. However, It reduces the r2 scores of the following models: Decision Tree regression, random forest regression, and K-nearest neighbors regression, which are the top 3 performing models with the datatset.
The transformation also impose overfitting on the Decision tree model. With the 0.96 test r2 score, the model learned very well from the train dataset and can be predict very well as well.

The complete codes are available here

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here