100 Days of Machine Learning, Day 5,6: Project One | by Munish Prasad Lohani | Jul, 2024


A California House Price Predictor!

Photo by Markus Spiske on Unsplash

Introduction

It is finally time to code! For my first project, I followed the book “Hands-On Machine Learning with Scikit-Learn and TensorFlow.” The dataset used here gives an idea about housing prices based on location, rooms, bedrooms, etc. In this project, I will perform a basic machine learning task. To begin with, let’s complete some tasks.

The Big Picture

Our dataset contains data metrics such as population, median income, median housing cost, etc., for different districts.

Frame The Problem

Pipelines

In simple words, a data pipeline is a sequence of data processing components. Pipelines are crucial in machine learning, especially when managing complex workflows. This involves steps designed to transform our raw data into a model that can solve problems.

The next step is to ensure whether there are any existing ways to solve this problem. Knowing this will give us performance insights on how to solve the problem. For this problem, let’s assume that previously, the work was done manually: a separate department for data collection, a separate one for complex calculations, etc.

Based on the current solution, we get an idea that the error in prediction can be high. What if we don’t have price data for some regions, and based on our complex calculations, we must predict a certain price? But then, what if the actual price comes out 10% or 20% off? Well, it certainly affects our business. But with our model, we can predict prices based on various features. Now that we know what our problem is, let’s frame it:

  • Our Data is labelled- Supervised Learning.
  • We are predicting the price- It is a regression task, specifically multiple regression.
  • We do not have a continuous source of data and we do not need our model to adapt to constantly changing data- it is batch learning.

Select Performance Measure

Now that we have framed the problem, let’s select a performance measure for our model. In most regression tasks, we use Root Mean Square Error (RMSE) to measure our model’s performance.

RMSE

Another performance measure is Mean Absolute Error.

Mean Absolute Error

Both RMSE and MAE are ways to calculate distance between the actual value vector and the predicted value vector.

RMSE

RMSE is based on Euclidean Norm. This measures the “straight-line” distance between two points in the Euclidean Space.

Euclidean Norm

MAE

MAE is based on Manhattan Norm. This measures the average absolute distances between actual values and the predicted values.

Manhattan Norm

Check Assumptions

Finally, it is good practice to check whether the model that our data gets fit into categorizes the prices into categories or simply uses the prices themselves. For example, if the model uses categories, our task can be taken as a classification project instead of regression — we should be categorizing the house (cheap, medium, expensive) rather than predicting the price. For this project, let’s say it is a regression task.

For this project, let’s say it is a regression task.

The Project

first, let’s load our data set.

import pandas as pd
import numpy as np

housing=pd.read_csv("E://books to read//100 Days of Machine Learning//Code//Project_1_Cali_House_Dataset//housing.csv")
housing.head()

Output
housing.info()
Output

We see that total_bedrooms has some missing values. We will resolve this issue later. Here, we also find that every other feature is a number (float64) but ocean_proximity is an object.

housing["ocean_proximity"].value_counts()
Output

So, it seems that there are five categories of ocean_proximity. Since our ML model only understands numbers, we will later be converting it into one via OneHotEncoding. For now, let’s visualize the data in the form of a histogram.

For now, let’s visualize the data inform of a histogram.

import matplotlib.pyplot as plt
%matplotlib inline

housing.hist(bins=50,figsize=(20,15))
plt.show()

Output

From the figure, we can infer that most of the data seem to be negatively tailed. This might lead to a problem later in our ML model. So, it’s better to transform it into a bell-shaped curve. The next thing is the scale. For example, it seems that median_income has been transformed.

Now, let’s work on our data. In order to train our model, it is necessary to divide it into a training set and a testing set. A general rule is to allow 20% of data as testing and the rest as training. But, before we proceed, let’s work on median_income. We see that most of the incomes are clustered around 1.5 and 6. But, there are also some beyond 6. So, it’s important for us to have enough instances for each stratum.

housing["income_cat"]=pd.cut(housing['median_income'],bins=[0.,1.5,3.0,4.5,6,np.inf],labels=[1,2,3,4,5])
housing["income_cat"].hist()
plt.plot()
output

Now, lets work on training and testing set.

from sklearn.model_selection import StratifiedShuffleSplit

split=StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index,test_index in split.split(housing,housing['income_cat']):
strat_train=housing.loc[train_index]
strat_test=housing.loc[test_index]

In order for the model to be effective and unbiased, it is necessary to have a good proportion of every category in our training set. While we could use sklearn‘s train_test_split to divide our data, a more effective way to divide it as per the income category is StratifiedShuffleSplit. Finally, let’s get the data back into its original form.

for set_ in (strat_test,strat_train):
set_.drop("income_cat",axis=1,inplace=True)

Until now, we’ve just went through the data without digging deep into it. But now, let’s explore the data further. One way of doing so is via visualizing.

copy_housing=strat_train.copy()
copy_housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.1)
Output

Well, this definately looks like California with higher density zones. Now, let’s visualize the income.

copy_housing.plot(kind="scatter",x="latitude",y="longitude",s=copy_housing["population"]/100,c="median_income",cmap=plt.get_cmap("jet"),alpha=0.3)
plt.legend()
plt.show()
Output

Well, another way of exploring data is through finding correlation.

#Corr
corr=copy_housing.drop("ocean_proximity",axis="columns").corr()

from pandas.plotting import scatter_matrix

attributes=['median_house_value','median_income','total_rooms','housing_median_age']
scatter_matrix(copy_housing[attributes],figsize=(22,18))

Output

From this graph, we can find that median_income and median_house_value have a quite linear relationship, whereas others aren’t so clear. Well now, let’s experiment by joining some values.

copy_housing["rooms_per_household"]=copy_housing["total_rooms"]/copy_housing["households"]
copy_housing["bedrooms_per_room"]=copy_housing["total_bedrooms"]/copy_housing["total_rooms"]
copy_housing["population_per_household"]=copy_housing["population"]/copy_housing["households"]

corr=copy_housing.drop("ocean_proximity",axis="columns").corr()
corr["median_house_value"].sort_values(ascending=False)

Output

Well, joining the dataset indeed helped us as rooms_per_household has a high correlation value with median_house_value. Now that we have performed some feature engineering, let’s prepare our dataset based on features and labels.

housing_features=strat_train.drop("median_house_value",axis=1)
housing_label=strat_train['median_house_value'].copy()

Now, we have perhaps reached an important point in our project. It’s time to create a pipeline. Remember, we had certain data missing in total_bedrooms? Now, we will be fixing it via sklearn‘s SimpleImputer.

A SimpleImputer is used in the preprocessing stage of data preparation to handle missing values in a dataset. The goal is to replace null or missing values so that they do not negatively affect the performance of machine learning models. SimpleImputer does this by imputing values based on one of several strategies: Mean, Median, High Frequency, or Constant. Next, we will also be fixing the problem of tail-heavy data via StandardScaler.

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

pipeline=Pipeline([
("imputer",SimpleImputer(strategy="median")),
("std_Scaler",StandardScaler())
])

Now that we’ve created a pipeline to fix our data issues, let’s create a ColumnTransformer to specify the transformation for different columns.

num_features=housing_features.drop("ocean_proximity",axis=1)
num_attrs=list(num_features)
cat_attrs=["ocean_proximity"]
full_pipeline=ColumnTransformer([
("num",pipeline,num_attrs),
("cat",OneHotEncoder(),cat_attrs)
])

housing_prepares=full_pipeline.fit_transform(housing_features)

This concludes our preprocessing. Firstly, we fixed the issue of tail-heavy data & NaN data. Then, we converted categorical data into numerical data.

Now, we will start building our models.

Linear Regression

from sklearn.linear_model import LinearRegression

lin_reg=LinearRegression()
lin_reg.fit(housing_prepares,housing_label)

from sklearn.metrics import mean_squared_error

predictions=lin_reg.predict(housing_prepares)
lin_mse=mean_squared_error(housing_label,predictions)
lin_rmse=np.sqrt(lin_mse)
lin_rmse

RMSE output

The value is better than nothing, but it is not perfect- in fact quite error prone. Most of our housing prices ranges from $128,000 to $225,000. So, an error of around $68,000 is rather imperfect. This means, our model isn’t powerful enough to grasp the information from our features.

Decision Tree

from sklearn.tree import DecisionTreeRegressor
tree_reg=DecisionTreeRegressor()
tree_reg.fit(housing_prepares,housing_label)
tree_predictions=tree_reg.predict(housing_prepares)
tree_mse=mean_squared_error(housing_label,tree_predictions)
tree_rmse=np.sqrt(tree_mse)

tree_rmse

#Output:0.0

Well, our model turns out to be error free. But here is a problem- it shouldn’t actually be perfect. By this, we can imply that the model is overfitting. So, how do we solve it? One way is to use sklearn’s cross_val_score.

cross_val_score divides the training set into K-subsets, called folds. It involves training the model on some data set and then validating it on the remaining subset.

from sklearn.model_selection import cross_val_score

cross_val_scores=cross_val_score(tree_reg,housing_prepares,housing_label,scoring="neg_mean_squared_error",cv=10)
tree_rmse_scores=np.sqrt(-cross_val_scores)

tree_rmse_scores

Output

Then, we can aggregate the mean value for all these values using np.mean() which comes out 71857.76227885179.

Well, the Decision Tree performs worse as compared to Linear Regression.

Random Forest Regressor

Finally, let’s use RandomForestRegressor. A RandomForestRegressor is an ensemble learning method that constructs numerous DecisionTrees during training and outputs mean result.

from sklearn.ensemble import RandomForestRegressor

forest_reg=RandomForestRegressor()
forest_reg.fit(housing_prepares,housing_label)
forest_pred=forest_reg.predict(housing_prepares)
forest_mse=mean_squared_error(housing_label,forest_pred)
forest_rmse=np.sqrt(forest_mse)
forest_rmse

#Output: 18694.91813388159

forest_cross_val_scores=cross_val_score(forest_reg,housing_prepares,housing_label,scoring="neg_mean_squared_error",cv=10)
forest_rmse_scores=np.sqrt(-cross_val_scores)

forest_rmse_scores

#Output: 68321.7118618

Well, RandomForest looks much more promising, but the model is still overfitting. One way to solve this problem is by fiddling with hyperparameters.

GridSearchCV

sklearn’s GridSearchCV can help us automate the process of hyperparameter tuning. It also evaluates the performance of our models using cross validation.

from sklearn.model_selection import GridSearchCV

param_grid=[{"n_estimators":[3,10,30,60],"max_features":[2,4,6,8,10]},
{"bootstrap":[False],"n_estimators":[3,10],
"max_features":[2,3,4]}]

forest_reg=RandomForestRegressor()
grid_search=GridSearchCV(forest_reg,param_grid,cv=5,scoring="neg_mean_squared_error",return_train_score=True)

grid_search.fit(housing_prepares,housing_label)

grid_search.best_params_

#Output: {'max_features': 6, 'n_estimators': 60}

The .best_params_ gives the most effective hyperparameters for our model.

final_model=grid_search.best_estimator_

x_test=strat_test.drop("median_house_value",axis=1)
y_test=strat_test["median_house_value"].copy()

x_test["rooms_per_household"]=x_test["total_rooms"]/x_test["households"]
x_test["bedrooms_per_room"]=x_test["total_bedrooms"]/x_test["total_rooms"]
x_test["population_per_household"]=x_test["population"]/x_test["households"]

x_test_prepared=full_pipeline.transform(x_test)

final_pred=final_model.predict(x_test_prepared)

final_mse=mean_squared_error(y_test,final_pred)
final_rmse=np.sqrt(final_mse)

final_rmse

#Output: 47297.496275008285

The root-mean square for our final model is the best one among the model we tried earlier. But our task doesn’t end here. In some cases, we might have to find the range of our generalized error to be convinced. This can be done by finding the confidence interval for RMSE.

from scipy import stats

confidence=0.95
error=(final_pred-y_test)**2
ci=np.sqrt(stats.t.interval(confidence,len(error)-1,loc=error.mean(),scale=stats.sem(error)))

interval=ci[1]-ci[0]
mean_rmse=(ci[0]+ci[1])/2
proportion=(ci[1]-ci[0])/mean_rmse

print(ci)

#Output:[45333.94519671 49182.71770318]

Reflection

By the end of day 6, not only was I able to learn the concepts of many Machine Learning Techniques but was also able to implement it.

While this project is based in the book itself, I will also be working on a separate project to avoid “tutorial hell”.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here