A California House Price Predictor!
Introduction
It is finally time to code! For my first project, I followed the book “Hands-On Machine Learning with Scikit-Learn and TensorFlow.” The dataset used here gives an idea about housing prices based on location, rooms, bedrooms, etc. In this project, I will perform a basic machine learning task. To begin with, let’s complete some tasks.
The Big Picture
Our dataset contains data metrics such as population, median income, median housing cost, etc., for different districts.
Frame The Problem
Pipelines
In simple words, a data pipeline is a sequence of data processing components. Pipelines are crucial in machine learning, especially when managing complex workflows. This involves steps designed to transform our raw data into a model that can solve problems.
The next step is to ensure whether there are any existing ways to solve this problem. Knowing this will give us performance insights on how to solve the problem. For this problem, let’s assume that previously, the work was done manually: a separate department for data collection, a separate one for complex calculations, etc.
Based on the current solution, we get an idea that the error in prediction can be high. What if we don’t have price data for some regions, and based on our complex calculations, we must predict a certain price? But then, what if the actual price comes out 10% or 20% off? Well, it certainly affects our business. But with our model, we can predict prices based on various features. Now that we know what our problem is, let’s frame it:
- Our Data is labelled- Supervised Learning.
- We are predicting the price- It is a regression task, specifically multiple regression.
- We do not have a continuous source of data and we do not need our model to adapt to constantly changing data- it is batch learning.
Select Performance Measure
Now that we have framed the problem, let’s select a performance measure for our model. In most regression tasks, we use Root Mean Square Error (RMSE) to measure our model’s performance.
Another performance measure is Mean Absolute Error.
Both RMSE and MAE are ways to calculate distance between the actual value vector and the predicted value vector.
RMSE
RMSE is based on Euclidean Norm. This measures the “straight-line” distance between two points in the Euclidean Space.
MAE
MAE is based on Manhattan Norm. This measures the average absolute distances between actual values and the predicted values.
Check Assumptions
Finally, it is good practice to check whether the model that our data gets fit into categorizes the prices into categories or simply uses the prices themselves. For example, if the model uses categories, our task can be taken as a classification project instead of regression — we should be categorizing the house (cheap, medium, expensive) rather than predicting the price. For this project, let’s say it is a regression task.
For this project, let’s say it is a regression task.
The Project
first, let’s load our data set.
import pandas as pd
import numpy as nphousing=pd.read_csv("E://books to read//100 Days of Machine Learning//Code//Project_1_Cali_House_Dataset//housing.csv")
housing.head()
housing.info()
We see that total_bedrooms
has some missing values. We will resolve this issue later. Here, we also find that every other feature is a number (float64
) but ocean_proximity
is an object.
housing["ocean_proximity"].value_counts()
So, it seems that there are five categories of ocean_proximity
. Since our ML model only understands numbers, we will later be converting it into one via OneHotEncoding
. For now, let’s visualize the data in the form of a histogram.
For now, let’s visualize the data inform of a histogram.
import matplotlib.pyplot as plt
%matplotlib inlinehousing.hist(bins=50,figsize=(20,15))
plt.show()
From the figure, we can infer that most of the data seem to be negatively tailed. This might lead to a problem later in our ML model. So, it’s better to transform it into a bell-shaped curve. The next thing is the scale. For example, it seems that median_income
has been transformed.
Now, let’s work on our data. In order to train our model, it is necessary to divide it into a training set and a testing set. A general rule is to allow 20% of data as testing and the rest as training. But, before we proceed, let’s work on median_income
. We see that most of the incomes are clustered around 1.5 and 6. But, there are also some beyond 6. So, it’s important for us to have enough instances for each stratum.
housing["income_cat"]=pd.cut(housing['median_income'],bins=[0.,1.5,3.0,4.5,6,np.inf],labels=[1,2,3,4,5])
housing["income_cat"].hist()
plt.plot()
Now, lets work on training and testing set.
from sklearn.model_selection import StratifiedShuffleSplitsplit=StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index,test_index in split.split(housing,housing['income_cat']):
strat_train=housing.loc[train_index]
strat_test=housing.loc[test_index]
In order for the model to be effective and unbiased, it is necessary to have a good proportion of every category in our training set. While we could use sklearn
‘s train_test_split
to divide our data, a more effective way to divide it as per the income category is StratifiedShuffleSplit
. Finally, let’s get the data back into its original form.
for set_ in (strat_test,strat_train):
set_.drop("income_cat",axis=1,inplace=True)
Until now, we’ve just went through the data without digging deep into it. But now, let’s explore the data further. One way of doing so is via visualizing.
copy_housing=strat_train.copy()
copy_housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.1)
Well, this definately looks like California with higher density zones. Now, let’s visualize the income.
copy_housing.plot(kind="scatter",x="latitude",y="longitude",s=copy_housing["population"]/100,c="median_income",cmap=plt.get_cmap("jet"),alpha=0.3)
plt.legend()
plt.show()
Well, another way of exploring data is through finding correlation.
#Corr
corr=copy_housing.drop("ocean_proximity",axis="columns").corr()from pandas.plotting import scatter_matrix
attributes=['median_house_value','median_income','total_rooms','housing_median_age']
scatter_matrix(copy_housing[attributes],figsize=(22,18))
From this graph, we can find that median_income
and median_house_value
have a quite linear relationship, whereas others aren’t so clear. Well now, let’s experiment by joining some values.
copy_housing["rooms_per_household"]=copy_housing["total_rooms"]/copy_housing["households"]
copy_housing["bedrooms_per_room"]=copy_housing["total_bedrooms"]/copy_housing["total_rooms"]
copy_housing["population_per_household"]=copy_housing["population"]/copy_housing["households"]corr=copy_housing.drop("ocean_proximity",axis="columns").corr()
corr["median_house_value"].sort_values(ascending=False)
Well, joining the dataset indeed helped us as rooms_per_household
has a high correlation value with median_house_value
. Now that we have performed some feature engineering, let’s prepare our dataset based on features and labels.
housing_features=strat_train.drop("median_house_value",axis=1)
housing_label=strat_train['median_house_value'].copy()
Now, we have perhaps reached an important point in our project. It’s time to create a pipeline. Remember, we had certain data missing in total_bedrooms
? Now, we will be fixing it via sklearn
‘s SimpleImputer
.
A SimpleImputer
is used in the preprocessing stage of data preparation to handle missing values in a dataset. The goal is to replace null or missing values so that they do not negatively affect the performance of machine learning models. SimpleImputer
does this by imputing values based on one of several strategies: Mean, Median, High Frequency, or Constant. Next, we will also be fixing the problem of tail-heavy data via StandardScaler
.
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputerpipeline=Pipeline([
("imputer",SimpleImputer(strategy="median")),
("std_Scaler",StandardScaler())
])
Now that we’ve created a pipeline to fix our data issues, let’s create a ColumnTransformer
to specify the transformation for different columns.
num_features=housing_features.drop("ocean_proximity",axis=1)
num_attrs=list(num_features)
cat_attrs=["ocean_proximity"]
full_pipeline=ColumnTransformer([
("num",pipeline,num_attrs),
("cat",OneHotEncoder(),cat_attrs)
])housing_prepares=full_pipeline.fit_transform(housing_features)
This concludes our preprocessing. Firstly, we fixed the issue of tail-heavy data & NaN data. Then, we converted categorical data into numerical data.
Now, we will start building our models.
Linear Regression
from sklearn.linear_model import LinearRegressionlin_reg=LinearRegression()
lin_reg.fit(housing_prepares,housing_label)
from sklearn.metrics import mean_squared_errorpredictions=lin_reg.predict(housing_prepares)
lin_mse=mean_squared_error(housing_label,predictions)
lin_rmse=np.sqrt(lin_mse)
lin_rmse
The value is better than nothing, but it is not perfect- in fact quite error prone. Most of our housing prices ranges from $128,000 to $225,000. So, an error of around $68,000 is rather imperfect. This means, our model isn’t powerful enough to grasp the information from our features.
Decision Tree
from sklearn.tree import DecisionTreeRegressor
tree_reg=DecisionTreeRegressor()
tree_reg.fit(housing_prepares,housing_label)
tree_predictions=tree_reg.predict(housing_prepares)
tree_mse=mean_squared_error(housing_label,tree_predictions)
tree_rmse=np.sqrt(tree_mse)tree_rmse
#Output:0.0
Well, our model turns out to be error free. But here is a problem- it shouldn’t actually be perfect. By this, we can imply that the model is overfitting. So, how do we solve it? One way is to use sklearn’s
cross_val_score
.
cross_val_score
divides the training set into K-subsets, called folds. It involves training the model on some data set and then validating it on the remaining subset.
from sklearn.model_selection import cross_val_scorecross_val_scores=cross_val_score(tree_reg,housing_prepares,housing_label,scoring="neg_mean_squared_error",cv=10)
tree_rmse_scores=np.sqrt(-cross_val_scores)
tree_rmse_scores
Then, we can aggregate the mean value for all these values using np.mean()
which comes out 71857.76227885179
.
Well, the Decision Tree performs worse as compared to Linear Regression.
Random Forest Regressor
Finally, let’s use RandomForestRegressor
. A RandomForestRegressor
is an ensemble learning method that constructs numerous DecisionTrees
during training and outputs mean result.
from sklearn.ensemble import RandomForestRegressorforest_reg=RandomForestRegressor()
forest_reg.fit(housing_prepares,housing_label)
forest_pred=forest_reg.predict(housing_prepares)
forest_mse=mean_squared_error(housing_label,forest_pred)
forest_rmse=np.sqrt(forest_mse)
forest_rmse
#Output: 18694.91813388159
forest_cross_val_scores=cross_val_score(forest_reg,housing_prepares,housing_label,scoring="neg_mean_squared_error",cv=10)
forest_rmse_scores=np.sqrt(-cross_val_scores)forest_rmse_scores
#Output: 68321.7118618
Well, RandomForest
looks much more promising, but the model is still overfitting. One way to solve this problem is by fiddling with hyperparameters.
GridSearchCV
sklearn’s GridSearchCV
can help us automate the process of hyperparameter tuning. It also evaluates the performance of our models using cross validation.
from sklearn.model_selection import GridSearchCVparam_grid=[{"n_estimators":[3,10,30,60],"max_features":[2,4,6,8,10]},
{"bootstrap":[False],"n_estimators":[3,10],
"max_features":[2,3,4]}]
forest_reg=RandomForestRegressor()
grid_search=GridSearchCV(forest_reg,param_grid,cv=5,scoring="neg_mean_squared_error",return_train_score=True)
grid_search.fit(housing_prepares,housing_label)
grid_search.best_params_#Output: {'max_features': 6, 'n_estimators': 60}
The .best_params_
gives the most effective hyperparameters for our model.
final_model=grid_search.best_estimator_x_test=strat_test.drop("median_house_value",axis=1)
y_test=strat_test["median_house_value"].copy()
x_test["rooms_per_household"]=x_test["total_rooms"]/x_test["households"]
x_test["bedrooms_per_room"]=x_test["total_bedrooms"]/x_test["total_rooms"]
x_test["population_per_household"]=x_test["population"]/x_test["households"]
x_test_prepared=full_pipeline.transform(x_test)
final_pred=final_model.predict(x_test_prepared)
final_mse=mean_squared_error(y_test,final_pred)
final_rmse=np.sqrt(final_mse)final_rmse
#Output: 47297.496275008285
The root-mean square for our final model is the best one among the model we tried earlier. But our task doesn’t end here. In some cases, we might have to find the range of our generalized error to be convinced. This can be done by finding the confidence interval for RMSE.
from scipy import statsconfidence=0.95
error=(final_pred-y_test)**2
ci=np.sqrt(stats.t.interval(confidence,len(error)-1,loc=error.mean(),scale=stats.sem(error)))
interval=ci[1]-ci[0]
mean_rmse=(ci[0]+ci[1])/2
proportion=(ci[1]-ci[0])/mean_rmse
print(ci)
#Output:[45333.94519671 49182.71770318]
Reflection
By the end of day 6, not only was I able to learn the concepts of many Machine Learning Techniques but was also able to implement it.
While this project is based in the book itself, I will also be working on a separate project to avoid “tutorial hell”.