Dimensionality Reduction In Python and Preprocessing For Machine Learning | by James Owusu-Appiah | Oct, 2024

Dimensionality refers to the number of attributes or columns in a dataset. The higher the number of attributes, the greater the dataset’s dimensionality. Dimensionality reduction techniques help reduce the number of columns or attributes to an optimal level, decreasing the dataset’s complexity.
Dimensionality reduction aims to represent numerical input data in a lower-dimensional space while preserving essential relationships. There is no single ideal solution for all datasets, as various dimensionality reduction algorithms are available.
Reducing input dimensions often results in fewer degrees of freedom (or parameters) and a simpler structure in the machine learning model. Too many degrees of freedom can lead to overfitting on the training dataset, causing the model to underperform on new data.
Some popular methods of dimensionality reduction include:

Principal Component Analysis
Singular Value Decomposition
Non-negative matrix factorization

In this blog, we will move more into details of how the Principal Components Analysis (PCA) works.

Popular unsupervised learning methods for reducing data dimensionality include principal component analysis (PCA). PCA minimizes information loss while enhancing interpretability. By using a smaller set of ‘summary indices’ that are easier to visualize and interpret, it enables efficient summarization of the information contained in large datasets.
A short code on how it is used is outlined below:

#Importing the needed libraries
from sklearn.datasets import load_digits
import pandas as pd#Loading the dataset
dataset = load_digits()
dataset.keys()


#Showing the data shape
dataset.data.shape
#Loading the dataset as a pandas dataframe
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df.head()
#Splitting the data into X and y
X = df
y = dataset.target


#Scaling the dataset
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled
#Splitting the data into test set and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=30)
#Training and scoring model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)
#Using components such that 95% of variance is retained
from sklearn.decomposition import PCA
pca = PCA(0.95)
X_pca = pca.fit_transform(X)
X_pca.shape

Output:
From the code above which has its complete outputs in the GitHub link attached at the end of the blog: the number of columns or features is reduced from 64 to 29.

Data preprocessing comes after you have cleaned the data and performed some exploratory data analysis. It entails prepping your data for modelling. It sometimes involving changing categorical columns into numerical columns.
Steps to preprocess your data:

Remove missing data which could be rows or columns of the dataset
Converting datatype into the right datatype
Standardizing dataset
Feature engineering: it is the creation of new features from existing features
Splitting data into train and test set.

From this tutorial, we will be going through the first 3.

#Importing the needed libraries
import pandas as pd
import matplotlib.pyplot as plt#Uploading the dataset onto google colab
from google.colab import files
files.upload()
#Loading the dataset as a pandas dataframe
df = pd.read_csv('/content/weather_data.csv', parse_dates=['day'])
df.set_index('day', inplace=True)
df
#Checking for missing values
df.isna().sum()
#Method 1: Filling the missing values with 0s and creating a new dataset
new_df = df.fillna(0)
new_df
#Method 2: Filling the missing values with the appropriate values
new_df = df.fillna({
'temperature': 0,
'windspeed':0,
'event': 'no event'
})
new_df

Output:

Dataset:

2. Method 1: Filling NaN with 0s

3. Method 2: Filling the numerical missing values with 0s and the categorical ones with “no event”

#Importing the needed libraries
import pandas as pd
import matplotlib.pyplot as plt#Uploading the dataset onto google colab
from google.colab import files
files.upload()
#Loading the dataset as a pandas dataframe
df = pd.read_csv('/content/weather_data.csv', parse_dates=['day'])
df.set_index('day', inplace=True)
df
#Checking for missing values
df.isna().sum()
#Method 1: Filling the missing values with 0s and creating a new dataset
new_df = df.fillna(0)
new_df
#Method 2: Filling the missing values with the appropriate values
new_df = df.fillna({
'temperature': 0,
'windspeed':0,
'event': 'no event'
})
new_df
#Checking the data types of the various columns
new_df.dtypes
#Changing temperature column from float into integer
new_df['temperature'] = new_df['temperature'].astype('int64')
#Checking the data types once more
new_df.dtypes

Output:
1. Datatypes before

2. Datatype after

Standardization entails scaling data to fit a standard normal distribution. A standard normal distribution is defined as a distribution with a mean of 0 and a standard deviation of 1.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import random# set seed 
random.seed(42)
# thousand random numbers
num = [[random.randint(0,1000)] for _ in range(1000)]
# standardize values
ss = StandardScaler()
num_ss = ss.fit_transform(num)
# plot histograms
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
ax[0].hist(list(np.concatenate(num).flat), ec='black')
ax[0].set_xlabel('Value')
ax[0].set_ylabel('Frequency')
ax[0].set_title('Before Standardization')
ax[1].hist(list(np.concatenate(num_ss).flat), ec='black')
ax[1].set_xlabel('Value')
ax[1].set_ylabel('Frequency')
ax[1].set_title('After Standardization')
plt.show()

Outputs:

GitHub Link with all code:
https://github.com/Jegge2003/dimension_preparation

Dimensionality Reduction In Python and Preprocessing For Machine Learning | by James Owusu-Appiah | Oct, 2024

Recent Articles

Governing ML lifecycle at scale: Best practices to set up cost and usage visibility of ML workloads in multi-account environments

The world’s biggest battery maker says Elon Musk’s 4680 cell ‘is going to fail’

ESET Research Podcast: Gamaredon

Outlier Detection Using Random Forest Regressors: Leveraging Algorithm Strengths to Your Advantage | by Michael Zakhary

Gradient Boosting | Towards Data Science

Related Stories

Leave A Reply Cancel reply