Dimensionality refers to the number of attributes or columns in a dataset. The higher the number of attributes, the greater the dataset’s dimensionality. Dimensionality reduction techniques help reduce the number of columns or attributes to an optimal level, decreasing the dataset’s complexity.
Dimensionality reduction aims to represent numerical input data in a lower-dimensional space while preserving essential relationships. There is no single ideal solution for all datasets, as various dimensionality reduction algorithms are available.
Reducing input dimensions often results in fewer degrees of freedom (or parameters) and a simpler structure in the machine learning model. Too many degrees of freedom can lead to overfitting on the training dataset, causing the model to underperform on new data.
Some popular methods of dimensionality reduction include:
- Principal Component Analysis
- Singular Value Decomposition
- Non-negative matrix factorization
In this blog, we will move more into details of how the Principal Components Analysis (PCA) works.
Popular unsupervised learning methods for reducing data dimensionality include principal component analysis (PCA). PCA minimizes information loss while enhancing interpretability. By using a smaller set of ‘summary indices’ that are easier to visualize and interpret, it enables efficient summarization of the information contained in large datasets.
A short code on how it is used is outlined below:
#Importing the needed libraries
from sklearn.datasets import load_digits
import pandas as pd#Loading the dataset
dataset = load_digits()
dataset.keys()
#Showing the data shape
dataset.data.shape
#Loading the dataset as a pandas dataframe
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df.head()
#Splitting the data into X and y
X = df
y = dataset.target
#Scaling the dataset
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled
#Splitting the data into test set and train set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=30)
#Training and scoring model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)
#Using components such that 95% of variance is retained
from sklearn.decomposition import PCA
pca = PCA(0.95)
X_pca = pca.fit_transform(X)
X_pca.shape
Output:
From the code above which has its complete outputs in the GitHub link attached at the end of the blog: the number of columns or features is reduced from 64 to 29.
Data preprocessing comes after you have cleaned the data and performed some exploratory data analysis. It entails prepping your data for modelling. It sometimes involving changing categorical columns into numerical columns.
Steps to preprocess your data:
- Remove missing data which could be rows or columns of the dataset
- Converting datatype into the right datatype
- Standardizing dataset
- Feature engineering: it is the creation of new features from existing features
- Splitting data into train and test set.
From this tutorial, we will be going through the first 3.
#Importing the needed libraries
import pandas as pd
import matplotlib.pyplot as plt#Uploading the dataset onto google colab
from google.colab import files
files.upload()
#Loading the dataset as a pandas dataframe
df = pd.read_csv('/content/weather_data.csv', parse_dates=['day'])
df.set_index('day', inplace=True)
df
#Checking for missing values
df.isna().sum()
#Method 1: Filling the missing values with 0s and creating a new dataset
new_df = df.fillna(0)
new_df
#Method 2: Filling the missing values with the appropriate values
new_df = df.fillna({
'temperature': 0,
'windspeed':0,
'event': 'no event'
})
new_df
Output:
- Dataset:
2. Method 1: Filling NaN with 0s
3. Method 2: Filling the numerical missing values with 0s and the categorical ones with “no event”
#Importing the needed libraries
import pandas as pd
import matplotlib.pyplot as plt#Uploading the dataset onto google colab
from google.colab import files
files.upload()
#Loading the dataset as a pandas dataframe
df = pd.read_csv('/content/weather_data.csv', parse_dates=['day'])
df.set_index('day', inplace=True)
df
#Checking for missing values
df.isna().sum()
#Method 1: Filling the missing values with 0s and creating a new dataset
new_df = df.fillna(0)
new_df
#Method 2: Filling the missing values with the appropriate values
new_df = df.fillna({
'temperature': 0,
'windspeed':0,
'event': 'no event'
})
new_df
#Checking the data types of the various columns
new_df.dtypes
#Changing temperature column from float into integer
new_df['temperature'] = new_df['temperature'].astype('int64')
#Checking the data types once more
new_df.dtypes
Output:
1. Datatypes before
2. Datatype after
Standardization entails scaling data to fit a standard normal distribution. A standard normal distribution is defined as a distribution with a mean of 0 and a standard deviation of 1.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import random# set seed
random.seed(42)
# thousand random numbers
num = [[random.randint(0,1000)] for _ in range(1000)]
# standardize values
ss = StandardScaler()
num_ss = ss.fit_transform(num)
# plot histograms
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10,5))
ax[0].hist(list(np.concatenate(num).flat), ec='black')
ax[0].set_xlabel('Value')
ax[0].set_ylabel('Frequency')
ax[0].set_title('Before Standardization')
ax[1].hist(list(np.concatenate(num_ss).flat), ec='black')
ax[1].set_xlabel('Value')
ax[1].set_ylabel('Frequency')
ax[1].set_title('After Standardization')
plt.show()
Outputs:
GitHub Link with all code:
https://github.com/Jegge2003/dimension_preparation