Crack the Code: Mastering Category Encoders for Data Scientists



Image by Author | Canva

 

In data science, handling different types of data is a daily challenge. One of the most common data types is categorical data, which represents attributes or labels such as colors, gender, or types of vehicles. These characteristics or names can be divided into distinct groups or categories, facilitating classification and analysis. However, most machine learning algorithms work best with numbers, not words or labels. So, how do we make categorical data usable? This is where category encoders come in.

Category encoders are tools that transform these labels into numbers while keeping their original meaning intact. In this article, we’ll take a closer look at various category encoders available in the Sklearn library and how they can be used to improve the performance of machine learning models.

The category_encoders package in Python provides a variety of encoding techniques that cater to different types of data and scenarios. You can install it using pip:

pip install category_encoders

 

Common Types of Category Encoders

 

 

1. One-Hot Encoding

One-hot encoding converts each category into a binary vector. For example, if you have a ‘gender’ column with two categories (“male”, and “female”), each will be represented as a separate column with binary values (1 or 0). The drawback is that having distinct categories can significantly increase the number of features, resulting in increased computational complexity and memory usage.

Code Example:

import pandas as pd
from category_encoders import OneHotEncoder

# Sample data
data = 
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red', 'Green'],
    'Size': ['S', 'M', 'L', 'S', 'L', 'M']


# Create a DataFrame
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize the OneHotEncoder
encoder = OneHotEncoder(cols=['Color', 'Size'], use_cat_names=True)

# Fit and transform the DataFrame
df_encoded = encoder.fit_transform(df)

print("\nOne-Hot Encoded DataFrame:")
print(df_encoded)

 

Output:

Original DataFrame:
   Color Size
0    Red    S
1   Blue    M
2  Green    L
3   Blue    S
4    Red    L
5  Green    M

One-Hot Encoded DataFrame:
   Color_Red  Color_Blue  Color_Green  Size_S  Size_M  Size_L
0          1           0            0       1       0       0
1          0           1            0       0       1       0
2          0           0            1       0       0       1
3          0           1            0       1       0       0
4          1           0            0       0       0       1
5          0           0            1       0       1       0

 

One-hot encoding becomes impractical with categorical variables that have many distinct values. In the above example, the “color” variable has only 3 distinct values and it creates one binary column per unique color. A large number of unique colors would create a huge number of columns, making the dataset very sparse and making it difficult to attain the desired results.

 

2. Ordinal Encoding

Categorical variables can be categorized into two main types: ordinal and nominal. Ordinal variables have a meaningful order or ranking among the values, such as categories like “low,” “medium,” and “high,” where the categories are arranged in a sequence. In contrast, nominal variables consist of categories without any order or ranking, such as colors (e.g., “Red,” “Blue,” “Green”).

Ordinal encoding is a way to convert ordinal data into numbers. For example, if you have categories like “mild,” “moderate,” and “severe” ordinal encoding would assign them numbers like 1, 2, and 3, respectively, preserving the order.

Code Example:

import pandas as pd
from category_encoders import OrdinalEncoder

# Sample data
data = 
    'Education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'Master']


# Create a DataFrame
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Define the order of categories for the 'Education' column
education_order = ['High School', 'Bachelor', 'Master', 'PhD']

# Initialize the OrdinalEncoder
encoder = OrdinalEncoder(cols=['Education'], mapping=['col': 'Education', 'mapping': category: i for i, category in enumerate(education_order)])

# Fit and transform the DataFrame
df_encoded = encoder.fit_transform(df)

print("\nOrdinal Encoded DataFrame:")
print(df_encoded)

 

Output:

Original DataFrame:
     Education
0  High School
1     Bachelor
2       Master
3          PhD
4     Bachelor
5       Master

Ordinal Encoded DataFrame:
   Education
0          0
1          1
2          2
3          3
4          1
5          2

 

3. Binary Encoding

Binary encoding converts categories into a series of binary digits(0s and 1s). It differs from one-hot encoding as it converts each category into a binary number, and then represents this number in binary form. It provides a more compact representation and works well with tree-based algorithms by efficiently partitioning the data based on the encoded features.

Code Example:

import pandas as pd
from category_encoders import BinaryEncoder

# Sample data
data = 
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose']



# Create a DataFrame
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize the BinaryEncoder
encoder = BinaryEncoder(cols=['City'])

# Fit and transform the DataFrame
df_encoded = encoder.fit_transform(df)

print("\nBinary Encoded DataFrame:")
print(df_encoded)

 

Output:

Original DataFrame:
           City
0      New York
1   Los Angeles
2       Chicago
3       Houston
4       Phoenix
5  Philadelphia
6   San Antonio
7     San Diego
8        Dallas
9      San Jose

Binary Encoded DataFrame:
   City_0  City_1  City_2  City_3
0       0       0       0       1
1       0       0       1       0
2       0       0       1       1
3       0       1       0       0
4       0       1       0       1
5       0       1       1       0
6       0       1       1       1
7       1       0       0       0
8       1       0       0       1
9       1       0       1       0

 

In the example above, binary encoding representation is more compact compared to one-hot encoding. For 10 cities, binary encoding creates only 4 columns, while one-hot encoding would create 10 columns (one for each city).

 

4. Count Encoding

For the categorical variables, count encoding replaces the distinct values with their frequency in the dataset. It is found effective in many scenarios where the prevalence of the category correlates with the outcome.
Suppose we analyze the dataset that includes information about different car models.

Code Example:

import pandas as pd
from category_encoders import CountEncoder

# Sample dataset with car models
data = 'Car_Model': ['A', 'A', 'C', 'B', 'A', 'A', 'B', 'A', 'A', 'A']
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Initialize the CountEncoder
encoder = CountEncoder(cols=['Car_Model'])

# Fit and transform the DataFrame
df_encoded = encoder.fit_transform(df)

print("\nCount Encoded DataFrame:")
print(df_encoded)

 

Output:

Original DataFrame:
  Car_Model
0         A
1         A
2         C
3         B
4         A
5         A
6         B
7         A
8         A
9         A

Count Encoded DataFrame:
   Car_Model
0          7
1          7
2          1
3          2
4          7
5          7
6          2
7          7
8          7
9          7

 

The count encoder replaced each car model with the frequency of its occurrence in the dataset, indicating how often the car model is sold. The encoded values reflect the popularity of each car model, as their frequency is correlated with the total sales revenue.

 

5. BaseN Encoding

BaseN encoder represents categories into numerical features as a base-N number. The ‘n’ refers to the base, which is 10 (using digits 0 to 9) in a decimal system we commonly use to represent numbers. There are also other numeral systems to represent numbers like octal uses base 8 (using digits 0 to 7), and hexadecimal uses base 8 (using digits 0 to 15 and letters A to F). It significantly reduces the dimensionality of the dataset, especially for high-cardinality features.

Code Example:

import pandas as pd
from category_encoders import BaseNEncoder

# Sample data
data = 
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Date', 'Apple', 'Banana', 'Elderberry']


# Create a DataFrame
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Initialize the BaseNEncoder with base 3
encoder = BaseNEncoder(cols=['Fruit'], base=3)

# Fit and transform the DataFrame
df_encoded = encoder.fit_transform(df)

print("\nBase-3 Encoded DataFrame:")
print(df_encoded)

 

Output:

Original DataFrame:
        Fruit
0       Apple
1      Banana
2      Cherry
3        Date
4       Apple
5      Banana
6  Elderberry

Base-3 Encoded DataFrame:
   Fruit_0  Fruit_1
0        0        1
1        0        2
2        1        0
3        1        1
4        0        1
5        0        2
6        1        2

 

Conclusion

 
In essence, category encoders are key tools in a data scientist’s toolkit. They transform raw categorical data into numerical formats that machine learning models can digest. By choosing the right encoding technique, data scientists can boost model performance and extract meaningful insights. As the field advances, smart use of these encoders will continue to play a crucial role in developing effective data-driven solutions.

 
 

Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.

Our Top 3 Course Recommendations

1. Google Cybersecurity Certificate – Get on the fast track to a career in cybersecurity.

2. Google Data Analytics Professional Certificate – Up your data analytics game

3. Google IT Support Professional Certificate – Support your organization in IT

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here