🎓Data Cleaning/Transformation Techniques in Python | by Giovanni Melo | Aug, 2024

1 — Importing the required libraries

This step involves just importing the required libraries which are pandas, numpy. These are the necessary libraries when it comes to data science.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

2 — Loading the dataset and displaying

url = 'https://github.com/gcesarmelo7/drugs_side-effects_medical-conditions/raw/main/data/drugs_side_effects_drugs_com.csv'df = pd.read_csv(url)

df.head()

There are a total of 17 columns in this dataset, but they do not fit on the screen.

You can see the whole colab notebook here.

3 —Removing the unused or irrelevant columns

For example, I can drop the brand_names and medical_condition_url columns if I don’t want to use them. The axis = 1 is because it’s relative to the position to apply functions column by column.

df = df.drop (['brand_names', 'medical_condition_url'], axis=1)

To check that this command has worked, you can use this code to see all the columns in the dataset.

df.columns

Index(['drug_name', 'medical_condition', 'side_effects', 'generic_name',
'drug_classes', 'activity', 'rx_otc', 'pregnancy_category', 'csa',
'alcohol', 'related_drugs', 'medical_condition_description', 'rating',
'no_of_reviews', 'drug_link'],
dtype='object')

4 — Rename the columns names

Renaming the data correctly is not only for data ‘beauty’, but also for comprehensibility, to improve the understanding of the dataset and helps to minimise errors in data analysis. Unclear or ambiguous column names can lead to coding errors or misinterpretation of the data.

new = {'rx_otc':'over_the_counter', 
'csa':'controlled_substances_act',
'pregnancy_category':'pregnancy'}df.rename (columns = new, inplace = True)

As you can see, the columns are renamed correctly.

5 — Missing Data

Sometimes, we encounter instances where observations (rows) have missing data, and occasionally entire columns are completely devoid of values. What should we do in such situations? Should we disregard these gaps? NO!

Missing values are depicted differently depending on the tool used, such as NULL, empty strings, NaN, NA, #NA, among others. Understanding the context of the data is essential when handling missing values, as it sheds light on why the data is absent. There are two main approaches to managing missing data:

Removal of data;

Imputation of data.

To see if there is a missing value, this code below adds up all the null values in each row. There are only 6 columns with no missing values in this df.

df.isna().sum()

drug_name                           0
medical_condition                   0
side_effects                      124
generic_name                       43
drug_classes                       82
activity                            0
over_the_counter                    1
pregnancy                         229
controlled_substances_act           0
alcohol                          1554
related_drugs                    1469
medical_condition_description       0
rating                           1345
no_of_reviews                    1345
drug_link                           0

To not extend more because its depend of each case, today we will only know that are these 2 methods exists (removal or imputation of values). Before making a decision, one must first examine the patterns of missing values and think about the plan of modelling and deployment to make sure that you make the right decision about the nulls.

In the future we can discuss a bit more which is the more assertive decision.

6 — Duplicates

Occasionally, we encounter situations where multiple observations exist for one drug due to double form submissions or data entry errors. Despite the cause, it’s crucial to consolidate these into a unified, accurate representation of the truth.

Time is not wasted targeting the same drug twice;

A trusted and complete version of the truth is developed, and reliable business decisions can be made based on the analysis results.

As an example, see the code below:

df.duplicated().sum()0

Fortunately, this record doesn’t have any duplicate values. Another way of doing a check is to use this other code:

df.duplicated().any()0

7 — Conclusion

This article provides a basic overview of the data cleaning/transformation process. While this is a simplified approach and not a comprehensive industry standard, it provides a solid starting point. Starting with smaller data sets allows for a manageable introduction before moving on to larger data sets that require more extensive cleansing. This overview provides a beginner’s perspective on what the data cleansing process entails.

Data Cleaning/Transformation Techniques in Python | by Giovanni Melo | Aug, 2024

1 — Importing the required libraries

2 — Loading the dataset and displaying

3 —Removing the unused or irrelevant columns

4 — Rename the columns names

5 — Missing Data

6 — Duplicates

7 — Conclusion

Recent Articles

How to Ensure Your AI Solution Does What You Expect iI to Do

What is Machine Learning? Exploring Its Impact on Customer Experience | by Bestpeers SEO | Apr, 2025

Malware Attack Targets World Uyghur Congress Leaders via Trojanized UyghurEdit++ Tool

Today’s Hurdle hints and answers for April 29, 2025

Alibaba Qwen Team Just Released Qwen3: The Latest Generation of Large Language Models in Qwen Series, Offering a Comprehensive Suite of Dense and Mixture-of-Experts...

Related Stories

Leave A Reply Cancel reply