🎓Data Cleaning/Transformation Techniques in Python | by Giovanni Melo | Aug, 2024


1 — Importing the required libraries 📖

This step involves just importing the required libraries which are pandas, numpy. These are the necessary libraries when it comes to data science.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

2 — Loading the dataset and displaying

url = 'https://github.com/gcesarmelo7/drugs_side-effects_medical-conditions/raw/main/data/drugs_side_effects_drugs_com.csv'

df = pd.read_csv(url)

df.head()
There are a total of 17 columns in this dataset, but they do not fit on the screen.

You can see the whole colab notebook here.

3 —Removing the unused or irrelevant columns 🗑️

For example, I can drop the brand_names and medical_condition_url columns if I don’t want to use them. The axis = 1 is because it’s relative to the position to apply functions column by column.

df = df.drop (['brand_names', 'medical_condition_url'], axis=1)

To check that this command has worked, you can use this code to see all the columns in the dataset.

df.columns
Index(['drug_name', 'medical_condition', 'side_effects', 'generic_name',
'drug_classes', 'activity', 'rx_otc', 'pregnancy_category', 'csa',
'alcohol', 'related_drugs', 'medical_condition_description', 'rating',
'no_of_reviews', 'drug_link'],
dtype='object')

4 — Rename the columns names ✏️

Renaming the data correctly is not only for data ‘beauty’, but also for comprehensibility, to improve the understanding of the dataset and helps to minimise errors in data analysis. Unclear or ambiguous column names can lead to coding errors or misinterpretation of the data.

new = {'rx_otc':'over_the_counter', 
'csa':'controlled_substances_act',
'pregnancy_category':'pregnancy'}

df.rename (columns = new, inplace = True)

As you can see, the columns are renamed correctly.

5 — Missing Data ⛔

Sometimes, we encounter instances where observations (rows) have missing data, and occasionally entire columns are completely devoid of values. What should we do in such situations? Should we disregard these gaps? NO!

Missing values are depicted differently depending on the tool used, such as NULL, empty strings, NaN, NA, #NA, among others. Understanding the context of the data is essential when handling missing values, as it sheds light on why the data is absent. There are two main approaches to managing missing data:

✅ Removal of data;

✅ Imputation of data.

To see if there is a missing value, this code below adds up all the null values in each row. There are only 6 columns with no missing values in this df.

df.isna().sum()
drug_name                           0
medical_condition 0
side_effects 124
generic_name 43
drug_classes 82
activity 0
over_the_counter 1
pregnancy 229
controlled_substances_act 0
alcohol 1554
related_drugs 1469
medical_condition_description 0
rating 1345
no_of_reviews 1345
drug_link 0

To not extend more because its depend of each case, today we will only know that are these 2 methods exists (removal or imputation of values). Before making a decision, one must first examine the patterns of missing values and think about the plan of modelling and deployment to make sure that you make the right decision about the nulls.

In the future we can discuss a bit more which is the more assertive decision.

6 — Duplicates ✌️

Occasionally, we encounter situations where multiple observations exist for one drug due to double form submissions or data entry errors. Despite the cause, it’s crucial to consolidate these into a unified, accurate representation of the truth.

✅ Time is not wasted targeting the same drug twice;

✅ A trusted and complete version of the truth is developed, and reliable business decisions can be made based on the analysis results.

As an example, see the code below:

df.duplicated().sum()

0

Fortunately, this record doesn’t have any duplicate values. Another way of doing a check is to use this other code:

df.duplicated().any()

0

7 — Conclusion

This article provides a basic overview of the data cleaning/transformation process. While this is a simplified approach and not a comprehensive industry standard, it provides a solid starting point. Starting with smaller data sets allows for a manageable introduction before moving on to larger data sets that require more extensive cleansing. This overview provides a beginner’s perspective on what the data cleansing process entails.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here