Image by Author | DALL-E
Data processing and analysis are a major part of software and data engineering jobs. Pandas is the one go-to library in Python, widely used in the industry for processing and cleaning tabular data. In this article, we will see the basics of Pandas exploring 10 essential commands you must know for any data preprocessing task.
We will use a dummy dataset manually generated with Python, and use it as a running example throughout the article. We will explore the dataset, cleaning it on the way using Pandas to familiarize you with some important concepts.
Setup
For this article, we only require the Pandas library. You can install it using the pip package manager, and you are set to follow the article.
1. read_csv()
Our dataset is in a CSV file, and we need to load it first in pandas. Pandas provide simple helper functions for various file types. We can load the CSV dataset into a Pandas data frame using the below method call:
import pandas as pd
df = pd.read_csv("dummy_data.csv")
The data frame is a Pandas object that structures your tabular data into an appropriate format. It loads the complete data in memory so it is now ready for preprocessing.
2. head() & tail()
It is important to get a high-level look at your dataset so we know what we are working with. Printing the complete data might be impossible for large-scale datasets where the rows can be in thousands or even millions. For sanity checks, it is sufficient to look at the start or end of our data.
We can use the head method call to look at the first 5 rows of our dataset. Let’s use it for our dataset and get a look at our data.
We see our data now which gives us a better sense of what we are working with. The tail method works similarly but prints the last 5 rows of your data.
3. info()
Good starting point but is it enough? Information about only 5 rows might be insufficient when processing the complete dataset. Let’s get a summarized version of our data.
The above method call prints a summary of each column, giving us more information about the specific data types, total number of rows, null values and memory usage. We can then start preprocessing the data based on the results.
4. describe()
We have three different numerical columns. For a better understanding, it is important to know some basic statistical knowledge like the mean and spread of the data. Let’s look into these columns in greater detail using the describe method call.
This gives some important information about the numerical columns. This can be important when finding outliers and finding the range of data we are dealing with.
5. isnull()
Dealing with null values can be challenging.. Null values during data analysis can cause runtime errors and unexpected results. We need to be aware if we have null values and deal with them appropriately beforehand. Let’s extract this information from our data and see if and where the null values are.
This simple method call shows us that we have some null values in the name, age and salary column. We need to fix these values before running any analysis on our data. Let’s do this next.
6. dropna()
We have three columns with missing values. There are numerous ways to deal with empty values, and we will use two of them here.
The first is the simplest! Remove any row with a null value. Let’.s use this for the name and age column.
df = df.dropna(subset=['name', 'age'])
We replace our data frame, removing any rows with null values in the name or age column. It is the simplest choice but it can drastically reduce the dataset size. Use this sparingly as this may not be the right approach for your data.
7. fillna()
An alternative way is to fill in missing values with another value. Generally, we use mean value for numerical columns because it may cause minimal changes in your mathematical analysis while maintaining the original size of the data. Let’s use it for the salary column, and replace any missing values with the mean salary.
df['salary'] = df['salary'].fillna(df['salary'].mean())
Now, if we run the isnull() method again, we can verify that the null values are removed.
For categorical columns, you can replace the null values with the most frequently occurring label or use a custom state to symbolize missing values.
8. Filter Your Data
Filtering is like applying the where clause in a database. It is widely used and can help when you need to work on a specific subset of your data. For our use case, let us filter the data to only include rows where the department is Engineering. There is no method call for this, we can just use conditional indexing to fulfill our purpose.
To filter on the department, we can use the below syntax:
df = df[df['department'] == 'Engineering']
df.head()
To summarize, we get indexes of rows where the department has a value-matching Engineering, and we filter our data frame for those indexes.
9. apply()
We can run a lambda function on a column to modify its values. For a simple example, let’s convert the name to lowercase. To run a function over a complete column, we can use the apply method which iterates over each row and modifies the values.
df['name'] = df['name'].apply(lambda x: x.lower())
We run the apply method on the name column and convert each value to lowercase. The x parameter is populated with the name column values iteratively, and we can modify the values as we wish with Python. The resulting data frame now looks like this.
10. quantile()
Outliers can skew your analysis on numerical columns, and it is important to remove them. We can use the 25th and 75th quartile on numerical data, to get the inter-quartile range. This allows us to estimate an acceptable range, and we can then filter out any values outside this range. Mathematically, outliers are values occurring outside 1.5 times the interquartile range (IQR) from the first quartile (Q1) or third quartile (Q3).
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
The above methods get the inter-quartile range on the salary column, and we can now filter out outliers using conditional indexing as shown before.
# Filter salaries within the acceptable range
df = df[(df['salary'] >= Q1 - 1.5 * IQR) & (df['salary']
This removes the outliers and we are left with rows with values within the acceptable range.
This is how our data looks at the end:
Wrapping Up
From 100 rows originally, we have removed the null values, processed the names, removed outliers, and filtered to a specific department. While this is not an exhaustive list of preprocessing commands in Pandas, all these methods are commonly used for data preprocessing in software-based analysis. This puts you in a good spot to start your first data analysis project in Python, and it should make it easier to learn more advanced analysis tools and methods.
Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.