Image by Author
Â
Data wrangling takes a significant portion of time in several popular data roles, such as data analysts, data scientists, and data engineers.
Often, their main tool for data wrangling is Python due to its flexible nature and data processing libraries, such as pandas and NumPy.
If you’re aspiring to one of the roles I mentioned, learning data wrangling in Python is strongly advisable. Also strongly advisable is knowing what the heck wrangling actually is.
Â
What Is Data Wrangling?
Â
Data wrangling or data munging is the process of handling raw data where you prepare it for analysis.
It includes several distinct tasks.
Â
Â
1. Data Collection
Here, you gather necessary raw data – structured, semi-structured, or unstructured – from sources such as:
- Databases
- APIs
- Spreadsheets
- CSV files
- Web scraping
- Manual entry
Â
2. Data Cleaning
The goal here is to ensure the dataset is reliable by:
- Imputing or removing missing values
- Correcting errors
- Standardizing formats
- Removing outliers
Â
3. Data Structuring
This task improves analysis efficiency by organizing data into a consistent format, e.g., by pivoting or melting tables.
Â
4. Data Transformation
This task involves making data more suitable for a particular analysis by:
- Applying mathematical operations
- Converting data types (e.g., from strings to integers)
- Normalizing values
- Creating new features
- Combining or splitting columns
Â
5. Data Enrichment
With this task, you add external data to add value and context of existing data. For example, you can add demographic data to customer purchase data, for better insights into buying patterns
Â
6. Data Validation
This is where you check the data for accuracy and consistency, e.g., checking that data transformations have been applied correctly or that no anomalies left in the data.
Â
Now that we have a clear idea of what we’re looking for in the syllabuses let’s look at those five free courses.
Â
1. Basics of Python Data Wrangling
Â
Course link: Basics of Python Data Wrangling
Provider: Great Learning
Difficulty: Beginner
Description: This is a beginner’s course teaching you concepts of data wrangling with Python, focusing on pandas and NumPy:
- Exploring a webpage using inspect()
- Introduction to RegEx
- Finding characters in a text
- Using quantifiers to match patterns
- Matching groups of characters
- Introduction to web scraping
- Reading, scraping, and saving the data
- Wrangling text data using RegEx
- Data exploration
Â
2. Python Pandas For Your Grandpa
Â
Course link: Python Pandas For Your Grandpa
Provider: GormAnalysis
Difficulty: Beginner to intermediate
Description: This course is created for beginners who are learning how to work with pandas for data wrangling. However, the knowledge covered extends from the beginner to intermediate territory.
In the Series section, you will first learn series creation, indexing, basic operations, missing values, vectorization, and the apply() method.
The following, DataFrame section teaches you similar things, but this time on DataFrames, with the addition of the merge() and groupby() methods.
The Advanced section covers the following topics:
- Strings
- Dates and times
- Categorical
- MultiIndex
- DataFrame reshaping
Each section ends with several challenges, and the final section, Final Boss, has five additional challenges.
Â
3. Data Analysis with Python
Â
Course link: Data Analysis with Python
Provider: freeCodeCamp
Difficulty: Beginner to intermediate
Description: With this course, you will learn about the data analysis process. Maybe that seems to you more than you need, but a good portion of the course is dedicated to data wrangling and related topics. The course covers topics such as reading data from sources like CSV, SQL, and Excel, cleaning and transforming data with pandas and NumPy, and also visualizing it with Matplotlib and seaborn.
The course’s final section has five Python data analysis projects, which are great for practice. (And necessary for claiming your certification.)
Â
4. Data Wrangling With Python Pandas
Â
Course link: Data Wrangling With Python Pandas
Provider: The Analytics Professor
Difficulty: Intermediate
Description: This intermediate YouTube playlist requires some familiarity with Python. It teaches data wrangling using pandas by covering topics like filtering, sorting, and handling missing values.
Here’s the list of topics:
- Using Python pandas series
- Using Python pandas DataFrames
- Selecting, filtering, and sorting Python pandas DataFrames
- Data wrangling with Python and pandas
- Working with dates and times in Python pandas
- Removing duplicate records in Python pandas
- Grouping and aggregating data in Python pandas
Â
5. Machine Learning Data Pre-Processing & Data Wrangling Using Python
Â
Course link: Machine Learning Data Pre-Processing & Data Wrangling Using Python
Provider: The AI University
Difficulty: Intermediate
Description: One more intermediate YouTube playlist. It teaches you data wrangling with Python in the context of machine learning. While it employs some advanced techniques, most of those are also applicable to some simpler data wrangling tasks. In other words, it doesn’t necessarily have to be with the goal of building an ML model.
It teaches these topics:
- Dataset missing values & imputation
- One-hot encoding to process categorical variables
- Splitting data into training and test sets
- Feature scaling in ML
- Outlier detection and treatment
- Log transformation for outliers
- Outlier treatment with square root transformation
- Adding & dropping columns
- Creating pivot tables
- RegEx for splitting a string into DataFrame columns
- Using map(), apply(), and applymap()
- Merging DataFrames
Â
Conclusion
Â
 The five courses I mentioned above are a great gateway to learning data wrangling in Python. Along with that, you’ll improve your general Python skills and, especially, your knowledge of pandas and NumPy libraries.
Some courses can also lead you in new directions, such as data analysis or machine learning. This is good because data wrangling is rarely the point in itself. So, learning about the other stages where the wrangled data is used can only benefit you.
Â
Â
Nate Rosidi is a data scientist and in product strategy. He’s also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Nate writes on the latest trends in the career market, gives interview advice, shares data science projects, and covers everything SQL.