Image by Author | Ideogram
Â
Today, we’ll look at data cleaning using Bash, an essential skill for anyone working with data. While many of us reach for Python or R when it comes to data work, you can use Bash for quick and efficient data manipulation—right from your command line.
In this article, I’ll not only show you the commands but also walk you through the thinking process behind each step, explaining why we’re using specific commands and how they work together to solve common data cleaning problems.
Let’s dive right in!
Â
Understanding the Data Cleaning Process
Â
Before we jump into the code, let’s understand what data cleaning actually entails. Data cleaning involves going from raw and messy data to data that is accurate, complete, and ready for analysis. This can include:
- Handling missing values
- Fixing formatting issues
- Removing duplicates
- Standardizing inconsistent data
- Validating data against business rules
Clean data is, therefore, necessary for accurate analysis and reliable results. The “garbage in, garbage out” principle applies strongly to data science (no surprises there!).
🔗 We’ll use a sample dataset, the users.csv file, in this tutorial.
Â
Understanding Your Data
Â
Before making any changes, it’s important to understand what you’re working with. Let’s explore some basic inspection commands.
To view the contents of the file, you can use the cat command.
Â
Output:
1,John,Smith,john.smith@example.com,2023-01-15,2023-03-20,125.99
2,Jane,Doe,jane.doe@example.com,2023-01-16,2023-03-21,210.50
3,Bob,Johnson,bob@example.com,2023-01-17,2023-03-22,0
4,Alice,Williams,alice.williams@example.com,2023-01-18,,75.25
5,,Brown,mike.brown@example.com,2023-01-19,2023-03-24,150.75
6,Sarah,Miller,sarah.miller@example.com,invalid_date,2023-03-25,95.00
7,David,Jones,david.jones@example.com,2023-01-21,2023-03-26,300.00
8,Lisa,Garcia,lisa.garcia@example.com,2023-01-22,2023-03-27,-50.00
9,James,Martinez,mymail@example.com,2023-01-23,2023-03-28,125.00
Â
For large files, this might not be practical, so we have alternatives.
The head command, by default, displays the first 10 lines of a file. This is useful for quickly checking the structure and headers of your data.
Â
You can specify a different number of lines with the -n option.
Â
This gives:
id,first_name,last_name,email,signup_date,last_login,purchase_amount
1,John,Smith,john.smith@example.com,2023-01-15,2023-03-20,125.99
2,Jane,Doe,jane.doe@example.com,2023-01-16,2023-03-21,210.50
3,Bob,Johnson,bob@example.com,2023-01-17,2023-03-22,0
4,Alice,Williams,alice.williams@example.com,2023-01-18,,75.25
Â
Similarly, tail shows the last 10 lines by default. This is helpful for checking the most recent entries in log files or time-series data.
Â
The wc (word count) command with the -l flag counts the number of lines in a file. This gives you a quick understanding of the dataset size.
Â
Output:
Â
You can view the file structure using:
$ column -t -s, users.csv | less -S
Â
This command is particularly useful for CSV files. Let me break it down:
- column formats the output into columns
- -t tells it to create a table
- -s, specifies the comma as the delimiter
- less -S allows scrolling through the output without line wrapping
This gives you a nicely formatted view of your data, making it much easier to spot inconsistencies.
Â
Finding and Analyzing Issues in the Data
Â
Now that we have a basic understanding of our data, let’s identify specific issues.
This command uses grep to search for consecutive commas (which indicate an empty field) and -c to count the occurrences.
Â
This gives us a quick way to count rows with missing values.
Â
Adding the -n flag to grep shows the line numbers along with the matches, helping us pinpoint exactly where the missing values are.
Â
Output:
5:4,Alice,Williams,alice.williams@example.com,2023-01-18,,75.25
6:5,,Brown,mike.brown@example.com,2023-01-19,2023-03-24,150.75
Â
The following command identifies rows with invalid date formats, specifically looking for the text “invalid_date” in our sample data.
$ grep -v -E '^[0-9]4-[0-9]2-[0-9]2$' users.csv | grep "invalid_date"
Â
Output:
6,Sarah,Miller,sarah.miller@example.com,invalid_date,2023-03-25,95.00
Â
Let’s find records with negative purchase amount:
$ awk -F, '$7 < 0 print $0' users.csv
Â
This awk command:
- Uses -F, to specify the comma as the field separator
- Checks if the 7th field (purchase_amount) is less than 0
- Prints the entire line if the condition is true
Output:
8,Lisa,Garcia,lisa.garcia@example.com,2023-01-22,2023-03-27,-50.00
Â
Fixing the Issues with Data Cleaning Techniques
Â
Now that we’ve identified the issues, let’s fix them one by one.
Handling Missing Values
First, we replace empty fields with “NULL”.
$ sed 's/,,/,NULL,/g; s/,$/,NULL/g' users.csv > users_cleaned.csv
Â
This sed command:
- Replaces consecutive commas (,,) with ,NULL, to make missing values explicit
- Handles trailing commas at the end of lines with s/,$/,NULL/g
- Writes the result to a new file called users_cleaned.csv
The > operator redirects the output to a new file instead of the screen.
We can fill missing first names with “Unknown”
$ awk -F, 'BEGIN OFS="," if ($2 == "") $2 = "Unknown"; print' users.csv > users_cleaned.csv
Â
This awk command:
- Sets the input field separator to comma with -F,
- Uses BEGIN OFS=”,” to set the output field separator to comma as well
- Checks if the second field (first_name) is empty
- Replaces empty values with “Unknown”
- Prints each line (modified or not)
Correcting Date Formats
This sed command replaces the text “invalid_date” with a properly formatted date.
$ sed 's/invalid_date/2023-01-20/g' users.csv > users_cleaned.csv
Â
In practice, you might want to use a date that makes sense in your context or mark these entries for further review.
Handling Negative Values
You can replace negative values with zero.
$ awk -F, 'BEGIN OFS="," if ($7 < 0) $7 = 0; print' users.csv > users_cleaned.csv
Â
This command:
- Checks if the seventh field (purchase_amount) is less than 0
- Sets negative values to 0
- Preserves the original value if it’s not negative
Â
Combining Multiple Data Cleaning Steps
Â
In practice, you’ll often need to apply multiple fixes to your data. Here’s how to combine them into a single command:
$ awk -F, 'BEGIN OFS="," {
# Fix missing first names
if ($2 == "") $2 = "Unknown";
# Fix invalid dates
if ($5 == "invalid_date" || $5 == "") $5 = "2023-01-20";
if ($6 == "") $6 = "2023-03-23";
# Fix negative values
if ($7 users_cleaned.csv
Â
This command works like so:
- Processes each field according to its specific validation rules
- Handles multiple issues in a single pass through the data
This approach is more efficient than running separate commands for each issue, especially for large datasets.
Â
Validating the Data Cleaning Steps
Â
After cleaning your data, it’s important to verify that all issues have been resolved.
Check that there are no more empty fields:
$ grep -c ",," users_cleaned.csv
Â
This should return 0 if all empty fields have been properly filled.
Next, ensure that all the dates are in the correct format:
$ grep -v -E '[0-9]4-[0-9]2-[0-9]2' users_cleaned.csv | grep -v "signup_date"
Â
This should return no results if all dates are in the correct format.
Check that all purchase amounts are non-negative:
$ awk -F, '$7 < 0 print $0' users_cleaned.csv
Â
This should return no results if all purchase amounts are non-negative.
Â
Transforming the Data
Â
Once your data is clean, you might need to transform certain fields into different formats or extract specific info for analysis.
Extracting Specific Columns
Let’s extract just the names and emails (contact info):
$ cut -d, -f2,3,4 users_cleaned.csv > users_contact_info.csv
Â
The cut command:
- Uses -d, to specify the comma as the delimiter
- Uses -f2,3,4 to extract the 2nd, 3rd, and 4th fields (first_name, last_name, email)
- Creates a new file with just these columns
This is useful when you only need specific information for a particular analysis.
Next, let’s try to extract users with purchase amounts over $100
$ awk -F, '$7 > 100 print $0' users_cleaned.csv > users_high_value.csv
Â
This awk command filters rows based on a condition, creating a subset of your data that meets specific criteria.
Sorting Data
Let’s sort the records by last name by skipping header row.
$ (head -n 1 users_cleaned.csv && tail -n +2 users_cleaned.csv | sort -t, -k3) > users_sorted_by_name.csv
Â
This command:
- Preserves the header row with head -n 1
- Takes all rows except the header with tail -n +2
- Sorts the data by the third field (last_name) using sort -t, -k3
- Combines the header and sorted data into a new file
# Sort by purchase amount in descending order
(head -n 1 users_cleaned.csv && tail -n +2 users_cleaned.csv | sort -t, -k7 -n -r) > users_sorted_by_purchase.csv
Â
This command sorts the data by purchase amount in descending order:
- -t, specifies the comma as the delimiter
- -k7 sorts by the 7th field (purchase_amount)
- -n ensures numeric sorting (not alphabetical)
- -r sorts in reverse (descending) order
Â
Aggregating and Analyzing Data
Â
Bash can also be used for basic data analysis. Let’s calculate the total purchase amount
$ awk -F, 'NR>1 sum += $7 END print "Total purchases: $" sum' users_cleaned.csv
Â
This awk command:
- Skips the header row with NR>1
- Adds up all values in the 7th field (purchase_amount)
- Prints the total at the end
Total purchases: $1282.49
Â
Now let’s calculate average purchase amount:
$ awk -F, 'NR>1 sum += $7; count++ END print "Average purchase: $" sum/count' users_cleaned.csv
Â
This command calculates the average by:
- Summing all purchase amounts
- Counting the number of records
- Dividing the sum by the count at the end
Average purchase: $128.249
Â
Next, let’s count users by signup month:
$ awk -F, 'NR>1
split($5, date, "-");
months[date[2]]++;
END
for (month in months)
print "Month " month ": " months[month] " users"
' users_cleaned.csv
Â
This more complex awk command:
- Extracts the month from the signup date
- Uses an associative array (months) to count occurrences
- Prints a summary of users by month
Â
Creating Bash Scripts for Data Cleaning
Â
Often it’s convenient to create a reusable Bash script for data cleaning. Here’s how you can do it:
- Save the data cleaning commands in a file, say clean_user_data.sh (The script contains a couple of additional data cleaning steps)
- Make it executable with chmod +x clean_user_data.sh
- Run it with ./clean_user_data.sh
The script, clean_user_data.sh:
- Processes each cleaning step separately, but writes all changes to the same output file
- Includes progress messages to track execution
- Performs validation checks after cleaning
- Can easily be modified to add more cleaning steps or change existing ones
- Removes the temporary file at the end if it exists
Running the commands one at a time and running the script both accomplish the same goal, but the script provides better visibility into the cleaning process and makes troubleshooting easier.
Â
Wrapping Up
Â
I hope this tutorial has given you a solid foundation in using Bash for data cleaning.
Key takeaways:
- Bash is ideal for quick, efficient data cleaning of small to medium-sized datasets
- The combination of commands like grep, awk, sed, and sort provides a flexible toolkit
- You can use Bash scripts for data cleaning as well as documenting your cleaning process
Remember, while Bash is powerful, it’s just one tool in your data toolbox. For very large datasets or more complex transformations, you might want to consider Python, R, or dedicated ETL tools. However, for many everyday data cleaning tasks, Bash can be surprisingly effective and efficient!
Happy cleaning!
Â
Â
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.