Data Cleaning with Bash: A Handbook for Developers

Image by Author | Ideogram

Today, we’ll look at data cleaning using Bash, an essential skill for anyone working with data. While many of us reach for Python or R when it comes to data work, you can use Bash for quick and efficient data manipulation—right from your command line.

In this article, I’ll not only show you the commands but also walk you through the thinking process behind each step, explaining why we’re using specific commands and how they work together to solve common data cleaning problems.

Let’s dive right in!

Understanding the Data Cleaning Process

Before we jump into the code, let’s understand what data cleaning actually entails. Data cleaning involves going from raw and messy data to data that is accurate, complete, and ready for analysis. This can include:

Handling missing values
Fixing formatting issues
Removing duplicates
Standardizing inconsistent data
Validating data against business rules

Clean data is, therefore, necessary for accurate analysis and reliable results. The “garbage in, garbage out” principle applies strongly to data science (no surprises there!).

🔗 We’ll use a sample dataset, the users.csv file, in this tutorial.

Understanding Your Data

Before making any changes, it’s important to understand what you’re working with. Let’s explore some basic inspection commands.

To view the contents of the file, you can use the cat command.

Output:

1,John,Smith,john.smith@example.com,2023-01-15,2023-03-20,125.99
2,Jane,Doe,jane.doe@example.com,2023-01-16,2023-03-21,210.50
3,Bob,Johnson,bob@example.com,2023-01-17,2023-03-22,0
4,Alice,Williams,alice.williams@example.com,2023-01-18,,75.25
5,,Brown,mike.brown@example.com,2023-01-19,2023-03-24,150.75
6,Sarah,Miller,sarah.miller@example.com,invalid_date,2023-03-25,95.00
7,David,Jones,david.jones@example.com,2023-01-21,2023-03-26,300.00
8,Lisa,Garcia,lisa.garcia@example.com,2023-01-22,2023-03-27,-50.00
9,James,Martinez,mymail@example.com,2023-01-23,2023-03-28,125.00

For large files, this might not be practical, so we have alternatives.

The head command, by default, displays the first 10 lines of a file. This is useful for quickly checking the structure and headers of your data.

You can specify a different number of lines with the -n option.

This gives:

id,first_name,last_name,email,signup_date,last_login,purchase_amount
1,John,Smith,john.smith@example.com,2023-01-15,2023-03-20,125.99
2,Jane,Doe,jane.doe@example.com,2023-01-16,2023-03-21,210.50
3,Bob,Johnson,bob@example.com,2023-01-17,2023-03-22,0
4,Alice,Williams,alice.williams@example.com,2023-01-18,,75.25

Similarly, tail shows the last 10 lines by default. This is helpful for checking the most recent entries in log files or time-series data.

The wc (word count) command with the -l flag counts the number of lines in a file. This gives you a quick understanding of the dataset size.

Output:

You can view the file structure using:

$ column -t -s, users.csv | less -S

This command is particularly useful for CSV files. Let me break it down:

column formats the output into columns
-t tells it to create a table
-s, specifies the comma as the delimiter
less -S allows scrolling through the output without line wrapping

This gives you a nicely formatted view of your data, making it much easier to spot inconsistencies.

Finding and Analyzing Issues in the Data

Now that we have a basic understanding of our data, let’s identify specific issues.

This command uses grep to search for consecutive commas (which indicate an empty field) and -c to count the occurrences.

This gives us a quick way to count rows with missing values.

Adding the -n flag to grep shows the line numbers along with the matches, helping us pinpoint exactly where the missing values are.

Output:

5:4,Alice,Williams,alice.williams@example.com,2023-01-18,,75.25
6:5,,Brown,mike.brown@example.com,2023-01-19,2023-03-24,150.75

The following command identifies rows with invalid date formats, specifically looking for the text “invalid_date” in our sample data.

$ grep -v -E '^[0-9]4-[0-9]2-[0-9]2$' users.csv | grep "invalid_date"

Output:

6,Sarah,Miller,sarah.miller@example.com,invalid_date,2023-03-25,95.00

Let’s find records with negative purchase amount:

$ awk -F, '$7 < 0 print $0' users.csv

This awk command:

Uses -F, to specify the comma as the field separator
Checks if the 7th field (purchase_amount) is less than 0
Prints the entire line if the condition is true

Output:

8,Lisa,Garcia,lisa.garcia@example.com,2023-01-22,2023-03-27,-50.00

Fixing the Issues with Data Cleaning Techniques

Now that we’ve identified the issues, let’s fix them one by one.

Handling Missing Values

First, we replace empty fields with “NULL”.

$ sed 's/,,/,NULL,/g; s/,$/,NULL/g' users.csv > users_cleaned.csv

This sed command:

Replaces consecutive commas (,,) with ,NULL, to make missing values explicit
Handles trailing commas at the end of lines with s/,$/,NULL/g
Writes the result to a new file called users_cleaned.csv

The > operator redirects the output to a new file instead of the screen.

We can fill missing first names with “Unknown”

$ awk -F, 'BEGIN OFS="," if ($2 == "") $2 = "Unknown"; print' users.csv > users_cleaned.csv

This awk command:

Sets the input field separator to comma with -F,
Uses BEGIN OFS=”,” to set the output field separator to comma as well
Checks if the second field (first_name) is empty
Replaces empty values with “Unknown”
Prints each line (modified or not)

Correcting Date Formats

This sed command replaces the text “invalid_date” with a properly formatted date.

$ sed 's/invalid_date/2023-01-20/g' users.csv > users_cleaned.csv

In practice, you might want to use a date that makes sense in your context or mark these entries for further review.

Handling Negative Values

You can replace negative values with zero.

$ awk -F, 'BEGIN OFS="," if ($7 < 0) $7 = 0; print' users.csv > users_cleaned.csv

This command:

Checks if the seventh field (purchase_amount) is less than 0
Sets negative values to 0
Preserves the original value if it’s not negative

Combining Multiple Data Cleaning Steps

In practice, you’ll often need to apply multiple fixes to your data. Here’s how to combine them into a single command:

$ awk -F, 'BEGIN OFS="," {
    # Fix missing first names
    if ($2 == "") $2 = "Unknown";
    
    # Fix invalid dates
    if ($5 == "invalid_date" || $5 == "") $5 = "2023-01-20";
    if ($6 == "") $6 = "2023-03-23";
    
    # Fix negative values
    if ($7  users_cleaned.csv

This command works like so:

Processes each field according to its specific validation rules
Handles multiple issues in a single pass through the data

This approach is more efficient than running separate commands for each issue, especially for large datasets.

Validating the Data Cleaning Steps

After cleaning your data, it’s important to verify that all issues have been resolved.
Check that there are no more empty fields:

$ grep -c ",," users_cleaned.csv

This should return 0 if all empty fields have been properly filled.

Next, ensure that all the dates are in the correct format:

$ grep -v -E '[0-9]4-[0-9]2-[0-9]2' users_cleaned.csv | grep -v "signup_date"

This should return no results if all dates are in the correct format.

Check that all purchase amounts are non-negative:

$ awk -F, '$7 < 0 print $0' users_cleaned.csv

This should return no results if all purchase amounts are non-negative.

Transforming the Data

Once your data is clean, you might need to transform certain fields into different formats or extract specific info for analysis.

Extracting Specific Columns

Let’s extract just the names and emails (contact info):

$ cut -d, -f2,3,4 users_cleaned.csv > users_contact_info.csv

The cut command:

Uses -d, to specify the comma as the delimiter
Uses -f2,3,4 to extract the 2nd, 3rd, and 4th fields (first_name, last_name, email)
Creates a new file with just these columns

This is useful when you only need specific information for a particular analysis.

Next, let’s try to extract users with purchase amounts over $100

$ awk -F, '$7 > 100 print $0' users_cleaned.csv > users_high_value.csv

This awk command filters rows based on a condition, creating a subset of your data that meets specific criteria.

Sorting Data

Let’s sort the records by last name by skipping header row.

$ (head -n 1 users_cleaned.csv && tail -n +2 users_cleaned.csv | sort -t, -k3) > users_sorted_by_name.csv

This command:

Preserves the header row with head -n 1
Takes all rows except the header with tail -n +2
Sorts the data by the third field (last_name) using sort -t, -k3
Combines the header and sorted data into a new file

# Sort by purchase amount in descending order

(head -n 1 users_cleaned.csv && tail -n +2 users_cleaned.csv | sort -t, -k7 -n -r) > users_sorted_by_purchase.csv

This command sorts the data by purchase amount in descending order:

-t, specifies the comma as the delimiter
-k7 sorts by the 7th field (purchase_amount)
-n ensures numeric sorting (not alphabetical)
-r sorts in reverse (descending) order

Aggregating and Analyzing Data

Bash can also be used for basic data analysis. Let’s calculate the total purchase amount

$ awk -F, 'NR>1 sum += $7 END print "Total purchases: $" sum' users_cleaned.csv

This awk command:

Skips the header row with NR>1
Adds up all values in the 7th field (purchase_amount)
Prints the total at the end

Total purchases: $1282.49

Now let’s calculate average purchase amount:

$ awk -F, 'NR>1 sum += $7; count++ END print "Average purchase: $" sum/count' users_cleaned.csv

This command calculates the average by:

Summing all purchase amounts
Counting the number of records
Dividing the sum by the count at the end

Average purchase: $128.249

Next, let’s count users by signup month:

$ awk -F, 'NR>1 
    split($5, date, "-");
    months[date[2]]++;
 
END 
    for (month in months) 
        print "Month " month ": " months[month] " users"
    
' users_cleaned.csv

This more complex awk command:

Extracts the month from the signup date
Uses an associative array (months) to count occurrences
Prints a summary of users by month

Creating Bash Scripts for Data Cleaning

Often it’s convenient to create a reusable Bash script for data cleaning. Here’s how you can do it:

Save the data cleaning commands in a file, say clean_user_data.sh (The script contains a couple of additional data cleaning steps)
Make it executable with chmod +x clean_user_data.sh
Run it with ./clean_user_data.sh

The script, clean_user_data.sh:

Processes each cleaning step separately, but writes all changes to the same output file
Includes progress messages to track execution
Performs validation checks after cleaning
Can easily be modified to add more cleaning steps or change existing ones
Removes the temporary file at the end if it exists

Running the commands one at a time and running the script both accomplish the same goal, but the script provides better visibility into the cleaning process and makes troubleshooting easier.

Wrapping Up

I hope this tutorial has given you a solid foundation in using Bash for data cleaning.

Key takeaways:

Bash is ideal for quick, efficient data cleaning of small to medium-sized datasets
The combination of commands like grep, awk, sed, and sort provides a flexible toolkit
You can use Bash scripts for data cleaning as well as documenting your cleaning process

Remember, while Bash is powerful, it’s just one tool in your data toolbox. For very large datasets or more complex transformations, you might want to consider Python, R, or dedicated ETL tools. However, for many everyday data cleaning tasks, Bash can be surprisingly effective and efficient!

Happy cleaning!

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

Data Cleaning with Bash: A Handbook for Developers

Understanding the Data Cleaning Process

Understanding Your Data

Finding and Analyzing Issues in the Data

Fixing the Issues with Data Cleaning Techniques

Handling Missing Values

Correcting Date Formats

Handling Negative Values

Combining Multiple Data Cleaning Steps

Validating the Data Cleaning Steps

Transforming the Data

Extracting Specific Columns

Sorting Data

Aggregating and Analyzing Data

Creating Bash Scripts for Data Cleaning

Wrapping Up

Recent Articles

Applications with Context Vectors – MachineLearningMastery.com

The Shape of Memory: Functors, Monads, and the AI Mind | by Satyam Mishra | Apr, 2025

Look out! CapCut copycats are on the prowl

A comprehensive list of 2025 tech layoffs

Build a FinOps agent using Amazon Bedrock with multi-agent capability and Amazon Nova as the foundation model

Related Stories

Leave A Reply Cancel reply