Data Cleaning with Bash: A Handbook for Developers



Image by Author | Ideogram

 

Today, we’ll look at data cleaning using Bash, an essential skill for anyone working with data. While many of us reach for Python or R when it comes to data work, you can use Bash for quick and efficient data manipulation—right from your command line.

In this article, I’ll not only show you the commands but also walk you through the thinking process behind each step, explaining why we’re using specific commands and how they work together to solve common data cleaning problems.

Let’s dive right in!

 

Understanding the Data Cleaning Process

 

Before we jump into the code, let’s understand what data cleaning actually entails. Data cleaning involves going from raw and messy data to data that is accurate, complete, and ready for analysis. This can include:

  • Handling missing values
  • Fixing formatting issues
  • Removing duplicates
  • Standardizing inconsistent data
  • Validating data against business rules

Clean data is, therefore, necessary for accurate analysis and reliable results. The “garbage in, garbage out” principle applies strongly to data science (no surprises there!).

🔗 We’ll use a sample dataset, the users.csv file, in this tutorial.

 

Understanding Your Data

 

Before making any changes, it’s important to understand what you’re working with. Let’s explore some basic inspection commands.

To view the contents of the file, you can use the cat command.

 

Output:

1,John,Smith,john.smith@example.com,2023-01-15,2023-03-20,125.99
2,Jane,Doe,jane.doe@example.com,2023-01-16,2023-03-21,210.50
3,Bob,Johnson,bob@example.com,2023-01-17,2023-03-22,0
4,Alice,Williams,alice.williams@example.com,2023-01-18,,75.25
5,,Brown,mike.brown@example.com,2023-01-19,2023-03-24,150.75
6,Sarah,Miller,sarah.miller@example.com,invalid_date,2023-03-25,95.00
7,David,Jones,david.jones@example.com,2023-01-21,2023-03-26,300.00
8,Lisa,Garcia,lisa.garcia@example.com,2023-01-22,2023-03-27,-50.00
9,James,Martinez,mymail@example.com,2023-01-23,2023-03-28,125.00

 

For large files, this might not be practical, so we have alternatives.

The head command, by default, displays the first 10 lines of a file. This is useful for quickly checking the structure and headers of your data.

 

You can specify a different number of lines with the -n option.

 

This gives:

id,first_name,last_name,email,signup_date,last_login,purchase_amount
1,John,Smith,john.smith@example.com,2023-01-15,2023-03-20,125.99
2,Jane,Doe,jane.doe@example.com,2023-01-16,2023-03-21,210.50
3,Bob,Johnson,bob@example.com,2023-01-17,2023-03-22,0
4,Alice,Williams,alice.williams@example.com,2023-01-18,,75.25

 

Similarly, tail shows the last 10 lines by default. This is helpful for checking the most recent entries in log files or time-series data.

 

The wc (word count) command with the -l flag counts the number of lines in a file. This gives you a quick understanding of the dataset size.

 

Output:

 

You can view the file structure using:

$ column -t -s, users.csv | less -S

 

This command is particularly useful for CSV files. Let me break it down:

  • column formats the output into columns
  • -t tells it to create a table
  • -s, specifies the comma as the delimiter
  • less -S allows scrolling through the output without line wrapping

This gives you a nicely formatted view of your data, making it much easier to spot inconsistencies.

 

Finding and Analyzing Issues in the Data

 

Now that we have a basic understanding of our data, let’s identify specific issues.

This command uses grep to search for consecutive commas (which indicate an empty field) and -c to count the occurrences.

 
This gives us a quick way to count rows with missing values.

 

Adding the -n flag to grep shows the line numbers along with the matches, helping us pinpoint exactly where the missing values are.

 

Output:

5:4,Alice,Williams,alice.williams@example.com,2023-01-18,,75.25
6:5,,Brown,mike.brown@example.com,2023-01-19,2023-03-24,150.75

 

The following command identifies rows with invalid date formats, specifically looking for the text “invalid_date” in our sample data.

$ grep -v -E '^[0-9]4-[0-9]2-[0-9]2$' users.csv | grep "invalid_date"

 

Output:

6,Sarah,Miller,sarah.miller@example.com,invalid_date,2023-03-25,95.00

 

Let’s find records with negative purchase amount:

$ awk -F, '$7 < 0 print $0' users.csv

 

This awk command:

  • Uses -F, to specify the comma as the field separator
  • Checks if the 7th field (purchase_amount) is less than 0
  • Prints the entire line if the condition is true

Output:

8,Lisa,Garcia,lisa.garcia@example.com,2023-01-22,2023-03-27,-50.00

 

Fixing the Issues with Data Cleaning Techniques

 

Now that we’ve identified the issues, let’s fix them one by one.

Handling Missing Values

First, we replace empty fields with “NULL”.

$ sed 's/,,/,NULL,/g; s/,$/,NULL/g' users.csv > users_cleaned.csv

 

This sed command:

  • Replaces consecutive commas (,,) with ,NULL, to make missing values explicit
  • Handles trailing commas at the end of lines with s/,$/,NULL/g
  • Writes the result to a new file called users_cleaned.csv

The > operator redirects the output to a new file instead of the screen.

We can fill missing first names with “Unknown”

$ awk -F, 'BEGIN OFS="," if ($2 == "") $2 = "Unknown"; print' users.csv > users_cleaned.csv

 

This awk command:

  • Sets the input field separator to comma with -F,
  • Uses BEGIN OFS=”,” to set the output field separator to comma as well
  • Checks if the second field (first_name) is empty
  • Replaces empty values with “Unknown”
  • Prints each line (modified or not)

Correcting Date Formats

This sed command replaces the text “invalid_date” with a properly formatted date.

$ sed 's/invalid_date/2023-01-20/g' users.csv > users_cleaned.csv

 

In practice, you might want to use a date that makes sense in your context or mark these entries for further review.

Handling Negative Values

You can replace negative values with zero.

$ awk -F, 'BEGIN OFS="," if ($7 < 0) $7 = 0; print' users.csv > users_cleaned.csv

 
This command:

  • Checks if the seventh field (purchase_amount) is less than 0
  • Sets negative values to 0
  • Preserves the original value if it’s not negative

 

Combining Multiple Data Cleaning Steps

 

In practice, you’ll often need to apply multiple fixes to your data. Here’s how to combine them into a single command:

$ awk -F, 'BEGIN OFS="," {
    # Fix missing first names
    if ($2 == "") $2 = "Unknown";
    
    # Fix invalid dates
    if ($5 == "invalid_date" || $5 == "") $5 = "2023-01-20";
    if ($6 == "") $6 = "2023-03-23";
    
    # Fix negative values
    if ($7  users_cleaned.csv

 

This command works like so:

  • Processes each field according to its specific validation rules
  • Handles multiple issues in a single pass through the data

This approach is more efficient than running separate commands for each issue, especially for large datasets.

 

Validating the Data Cleaning Steps

 

After cleaning your data, it’s important to verify that all issues have been resolved.
Check that there are no more empty fields:

$ grep -c ",," users_cleaned.csv

 

This should return 0 if all empty fields have been properly filled.

Next, ensure that all the dates are in the correct format:

$ grep -v -E '[0-9]4-[0-9]2-[0-9]2' users_cleaned.csv | grep -v "signup_date"

 

This should return no results if all dates are in the correct format.

Check that all purchase amounts are non-negative:

$ awk -F, '$7 < 0 print $0' users_cleaned.csv

 

This should return no results if all purchase amounts are non-negative.

 

Transforming the Data

 

Once your data is clean, you might need to transform certain fields into different formats or extract specific info for analysis.

Extracting Specific Columns

Let’s extract just the names and emails (contact info):

$ cut -d, -f2,3,4 users_cleaned.csv > users_contact_info.csv

 

The cut command:

  • Uses -d, to specify the comma as the delimiter
  • Uses -f2,3,4 to extract the 2nd, 3rd, and 4th fields (first_name, last_name, email)
  • Creates a new file with just these columns

This is useful when you only need specific information for a particular analysis.

Next, let’s try to extract users with purchase amounts over $100

$ awk -F, '$7 > 100 print $0' users_cleaned.csv > users_high_value.csv

 

This awk command filters rows based on a condition, creating a subset of your data that meets specific criteria.

Sorting Data

Let’s sort the records by last name by skipping header row.

$ (head -n 1 users_cleaned.csv && tail -n +2 users_cleaned.csv | sort -t, -k3) > users_sorted_by_name.csv

 

This command:

  • Preserves the header row with head -n 1
  • Takes all rows except the header with tail -n +2
  • Sorts the data by the third field (last_name) using sort -t, -k3
  • Combines the header and sorted data into a new file

# Sort by purchase amount in descending order

(head -n 1 users_cleaned.csv && tail -n +2 users_cleaned.csv | sort -t, -k7 -n -r) > users_sorted_by_purchase.csv

 

This command sorts the data by purchase amount in descending order:

  • -t, specifies the comma as the delimiter
  • -k7 sorts by the 7th field (purchase_amount)
  • -n ensures numeric sorting (not alphabetical)
  • -r sorts in reverse (descending) order

 

Aggregating and Analyzing Data

 

Bash can also be used for basic data analysis. Let’s calculate the total purchase amount

$ awk -F, 'NR>1 sum += $7 END print "Total purchases: $" sum' users_cleaned.csv

 

This awk command:

  • Skips the header row with NR>1
  • Adds up all values in the 7th field (purchase_amount)
  • Prints the total at the end
Total purchases: $1282.49

 

Now let’s calculate average purchase amount:

$ awk -F, 'NR>1 sum += $7; count++ END print "Average purchase: $" sum/count' users_cleaned.csv

 

This command calculates the average by:

  • Summing all purchase amounts
  • Counting the number of records
  • Dividing the sum by the count at the end
Average purchase: $128.249

 

Next, let’s count users by signup month:

$ awk -F, 'NR>1 
    split($5, date, "-");
    months[date[2]]++;
 
END 
    for (month in months) 
        print "Month " month ": " months[month] " users"
    
' users_cleaned.csv

 

This more complex awk command:

  • Extracts the month from the signup date
  • Uses an associative array (months) to count occurrences
  • Prints a summary of users by month

 

Creating Bash Scripts for Data Cleaning

 

Often it’s convenient to create a reusable Bash script for data cleaning. Here’s how you can do it:

  1. Save the data cleaning commands in a file, say clean_user_data.sh (The script contains a couple of additional data cleaning steps)
  2. Make it executable with chmod +x clean_user_data.sh
  3. Run it with ./clean_user_data.sh

The script, clean_user_data.sh:

  • Processes each cleaning step separately, but writes all changes to the same output file
  • Includes progress messages to track execution
  • Performs validation checks after cleaning
  • Can easily be modified to add more cleaning steps or change existing ones
  • Removes the temporary file at the end if it exists

Running the commands one at a time and running the script both accomplish the same goal, but the script provides better visibility into the cleaning process and makes troubleshooting easier.

 

Wrapping Up

 

I hope this tutorial has given you a solid foundation in using Bash for data cleaning.

Key takeaways:

  • Bash is ideal for quick, efficient data cleaning of small to medium-sized datasets
  • The combination of commands like grep, awk, sed, and sort provides a flexible toolkit
  • You can use Bash scripts for data cleaning as well as documenting your cleaning process

Remember, while Bash is powerful, it’s just one tool in your data toolbox. For very large datasets or more complex transformations, you might want to consider Python, R, or dedicated ETL tools. However, for many everyday data cleaning tasks, Bash can be surprisingly effective and efficient!

Happy cleaning!
 
 

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.



Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here