Implementing Data Quality Assurance in Data Science Pipelines with Great Expectations

Image by Editor | Canva

In data science, data quality is very important. Good data is needed for accurate models and useful analysis. Problems like missing values and outliers can harm your model’s performance. They can also make insights less reliable.

In this article, we will learn how to use Great Expectations (GE). We will show an example of validating a dataset. This example will walk through setting and checking expectations. GE helps find data problems early. This improves data quality in your pipeline.

What is Great Expectations?

Great Expectations is a tool for data quality checks. It is free and open-source. Data teams use it to make sure data is accurate and reliable. GE lets you create “expectations” for your data. These expectations are rules that describe what the data should look like. For example, you can set rules to check for missing values or to ensure numbers stay within a certain range.

GE has many built-in checks. These checks cover common data issues, like data types, unique values, and value ranges. It also supports custom checks for more complex rules. Great Expectations works well in data pipelines. This means it can check data quality automatically as data moves through the pipeline. Data teams can run these checks regularly to spot issues early.

Data Quality Dimensions

Data quality assurance looks at important aspects of data quality. These dimensions often include:

Completeness: Checks if any data is missing. All necessary values should be present
Uniqueness: Ensures no duplicate values. Each record should appear only once.
Consistency: Confirms data is the same across sources. The data should match in all parts of the system.
Validity: Checks if data follows specific rules. For example, dates should be in a certain format.

Setting Up the Environment

First, install Great Expectations and import the necessary libraries.

pip install great_expectations

After installation, create a DataContext. It is the main setup for Great Expectations (GX).

# Import required modules from Great Expectations
import great_expectations as gx
import pandas as pd

# Initialize the Data Context
context = gx.get_context()

In this example, let’s assume you have an employee dataset in CSV format. We’ll load it into a Pandas DataFrame.

# Import employee data into a Pandas DataFrame
df = pd.read_csv("employee_data.csv")
df.head()

dataset dataset

Connecting to Data

Now, we need to connect it to Great Expectations. First, we create a data source. Next, we create a data asset. Then, we define a batch for the data. These components are essential in structuring how Great Expectations interacts with our data.

# Create Data Source, Data Asset, Batch Definition, and Batch
data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="pd dataframe asset")

batch_definition = data_asset.add_batch_definition_whole_dataframe("batch definition")
batch = batch_definition.get_batch(batch_parameters="dataframe": df)

Defining Expectations

Once the data is connected, we’ll define expectations for specific data columns. These expectations are like rules that the data should adhere to. For the employee dataset, let’s define the following expectations:

ID: Employee ID should not be null.
Age: Each employee’s age should be between 18 and 65.
Department: Department names should match a predefined set.

# EmployeeID Expectation: Should not be null
expectation_employee_id_not_null = gx.expectations.ExpectColumnValuesToNotBeNull(
    column="EmployeeID"
)

# Age Expectation: Age should be between 18 and 65
expectation_age = gx.expectations.ExpectColumnValuesToBeBetween(
    column="age", min_value=18, max_value=65
)

# Department Expectation: Department should be one of the predefined values
expectation_department = gx.expectations.ExpectColumnValuesToBeInSet(
    column="department", value_set=["Human Resources", "Engineering", "Marketing", "Finance"]
)

Running Validations

With expectations in place, we can now validate the data. Validation checks each record against the rules and shows whether the data follows or breaks them.

# Validate Batch using Expectations for all columns
validation_result_employee_id_not_null = batch.validate(expectation_employee_id_not_null)
validation_result_age = batch.validate(expectation_age)
validation_result_department = batch.validate(expectation_department)

# Print validation results for each expectation
print("EmployeeID Not Null Validation Result:", validation_result_employee_id_not_null)
print("Age Validation Result:", validation_result_age)
print("Department Validation Result:", validation_result_department)

id_validation

age_validation

department_validation

Handling Data Quality Failures

If the data doesn’t meet the expectations, you need to decide what to do. Possible actions include:

Alerting: Send notifications to relevant stakeholders about data quality issues.
Data Rejection: Reject the current batch of data and reprocess it.
Log and Continue: Log the failure for future analysis but continue processing other valid data.

Benefits of Using Great Expectations

Great Expectations provides several benefits for data science and engineering teams:

Automated Data Quality Checks: It checks the data automatically before using it. This ensures only good data is used for analysis or modeling.
Standardized Expectations: You can use the same rules for data in different projects. This helps keep data quality consistent.
Detailed Reporting: It gives reports that show how good the data is over time. This helps find and fix problems.
Flexibility: Great Expectations works with many data sources and formats. It also works well with tools like Pandas, SQL, and Spark.

Conclusion

Data quality is very important in data science. Great Expectations helps automate data checks. It fits well into data pipelines. With this tool, you can check data quality all the time. It keeps your data accurate, consistent, and reliable. This is important for good decision-making. Great Expectations can check simple rules like non-null values or more complex ones. It is flexible and powerful for all your data quality needs.

Jayita Gulati is a machine learning enthusiast and technical writer driven by her passion for building machine learning models. She holds a Master’s degree in Computer Science from the University of Liverpool.

Implementing Data Quality Assurance in Data Science Pipelines with Great Expectations

What is Great Expectations?

Data Quality Dimensions

Setting Up the Environment

Connecting to Data

Defining Expectations

Running Validations

Handling Data Quality Failures

Benefits of Using Great Expectations

Conclusion

Recent Articles

I Won’t Change Unless You Do

AWS DeepRacer: Closing time at AWS re:Invent 2024 –How did that physical racing go?

Microsoft hangs up on Skype: service to shut down May 5, 2025

History’s biggest heist just happened, and online abuse • Graham Cluley

DeepSeek-Level AI? Train Your Own Reasoning Model in Just 7 Easy Steps!

Related Stories

Leave A Reply Cancel reply