Image by Editor | Canva
In data science, data quality is very important. Good data is needed for accurate models and useful analysis. Problems like missing values and outliers can harm your model’s performance. They can also make insights less reliable.
In this article, we will learn how to use Great Expectations (GE). We will show an example of validating a dataset. This example will walk through setting and checking expectations. GE helps find data problems early. This improves data quality in your pipeline.
What is Great Expectations?
Great Expectations is a tool for data quality checks. It is free and open-source. Data teams use it to make sure data is accurate and reliable. GE lets you create “expectations” for your data. These expectations are rules that describe what the data should look like. For example, you can set rules to check for missing values or to ensure numbers stay within a certain range.
GE has many built-in checks. These checks cover common data issues, like data types, unique values, and value ranges. It also supports custom checks for more complex rules. Great Expectations works well in data pipelines. This means it can check data quality automatically as data moves through the pipeline. Data teams can run these checks regularly to spot issues early.
Data Quality Dimensions
Data quality assurance looks at important aspects of data quality. These dimensions often include:
- Completeness: Checks if any data is missing. All necessary values should be present
- Uniqueness: Ensures no duplicate values. Each record should appear only once.
- Consistency: Confirms data is the same across sources. The data should match in all parts of the system.
- Validity: Checks if data follows specific rules. For example, dates should be in a certain format.
Setting Up the Environment
First, install Great Expectations and import the necessary libraries.
pip install great_expectations
After installation, create a DataContext. It is the main setup for Great Expectations (GX).
# Import required modules from Great Expectations
import great_expectations as gx
import pandas as pd
# Initialize the Data Context
context = gx.get_context()
In this example, let’s assume you have an employee dataset in CSV format. We’ll load it into a Pandas DataFrame.
# Import employee data into a Pandas DataFrame
df = pd.read_csv("employee_data.csv")
df.head()
Connecting to Data
Now, we need to connect it to Great Expectations. First, we create a data source. Next, we create a data asset. Then, we define a batch for the data. These components are essential in structuring how Great Expectations interacts with our data.
# Create Data Source, Data Asset, Batch Definition, and Batch
data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="pd dataframe asset")
batch_definition = data_asset.add_batch_definition_whole_dataframe("batch definition")
batch = batch_definition.get_batch(batch_parameters="dataframe": df)
Defining Expectations
Once the data is connected, we’ll define expectations for specific data columns. These expectations are like rules that the data should adhere to. For the employee dataset, let’s define the following expectations:
- ID: Employee ID should not be null.
- Age: Each employee’s age should be between 18 and 65.
- Department: Department names should match a predefined set.
# EmployeeID Expectation: Should not be null
expectation_employee_id_not_null = gx.expectations.ExpectColumnValuesToNotBeNull(
column="EmployeeID"
)
# Age Expectation: Age should be between 18 and 65
expectation_age = gx.expectations.ExpectColumnValuesToBeBetween(
column="age", min_value=18, max_value=65
)
# Department Expectation: Department should be one of the predefined values
expectation_department = gx.expectations.ExpectColumnValuesToBeInSet(
column="department", value_set=["Human Resources", "Engineering", "Marketing", "Finance"]
)
Running Validations
With expectations in place, we can now validate the data. Validation checks each record against the rules and shows whether the data follows or breaks them.
# Validate Batch using Expectations for all columns
validation_result_employee_id_not_null = batch.validate(expectation_employee_id_not_null)
validation_result_age = batch.validate(expectation_age)
validation_result_department = batch.validate(expectation_department)
# Print validation results for each expectation
print("EmployeeID Not Null Validation Result:", validation_result_employee_id_not_null)
print("Age Validation Result:", validation_result_age)
print("Department Validation Result:", validation_result_department)
Handling Data Quality Failures
If the data doesn’t meet the expectations, you need to decide what to do. Possible actions include:
- Alerting: Send notifications to relevant stakeholders about data quality issues.
- Data Rejection: Reject the current batch of data and reprocess it.
- Log and Continue: Log the failure for future analysis but continue processing other valid data.
Benefits of Using Great Expectations
Great Expectations provides several benefits for data science and engineering teams:
- Automated Data Quality Checks: It checks the data automatically before using it. This ensures only good data is used for analysis or modeling.
- Standardized Expectations: You can use the same rules for data in different projects. This helps keep data quality consistent.
- Detailed Reporting: It gives reports that show how good the data is over time. This helps find and fix problems.
- Flexibility: Great Expectations works with many data sources and formats. It also works well with tools like Pandas, SQL, and Spark.
Conclusion
Data quality is very important in data science. Great Expectations helps automate data checks. It fits well into data pipelines. With this tool, you can check data quality all the time. It keeps your data accurate, consistent, and reliable. This is important for good decision-making. Great Expectations can check simple rules like non-null values or more complex ones. It is flexible and powerful for all your data quality needs.
Jayita Gulati is a machine learning enthusiast and technical writer driven by her passion for building machine learning models. She holds a Master’s degree in Computer Science from the University of Liverpool.