Data Validation with Great Expectations¶
Data validation is a critical component of production ML pipelines. This page explains how we use Great Expectations to ensure data quality throughout the DataOps workflow.
What is Great Expectations?¶
Great Expectations is an open-source Python library for data validation, documentation, and profiling. It allows you to:
- Define expectations (assertions about your data)
- Validate data against those expectations
- Generate data quality reports
- Fail pipelines when data doesn't meet quality standards
Think of it as unit tests for your data.
Why Great Expectations?¶
Without data validation:
- ❌ Bad data silently enters your pipeline
- ❌ Model performance degrades mysteriously
- ❌ Debugging requires manual data inspection
- ❌ Data quality assumptions are undocumented
With Great Expectations:
- ✅ Bad data is caught immediately
- ✅ Pipeline fails fast with clear error messages
- ✅ Data quality checks are automated
- ✅ Expectations serve as executable documentation
How It Works¶
Great Expectations operates in three stages:
1. Define Expectations¶
Create a suite of expectations (validation rules) for your data:
# Example: Raw train data expectations
validator.expect_table_columns_to_match_set(
column_set=["Store", "DayOfWeek", "Date", "Sales", "Customers", ...]
)
validator.expect_column_values_to_not_be_null(
column="Store"
)
validator.expect_column_values_to_be_between(
column="Sales",
min_value=0,
max_value=1000000
)
2. Run Validation¶
Execute expectations against your data:
3. Review Results¶
Get immediate feedback on data quality:
============================================================
Validating raw training data: data/raw/train.csv
============================================================
✓ PASSED
Total expectations: 13
Successful: 13
Failed: 0
Success rate: 100.0%
Our Validation Strategy¶
We validate data at three critical stages in the DataOps pipeline:
Stage 1: Raw Data Validation¶
Location: great_expectations/expectations/raw_train_suite.json
Purpose: Catch issues in incoming data before any processing
Key Expectations:
| Expectation | What It Checks |
|---|---|
expect_table_columns_to_match_set |
All required columns are present |
expect_column_values_to_not_be_null |
Critical fields have no missing values |
expect_column_values_to_be_of_type |
Data types are correct (int, float, string) |
expect_column_values_to_be_between |
Numeric values are in valid ranges |
expect_column_values_to_be_in_set |
Categorical values match expected set |
Example Expectations:
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {
"column": "Sales",
"min_value": 0,
"max_value": 1000000
}
},
{
"expectation_type": "expect_column_values_to_be_in_set",
"kwargs": {
"column": "DayOfWeek",
"value_set": [1, 2, 3, 4, 5, 6, 7]
}
}
Stage 2: Processed Data Validation¶
Location: great_expectations/expectations/processed_data_suite.json
Purpose: Verify processing didn't introduce errors
Additional Expectations:
| Expectation | What It Checks |
|---|---|
expect_table_row_count_to_be_between |
No rows were lost during processing |
expect_column_values_to_not_be_null |
Processing didn't create unexpected nulls |
expect_column_mean_to_be_between |
Statistical properties remain reasonable |
Stage 3: Feature Validation¶
Location: great_expectations/expectations/features_suite.json
Purpose: Ensure feature engineering produced valid features
Feature-Specific Expectations:
| Expectation | What It Checks |
|---|---|
expect_column_values_to_not_contain_inf |
No infinite values from log transforms |
expect_column_values_to_not_be_null |
Features don't have unexpected NaN values |
expect_column_values_to_be_between |
Feature ranges are reasonable |
Common Expectation Types¶
Great Expectations provides dozens of built-in expectations. Here are the most commonly used:
Schema Expectations¶
# Column presence
validator.expect_table_columns_to_match_set(
column_set=["Store", "Date", "Sales"]
)
# Data types
validator.expect_column_values_to_be_of_type(
column="Date",
type_="datetime64[ns]"
)
# Unique values
validator.expect_column_values_to_be_unique(
column="Store"
)
Value Range Expectations¶
# Numeric ranges
validator.expect_column_values_to_be_between(
column="Sales",
min_value=0,
max_value=1000000
)
# Categorical sets
validator.expect_column_values_to_be_in_set(
column="StateHoliday",
value_set=["0", "a", "b", "c"]
)
# No nulls
validator.expect_column_values_to_not_be_null(
column="Store"
)
Statistical Expectations¶
# Mean value
validator.expect_column_mean_to_be_between(
column="Sales",
min_value=5000,
max_value=7000
)
# Standard deviation
validator.expect_column_stdev_to_be_between(
column="Sales",
min_value=3000,
max_value=5000
)
# Quantile ranges
validator.expect_column_quantile_values_to_be_between(
column="Sales",
quantile_ranges={
"quantiles": [0.5, 0.95],
"value_ranges": [[5000, 7000], [15000, 20000]]
}
)
Validation Workflow¶
Running Validations¶
Our validation script (src/data/validate_data.py) provides a unified interface:
# Validate raw data
python src/data/validate_data.py --stage raw --fail-on-error
# Validate processed data
python src/data/validate_data.py --stage processed --fail-on-error
# Validate features
python src/data/validate_data.py --stage features --fail-on-error
Key options:
--stage- Which validation suite to run (raw, processed, features)--fail-on-error- Exit with error code if validation fails (recommended for CI/CD)
Understanding Validation Results¶
Success output:
============================================================
Validating processed data: data/processed/train_clean.parquet
============================================================
✓ PASSED
Total expectations: 15
Successful: 15
Failed: 0
Success rate: 100.0%
Failure output:
============================================================
Validating raw data: data/raw/train.csv
============================================================
❌ FAILED
Total expectations: 13
Successful: 11
Failed: 2
Success rate: 84.6%
Failed expectations:
❌ expect_column_values_to_be_between (Sales)
- Found 3 values outside range [0, 1000000]
- Example failures: [-10, -5, -2]
❌ expect_column_values_to_not_be_null (Store)
- Found 5 null values in Store column
Creating Custom Expectations¶
You can add new expectations to suit your project's needs:
1. Interactive Exploration¶
Use Great Expectations CLI to explore your data and generate expectations:
# Initialize Great Expectations
great_expectations init
# Create a checkpoint for validation
great_expectations checkpoint new my_checkpoint
2. Programmatic Creation¶
Add expectations directly in Python:
import great_expectations as gx
# Load data context
context = gx.get_context()
# Create validator
validator = context.sources.pandas_default.read_csv(
"data/raw/train.csv"
)
# Add custom expectations
validator.expect_column_unique_value_count_to_be_between(
column="Store",
min_value=1000,
max_value=1200
)
# Save suite
validator.save_expectation_suite()
3. Modify Existing Suites¶
Edit expectation suite JSON files directly:
{
"expectation_type": "expect_column_values_to_match_regex",
"kwargs": {
"column": "Date",
"regex": "^\\d{4}-\\d{2}-\\d{2}$"
}
}
Best Practices¶
1. Start Simple, Add Incrementally¶
Don't try to validate everything at once:
# ✅ Good: Start with critical checks
validator.expect_column_values_to_not_be_null("Store")
validator.expect_column_values_to_not_be_null("Date")
validator.expect_column_values_to_not_be_null("Sales")
# ❌ Bad: 100 expectations on day 1
# (overwhelming, hard to maintain)
2. Focus on Business Logic¶
Validate assumptions that reflect business requirements:
# ✅ Good: Business rule
validator.expect_column_values_to_be_between(
column="Sales",
min_value=0, # Sales can't be negative
max_value=1000000 # No single-day sales > $1M
)
# ❌ Less useful: Overly specific
validator.expect_column_mean_to_equal(
column="Sales",
value=5912.34567 # Too precise, likely to break
)
3. Use Tolerances for Statistical Checks¶
Allow for natural variation:
# ✅ Good: Reasonable tolerance
validator.expect_column_mean_to_be_between(
column="Sales",
min_value=5000,
max_value=7000,
meta={"notes": "Historical mean ~6000, allow ±1000 variation"}
)
4. Document Expectations¶
Add metadata to explain why expectations exist:
validator.expect_column_values_to_be_in_set(
column="StateHoliday",
value_set=["0", "a", "b", "c"],
meta={
"notes": "StateHoliday codes from Rossmann spec",
"reference": "https://www.kaggle.com/c/rossmann-store-sales/data"
}
)
5. Version Control Expectation Suites¶
Track changes to expectations like code:
# Commit expectation changes
git add great_expectations/expectations/
git commit -m "expectations: tighten sales range validation
- Reduce max_value from 2M to 1M
- Add outlier detection for sales > 100K
- Rationale: historical data shows 1M is realistic max"
Troubleshooting¶
Validation Fails After Data Update¶
Symptom: Validation that previously passed now fails
Diagnosis:
Solutions:
-
Fix data at source (preferred)
- Contact data provider
- Correct upstream ETL process
-
Update expectations (if assumptions changed)
-
Add data cleaning (if pattern is expected)
Performance Issues with Large Datasets¶
Symptom: Validation takes too long
Solutions:
-
Sample data for validation
-
Reduce expectation complexity
- Remove statistical expectations on large columns
- Focus on schema and critical value checks
-
Parallelize validation
- Run different suites concurrently
- Use Great Expectations' batch mode
Next Steps¶
- Individual Steps - See validation in context of full pipeline
- Automation - Integrate validation into automated workflows
- Best Practices - Production-grade validation strategies
External Resources:
- Great Expectations Documentation
- Expectation Gallery - Full list of built-in expectations
- Great Expectations GitHub