Real-World Scenarios¶

This page walks through practical scenarios showing how the DataOps pipeline handles different situations in production.

Scenario 1: Initial Setup¶

Context: Setting up the DataOps pipeline for the first time with historical data.

Week 1: Baseline Establishment¶

# Run DataOps pipeline on historical data (2013-2015)
bash scripts/dataops_workflow.sh

# Expected output:
# ✓ Raw data validation: PASSED
# ✓ Processed 1,017,209 records
# ✓ Created 32 standard features
# ✓ All validations: PASSED

Results:

Clean processed data: data/processed/train_clean.parquet
Feature dataset: data/processed/train_features.parquet
DVC metadata: *.dvc files committed to Git
Ready for model training

Train Initial Model¶

# Train baseline and advanced models
python -m src.models.train_baselines
python -m src.models.train_advanced

# Model achieves RMSPE = 0.098 on validation
# Deploy to production

Scenario 2: New Data Arrives (Success Path)¶

Context: New week of sales data arrives from Rossmann's internal systems.

Week 2: Routine Data Update¶

# New data file arrives
cp rossmann_sales_2015_week32.csv data/raw/train.csv

# Run DataOps workflow (automated via cron or Airflow)
bash scripts/dataops_workflow.sh

Pipeline Automatically¶

Validates new data - Catches any schema changes
Merges with historical data - Combines with existing records
Re-engineers all features - Includes new lags/rolling stats
Versions new dataset - Creates new DVC snapshot
Triggers model retraining - Signals ModelOps workflow

Model Retraining¶

# Automatic trigger (via Airflow/GitHub Actions)
python -m src.models.train_advanced

# New model trained on expanded dataset
# Achieves RMSPE = 0.097 (improvement!)
# Automatically deployed to production

Key Benefits:

✅ Fully automated end-to-end
✅ Quality gates ensure data integrity
✅ Model continuously improves with more data
✅ Zero manual intervention required

Scenario 3: Bad Data Arrives (Failure Path)¶

Context: New data file has quality issues that violate expectations.

Week 3: Schema Change Detected¶

# New data file has schema issue (missing 'Store' column)
cp bad_data_missing_column.csv data/raw/train.csv

# Run DataOps workflow
bash scripts/dataops_workflow.sh

Pipeline Response¶

============================================================
Step 1: Validating raw data...
============================================================
❌ FAILED
Total expectations: 13
Successful: 11
Failed: 2

Failed expectations:
  ❌ expect_table_columns_to_match_set
     - Missing required column: Store
     - Found columns: [Date, Sales, Customers, ...]

Pipeline STOPPED at validation gate.
Exit code: 1

What Happens:

❌ Validation fails at Step 1
🛑 Pipeline stops immediately
📧 Alert sent to data engineering team
✅ Production model continues running on last known good data
👥 Team investigates and fixes data source

Critical Point: Bad data never enters the pipeline. This fail-safe behavior prevents broken models from reaching production.

Scenario 4: Data Quality Degradation¶

Context: Data passes schema validation but has statistical anomalies.

Week 4: Outlier Detection¶

# New data has unusual sales patterns
cp data_with_outliers.csv data/raw/train.csv

# Run DataOps workflow
bash scripts/dataops_workflow.sh

Validation Catches Outliers¶

============================================================
Step 1: Validating raw data...
============================================================
⚠️  WARNING
Total expectations: 13
Successful: 12
Failed: 1

Failed expectations:
  ❌ expect_column_values_to_be_between (Sales)
     - Found 3 values outside expected range [0, 1000000]
     - Outliers: [1,250,000, 1,100,000, 1,500,000]
     - Possible cause: Holiday sales spike or data error

Recommend manual review before proceeding.

Response Options:

Investigate outliers

# Review outlier records
python -c "
import pandas as pd
df = pd.read_csv('data/raw/train.csv')
print(df[df['Sales'] > 1000000])
"

Update expectations (if legitimate)

# If Black Friday sales spike is real
validator.expect_column_values_to_be_between(
    column="Sales",
    min_value=0,
    max_value=2000000  # Increased limit
)

Cap outliers (if data errors)

# In make_dataset.py
df["Sales"] = df["Sales"].clip(upper=1000000)

Scenario 5: Feature Engineering Update¶

Context: Data scientist wants to add new experimental features.

Developer Iteration¶

# Modify feature engineering code
vim src/features/build_features.py

# Add new experimental features (e.g., holiday interactions)
# Test locally
python -m src.features.build_features

# Run DVC pipeline (only reruns changed stages)
dvc repro

DVC Smart Caching¶

Running stage 'build_features':
⚙️  Re-running build_features (code changed)
⚙️  Re-running validate_features (input changed)
✅ Skipping validate_raw (unchanged)
✅ Skipping process_data (unchanged)
✅ Skipping validate_processed (unchanged)

Completed in 45 seconds (vs. 5 minutes full rebuild)

Benefits:

⚡ Fast iteration (only reruns what changed)
✅ Validation ensures new features are valid
🔄 Easy rollback if features don't improve model

Scenario 6: Rollback to Previous Data Version¶

Context: New data processing introduced a bug. Need to revert.

Problem Detection¶

# Week 5: Model performance suddenly drops
# RMSPE increased from 0.097 → 0.125
# Suspect data processing bug in latest run

Rollback with DVC¶

# List DVC commits
git log --oneline data/processed/train_features.parquet.dvc

# Output:
# a1b2c3d data: version features with new holiday interactions
# e4f5g6h data: version processed data and features (← this one worked!)
# i7j8k9l data: initial feature engineering

# Checkout previous version
git checkout e4f5g6h data/processed/train_features.parquet.dvc

# Pull previous data from DVC cache
dvc checkout

# Retrain model on previous data
python -m src.models.train_advanced

# Model performance restored: RMSPE = 0.097 ✅

Next Steps:

Debug feature engineering code
Fix bug
Re-run pipeline
Validate performance before deploying

Scenario 7: Scheduled Pipeline Execution¶

Context: Production setup with weekly automated data refreshes.

Cron Schedule¶

# crontab -e
# Run DataOps pipeline every Monday at 2 AM
0 2 * * 1 cd /path/to/rossmann-forecasting && bash scripts/dataops_workflow.sh >> logs/dataops.log 2>&1

Typical Week¶

Monday 2:00 AM:

Cron triggers DataOps pipeline
New sales data processed
Features engineered
Data versioned with DVC

Monday 3:00 AM:

Model retraining triggered
New model evaluated
If better: deployed to production
If worse: old model retained

Monday 4:00 AM:

Monitoring dashboards updated
Team receives summary email
System ready for the week

Scenario 8: Multi-Environment Setup¶

Context: Separate pipelines for development, staging, and production.

Environment Structure¶

# Development: Experiment with new features
dev/
├── data/raw/train_sample.csv  # 10% sample for fast iteration
└── .dvc/config  # Points to dev S3 bucket

# Staging: Test full pipeline
staging/
├── data/raw/train.csv  # Full dataset
└── .dvc/config  # Points to staging S3 bucket

# Production: Automated runs
production/
├── data/raw/train.csv  # Latest production data
└── .dvc/config  # Points to prod S3 bucket

Workflow¶

Develop - Iterate on sample data
Test - Run on full dataset in staging
Deploy - Promote to production after validation

Key Takeaways¶

Validation gates prevent bad data from propagating
- Fail fast, fail safely
- Alert humans when intervention needed
Automation enables hands-off operation
- Scheduled runs work reliably
- Human intervention only for exceptions
Versioning enables easy rollback
- DVC + Git track exact data state
- Return to any previous version quickly
Smart caching speeds up iteration
- DVC only reruns changed stages
- Developers iterate faster
Quality over speed
- Better to run on slightly outdated data than broken data
- Reliability > freshness

Next Steps¶

Best Practices - Production-grade recommendations
Validation Guide - Customize expectations for your use case
Automation - Set up scheduled pipelines