Model Monitoring Overview¶

Introduction¶

Model monitoring is a critical component of production ML systems that ensures models continue to perform well after deployment. Once a model is serving predictions in production, its performance can degrade over time due to changes in the underlying data distribution, business environment, or data quality issues.

This project implements a prediction logging and drift detection system to monitor the Rossmann sales forecasting model in production. The monitoring infrastructure captures prediction metadata, tracks feature distributions, and alerts when data drift is detected.

Why Monitor ML Models?¶

Machine learning models are trained on historical data, but production environments are dynamic:

Data drift: Feature distributions change over time (e.g., seasonal patterns, new promotions)
Concept drift: The relationship between features and target changes (e.g., customer behavior shifts)
Data quality issues: Missing values, schema changes, or data pipeline failures
Business changes: New stores, products, or market conditions not seen during training

Without monitoring, these issues can silently degrade model performance, leading to poor business decisions.

Monitoring Components¶

Our monitoring system consists of three key components:

1. Prediction Logging¶

Every prediction made by the deployed model is logged to a SQLite database with:

Metadata: Timestamp, batch ID, model version, model stage
Raw inputs: Store, date, day of week, open, promo, holidays
Key engineered features: Month, year, store type, assortment, competition distance, promo2 status
Prediction: The forecasted sales value
Performance metrics: API response time

Database location: data/monitoring/predictions.db

Key features tracked (~10 most important):

promo: Daily promotion active (0/1)
day_of_week: Day of week (1-7)
month: Month of year (1-12)
state_holiday: State holiday type (0, a, b, c)
school_holiday: School holiday indicator (0/1)
store_type: Store format (a, b, c, d)
assortment: Product assortment level (a, b, c)
competition_distance: Distance to nearest competitor
promo2: Long-running promo participation (0/1)
is_promo2_active: Whether Promo2 is currently active (0/1)

2. Data Drift Detection¶

The drift detection system compares production data distributions against training data using:

Statistical tests: Kolmogorov-Smirnov for numerical features, total variation distance for categorical features
Reference data: Full training feature set (data/processed/train_features.parquet)
Detection window: Configurable (default: 7 days of production data)
Alerting: Visual dashboard warnings when drift exceeds thresholds

See: Data Drift Detection for detailed methodology

3. Monitoring Dashboard¶

An interactive Streamlit dashboard provides:

Usage statistics: Total predictions, daily volume, model version distribution
Drift reports: Feature-level drift scores, distribution comparisons, trend analysis
Visualizations: Time-series charts, histograms, comparison plots
Recommendations: When to retrain, best practices, next steps

Access: Integrated into main Streamlit deployment at http://localhost:8501 → Monitoring page

Monitoring Workflow¶

graph LR
    A[User Request] --> B[FastAPI Endpoint]
    B --> C[Feature Engineering]
    C --> D[Model Prediction]
    D --> E[Log to Database]
    E --> F[predictions.db]

    G[Monitoring Dashboard] --> H[Query Database]
    H --> F
    H --> I[Load Reference Data]
    I --> J[train_features.parquet]

    H --> K[Generate Drift Report]
    I --> K
    K --> L[Statistical Tests]
    L --> M[Drift Summary + Plots]
    M --> G

Flow:

User makes prediction request via API or Streamlit
FastAPI processes request and generates prediction
Prediction logger captures inputs, features, and prediction
Data stored in SQLite database with timestamp
Monitoring dashboard queries recent predictions
Drift detector compares production vs. training distributions
Statistical tests identify drifted features
Dashboard displays drift summary and visualizations

Design Decisions¶

Why SQLite?¶

Simplicity: File-based, no server setup required
Portability: Easy to backup, version, and share
Sufficient scale: Handles millions of predictions efficiently
Production note: Replace with PostgreSQL/MySQL for high-throughput production systems

Why ~10 Key Features?¶

Focus: Monitor most important features rather than all 46
Performance: Faster drift detection on subset
Interpretability: Easier to understand what's driving drift
Coverage: Selected features represent different types (temporal, promotional, store characteristics)

Why Full Training Data as Reference?¶

Accuracy: Most complete representation of expected distribution
Consistency: Avoids sampling bias from using subset
Trade-off: Larger file size (~200MB parquet) vs. 10MB sample
Note: Can switch to sampled reference if performance is an issue

Monitoring Best Practices¶

When to Check for Drift¶

Weekly: Run drift detection on past 7 days vs. training data
After events: Check after major promotions, holidays, or business changes
Before retraining: Understand what has changed since last training
Continuous: Consider automated daily/weekly drift monitoring jobs

When to Retrain¶

Consider retraining if you observe:

Significant drift: >20% of key features showing drift
Target drift: Prediction distribution has shifted substantially
Business changes: New stores, products, or market conditions
Regular schedule: Monthly or quarterly retraining even without drift
Performance degradation: Actual sales vs. predictions show increasing errors

What to Do When Drift is Detected¶

Investigate: Review drift report to identify which features are drifting
Validate: Check if drift is expected (seasonal patterns) or concerning (data quality)
Assess impact: Does drift affect prediction quality? Check actual vs. predicted performance
Root cause analysis: Understand why drift occurred (business change, data pipeline issue, etc.)
Decide action:
- Retrain if drift is substantial and affecting performance
- Monitor if drift is minor or expected seasonal variation
- Fix data if drift is due to data quality issues

Current Limitations¶

What This Monitoring System Does¶

✅ Logs all predictions with key features
✅ Detects distribution changes in features (data drift)
✅ Provides visual comparison of reference vs. production distributions
✅ Alerts when drift thresholds are exceeded
✅ Tracks prediction volume and model version usage

What This Monitoring System Does NOT Do¶

❌ Performance monitoring: Does not track prediction accuracy (requires ground truth labels)
❌ Root cause analysis: Flags drift but doesn't explain why it occurred
❌ Automated retraining: Does not automatically trigger model retraining
❌ Target drift detection: Cannot detect concept drift without actual sales data
❌ Advanced drift analysis: No temporal drift trends, multivariate drift, or sensitivity analysis

Scope and Production Considerations¶

This implementation provides a lightweight monitoring foundation focused on the essentials: prediction logging, data drift detection, and basic visualization. It's suitable for demonstration purposes and small-scale deployments where monitoring is integrated directly into the Streamlit app for simplicity. However, a comprehensive production system would extend this foundation with additional capabilities including performance tracking with ground truth labels, temporal drift trend analysis, multivariate drift detection, automated retraining workflows, scheduled monitoring jobs, alerting integration (Slack, email), time-series databases (InfluxDB, Prometheus), real-time dashboards (Grafana), and deployment as a separate monitoring service decoupled from the prediction API. This separation is particularly important in production environments as it provides better security (monitoring internals not exposed to customers), performance (monitoring queries don't impact prediction latency), scalability (independent scaling), and maintenance flexibility (updates without redeploying prediction services).

Next Steps¶

Review: Data Drift Detection for statistical methodology
Explore: Dashboard Guide for using the monitoring interface
Implement: Performance tracking when ground truth labels become available
Plan: Production monitoring architecture based on team needs