Rossmann Sales Forecasting¶

Production MLOps System

This repository accompanies the ML/AI System Design book, providing a hands-on implementation of the concepts covered in the course.

This project demonstrates end-to-end MLOps best practices for traditional machine learning, using the Kaggle Rossmann Store Sales competition as a real-world example. The reason for this example is because many students and individuals new to MLOps are often exposed to Kaggle examples—this is a great way to learn how to build production ML systems using familiar datasets.

Overview¶

The Business Problem¶

This project is based on the Kaggle Rossmann Store Sales competition, which challenges participants to forecast daily sales for Rossmann drugstore chains across Europe. Rossmann operates over 3,000 stores in 7 European countries, and accurate sales forecasting is critical for inventory management, staffing decisions, and financial planning.

The competition dataset includes:

Historical sales data for 1,115 stores from 2013-2015
Store metadata including type, assortment, competition distance
Promotional information for regular and extended promotions
External factors like holidays, school closures, and seasonality

The forecasting challenge requires predicting sales for a 6-week period, with performance measured by RMSPE (Root Mean Square Percentage Error).

The Hypothetical Scenario¶

Imagine you work as a data scientist at Rossmann. You've successfully developed a forecasting model that achieves strong performance (RMSPE \< 0.10) on historical data. Now, your director has tasked you with productionizing this system to support ongoing business operations.

The requirements:

Automated daily/weekly updates: New sales data arrives regularly and must be processed automatically
Data quality guarantees: Invalid data must be caught before it corrupts the system
Model retraining: The model should retrain automatically as new data becomes available
Production deployment: Forecasts must be accessible via API and dashboard for business stakeholders
Monitoring and alerts: System must detect data drift and performance degradation
Reproducibility: All data, models, and predictions must be versioned and auditable

This repository provides an exemplar approach to productionizing a Kaggle-style model, assuming:

New actual sales data arrives on a daily or weekly basis
Historical data grows continuously, improving model performance over time
The system must run reliably with minimal human intervention
Quality and reproducibility are more important than speed

What This Project Demonstrates¶

The Rossmann Sales Forecasting project is a production-ready ML system that showcases modern MLOps practices:

📊 DataOps: Data versioning (DVC), validation (Great Expectations), reproducible pipelines
🤖 ModelOps: Experiment tracking (MLflow), model registry, hyperparameter tuning
🚀 Deployment: REST API (FastAPI), interactive dashboard (Streamlit), containerization (Docker)
📈 Monitoring: Data drift detection (Evidently), model performance tracking
✅ CI/CD: Automated testing (pytest), GitHub Actions workflows

This is not just a Kaggle competition submission—it's a blueprint for building production ML systems that handle the messy realities of real-world data and operational constraints.

Key Features¶

Production-Grade Architecture¶

Modular code structure with clear separation of concerns
Comprehensive test suite with >80% coverage
Type hints and documentation throughout
Modern Python packaging with uv

MLOps Infrastructure¶

Experiment Tracking: All runs logged to MLflow
Data Validation: Automated quality checks with Great Expectations
Version Control: Data and models versioned with DVC
Containerization: Docker for reproducible deployments
Monitoring: Drift detection and performance tracking

Quick Links¶

Getting Started

Install and set up the project in minutes

Quick Start
DataOps

Learn data processing, validation, and versioning

DataOps Guide
ModelOps

Train, track, and register models with MLflow

ModelOps Guide
Deployment

Deploy models via API and dashboard (Coming Soon)

Technology Stack¶

Core ML¶

Python 3.10+ with modern packaging (uv, pyproject.toml)
Pandas for data manipulation
Scikit-learn for preprocessing and metrics
LightGBM, XGBoost, CatBoost for gradient boosting

MLOps Tools¶

MLflow - Experiment tracking and model registry
DVC - Data and pipeline versioning
Great Expectations - Data validation
Evidently - Drift detection

Deployment¶

FastAPI - REST API for predictions
Streamlit - Interactive dashboard
Docker - Containerization
GitHub Actions - CI/CD automation

Development¶

pytest - Testing framework
black & ruff - Code formatting and linting
mypy - Type checking
MkDocs Material - Documentation