Detailed Setup Guide¶

Complete installation and setup instructions for the Rossmann forecasting project. This guide provides step-by-step explanations for users new to MLOps projects.

Already Familiar with Python Projects?

If you're comfortable with Python development, try the Quick Start Guide for a faster setup.

Prerequisites¶

Python 3.10 or higher
Git
8GB+ RAM (recommended for processing full dataset)
~5GB disk space (for data and models)

Installation¶

Step 1: Install `uv` Package Manager¶

uv is a fast Python package manager that we use for dependency management.

pip (Recommended)macOS/Linux (Standalone)Windows PowerShell (Standalone)

pip install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

# After installation, restart your terminal or run:
source $HOME/.cargo/env

irm https://astral.sh/uv/install.ps1 | iex

Verify installation:

uv --version

Step 2: Clone Repository¶

git clone https://github.com/bradleyboehmke/rossmann-forecasting.git
cd rossmann-forecasting

Step 3: Create Virtual Environment¶

# Create virtual environment
uv venv

# Activate environment
source .venv/bin/activate  # macOS/Linux
# or
.venv\Scripts\activate  # Windows

You should see (.venv) in your terminal prompt.

Step 4: Install Dependencies¶

# Install production dependencies
uv pip install -e .

# Or install with dev dependencies (recommended)
uv pip install -e ".[dev]"

# Or install everything (dev + docs)
uv pip install -e ".[dev,docs]"

Step 5: Verify Installation¶

# Run tests to verify setup
pytest tests/ -v

# Check Python packages
python -c "
import pandas
import lightgbm
import xgboost
import catboost
print('✓ All packages installed successfully')
"

Data Setup¶

Verify Data Files¶

Good news! The raw data files are already included in this repository, so you can start working immediately.

Verify data files are present:

ls -lh data/raw/train.csv data/raw/store.csv

Expected output:

-rw-r--r--  1 user  staff   17M  train.csv
-rw-r--r--  1 user  staff   45K  store.csv

Download Fresh Data (Optional)¶

If you want to download the latest data directly from Kaggle, you can use these methods:

Kaggle CLIManual Download

# Install Kaggle CLI
pip install kaggle

# Configure API credentials (~/.kaggle/kaggle.json)
# Download from: https://www.kaggle.com/settings

# Download competition data
kaggle competitions download -c rossmann-store-sales

# Extract to data/raw/
unzip rossmann-store-sales.zip -d data/raw/

Visit Kaggle competition page
Download train.csv and store.csv
Place in data/raw/ directory

Data Source

This project uses data from the Kaggle Rossmann Store Sales competition. The included data is for educational purposes.

Exploring MLOps Workflows¶

Now that setup is complete, you can explore different MLOps workflows:

1. DataOps Workflow¶

Process and validate data using production-grade practices:

# Run complete DataOps pipeline
bash scripts/dataops_workflow.sh

# Or run individual steps:
python src/data/validate_data.py --stage raw
python -m src.data.make_dataset
python -m src.features.build_features

Learn more: DataOps Workflow Guide

2. Model Experimentation¶

Explore different modeling approaches:

# Option A: Jupyter Notebooks (recommended for learning)
jupyter lab

# Navigate to notebooks/ and run in order:
# - 01-eda-and-cleaning.ipynb
# - 02-feature-engineering.ipynb
# - 03-baseline-models.ipynb
# - 04-advanced-models-and-ensembles.ipynb

# Once you are done running the notebooks above, you can
# run the python scripts to mimic production model training:
python -m src.models.train_ensemble
python -m src.models.validate_model

Learn more: Model Training Guide

3. Reproducible Pipelines with DVC¶

Run automated, cacheable pipelines:

# Run entire pipeline (only reruns changed stages)
dvc repro

# Run specific stage
dvc repro build_features

# Visualize pipeline
dvc dag

Learn more: DVC Pipeline Guide

4. Deployment (Optional)¶

Deploy models via API and dashboard:

# Start all services with Docker
docker-compose up --build

# Access services:
# - FastAPI: http://localhost:8000
# - Streamlit: http://localhost:8501
# - MLflow: http://localhost:5000

Learn more: Deployment guide coming soon

Next Steps¶

After setup is complete:

✅ Quick Start Guide - Run your first workflow
✅ Project Structure - Understand the codebase
✅ DataOps Workflow - Process and validate data
✅ Model Training - Train your first model

Additional Resources¶

MkDocs Material - Documentation theme
uv documentation - Package manager
DVC documentation - Data version control
MLflow documentation - Experiment tracking