Skip to content

Detailed Setup Guide

Complete installation and setup instructions for the Rossmann forecasting project. This guide provides step-by-step explanations for users new to MLOps projects.

Already Familiar with Python Projects?

If you're comfortable with Python development, try the Quick Start Guide for a faster setup.

Prerequisites

  • Python 3.10 or higher
  • Git
  • 8GB+ RAM (recommended for processing full dataset)
  • ~5GB disk space (for data and models)

Installation

Step 1: Install uv Package Manager

uv is a fast Python package manager that we use for dependency management.

pip install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# After installation, restart your terminal or run:
source $HOME/.cargo/env
irm https://astral.sh/uv/install.ps1 | iex

Verify installation:

uv --version

Step 2: Clone Repository

git clone https://github.com/bradleyboehmke/rossmann-forecasting.git
cd rossmann-forecasting

Step 3: Create Virtual Environment

# Create virtual environment
uv venv

# Activate environment
source .venv/bin/activate  # macOS/Linux
# or
.venv\Scripts\activate  # Windows

You should see (.venv) in your terminal prompt.

Step 4: Install Dependencies

# Install production dependencies
uv pip install -e .

# Or install with dev dependencies (recommended)
uv pip install -e ".[dev]"

# Or install everything (dev + docs)
uv pip install -e ".[dev,docs]"

Step 5: Verify Installation

# Run tests to verify setup
pytest tests/ -v

# Check Python packages
python -c "
import pandas
import lightgbm
import xgboost
import catboost
print('✓ All packages installed successfully')
"

Data Setup

Verify Data Files

Good news! The raw data files are already included in this repository, so you can start working immediately.

Verify data files are present:

ls -lh data/raw/train.csv data/raw/store.csv

Expected output:

-rw-r--r--  1 user  staff   17M  train.csv
-rw-r--r--  1 user  staff   45K  store.csv

Download Fresh Data (Optional)

If you want to download the latest data directly from Kaggle, you can use these methods:

# Install Kaggle CLI
pip install kaggle

# Configure API credentials (~/.kaggle/kaggle.json)
# Download from: https://www.kaggle.com/settings

# Download competition data
kaggle competitions download -c rossmann-store-sales

# Extract to data/raw/
unzip rossmann-store-sales.zip -d data/raw/
  1. Visit Kaggle competition page
  2. Download train.csv and store.csv
  3. Place in data/raw/ directory

Data Source

This project uses data from the Kaggle Rossmann Store Sales competition. The included data is for educational purposes.

Exploring MLOps Workflows

Now that setup is complete, you can explore different MLOps workflows:

1. DataOps Workflow

Process and validate data using production-grade practices:

# Run complete DataOps pipeline
bash scripts/dataops_workflow.sh

# Or run individual steps:
python src/data/validate_data.py --stage raw
python -m src.data.make_dataset
python -m src.features.build_features

Learn more: DataOps Workflow Guide

2. Model Experimentation

Explore different modeling approaches:

# Option A: Jupyter Notebooks (recommended for learning)
jupyter lab

# Navigate to notebooks/ and run in order:
# - 01-eda-and-cleaning.ipynb
# - 02-feature-engineering.ipynb
# - 03-baseline-models.ipynb
# - 04-advanced-models-and-ensembles.ipynb
# Once you are done running the notebooks above, you can
# run the python scripts to mimic production model training:
python -m src.models.train_ensemble
python -m src.models.validate_model

Learn more: Model Training Guide

3. Reproducible Pipelines with DVC

Run automated, cacheable pipelines:

# Run entire pipeline (only reruns changed stages)
dvc repro

# Run specific stage
dvc repro build_features

# Visualize pipeline
dvc dag

Learn more: DVC Pipeline Guide

4. Deployment (Optional)

Deploy models via API and dashboard:

# Start all services with Docker
docker-compose up --build

# Access services:
# - FastAPI: http://localhost:8000
# - Streamlit: http://localhost:8501
# - MLflow: http://localhost:5000

Learn more: Deployment guide coming soon

Next Steps

After setup is complete:

  1. Quick Start Guide - Run your first workflow
  2. Project Structure - Understand the codebase
  3. DataOps Workflow - Process and validate data
  4. Model Training - Train your first model

Additional Resources