Detailed Setup Guide¶
Complete installation and setup instructions for the Rossmann forecasting project. This guide provides step-by-step explanations for users new to MLOps projects.
Already Familiar with Python Projects?
If you're comfortable with Python development, try the Quick Start Guide for a faster setup.
Prerequisites¶
- Python 3.10 or higher
- Git
- 8GB+ RAM (recommended for processing full dataset)
- ~5GB disk space (for data and models)
Installation¶
Step 1: Install uv Package Manager¶
uv is a fast Python package manager that we use for dependency management.
Verify installation:
Step 2: Clone Repository¶
Step 3: Create Virtual Environment¶
# Create virtual environment
uv venv
# Activate environment
source .venv/bin/activate # macOS/Linux
# or
.venv\Scripts\activate # Windows
You should see (.venv) in your terminal prompt.
Step 4: Install Dependencies¶
# Install production dependencies
uv pip install -e .
# Or install with dev dependencies (recommended)
uv pip install -e ".[dev]"
# Or install everything (dev + docs)
uv pip install -e ".[dev,docs]"
Step 5: Verify Installation¶
# Run tests to verify setup
pytest tests/ -v
# Check Python packages
python -c "
import pandas
import lightgbm
import xgboost
import catboost
print('✓ All packages installed successfully')
"
Data Setup¶
Verify Data Files¶
Good news! The raw data files are already included in this repository, so you can start working immediately.
Verify data files are present:
Expected output:
Download Fresh Data (Optional)¶
If you want to download the latest data directly from Kaggle, you can use these methods:
- Visit Kaggle competition page
- Download
train.csvandstore.csv - Place in
data/raw/directory
Data Source
This project uses data from the Kaggle Rossmann Store Sales competition. The included data is for educational purposes.
Exploring MLOps Workflows¶
Now that setup is complete, you can explore different MLOps workflows:
1. DataOps Workflow¶
Process and validate data using production-grade practices:
# Run complete DataOps pipeline
bash scripts/dataops_workflow.sh
# Or run individual steps:
python src/data/validate_data.py --stage raw
python -m src.data.make_dataset
python -m src.features.build_features
Learn more: DataOps Workflow Guide
2. Model Experimentation¶
Explore different modeling approaches:
# Option A: Jupyter Notebooks (recommended for learning)
jupyter lab
# Navigate to notebooks/ and run in order:
# - 01-eda-and-cleaning.ipynb
# - 02-feature-engineering.ipynb
# - 03-baseline-models.ipynb
# - 04-advanced-models-and-ensembles.ipynb
# Once you are done running the notebooks above, you can
# run the python scripts to mimic production model training:
python -m src.models.train_ensemble
python -m src.models.validate_model
Learn more: Model Training Guide
3. Reproducible Pipelines with DVC¶
Run automated, cacheable pipelines:
# Run entire pipeline (only reruns changed stages)
dvc repro
# Run specific stage
dvc repro build_features
# Visualize pipeline
dvc dag
Learn more: DVC Pipeline Guide
4. Deployment (Optional)¶
Deploy models via API and dashboard:
# Start all services with Docker
docker-compose up --build
# Access services:
# - FastAPI: http://localhost:8000
# - Streamlit: http://localhost:8501
# - MLflow: http://localhost:5000
Learn more: Deployment guide coming soon
Next Steps¶
After setup is complete:
- ✅ Quick Start Guide - Run your first workflow
- ✅ Project Structure - Understand the codebase
- ✅ DataOps Workflow - Process and validate data
- ✅ Model Training - Train your first model
Additional Resources¶
- MkDocs Material - Documentation theme
- uv documentation - Package manager
- DVC documentation - Data version control
- MLflow documentation - Experiment tracking