Feature Engineering API
This module contains functions for creating features from cleaned Rossmann sales data.
Overview
The feature engineering pipeline creates 32+ standard features organized into categories:
- Calendar features - Year, month, week, day-of-week, seasonality
- Promotion features - Promo, Promo2, durations, intervals
- Competition features - Distance, age, opened flags
- Lag features - Store-level lags [1, 7, 14, 28 days]
- Rolling features - Rolling means and standard deviations
Module Reference
Feature engineering functions for the Rossmann forecasting project.
This module creates STANDARD/PROVEN features that are part of the DataOps workflow.
These features have been validated as beneficial for the organization and are:
- Automated in the dataops_workflow.sh script
- Tested in tests/test_features.py
- Always created for all modeling efforts
Examples of standard features:
- Calendar features: year, month, quarter, season, weekend flags
- Temporal patterns: is_month_start, is_quarter_end
- Business context: promo features, competition features
- Historical patterns: lags and rolling statistics
IMPORTANT: Model-specific transformations (scaling, normalization) or experimental
feature engineering should be done separately in model training pipelines (ModelOps).
CRITICAL: All lag and rolling features MUST use proper groupby to prevent data leakage.
- Lag features: df.groupby("Store")["column"].shift(lag)
- Rolling features: df.groupby("Store")["column"].rolling(window).agg()
add_calendar_features(df)
Add calendar-based features derived from Date.
Features created:
- Year, Month, Week, Day, DayOfMonth
- Quarter, IsMonthStart, IsMonthEnd, IsQuarterStart, IsQuarterEnd
- Season (meteorological: Winter, Spring, Summer, Fall)
- IsWeekend
Parameters
df : pd.DataFrame
Dataframe with Date column (datetime)
Returns
pd.DataFrame
Dataframe with added calendar features
Source code in src/features/build_features.py
| def add_calendar_features(df: pd.DataFrame) -> pd.DataFrame:
"""Add calendar-based features derived from Date.
Features created:
- Year, Month, Week, Day, DayOfMonth
- Quarter, IsMonthStart, IsMonthEnd, IsQuarterStart, IsQuarterEnd
- Season (meteorological: Winter, Spring, Summer, Fall)
- IsWeekend
Parameters
----------
df : pd.DataFrame
Dataframe with Date column (datetime)
Returns
-------
pd.DataFrame
Dataframe with added calendar features
"""
logger.info("Adding calendar features")
df = df.copy()
# Basic date components
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
df["Week"] = df["Date"].dt.isocalendar().week
df["Day"] = df["Date"].dt.day
df["DayOfMonth"] = df["Date"].dt.day
df["Quarter"] = df["Date"].dt.quarter
# Month flags
df["IsMonthStart"] = df["Date"].dt.is_month_start.astype("int8")
df["IsMonthEnd"] = df["Date"].dt.is_month_end.astype("int8")
df["IsQuarterStart"] = df["Date"].dt.is_quarter_start.astype("int8")
df["IsQuarterEnd"] = df["Date"].dt.is_quarter_end.astype("int8")
# Weekend flag (DayOfWeek: 1=Mon, 7=Sun)
df["IsWeekend"] = (df["DayOfWeek"] >= 6).astype("int8")
# Season (meteorological)
# Winter: 12, 1, 2
# Spring: 3, 4, 5
# Summer: 6, 7, 8
# Fall: 9, 10, 11
df["Season"] = (
df["Month"]
.map(
{
12: 0,
1: 0,
2: 0, # Winter
3: 1,
4: 1,
5: 1, # Spring
6: 2,
7: 2,
8: 2, # Summer
9: 3,
10: 3,
11: 3, # Fall
}
)
.astype("int8")
)
# Convert to memory-efficient dtypes
int_cols = ["Year", "Month", "Week", "Day", "DayOfMonth", "Quarter"]
for col in int_cols:
df[col] = df[col].astype("int16")
logger.info(f"Added {11} calendar features")
return df
|
add_competition_features(df)
Add competition-related features.
Features created:
- CompetitionDistance_log: Log-scaled competition distance
- CompetitionAge: Days since competition opened
- HasCompetition: Binary flag for presence of competition
Parameters
df : pd.DataFrame
Dataframe with CompetitionDistance, CompetitionOpenSince columns
Returns
pd.DataFrame
Dataframe with added competition features
Source code in src/features/build_features.py
| def add_competition_features(df: pd.DataFrame) -> pd.DataFrame:
"""Add competition-related features.
Features created:
- CompetitionDistance_log: Log-scaled competition distance
- CompetitionAge: Days since competition opened
- HasCompetition: Binary flag for presence of competition
Parameters
----------
df : pd.DataFrame
Dataframe with CompetitionDistance, CompetitionOpenSince columns
Returns
-------
pd.DataFrame
Dataframe with added competition features
"""
logger.info("Adding competition features")
df = df.copy()
# Log-scaled competition distance
# Use log1p to handle zeros and avoid -inf
df["CompetitionDistance_log"] = np.log1p(df["CompetitionDistance"]).astype("float32")
# Has competition flag (distance < 100000, which was our fill value)
df["HasCompetition"] = (df["CompetitionDistance"] < 100000).astype("int8")
# Competition age (days since competition opened)
def calc_competition_age(row):
if row["CompetitionOpenSinceYear"] == 0 or row["CompetitionOpenSinceMonth"] == 0:
return 0
try:
comp_open_date = pd.to_datetime(
f"{int(row['CompetitionOpenSinceYear'])}-{int(row['CompetitionOpenSinceMonth'])}-01"
)
age = (row["Date"] - comp_open_date).days
return max(0, age) # Return 0 if negative
except Exception:
return 0
df["CompetitionAge"] = df.apply(calc_competition_age, axis=1).astype("int32")
logger.info(f"Added {3} competition features")
return df
|
add_lag_features(df, lags=None, target_col='Sales')
Add lag features at the store level.
CRITICAL: Uses groupby("Store").shift(lag) to prevent data leakage.
Features created:
- Sales_Lag_{lag}: Sales from {lag} days ago for each store
Parameters
df : pd.DataFrame
Dataframe sorted by Store and Date
lags : list of int, default=[1, 7, 14, 28]
Lag periods in days
target_col : str, default='Sales'
Column to create lags for
Returns
pd.DataFrame
Dataframe with added lag features
Source code in src/features/build_features.py
| def add_lag_features(
df: pd.DataFrame, lags: list[int] | None = None, target_col: str = "Sales"
) -> pd.DataFrame:
"""Add lag features at the store level.
CRITICAL: Uses groupby("Store").shift(lag) to prevent data leakage.
Features created:
- Sales_Lag_{lag}: Sales from {lag} days ago for each store
Parameters
----------
df : pd.DataFrame
Dataframe sorted by Store and Date
lags : list of int, default=[1, 7, 14, 28]
Lag periods in days
target_col : str, default='Sales'
Column to create lags for
Returns
-------
pd.DataFrame
Dataframe with added lag features
"""
if lags is None:
lags = [1, 7, 14, 28]
logger.info(f"Adding lag features for {target_col} with lags: {lags}")
df = df.copy()
# CRITICAL: Must group by Store to prevent leakage across stores
for lag in lags:
col_name = f"{target_col}_Lag_{lag}"
df[col_name] = df.groupby("Store")[target_col].shift(lag).astype("float32")
logger.info(f" Created {col_name}")
logger.info(f"Added {len(lags)} lag features")
return df
|
Add promotion-related features.
Features created:
- Promo2Active: Whether Promo2 is active in current month
- Promo2Duration: How long store has been in Promo2 (in days)
- PromoInterval_: One-hot encoding of PromoInterval months
df : pd.DataFrame
Dataframe with Promo, Promo2, PromoInterval columns
pd.DataFrame
Dataframe with added promo features
Source code in src/features/build_features.py
| def add_promo_features(df: pd.DataFrame) -> pd.DataFrame:
"""Add promotion-related features.
Features created:
- Promo2Active: Whether Promo2 is active in current month
- Promo2Duration: How long store has been in Promo2 (in days)
- PromoInterval_<Month>: One-hot encoding of PromoInterval months
Parameters
----------
df : pd.DataFrame
Dataframe with Promo, Promo2, PromoInterval columns
Returns
-------
pd.DataFrame
Dataframe with added promo features
"""
logger.info("Adding promotion features")
df = df.copy()
# Promo2 active in current month
# PromoInterval format: "Jan,Apr,Jul,Oct" or "Feb,May,Aug,Nov" or "Mar,Jun,Sept,Dec"
month_abbr = df["Date"].dt.strftime("%b")
# Check if current month is in PromoInterval
def is_promo2_active(row):
if pd.isna(row["PromoInterval"]) or row["PromoInterval"] == "" or row["Promo2"] == 0:
return 0
# Handle "Sept" vs "Sep" inconsistency
promo_months = row["PromoInterval"].replace("Sept", "Sep")
current_month = row["Month_Abbr"].replace("Sep", "Sep")
return 1 if current_month in promo_months else 0
df["Month_Abbr"] = month_abbr
df["Promo2Active"] = df.apply(is_promo2_active, axis=1).astype("int8")
df = df.drop("Month_Abbr", axis=1)
# Promo2 duration (days since Promo2 started)
def calc_promo2_duration(row):
if row["Promo2"] == 0 or row["Promo2SinceYear"] == 0:
return 0
# Convert Promo2SinceWeek and Promo2SinceYear to date
try:
promo2_start = pd.to_datetime(
f"{int(row['Promo2SinceYear'])}-W{int(row['Promo2SinceWeek'])}-1",
format="%Y-W%W-%w",
)
duration = (row["Date"] - promo2_start).days
return max(0, duration) # Return 0 if negative
except Exception:
return 0
df["Promo2Duration"] = df.apply(calc_promo2_duration, axis=1).astype("int32")
# One-hot encode PromoInterval
# Common patterns: "Jan,Apr,Jul,Oct", "Feb,May,Aug,Nov", "Mar,Jun,Sept,Dec"
promo_patterns = {
"Jan,Apr,Jul,Oct": "PromoInterval_JAJO",
"Feb,May,Aug,Nov": "PromoInterval_FMAN",
"Mar,Jun,Sept,Dec": "PromoInterval_MJSD",
}
for pattern, col_name in promo_patterns.items():
df[col_name] = (df["PromoInterval"] == pattern).astype("int8")
logger.info(f"Added {5} promotion features")
return df
|
add_rolling_features(df, windows=None, target_col='Sales')
Add rolling window features at the store level.
CRITICAL: Uses groupby("Store").rolling(window) to prevent data leakage.
Features created:
- Sales_RollingMean_{window}: Rolling mean over {window} days
- Sales_RollingStd_{window}: Rolling std over {window} days
Parameters
df : pd.DataFrame
Dataframe sorted by Store and Date
windows : list of int, default=[7, 14, 28, 60]
Rolling window sizes in days
target_col : str, default='Sales'
Column to create rolling features for
Returns
pd.DataFrame
Dataframe with added rolling features
Source code in src/features/build_features.py
| def add_rolling_features(
df: pd.DataFrame, windows: list[int] | None = None, target_col: str = "Sales"
) -> pd.DataFrame:
"""Add rolling window features at the store level.
CRITICAL: Uses groupby("Store").rolling(window) to prevent data leakage.
Features created:
- Sales_RollingMean_{window}: Rolling mean over {window} days
- Sales_RollingStd_{window}: Rolling std over {window} days
Parameters
----------
df : pd.DataFrame
Dataframe sorted by Store and Date
windows : list of int, default=[7, 14, 28, 60]
Rolling window sizes in days
target_col : str, default='Sales'
Column to create rolling features for
Returns
-------
pd.DataFrame
Dataframe with added rolling features
"""
if windows is None:
windows = [7, 14, 28, 60]
logger.info(f"Adding rolling features for {target_col} with windows: {windows}")
df = df.copy()
# CRITICAL: Must group by Store to prevent leakage across stores
for window in windows:
# Rolling mean
col_mean = f"{target_col}_RollingMean_{window}"
df[col_mean] = (
df.groupby("Store")[target_col]
.rolling(window=window, min_periods=1)
.mean()
.reset_index(level=0, drop=True)
.astype("float32")
)
logger.info(f" Created {col_mean}")
# Rolling std
col_std = f"{target_col}_RollingStd_{window}"
df[col_std] = (
df.groupby("Store")[target_col]
.rolling(window=window, min_periods=1)
.std()
.reset_index(level=0, drop=True)
.astype("float32")
)
logger.info(f" Created {col_std}")
logger.info(f"Added {len(windows) * 2} rolling features")
return df
|
build_all_features(df, config=None)
Orchestrate all feature engineering steps.
Parameters
df : pd.DataFrame
Cleaned dataframe (from Phase 1)
config : dict, optional
Configuration with feature parameters
Expected keys:
- lags: list of lag periods
- rolling_windows: list of window sizes
- include_promo_features: bool
- include_competition_features: bool
Returns
pd.DataFrame
Dataframe with all engineered features
Source code in src/features/build_features.py
| def build_all_features(df: pd.DataFrame, config: dict[str, Any] = None) -> pd.DataFrame:
"""Orchestrate all feature engineering steps.
Parameters
----------
df : pd.DataFrame
Cleaned dataframe (from Phase 1)
config : dict, optional
Configuration with feature parameters
Expected keys:
- lags: list of lag periods
- rolling_windows: list of window sizes
- include_promo_features: bool
- include_competition_features: bool
Returns
-------
pd.DataFrame
Dataframe with all engineered features
"""
logger.info("=" * 60)
logger.info("Starting feature engineering")
logger.info("=" * 60)
# Default config
if config is None:
config = {
"lags": [1, 7, 14, 28],
"rolling_windows": [7, 14, 28, 60],
"include_promo_features": True,
"include_competition_features": True,
}
df = df.copy()
initial_cols = len(df.columns)
# Ensure data is sorted by Store and Date
logger.info("Ensuring data is sorted by Store and Date")
df = df.sort_values(["Store", "Date"]).reset_index(drop=True)
# Add calendar features
df = add_calendar_features(df)
# Add promotion features
if config.get("include_promo_features", True):
df = add_promo_features(df)
else:
logger.info("Skipping promotion features (disabled in config)")
# Add competition features
if config.get("include_competition_features", True):
df = add_competition_features(df)
else:
logger.info("Skipping competition features (disabled in config)")
# Add lag features
lags = config.get("lags", [1, 7, 14, 28])
df = add_lag_features(df, lags=lags)
# Add rolling features
windows = config.get("rolling_windows", [7, 14, 28, 60])
df = add_rolling_features(df, windows=windows)
final_cols = len(df.columns)
added_cols = final_cols - initial_cols
logger.info("=" * 60)
logger.info("Feature engineering complete!")
logger.info(f"Added {added_cols} new features")
logger.info(f"Total columns: {final_cols}")
logger.info("=" * 60)
return df
|
main()
Main function to run the feature engineering pipeline.
Source code in src/features/build_features.py
| def main():
"""Main function to run the feature engineering pipeline."""
import yaml
logger.info("=" * 60)
logger.info("Starting feature engineering pipeline")
logger.info("=" * 60)
# Load configuration
config_path = Path("config/params.yaml")
if config_path.exists():
with open(config_path) as f:
params = yaml.safe_load(f)
feature_config = params.get("features", {})
else:
logger.warning("Config file not found, using defaults")
feature_config = None
# Load cleaned data
logger.info("Loading cleaned data from data/processed/train_clean.parquet")
df = read_parquet("data/processed/train_clean.parquet")
logger.info(f"Loaded {len(df):,} rows, {len(df.columns)} columns")
# Build features
df_featured = build_all_features(df, config=feature_config)
# Save featured data
output_path = "data/processed/train_features.parquet"
logger.info(f"Saving featured data to {output_path}")
save_parquet(df_featured, output_path)
# Report file size
file_size_mb = Path(output_path).stat().st_size / (1024 * 1024)
logger.info(f"Saved {len(df_featured):,} rows to {output_path} ({file_size_mb:.2f} MB)")
logger.info("=" * 60)
logger.info("Feature engineering pipeline complete!")
logger.info("=" * 60)
|
Usage Examples
Basic Usage
from src.features.build_features import (
add_calendar_features,
add_promo_features,
add_competition_features,
add_lag_features,
add_rolling_features,
build_all_features
)
# Build all features at once
df_with_features = build_all_features(clean_df)
# Or build features individually
df = add_calendar_features(clean_df)
df = add_promo_features(df)
df = add_competition_features(df)
df = add_lag_features(df)
df = add_rolling_features(df)
Running the Full Pipeline
# From command line
python -m src.features.build_features
# This will:
# 1. Load data/processed/train_clean.parquet
# 2. Build all standard features
# 3. Save to data/processed/train_features.parquet
Feature Categories
Calendar Features
df = add_calendar_features(df)
Creates:
Year, Month, Quarter - Date components
DayOfMonth, WeekOfYear - Temporal position
IsMonthStart, IsMonthEnd - Month boundary flags
Season - Meteorological season (Winter, Spring, Summer, Fall)
df = add_promo_features(df)
Creates:
Promo2Active - Whether Promo2 is currently active
Promo2Duration - Days since Promo2 started
PromoActiveThisMonth - Promo2 active in current month based on PromoInterval
Competition Features
df = add_competition_features(df)
Creates:
CompetitionDistance - Log-transformed distance (handles missing values)
CompetitionAge - Days since competition opened
HasCompetition - Binary flag for competition presence
Lag Features
df = add_lag_features(df, lags=[1, 7, 14, 28])
Creates store-level lags:
lag_sales_1 - Sales from 1 day ago
lag_sales_7 - Sales from 7 days ago (last week)
lag_sales_14 - Sales from 14 days ago
lag_sales_28 - Sales from 28 days ago (4 weeks)
IMPORTANT: Uses groupby("Store").shift(lag) to prevent data leakage across stores.
Rolling Features
df = add_rolling_features(df, windows=[7, 14, 28, 60])
Creates store-level rolling statistics:
rolling_mean_sales_7 - 7-day average sales
rolling_std_sales_7 - 7-day sales volatility
- Similar features for 14, 28, and 60-day windows
IMPORTANT: Uses groupby("Store").rolling(window) to prevent data leakage across stores.
Data Flow
flowchart TD
A[train_clean.parquet] --> B[add_calendar_features]
B --> C[add_promo_features]
C --> D[add_competition_features]
D --> E[add_lag_features]
E --> F[add_rolling_features]
F --> G[train_features.parquet]
Time-Series Safety
All lag and rolling features are designed to prevent data leakage:
# ✅ CORRECT: Store-level lags prevent future data from influencing past
df["lag_sales_7"] = df.groupby("Store")["Sales"].shift(7)
# ❌ WRONG: Global lag would leak data across stores
df["lag_sales_7"] = df["Sales"].shift(7)
This ensures:
- Each store's features only use that store's historical data
- No future information leaks into training features
- Time-series cross-validation remains valid
Key Functions
build_all_features()
Orchestrates all feature engineering functions in the correct order.
Parameters:
df: Cleaned dataframe from data processing
Returns:
- Dataframe with 32+ engineered features
Example:
features_df = build_all_features(clean_df)
print(features_df.shape) # (1,017,209, 50+)
- Data - Data loading and cleaning
- Models - Model training using features
- Evaluation - Feature importance analysis