The Complete Journey

completejourney provides access to data sets characterizing household level transactions over one year from a group of 2,469 households who are frequent shoppers at a grocery store. It contains all of each household’s purchases, not just those from a limited number of categories. For certain households, demographic information as well as direct marketing contact history are captured.

Due to the number of tables and the overall complexity of The Complete Journey, it is suggested that this database be used in more advanced classroom settings. Further, The Complete Journey would be ideal for academic research as it should enable one to study the effects of direct marketing to customers.

The following are examples of questions that could be submitted to students or considered for academic research:

  • How many customers are spending more over time? Less over time? Describe these customers.
  • Of those customers who are spending more over time, which categories are growing at a faster rate?
  • Of those customers who are spending less over time, with which categories are they becoming less engaged?
  • Which demographic factors (e.g. household size, presence of children, income) appear to affect customer spend? Engagement with certain categories?
  • Is there evidence to suggest that direct marketing improves overall engagement?

Accessing the Data

library(completejourney)

There are eight built-in data sets available in this package (see data(package = 'completejourney'). The data sets include:

  • campaigns: campaigns received by each household
  • campaign_descriptions: campaign metadata (length of time active)
  • coupons: coupon metadata (UPC code, campaign, etc.)
  • coupon_redemptions: coupon redemptions (household, day, UPC code, campaign)
  • demographics: household demographic data (age, income, family size, etc.)
  • products: product metadata (brand, description, etc.)
  • promotions_sample: a sampling of the product placement in mailers and in stores corresponding to advertising campaigns
  • transactions_sample: a sampling of the products purchased by households

Due to the size of the transactions and promotions data, the package provides a sampling of the data built-in with transactions_sample and promotions_sample. However, you can access the full promotions and transactions data sets from the source GitHub repository with the following:

# a convenience function to get both
c(promotions, transactions) %<-% get_data(which = 'both', verbose = FALSE)
dim(promotions)
## [1] 20940529        5

dim(transactions)
## [1] 1469307      11

Dataset Details

demographics

This table contains demographic information for a portion of households. Due to nature of the data, the demographic information is not available for all households.

Variable Description
household_id Uniquely identifies each household
age Estimated age range
income Household income
home_ownership Homeowner, renter, etc.
marital_status Marital status (Married, Single, Unknown)
household_size Size of household up to 5+
household_comp Household composition
kid_count Number of children present up to 3+

transactions

Contains transaction-level product purchases by households (what you would normally see on a receipt). Use a sample of all transactions (transactions_sample) or get the entire data set with get_transactions().

Variable Definition
household_id Uniquely identifies each household
store_id Uniquely identifies each store
basket_id Uniquely identifies each purchase occasion
product_id Uniquely identifies each product
quantity Number of the product purchased during the trip
sales_value Amount of dollars the retailer receives from sale
retail_disc Discount applied due to the retailer’s loyalty card program
coupon_disc Discount applied due to a manufacturer coupon
coupon_match_disc Discount applied due to retailer’s match of manufacturer coupon
week Week of the transaction; Ranges 1-53
transaction_timestamp Date and time of day when the transaction occurred

The variable sales_value in this table is the amount of dollars received by the retailer on the sale of the specific product, taking the coupon match and loyalty card discount into account. It is not the actual price paid by the customer. If a customer uses a coupon, the actual price paid will be less than the sales_value because the manufacturer issuing the coupon will reimburse the retailer for the amount of the coupon.

To calculate the actual product prices, use the formulas below:

  • Loyalty card price = (sales_value – (retail_disc + coupon_match_disc)) / quantity
  • Non-loyalty card price = (sales_value – (coupon_match_disc)) / quantity

The example below demonstrates how to calculate the actual shelf price of the product:

  • Line 1 – When this product was purchased the retail_disc and coupon_disc were both zero, meaning the price of the product is the same as the amount received by the retailer.
  • Line 2 – Two items of this product were purchased, and there was a retail discount applied due to a loyalty card. To determine the regular shelf price of the product (exclusive of loyalty card discount) we take the sum of the amount paid and the discount, then divide it by the quantity (\(\$3.34 + \$0.36) \div 2 = \$1.85\). The shelf price of the product including loyalty card discount is \(\$3.34 \div 2 = \$1.67\). In this case, the customer paid $3.34 for both of these products, which is the same amount the retailer received.
  • Line 3 – The actual shelf price of each product here is (\(\$2.89 + \$0.45) \div 2 = \$1.67\). In this case, the customer paid $2.34 (\(\$2.89 - \$0.55\)) for these products, but the retailer will receive $2.89 due to reimbursement of the manufacturer discount.

campaigns

Contains identifying information for the marketing campaigns each households participated in.

Variable Definition
campaign_id Uniquely identifies each campaign; Ranges 1-27
household_id Uniquely identifies each household

campaign_descriptions

Contains campaign metatdata, specifically the time a campaign was active.

Variable Definition
campaign_id Uniquely identifies each campaign; Ranges 1-27
campaign_type Type of campaign (Type A, Type B, Type C)
start_date Start date of campaign
end_date End date of campaign

products

Contains metatdata regarding the products purchased (brand, description, etc.).

Variable Definition
product_id Uniquely identifies each product
manufacturer_id Uniquely identifies each manufacturer
department Groups similar products together
brand Indicates private or national label brand
product_category Groups similar products together at lower level
product_type Groups similar products together at lowest level
package_size Indicates package size (not available for all products)

coupons

This table lists all the coupons sent to customers as part of a campaign, as well as the products for which each coupon is redeemable. Some coupons are redeemable for multiple products. One example is a coupon for any private label frozen vegetable. There are a large number of products where this coupon could be redeemed.

For campaign TypeA, this table provides the pool of possible coupons. Each customer participating in a Type A campaign received 16 coupons out of the pool. The 16 coupons were selected based on the customer’s prior purchase behavior. Identifying the specific 16 coupons that each customer received is outside the scope of this database.

For campaign Type B and Type C, all customers participating in a campaign receives all coupons pertaining to that campaign.

Variable Definition
coupon_upc Uniquely identifies each coupon (unique to household and campaign)
product_id Uniquely identifies each product
campaign_id Uniquely identifies each campaign

coupon_redemptions

Contains individual coupon redemptions (household, day, UPC code, campaign) that can be used to measure campaign efficacy and coupon usage on specific products.

Variable Definition
household_id Uniquely identifies each household
coupon_upc Uniquely identifies each coupon (unique to household and campaign)
campaign_id Uniquely identifies each campaign
redemption_date Date when the coupon was redeemed

promotions

Contains product placement information (e.g. mailer page placement and in-store display placement) corresponding to campaigns. Use a sample of all promotions (promotions_sample) or get the entire data set with get_promotions().

Variable Definition
product_id Uniquely identifies each product
store_id Uniquely identifies each store
display_location Display location (0-no display, 1-store front, 2-store rear, 3-front end cap, 4-mid-aisle end cap, 5-rear end cap, 6-side aisle end cap, 7-in-aisle, 9-secondary location display, A-in-shelf)
mailer_location Mailer location (0-not on ad, A-interior page feature, C-interior page line item, D-front page feature, F-back page feature, H-wrap front feature, J-wrap interior coupon, L-wrap back feature, P-interior page coupon, X-free on interior page, Z-free on front page, back page, or wrap)
week Week of the transaction; Ranges 1-53

Example Case Study

John Smith is a valued customer at a national grocery retailer for which we have detailed transaction data. Throughout all the tables in the database, he is identified with a household_id of “208”.

To learn a little about John, we can obtain his demographic information by looking at the record in the demographics table where household_id = "208". The table below shows the information we receive, and tells us that he is a homeowner, who makes between $50,000 and $74,000 a year and is between 45 and 54 years old.

If we look at John’s records from campaigns, we can see that he received 7 different campaigns.

These campaigns were spread out over the one year period of the study. To understand the time periods and types of these campaigns, you can join with the campaign_descriptions data. In this example, we see the type of campaign along with the start and end dates of the campaign.

Let us take a closer look at campaign 18 When we look at all the distinct coupon_upc’s from the coupon table where campaign_id = "18", we see that there were 209 distinct coupons sent out as part of that campaign.

Let us take an even deeper look at one of the specific coupons offered as part of the campaign. If we print out all records from the coupon table where campaign_id = "18" and coupon_upc = "55410000076", we see that this coupon could actually be redeemed for a number of products.

Although all the products are not displayed above, we find that this coupon is actually valid on 50 distinct products. If we go to the product table and print out all records for the product_id’s above (“1004458”, “1011841”, etc.), we see that this coupon is valid for pickles from a national brand.

As we’ve seen, John received a number of campaigns over the past year that contained many coupons. Chances are, he did not redeem every coupon he received. So, let us take a look to see what coupons he did redeem. To do this, we need to view all records from the coupon_redemptions table where household_id = "208". This shows us that he redeemed five coupons from two of the campaigns (“8” and “18”).

John’s coupon redemptions are only part of the overall picture of his purchasing behavior. If we look at the records from the transaction_data table where household_id = "208", we can view everything that John purchased.

This gets a bit complicated, but we can combine the transaction data with the other tables to understand John’s behavior when he was redeeming a coupon (and when he wasn’t redeeming a coupon).

  • John received offers as part of campaign 18, which occurred between 10/30/2017 and 12/24/2017
  • We know he redeemed coupon_upc = "55410000076" on day 11/13/2017
  • Through the coupon table, we know that the coupon is actually valid for a number of products, including product_id = "896292"
  • From the table below, we can see where John purchased this item and received a $1 discount from using the coupon.

Knowing when John redeemed a coupon can help us learn a lot about him, and how the receipt of certain campaigns affected his behavior. Does the receipt of campaigns cause him to purchase more items than he did previously? Is John more likely to redeem coupons for products he already purchases, or does it entice him to try products he has never purchased before?

There is one bit of information we have not talked about yet – what is happening in the rest of the store? Is it possible that John purchased the item above because of other events occurring in the store in addition to his coupon?

We obviously do not know a customer’s reason for purchasing an item, but we do know whether an item was featured during the time of the purchase. If we view all records from the promotions table where product_id = "896292", we see the weeks and stores where this product was featured in the weekly mailer and whether it was featured as part of an in-store display. If we look at the first line, we can tell that in store 327 and week 47, the product was on the shelf (i.e. display_location="A") so it was not part of a special display. We can assume this was not an impulse buy because of a display or advertisement. He most likely went to the store with the coupon intent upon purchasing the item.

We hope that this quick look at John Smith’s behavior provides clarity around The Complete Journey database, and inspires your own investigation into the purchasing behavior of these customers.