16 Lesson 4c: Handling dates & times

Real world data are often associated with dates and time; however, dealing with dates accurately can appear to be a complicated task due to the variety in formats and accounting for time-zone differences and leap years. R has a range of functions that allow you to work with dates and times. Furthermore, packages such as lubridate make it easier to work with dates and times.

16.1 Learning objectives

By the end of this lesson you’ll be able to:

  • Create date and time data.
  • Extract & manipulate parts of dates and times.
  • Perform calculations with dates and times.

16.2 Prerequisites

For this lesson, we will use the packages supplied by tidyverse for supporting roles; however, the bulk of the emphasis will be on using the lubridate package which provides useful functions for working with dates and times.

library(tidyverse)  # provides supporting data wrangling functions
library(lubridate)  # provides functions for working with dates & times

To illustrate various transformation tasks on dates we will use the transactions data from completejourney. Note how transaction_timestamp is a <dttm> type, which stands for a date-time data type.

library(completejourney)

glimpse(transactions_sample)
## Rows: 75,000
## Columns: 11
## $ household_id          <chr> "2261", "2131", "511", "400", "918", "7…
## $ store_id              <chr> "309", "368", "316", "388", "340", "324…
## $ basket_id             <chr> "31625220889", "32053127496", "32445856…
## $ product_id            <chr> "940996", "873902", "847901", "13094913…
## $ quantity              <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, …
## $ sales_value           <dbl> 3.86, 1.59, 1.00, 11.87, 1.29, 2.50, 3.…
## $ retail_disc           <dbl> 0.43, 0.90, 0.69, 2.90, 0.00, 0.49, 0.0…
## $ coupon_disc           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ coupon_match_disc     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ week                  <int> 5, 10, 13, 8, 10, 15, 10, 33, 2, 16, 47…
## $ transaction_timestamp <dttm> 2017-01-28 14:06:53, 2017-02-28 22:31:…


16.3 Getting current date & time

To get current date and time information:

Sys.timezone()
## [1] "America/New_York"

Sys.Date()
## [1] "2024-07-15"

Sys.time()
## [1] "2024-07-15 14:49:12 EDT"

If using the lubridate package:

today()
## [1] "2024-07-15"

now()
## [1] "2024-07-15 14:49:12 EDT"

16.4 Creating dates

When date and time data are imported into R they will often default to a character string. This requires us to convert strings to dates. We may also have multiple strings that we want to merge to create a date variable.

16.4.1 Convert strings to dates

To convert a character string that is already in a date format (YYYY-MM-DD) into a date object, we can use one of the lubridate parsing functions. One of the many benefits of the lubridate package is that it automatically recognizes the common separators used when recording dates (“-“, “/”, “.”, and “”). As a result, you only need to focus on specifying the order of the date elements to determine the parsing function applied. The different parsing functions available in lubridate include:

Date-time parsing functions.
Order of elements in date-time Parsing function
year, month, day ymd()
year, day, month ydm()
month, day, year mdy()
day, month, year dmy()
hour, minute hm()
hour, minute, second hms()
year, month, day, hour, minute, second ymd_hms()

For example, consider the three different character strings that follow. We can convert each one to a date or date-time object with the relevant lubridate parsing function.

x <- c("2015-07-01", "2015-08-01", "2015-09-01")
y <- c("07/01/2015", "08/01/2015", "09/01/2015")
z <- c("2015-07-01 12:59:59", "2015-08-01 02:45:22", "2015-09-01 15:05:12")

ymd(x)
## [1] "2015-07-01" "2015-08-01" "2015-09-01"

mdy(y)
## [1] "2015-07-01" "2015-08-01" "2015-09-01"

ymd_hms(z)
## [1] "2015-07-01 12:59:59 UTC" "2015-08-01 02:45:22 UTC"
## [3] "2015-09-01 15:05:12 UTC"

You can also use lubridate functions within dplyr functions. Say, for example, that our transaction timestamp in our data was not already a dttm date type.

non_dates <- transactions_sample %>% 
   mutate(date = as.character(transaction_timestamp))

glimpse(non_dates)
## Rows: 75,000
## Columns: 12
## $ household_id          <chr> "2261", "2131", "511", "400", "918", "7…
## $ store_id              <chr> "309", "368", "316", "388", "340", "324…
## $ basket_id             <chr> "31625220889", "32053127496", "32445856…
## $ product_id            <chr> "940996", "873902", "847901", "13094913…
## $ quantity              <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, …
## $ sales_value           <dbl> 3.86, 1.59, 1.00, 11.87, 1.29, 2.50, 3.…
## $ retail_disc           <dbl> 0.43, 0.90, 0.69, 2.90, 0.00, 0.49, 0.0…
## $ coupon_disc           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ coupon_match_disc     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ week                  <int> 5, 10, 13, 8, 10, 15, 10, 33, 2, 16, 47…
## $ transaction_timestamp <dttm> 2017-01-28 14:06:53, 2017-02-28 22:31:…
## $ date                  <chr> "2017-01-28 14:06:53", "2017-02-28 22:3…

We can easily convert this to a date-time by using the appropriate parsing function:

non_dates %>%
   mutate(date = ymd_hms(date)) %>%
   glimpse()
## Rows: 75,000
## Columns: 12
## $ household_id          <chr> "2261", "2131", "511", "400", "918", "7…
## $ store_id              <chr> "309", "368", "316", "388", "340", "324…
## $ basket_id             <chr> "31625220889", "32053127496", "32445856…
## $ product_id            <chr> "940996", "873902", "847901", "13094913…
## $ quantity              <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, …
## $ sales_value           <dbl> 3.86, 1.59, 1.00, 11.87, 1.29, 2.50, 3.…
## $ retail_disc           <dbl> 0.43, 0.90, 0.69, 2.90, 0.00, 0.49, 0.0…
## $ coupon_disc           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ coupon_match_disc     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ week                  <int> 5, 10, 13, 8, 10, 15, 10, 33, 2, 16, 47…
## $ transaction_timestamp <dttm> 2017-01-28 14:06:53, 2017-02-28 22:31:…
## $ date                  <dttm> 2017-01-28 14:06:53, 2017-02-28 22:31:…

16.4.2 Create dates by merging data

Sometimes parts of your date-time are collected in separate elements/columns. For example, in the following data that comes from a data frame supplied by the nycflights13 package, the year, month, day, hour, and minute values are all separated into their own columns.

library(nycflights13)

flights %>% 
  select(year, month, day, hour, minute)
## # A tibble: 336,776 × 5
##     year month   day  hour minute
##    <int> <int> <int> <dbl>  <dbl>
##  1  2013     1     1     5     15
##  2  2013     1     1     5     29
##  3  2013     1     1     5     40
##  4  2013     1     1     5     45
##  5  2013     1     1     6      0
##  6  2013     1     1     5     58
##  7  2013     1     1     6      0
##  8  2013     1     1     6      0
##  9  2013     1     1     6      0
## 10  2013     1     1     6      0
## # ℹ 336,766 more rows

To create a date-time value from this sort of input, use make_date() for dates and make_datetime() for date-times:

flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day, hour, minute))
## # A tibble: 336,776 × 6
##     year month   day  hour minute departure          
##    <int> <int> <int> <dbl>  <dbl> <dttm>             
##  1  2013     1     1     5     15 2013-01-01 05:15:00
##  2  2013     1     1     5     29 2013-01-01 05:29:00
##  3  2013     1     1     5     40 2013-01-01 05:40:00
##  4  2013     1     1     5     45 2013-01-01 05:45:00
##  5  2013     1     1     6      0 2013-01-01 06:00:00
##  6  2013     1     1     5     58 2013-01-01 05:58:00
##  7  2013     1     1     6      0 2013-01-01 06:00:00
##  8  2013     1     1     6      0 2013-01-01 06:00:00
##  9  2013     1     1     6      0 2013-01-01 06:00:00
## 10  2013     1     1     6      0 2013-01-01 06:00:00
## # ℹ 336,766 more rows


16.4.3 Knowledge check

  1. Using any of the parsing functions to convert your birthday to a date object.
  2. Now save the date of your birthday into 3 separate character string variables: year, month, day and convert these three separate variables into a single date with make_datetime()


16.5 Extract & manipulate parts of dates

lubridate also provides several accessor function that allow you to easily extract and manipulate individual elements of a date. The available accessor functions include:

Date-time component accessor functions.
Date component Accessor function
Year year()
Month month()
Week week()
Day of year yday()
Day of month mday()
Day of week wday()
Hour hour()
Minute minute()
Second second()
Time zone tz()

To extract an individual element of the date variable you simply use the accessor function desired. Note that the accessor variables have additional arguments that can be used to show the name of the date element in full or abbreviated form.

x <- c("2015-07-01", "2015-08-01", "2015-09-01")

year(x)
## [1] 2015 2015 2015

# default is numerical value
month(x)
## [1] 7 8 9

# show abbreviated name
month(x, label = TRUE)
## [1] Jul Aug Sep
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < ... < Dec

# show unabbreviated name
month(x, label = TRUE, abbr = FALSE)
## [1] July      August    September
## 12 Levels: January < February < March < April < May < ... < December

wday(x, label = TRUE, abbr = FALSE)
## [1] Wednesday Saturday  Tuesday  
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < ... < Saturday

This can become handy when you want to perform quick data mining on certain aspects of data. Say we want to get the number of transactions by day for the month of December. By combining lubridate and dplyr functions we can easily filter for December transactions and count the number of transactions for each day.

transactions_sample %>%
   filter(month(transaction_timestamp) == 12) %>%
   count(day(transaction_timestamp))
## # A tibble: 30 × 2
##    `day(transaction_timestamp)`     n
##                           <int> <int>
##  1                            1   207
##  2                            2   258
##  3                            3   327
##  4                            4   201
##  5                            5   218
##  6                            6   195
##  7                            7   181
##  8                            8   189
##  9                            9   228
## 10                           10   272
## # ℹ 20 more rows


16.5.1 Knowledge check

  1. Get the day of the week that you were born on. The default will print out a number but you can use the label parameter to get the name of the day (i.e. Sun, Mon, …, Sat).
  2. Using transactions_sample compute the total sales by the day of the week. Which day produces the largest sales? Which day produces the smallest sales?


16.6 Calculations with dates

Since R stores date and time objects as numbers, this allows you to perform various calculations such as logical comparisons, computing the duration between different dates, and adding/subtracting periods (months, days, hours, minutes, etc.) to dates and times.

x <- today()
x
## [1] "2024-07-15"

y <- ymd("2018-06-01")

x > y
## [1] TRUE

x - y
## Time difference of 2236 days

16.6.1 Durations

In R, when you subtract two dates, you get a difftime object. A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds as the duration but will summarize that duration to the nearest “useful” period (years, weeks, days).

x <- ymd_hm("2018-06-01 12:10")
y <- ymd_hm("2016-03-21 13:54")
z <- ymd_hm("2018-05-21 13:54")

# base R difftime results
x - y
## Time difference of 801.93 days
x - z
## Time difference of 10.928 days

# lubridate::as.duration difftime results
as.duration(x - y)
## [1] "69286560s (~2.2 years)"
as.duration(x - z)
## [1] "944160s (~1.56 weeks)"

Working with durations can be useful when performing data mining as we can perform normal statistical procedures with durations. For example, we can compute the duration between the first and last transaction date for each household to find those households that have been shopping with us the longest.

transactions_sample %>%
   group_by(household_id) %>%
   summarize(
      first_date = min(transaction_timestamp),
      last_date = max(transaction_timestamp)
   ) %>%
   mutate(difference = as.duration(last_date - first_date)) %>%
   arrange(desc(difference))
## # A tibble: 2,377 × 4
##    household_id first_date          last_date          
##    <chr>        <dttm>              <dttm>             
##  1 1378         2017-01-01 14:04:55 2017-12-31 18:07:18
##  2 58           2017-01-01 10:14:16 2017-12-31 12:46:33
##  3 1899         2017-01-01 10:48:12 2017-12-31 11:01:59
##  4 2337         2017-01-01 13:33:43 2017-12-31 13:32:48
##  5 1130         2017-01-01 10:38:18 2017-12-31 10:35:42
##  6 2445         2017-01-01 17:36:15 2017-12-31 17:27:35
##  7 290          2017-01-01 16:30:07 2017-12-31 16:14:42
##  8 304          2017-01-01 17:45:07 2017-12-31 17:24:25
##  9 1563         2017-01-01 16:09:32 2017-12-31 15:23:24
## 10 2194         2017-01-01 11:34:53 2017-12-31 09:22:43
## # ℹ 2,367 more rows
## # ℹ 1 more variable: difference <Duration>

In addition to as.duration(), lubridate provides several duration functions (dseconds, dhours, dyears, etc.) that will convert a value to the number of seconds for the duration period of interest.

dseconds(55)
## [1] "55s"
dminutes(25)
## [1] "1500s (~25 minutes)"
ddays(30)
## [1] "2592000s (~4.29 weeks)"
dweeks(3)
## [1] "1814400s (~3 weeks)"
dyears(4)
## [1] "126230400s (~4 years)"

This makes for efficient data mining as we do not need to worry about spending time to correctly compute duration periods, we simply use the right duration function of interest. For example, we can build upon the example above and filter our transactions for only those customers that have been shopping with us for at least 25 weeks.

transactions_sample %>%
   group_by(household_id) %>%
   summarize(
      first_date = min(transaction_timestamp),
      last_date = max(transaction_timestamp)
   ) %>%
   mutate(difference = as.duration(last_date - first_date)) %>%
   filter(difference >= dweeks(25))
## # A tibble: 1,996 × 4
##    household_id first_date          last_date          
##    <chr>        <dttm>              <dttm>             
##  1 1            2017-01-07 13:55:24 2017-12-14 10:13:38
##  2 100          2017-02-15 18:38:13 2017-11-17 21:12:00
##  3 1000         2017-01-04 18:30:59 2017-12-26 17:50:21
##  4 1001         2017-02-09 18:02:42 2017-12-28 13:44:31
##  5 1002         2017-03-16 12:39:10 2017-12-24 13:09:28
##  6 1003         2017-03-05 17:17:40 2017-12-02 09:37:27
##  7 1004         2017-01-21 01:50:10 2017-12-26 20:51:39
##  8 1005         2017-01-04 18:18:18 2017-12-17 14:27:52
##  9 1007         2017-01-14 20:33:42 2017-10-17 22:59:46
## 10 1008         2017-03-15 06:40:12 2017-12-03 08:56:03
## # ℹ 1,986 more rows
## # ℹ 1 more variable: difference <Duration>

The duration functions (dxxx()) represent an exact number of seconds and does not consider the differences due to leap year and daylight savings time adjustments.

It’s important to note that the duration functions do not capture differences in time due to leap year or daylight savings time adjustments. For example, if we add two hours to the given date-time provided in one_am we go from 1am to 4am. Why? Because this is the date-time that daylight savings occurred.

one_am <- ymd_hms("2018-03-11 01:00:00", tz = "America/New_York")

one_am
## [1] "2018-03-11 01:00:00 EST"

one_am + dhours(2)
## [1] "2018-03-11 04:00:00 EDT"

This is where working with periods can help us.

16.6.2 Periods

Periods are time spans but don’t have a fixed length in seconds, instead they work with calendar friendly time periods that account for unique instances like daylight savings time and leap years. That allows them work in a more intuitive way.

Like durations, periods can be created with a number of friendly constructor functions.

seconds(55)
## [1] "55S"
minutes(25)
## [1] "25M 0S"
days(30)
## [1] "30d 0H 0M 0S"
weeks(3)
## [1] "21d 0H 0M 0S"
years(4)
## [1] "4y 0m 0d 0H 0M 0S"

When we add periods to any existing date-time, it will add calendar and/or clock “correct” time periods. For example, if we add two hours to our daylight savings time we created previously, it will result in two “normal” clock hours being added to our time even though daylight savings occurred at 2am.

# daylight savings date
one_am
## [1] "2018-03-11 01:00:00 EST"

# add two hours
one_am + hours(2)
## [1] "2018-03-11 03:00:00 EDT"

16.6.3 Knowledge check

  1. Compute the last (most recent) transaction date in our transactions_sample data.
  2. Now identify shoppers that haven’t made a transaction within 3 months of this date.


16.7 Exercises

  1. Compute the total sales_value of all transactions made Thanksgiving Day during the year of our available transactions (November 23, 2017).
  2. What is the average time of date across all transactions?
  3. Which household shops the earliest in the day (on average) and which household shops the latest?

16.8 Additional resources

For additional resources on learning and dealing with dates I recommend the following: