16 Lesson 4c: Handling dates & times
Real world data are often associated with dates and time; however, dealing with dates accurately can appear to be a complicated task due to the variety in formats and accounting for time-zone differences and leap years. R has a range of functions that allow you to work with dates and times. Furthermore, packages such as lubridate
make it easier to work with dates and times.
16.1 Learning objectives
By the end of this lesson you’ll be able to:
- Create date and time data.
- Extract & manipulate parts of dates and times.
- Perform calculations with dates and times.
16.2 Prerequisites
For this lesson, we will use the packages supplied by tidyverse
for supporting roles; however, the bulk of the emphasis will be on using the lubridate
package which provides useful functions for working with dates and times.
library(tidyverse) # provides supporting data wrangling functions
library(lubridate) # provides functions for working with dates & times
To illustrate various transformation tasks on dates we will use the transactions data from completejourney. Note how transaction_timestamp
is a <dttm>
type, which stands for a date-time data type.
library(completejourney)
glimpse(transactions_sample)
## Rows: 75,000
## Columns: 11
## $ household_id <chr> "2261", "2131", "511", "400", "918", "7…
## $ store_id <chr> "309", "368", "316", "388", "340", "324…
## $ basket_id <chr> "31625220889", "32053127496", "32445856…
## $ product_id <chr> "940996", "873902", "847901", "13094913…
## $ quantity <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, …
## $ sales_value <dbl> 3.86, 1.59, 1.00, 11.87, 1.29, 2.50, 3.…
## $ retail_disc <dbl> 0.43, 0.90, 0.69, 2.90, 0.00, 0.49, 0.0…
## $ coupon_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ coupon_match_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ week <int> 5, 10, 13, 8, 10, 15, 10, 33, 2, 16, 47…
## $ transaction_timestamp <dttm> 2017-01-28 14:06:53, 2017-02-28 22:31:…
16.3 Getting current date & time
To get current date and time information:
Sys.timezone()
## [1] "America/New_York"
Sys.Date()
## [1] "2024-07-15"
Sys.time()
## [1] "2024-07-15 14:49:12 EDT"
If using the lubridate package:
16.4 Creating dates
When date and time data are imported into R they will often default to a character string. This requires us to convert strings to dates. We may also have multiple strings that we want to merge to create a date variable.
16.4.1 Convert strings to dates
To convert a character string that is already in a date format (YYYY-MM-DD) into a date object, we can use one of the lubridate
parsing functions. One of the many benefits of the lubridate
package is that it automatically recognizes the common separators used when recording dates (“-“, “/”, “.”, and “”). As a result, you only need to focus on specifying the order of the date elements to determine the parsing function applied. The different parsing functions available in lubridate
include:
Order of elements in date-time | Parsing function |
---|---|
year, month, day | ymd() |
year, day, month | ydm() |
month, day, year | mdy() |
day, month, year | dmy() |
hour, minute | hm() |
hour, minute, second | hms() |
year, month, day, hour, minute, second | ymd_hms() |
For example, consider the three different character strings that follow. We can convert each one to a date or date-time object with the relevant lubridate
parsing function.
x <- c("2015-07-01", "2015-08-01", "2015-09-01")
y <- c("07/01/2015", "08/01/2015", "09/01/2015")
z <- c("2015-07-01 12:59:59", "2015-08-01 02:45:22", "2015-09-01 15:05:12")
ymd(x)
## [1] "2015-07-01" "2015-08-01" "2015-09-01"
mdy(y)
## [1] "2015-07-01" "2015-08-01" "2015-09-01"
ymd_hms(z)
## [1] "2015-07-01 12:59:59 UTC" "2015-08-01 02:45:22 UTC"
## [3] "2015-09-01 15:05:12 UTC"
You can also use lubridate
functions within dplyr
functions. Say, for example, that our transaction timestamp in our data was not already a dttm
date type.
non_dates <- transactions_sample %>%
mutate(date = as.character(transaction_timestamp))
glimpse(non_dates)
## Rows: 75,000
## Columns: 12
## $ household_id <chr> "2261", "2131", "511", "400", "918", "7…
## $ store_id <chr> "309", "368", "316", "388", "340", "324…
## $ basket_id <chr> "31625220889", "32053127496", "32445856…
## $ product_id <chr> "940996", "873902", "847901", "13094913…
## $ quantity <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, …
## $ sales_value <dbl> 3.86, 1.59, 1.00, 11.87, 1.29, 2.50, 3.…
## $ retail_disc <dbl> 0.43, 0.90, 0.69, 2.90, 0.00, 0.49, 0.0…
## $ coupon_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ coupon_match_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ week <int> 5, 10, 13, 8, 10, 15, 10, 33, 2, 16, 47…
## $ transaction_timestamp <dttm> 2017-01-28 14:06:53, 2017-02-28 22:31:…
## $ date <chr> "2017-01-28 14:06:53", "2017-02-28 22:3…
We can easily convert this to a date-time by using the appropriate parsing function:
non_dates %>%
mutate(date = ymd_hms(date)) %>%
glimpse()
## Rows: 75,000
## Columns: 12
## $ household_id <chr> "2261", "2131", "511", "400", "918", "7…
## $ store_id <chr> "309", "368", "316", "388", "340", "324…
## $ basket_id <chr> "31625220889", "32053127496", "32445856…
## $ product_id <chr> "940996", "873902", "847901", "13094913…
## $ quantity <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, …
## $ sales_value <dbl> 3.86, 1.59, 1.00, 11.87, 1.29, 2.50, 3.…
## $ retail_disc <dbl> 0.43, 0.90, 0.69, 2.90, 0.00, 0.49, 0.0…
## $ coupon_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ coupon_match_disc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ week <int> 5, 10, 13, 8, 10, 15, 10, 33, 2, 16, 47…
## $ transaction_timestamp <dttm> 2017-01-28 14:06:53, 2017-02-28 22:31:…
## $ date <dttm> 2017-01-28 14:06:53, 2017-02-28 22:31:…
16.4.2 Create dates by merging data
Sometimes parts of your date-time are collected in separate elements/columns. For example, in the following data that comes from a data frame supplied by the nycflights13
package, the year, month, day, hour, and minute values are all separated into their own columns.
library(nycflights13)
flights %>%
select(year, month, day, hour, minute)
## # A tibble: 336,776 × 5
## year month day hour minute
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 5 15
## 2 2013 1 1 5 29
## 3 2013 1 1 5 40
## 4 2013 1 1 5 45
## 5 2013 1 1 6 0
## 6 2013 1 1 5 58
## 7 2013 1 1 6 0
## 8 2013 1 1 6 0
## 9 2013 1 1 6 0
## 10 2013 1 1 6 0
## # ℹ 336,766 more rows
To create a date-time value from this sort of input, use make_date()
for dates and make_datetime()
for date-times:
flights %>%
select(year, month, day, hour, minute) %>%
mutate(departure = make_datetime(year, month, day, hour, minute))
## # A tibble: 336,776 × 6
## year month day hour minute departure
## <int> <int> <int> <dbl> <dbl> <dttm>
## 1 2013 1 1 5 15 2013-01-01 05:15:00
## 2 2013 1 1 5 29 2013-01-01 05:29:00
## 3 2013 1 1 5 40 2013-01-01 05:40:00
## 4 2013 1 1 5 45 2013-01-01 05:45:00
## 5 2013 1 1 6 0 2013-01-01 06:00:00
## 6 2013 1 1 5 58 2013-01-01 05:58:00
## 7 2013 1 1 6 0 2013-01-01 06:00:00
## 8 2013 1 1 6 0 2013-01-01 06:00:00
## 9 2013 1 1 6 0 2013-01-01 06:00:00
## 10 2013 1 1 6 0 2013-01-01 06:00:00
## # ℹ 336,766 more rows
16.5 Extract & manipulate parts of dates
lubridate
also provides several accessor function that allow you to easily extract and manipulate individual elements of a date. The available accessor functions include:
Date component | Accessor function |
---|---|
Year | year() |
Month | month() |
Week | week() |
Day of year | yday() |
Day of month | mday() |
Day of week | wday() |
Hour | hour() |
Minute | minute() |
Second | second() |
Time zone | tz() |
To extract an individual element of the date variable you simply use the accessor function desired. Note that the accessor variables have additional arguments that can be used to show the name of the date element in full or abbreviated form.
x <- c("2015-07-01", "2015-08-01", "2015-09-01")
year(x)
## [1] 2015 2015 2015
# default is numerical value
month(x)
## [1] 7 8 9
# show abbreviated name
month(x, label = TRUE)
## [1] Jul Aug Sep
## 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < ... < Dec
# show unabbreviated name
month(x, label = TRUE, abbr = FALSE)
## [1] July August September
## 12 Levels: January < February < March < April < May < ... < December
wday(x, label = TRUE, abbr = FALSE)
## [1] Wednesday Saturday Tuesday
## 7 Levels: Sunday < Monday < Tuesday < Wednesday < ... < Saturday
This can become handy when you want to perform quick data mining on certain aspects of data. Say we want to get the number of transactions by day for the month of December. By combining lubridate
and dplyr
functions we can easily filter for December transactions and count the number of transactions for each day.
transactions_sample %>%
filter(month(transaction_timestamp) == 12) %>%
count(day(transaction_timestamp))
## # A tibble: 30 × 2
## `day(transaction_timestamp)` n
## <int> <int>
## 1 1 207
## 2 2 258
## 3 3 327
## 4 4 201
## 5 5 218
## 6 6 195
## 7 7 181
## 8 8 189
## 9 9 228
## 10 10 272
## # ℹ 20 more rows
16.5.1 Knowledge check
-
Get the day of the week that you were born on. The default will
print out a number but you can use the
label
parameter to get the name of the day (i.e. Sun, Mon, …, Sat). -
Using
transactions_sample
compute the total sales by the day of the week. Which day produces the largest sales? Which day produces the smallest sales?
16.6 Calculations with dates
Since R stores date and time objects as numbers, this allows you to perform various calculations such as logical comparisons, computing the duration between different dates, and adding/subtracting periods (months, days, hours, minutes, etc.) to dates and times.
x <- today()
x
## [1] "2024-07-15"
y <- ymd("2018-06-01")
x > y
## [1] TRUE
x - y
## Time difference of 2236 days
16.6.1 Durations
In R, when you subtract two dates, you get a difftime object. A difftime class object records a time span of seconds, minutes, hours, days, or weeks. This ambiguity can make difftimes a little painful to work with, so lubridate provides an alternative which always uses seconds as the duration but will summarize that duration to the nearest “useful” period (years, weeks, days).
x <- ymd_hm("2018-06-01 12:10")
y <- ymd_hm("2016-03-21 13:54")
z <- ymd_hm("2018-05-21 13:54")
# base R difftime results
x - y
## Time difference of 801.93 days
x - z
## Time difference of 10.928 days
# lubridate::as.duration difftime results
as.duration(x - y)
## [1] "69286560s (~2.2 years)"
as.duration(x - z)
## [1] "944160s (~1.56 weeks)"
Working with durations can be useful when performing data mining as we can perform normal statistical procedures with durations. For example, we can compute the duration between the first and last transaction date for each household to find those households that have been shopping with us the longest.
transactions_sample %>%
group_by(household_id) %>%
summarize(
first_date = min(transaction_timestamp),
last_date = max(transaction_timestamp)
) %>%
mutate(difference = as.duration(last_date - first_date)) %>%
arrange(desc(difference))
## # A tibble: 2,377 × 4
## household_id first_date last_date
## <chr> <dttm> <dttm>
## 1 1378 2017-01-01 14:04:55 2017-12-31 18:07:18
## 2 58 2017-01-01 10:14:16 2017-12-31 12:46:33
## 3 1899 2017-01-01 10:48:12 2017-12-31 11:01:59
## 4 2337 2017-01-01 13:33:43 2017-12-31 13:32:48
## 5 1130 2017-01-01 10:38:18 2017-12-31 10:35:42
## 6 2445 2017-01-01 17:36:15 2017-12-31 17:27:35
## 7 290 2017-01-01 16:30:07 2017-12-31 16:14:42
## 8 304 2017-01-01 17:45:07 2017-12-31 17:24:25
## 9 1563 2017-01-01 16:09:32 2017-12-31 15:23:24
## 10 2194 2017-01-01 11:34:53 2017-12-31 09:22:43
## # ℹ 2,367 more rows
## # ℹ 1 more variable: difference <Duration>
In addition to as.duration()
, lubridate
provides several duration functions (dseconds
, dhours
, dyears
, etc.) that will convert a value to the number of seconds for the duration period of interest.
dseconds(55)
## [1] "55s"
dminutes(25)
## [1] "1500s (~25 minutes)"
ddays(30)
## [1] "2592000s (~4.29 weeks)"
dweeks(3)
## [1] "1814400s (~3 weeks)"
dyears(4)
## [1] "126230400s (~4 years)"
This makes for efficient data mining as we do not need to worry about spending time to correctly compute duration periods, we simply use the right duration function of interest. For example, we can build upon the example above and filter our transactions for only those customers that have been shopping with us for at least 25 weeks.
transactions_sample %>%
group_by(household_id) %>%
summarize(
first_date = min(transaction_timestamp),
last_date = max(transaction_timestamp)
) %>%
mutate(difference = as.duration(last_date - first_date)) %>%
filter(difference >= dweeks(25))
## # A tibble: 1,996 × 4
## household_id first_date last_date
## <chr> <dttm> <dttm>
## 1 1 2017-01-07 13:55:24 2017-12-14 10:13:38
## 2 100 2017-02-15 18:38:13 2017-11-17 21:12:00
## 3 1000 2017-01-04 18:30:59 2017-12-26 17:50:21
## 4 1001 2017-02-09 18:02:42 2017-12-28 13:44:31
## 5 1002 2017-03-16 12:39:10 2017-12-24 13:09:28
## 6 1003 2017-03-05 17:17:40 2017-12-02 09:37:27
## 7 1004 2017-01-21 01:50:10 2017-12-26 20:51:39
## 8 1005 2017-01-04 18:18:18 2017-12-17 14:27:52
## 9 1007 2017-01-14 20:33:42 2017-10-17 22:59:46
## 10 1008 2017-03-15 06:40:12 2017-12-03 08:56:03
## # ℹ 1,986 more rows
## # ℹ 1 more variable: difference <Duration>
The duration functions (dxxx()
) represent an exact
number of seconds and does not consider the differences due to leap year
and daylight savings time adjustments.
It’s important to note that the duration functions do not capture differences in time due to leap year or daylight savings time adjustments. For example, if we add two hours to the given date-time provided in one_am
we go from 1am to 4am. Why? Because this is the date-time that daylight savings occurred.
one_am <- ymd_hms("2018-03-11 01:00:00", tz = "America/New_York")
one_am
## [1] "2018-03-11 01:00:00 EST"
one_am + dhours(2)
## [1] "2018-03-11 04:00:00 EDT"
This is where working with periods can help us.
16.6.2 Periods
Periods are time spans but don’t have a fixed length in seconds, instead they work with calendar friendly time periods that account for unique instances like daylight savings time and leap years. That allows them work in a more intuitive way.
Like durations, periods can be created with a number of friendly constructor functions.
seconds(55)
## [1] "55S"
minutes(25)
## [1] "25M 0S"
days(30)
## [1] "30d 0H 0M 0S"
weeks(3)
## [1] "21d 0H 0M 0S"
years(4)
## [1] "4y 0m 0d 0H 0M 0S"
When we add periods to any existing date-time, it will add calendar and/or clock “correct” time periods. For example, if we add two hours to our daylight savings time we created previously, it will result in two “normal” clock hours being added to our time even though daylight savings occurred at 2am.
16.7 Exercises
-
Compute the total
sales_value
of all transactions made Thanksgiving Day during the year of our available transactions (November 23, 2017). - What is the average time of date across all transactions?
- Which household shops the earliest in the day (on average) and which household shops the latest?