23 Lesson 6b: Iteration statements
Often, we need to execute repetitive code statements a particular number of times. Or, we may even need to execute code for an undetermined number of times until a certain condition no longer holds. There are multiple ways we can achieve this and in this lesson we will cover several of the more common approaches to perform iteration.
23.1 Learning objectives
By the end of this module you will be able to:
- Apply
for
,while
, andrepeat
to execute repetitive code statements. - Incorporate
break
andnext
to control looping statements. - Use functional programming to perform repetitive code statements.
23.2 for
loops
The for
loop is used to execute repetitive code statements for a particular number of times. The general syntax is provided below where i
is the counter and as i
assumes each sequential value the code in the body will be performed for that ith value.
There are three main components of a for
loop to consider:
- Sequence: the sequence of values to iterate over.
- Body: apply some function(s) to the object we are iterating over.
- Output: You must specify what to do with the result. This may include printing out a result or modifying the object in place.
For example, the following for loop iterates through each value (2018, 2011, …, 2022) and performs the paste and print functions inside the curly brackets.
Note how we use year
in place of i
. Its
always good to be more descriptive with your iteration terms so its
clearer to others what you are referring to.
years <- c(2018:2022)
for (year in years) {
output <- paste("The year is", year)
print(output)
}
## [1] "The year is 2018"
## [1] "The year is 2019"
## [1] "The year is 2020"
## [1] "The year is 2021"
## [1] "The year is 2022"
In the above example, year
refers directly to the value of the element in years
. However, sometimes we want to refer to the index value rather than the value. To do this, we primarily use seq_along
. seq_along()
is a function that creates a vector that contains a sequence of numbers from 1 to the length of the object:
To use seq_along
with a for
loop we just specify it in the sequence portion of the for
loop:
for (index in seq_along(years)) {
output <- paste0("Element ", index, ": ", years[index])
print(output)
}
## [1] "Element 1: 2018"
## [1] "Element 2: 2019"
## [1] "Element 3: 2020"
## [1] "Element 4: 2021"
## [1] "Element 5: 2022"
23.2.1 Knowledge check
Download the monthly_data
data sets used throughout this
lesson from the supplementary files provided via Canvas.
We can see all data sets that we have in the “data” folder with list.files()
:
library(here)
monthly_data_files <- here("data/monthly_data")
list.files(monthly_data_files)
## [1] "Month-01.csv" "Month-02.csv" "Month-03.csv" "Month-04.csv"
## [5] "Month-05.csv" "Month-06.csv" "Month-07.csv" "Month-08.csv"
## [9] "Month-09.csv" "Month-10.csv" "Month-11.csv"
Say we wanted to import one of these files into R:
# here's a single file
(first_df <- list.files(monthly_data_files)[1])
## [1] "Month-01.csv"
# create path and import this single file
df <- readr::read_csv(here(monthly_data_files, first_df))
# create a new name for file
(new_name <- stringr::str_sub(first_df, end = -5))
## [1] "Month-01"
# dynamically rename file
assign(new_name, df)
Can you incorporate these procedures into a for loop to import all the data files?
for(data_file in _____) {
# 1: import data
df <- readr::read_csv(_____)
# 2: remove ".csv" from file name
new_name <- _____
# 3: dynamically rename file
assign(_____, _____)
}
23.3 Controlling sequences
There are two ways to control the progression of a loop:
next
: terminates the current iteration and advances to the next.break
: exits the entire for loop.
Both are used in conjunction with if
statements. For example, this for
loop will iterate for each element in year
; however, when it gets to the element that equals the year of covid
(2020) it will break out and end the for loop process.
The next
argument is useful when we want to skip the current iteration of a loop without terminating it. On encountering next
, the R parser skips further evaluation and starts the next iteration of the loop. In this example, the for loop will iterate for each element in year
; however, when it gets to the element that equals covid
it will skip the rest of the code execution simply jump to the next iteration.
covid = 2020
for (year in years) {
if (year == covid) next
print(year)
}
## [1] 2018
## [1] 2019
## [1] 2021
## [1] 2022
23.3.1 Knowledge check
The following code identifies the month of the data set from our monthly data files in the last knowledge check:
# data files
(data_files <- list.files(monthly_data_files))
## [1] "Month-01.csv" "Month-02.csv" "Month-03.csv" "Month-04.csv"
## [5] "Month-05.csv" "Month-06.csv" "Month-07.csv" "Month-08.csv"
## [9] "Month-09.csv" "Month-10.csv" "Month-11.csv"
# extract month number
as.numeric(stringr::str_extract(data_files, "\\d+"))
## [1] 1 2 3 4 5 6 7 8 9 10 11
Modify the following for
loop with a next
or break
statement to:
- only import Month-01 through Month-07
- only import Month-08 through Month-10
# Modify this code chunk with you next/break statement
for(data_file in data_files) {
# steps to import each data set
df <- readr::read_csv(paste0("data/", data_file))
new_name <- stringr::str_sub(data_file, end = -5)
assign(new_name, df)
rm(df)
}
23.4 Repeating code for undefined number of iterations
Sometimes we need to execute code for an undetermined number of times until a certain condition no longer holds.
There are two very similar options to do this:
while
looprepeat
loop
23.4.1 while
loop
With a while
loop we:
- Test condition first
- Then execute code
For example, the probability of flipping 10 coins and getting all heads or tails is \((\frac{1}{2})^{10} = 0.0009765625\) (1 in 1024 tries). Let’s implement this and see how many times it’ll take to accomplish this feat.
The following while
statement will check if the number of unique values in flip
are 1, which implies that we flipped all heads or tails. If it is not equal to 1 then we repeat the process of flipping 10 coins and incrementing the number of tries. When our condition statement length(unique(flip)) != 1
is FALSE
then we exit the while
loop.
# create a coin
coin <- c("heads", "tails")
# set number of tries to zero
n_tries <- 0
# this will be used to imitate a flip of 10 coins
flip <- NULL
while(length(unique(flip)) != 1) {
# flip coin 10x
flip <- sample(coin, 10, replace = TRUE)
# add to the number of tries
n_tries <- n_tries + 1
}
n_tries
## [1] 553
23.4.2 repeat
loop
With a repeat
loop we:
- Execute code first
- Then test condition
We can perform the same exercise as above to assess how many times it takes to flip 10 coins and get all heads or tails. The main difference here is repeat
performs the action and then we incorporate a conditional statement to see if the flip is all heads or tails (length(unique(flip)) == 1
) and if so we break out of the loop.
Notice that you need to incorporate the conditional statement in
repeat
, otherwise it will continue looping
indefinitely!
23.4.3 Knowledge check
An elementary example of a random walk is the random walk on the integer number line, \(\mathbb{Z}\), which starts at 0 and at each step moves +1 or −1 with equal probability.
Fill in the incomplete code chunk below perform a random walk starting at value 0, with each step either adding or subtracting 1. Have your random walk stop if the value it exceeds 100 or if the number of steps taken exceeds 10,000.
value <- 0
step <- 0
repeat {
# randomly add or subtract 1
random_step <- sample(c(-1, 1), 1)
value <- value + _______
# count step
step <- step + __
# break once our walk exceeds 100
if (______ == 100 | _____ > 10000) {
print(step)
break
}
}
23.5 Iteration with functional programming
Iteration can be summed up as FOR EACH ____ DO ____. For example, in the previous knowledge checks we saw this code:
data_files <- data_files <- list.files(monthly_data_files)
for(data_file in data_files) {
# steps to import each data set
df <- readr::read_csv(paste0("data/", data_file))
new_name <- stringr::str_sub(data_file, end = -5)
assign(new_name, df)
rm(df)
}
The intent of this approach is FOR EACH file DO importing procedure. However, sometimes when reading for
loops it’s tough to focus on the primary intent.
Functional programming turns this idea into a function which, as we’ll see in later examples, can help to make iteration more efficient, strict, and explicit!
The purrr package provides functional programming tools that:
- for each element of x
- apply function f and
- provide consistent output
The most common use of purrr functions focus around the family of map()
functions. The family of map
functions provided by purrr
consist of vectorized functions which minimize your need to explicitly create loops. The initial functions we’ll explore include
map()
outputs a list.map_lgl()
outputs a logical vector.map_int()
outputs an integer vector.map_dbl()
outputs a double vector.map_chr()
outputs a character vector.map_df()
outputs a data frame.
These functions all behave in a similar manner - they each take a vector as input, applies a function to each piece, and then returns a new vector that’s the same length as the input. The primary difference is in the object they return.
For example, say we wanted to iterate over the Complete Journey demographics
data frame and compute the number of unique values for each column. We could do this with a for
loop, which would look like this:
cols <- colnames(completejourney::demographics)
distinct_values <- vector(mode = 'integer', length = length(cols))
for (column in seq_along(cols)) {
n_dist <- n_distinct(completejourney::demographics[column])
distinct_values[column] <- n_dist
}
distinct_values
## [1] 801 6 12 5 3 5 4 4
However, admittedly, this is a bit busy and its tough to see what the primary intent of this code is. Alternatively, we could apply purrr::map
, which in this example will iterate over each column of completejourney::demographics
and apply the n_distinct
function.
# tidyverse automatically loads purrr
library(tidyverse)
completejourney::demographics %>%
map(n_distinct)
## $household_id
## [1] 801
##
## $age
## [1] 6
##
## $income
## [1] 12
##
## $home_ownership
## [1] 5
##
## $marital_status
## [1] 3
##
## $household_size
## [1] 5
##
## $household_comp
## [1] 4
##
## $kids_count
## [1] 4
Notice how the above results in a named list. If instead we wanted a vector to be returned we can us map_int
:
completejourney::demographics %>%
map_int(n_distinct)
## household_id age income home_ownership
## 801 6 12 5
## marital_status household_size household_comp kids_count
## 3 5 4 4
We can apply other map
functions to our input as well; we simply need to think about what we expect the output to be and that directs us to use the relevant map
function:
# logical output
completejourney::demographics %>% map_lgl(is.factor)
## household_id age income home_ownership
## FALSE TRUE TRUE TRUE
## marital_status household_size household_comp kids_count
## TRUE TRUE TRUE TRUE
# integer output
completejourney::demographics %>% map_int(~ length(unique(.)))
## household_id age income home_ownership
## 801 6 12 5
## marital_status household_size household_comp kids_count
## 3 5 4 4
Notice the last function applied looks different. The map functions provide a few shortcuts that you can use with the .f
argument in order to save a little typing. The syntax for creating an anonymous function in R is quite verbose so purrr provides a convenient shortcut: a one-sided formula where .
inside of unique
points where the data should be evaluated.
# traditional anonymous function approach
completejourney::demographics %>% map_int(function(x) length(unique(x)))
## household_id age income home_ownership
## 801 6 12 5
## marital_status household_size household_comp kids_count
## 3 5 4 4
# one-sided formula approach
completejourney::demographics %>% map_int(~ length(unique(.)))
## household_id age income home_ownership
## 801 6 12 5
## marital_status household_size household_comp kids_count
## 3 5 4 4
This, along with chaining multiple map functions together, can make for very efficient data mining. To provide a toy example, the following uses the diamonds
data set provided by ggplot2 and:
- breaks the data set into separate data frames based on the
cut
variable, - applies a linear regression model to each data frame,
- extracts the summary of each linear regression model, and
- computes the \(R^2\) for each model.
Don’t worry, in the next module you’ll learn more about applying
statistical models to your data. For now, just work through each line of
code to get insight into the input and output of each map
step.
diamonds %>%
split(.$cut) %>%
map(~lm(price ~ carat, data = .)) %>%
map(summary) %>%
map_dbl(~.$r.squared)
## Fair Good Very Good Premium Ideal
## 0.73839 0.85095 0.85816 0.85563 0.86709
23.5.1 Knowledge check
With the built-in airquality
data set, use the most
appropriate map functions to answer these three questions:
- How many n_distinct values are in each column?
- Are there any missing values in this data set?
- What is the standard deviation for each variable?
If you want to get deeper into functional programming with purrr check out Section 21.5 of R for Data Science.
23.6 Exercises
-
Use
purrr::map_dfr
to import each of themonthly_data/Month-XX.csv
data files and combine into one single data frame. -
Check the current class of each column
i.e. (
class(df$Account_ID)
). Since the time stamp variable has two classes, you can’t condense this down to an atomic vector. - How many unique values exist in each column?