4 Lesson 1c: Vectors

In the last module we started to explore how to use R as a calculator. This is great; however, we were only working with individual values. Often, we want to work on several values at once so we need a structure that will hold multiple pieces of data. We will discuss data structures more in the next module but in this lesson we introduce the vector, which is the fundamental data structure in R. Once you have a good grasp of working with vectors, then working with other data structures because much easier.

4.1 Learning objectives

By the end of this lesson you will be able to:

Create vectors.
Extract and replace elements within a vector.
Perform basic operations on a vector (i.e. compute the mean).
Work with missing data in a vector.
Explain and take advantage of vectorization.

4.2 Creating vectors

There are multiple ways to create vectors but the first one you’ll be introduced to is by using c(). The c() function is short for concatenate and we use it to join together a series of values and store them in a vector.

my_vec <- c(2, 3, 1, 6, 4, 3, 3, 7)

To examine the value of our new object we can simply type out the name of the object as we did before:

my_vec
## [1] 2 3 1 6 4 3 3 7

4.2.1 Creating sequences

Sometimes it can be useful to create a vector that contains a regular sequence of values in steps of one. Here we can make use of a shortcut using the : symbol.

my_seq <- 1:10     # create regular sequence
my_seq
##  [1]  1  2  3  4  5  6  7  8  9 10

my_seq2 <- 10:1    # in decending order
my_seq2
##  [1] 10  9  8  7  6  5  4  3  2  1

Other useful functions for generating vectors of sequences include the seq() and rep() functions. For example, to generate a sequence from 1 to 5 in steps of 0.5

my_seq2 <- seq(from = 1, to = 5, by = 0.5)
my_seq2
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Here we’ve used the arguments from = and to = to define the limits of the sequence and the by = argument to specify the increment of the sequence. Play around with other values for these arguments to see their effect.

The rep() function allows you to replicate (repeat) values a specified number of times. To repeat the value 2, 10 times

# repeats 2, 10 times
my_seq3 <- rep(2, times = 10)
my_seq3
##  [1] 2 2 2 2 2 2 2 2 2 2

You can also repeat non-numeric values

# repeats ‘abc’ 3 times
my_seq4 <- rep("abc", times = 3)     
my_seq4
## [1] "abc" "abc" "abc"

or each element of a series

# repeats the series 1 to 5, 3 times
my_seq5 <- rep(1:5, times = 3) 
my_seq5
##  [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

or elements of a series

# repeats each element of the series 3 times
my_seq6 <- rep(1:5, each = 3)
my_seq6
##  [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5

We can also repeat a non-sequential series

# repeats each element of the series 3 times
my_seq7 <- rep(c(3, 1, 10, 7), each = 3)
my_seq7
##  [1]  3  3  3  1  1  1 10 10 10  7  7  7

Note in the code above how we’ve used the c() function inside the seq() function. Nesting functions allows us to build quite complex commands within a single line of code and is a very common practice when using R. However, care needs to be taken as too many nested functions can make your code quite difficult for others to understand (or yourself some time in the future!). We could rewrite the code above to explicitly separate the two different steps to generate our vector. Either approach will give the same result, you just need to use your own judgement as to which is more readable.

in_vec <- c(3, 1, 10, 7)
my_seq7 <- rep(in_vec, each = 3) 
my_seq7
##  [1]  3  3  3  1  1  1 10 10 10  7  7  7

4.2.2 Knowledge check

Use c() to create a vector called weight containing the weight (in lbs) of 10 children: 75, 95, 92, 89, 101, 87, 79, 92, 70, 99.
Use the seq() function to create a sequence of numbers ranging from 0 to 1 in steps of 0.1.
Generate the following sequences. You will need to experiment with the arguments to the rep() function to generate these sequences:
- 1 2 3 1 2 3 1 2 3
- “a” “a” “a” “c” “c” “c” “e” “e” “e” “g” “g” “g”
- “a” “c” “e” “g” “a” “c” “e” “g” “a” “c” “e” “g”
- 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3
- 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5

4.3 Extracting elements

To extract (also known as indexing) one or more values (more generally known as elements) from a vector we use the square bracket [ ] notation. The general approach is to name the object you wish to extract from, then a set of square brackets with an index of the element you wish to extract contained within the square brackets.

This index can be a position or the result of a logical test.

4.3.1 Positional indexing

# extract the 3rd value
my_vec[3]
## [1] 1

# if you want to assign this value in another object
val_3 <- my_vec[3]
val_3
## [1] 1

Note that the positional index starts at 1 rather than 0 like some other other programming languages (i.e. Python).

We can also extract more than one value by using the c() function inside the square brackets. Here we extract the 1^st, 5^th, 6^th and 8^th element from the my_vec object

my_vec[c(1, 5, 6, 8)]
## [1] 2 4 3 7

Or we can extract a range of values using the : notation. To extract the values from the 3^rd to the 8^th elements

my_vec[3:8]
## [1] 1 6 4 3 3 7

4.3.2 Logical indexing

Another really useful way to extract data from a vector is to use a logical expression as an index. For example, to extract all elements with a value greater than 4 in the vector my_vec

my_vec[my_vec > 4]
## [1] 6 7

Here, the logical expression is my_vec > 4 and R will only extract those elements that satisfy this logical condition. So how does this actually work? If we look at the output of just the logical expression without the square brackets you can see that R returns a vector containing either TRUE or FALSE which correspond to whether the logical condition is satisfied for each element. In this case only the 4^th and 8^th elements return a TRUE as their value is greater than 4.

my_vec > 4
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

So what R is actually doing under the hood is equivalent to

my_vec[c(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE)]
## [1] 6 7

and only those element that are TRUE will be extracted.

In addition to the < and > operators you can also use composite operators to increase the complexity of your expressions. For example the expression for ‘greater or equal to’ is >=. To test whether a value is equal to a value we need to use a double equals symbol == and for ‘not equal to’ we use != (the ! symbol means ‘not’).

my_vec[my_vec >= 4]        # values greater or equal to 4
## [1] 6 4 7
my_vec[my_vec < 4]         # values less than 4
## [1] 2 3 1 3 3
my_vec[my_vec <= 4]        # values less than or equal to 4
## [1] 2 3 1 4 3 3
my_vec[my_vec == 4]        # values equal to 4
## [1] 4
my_vec[my_vec != 4]        # values not equal to 4
## [1] 2 3 1 6 3 3 7

We can also combine multiple logical expressions using Boolean expressions. In R the & symbol means AND and the | symbol means OR. For example, to extract values in my_vec which are less than 6 AND greater than 2

val26 <- my_vec[my_vec < 6 & my_vec > 2]
val26
## [1] 3 4 3 3

or extract values in my_vec that are greater than 6 OR less than 3

val63 <- my_vec[my_vec > 6 | my_vec < 3]
val63
## [1] 2 1 7

4.3.3 Knowledge check

Use c() to create a vector called weight containing the weight (in lbs) of 10 children: 75, 95, 92, 89, 101, 87, 79, 92, 70, 99.
Extract the weights for the first and last child.
Extract the weights for the first five children.
Extract the weights for those children that weigh over 90lbs.

4.4 Replacing elements

We can change the values of some elements in a vector using our [ ] notation in combination with the assignment operator <-. For example, to replace the 4^th value of our my_vec object from 6 to 500

my_vec[4] <- 500
my_vec
## [1]   2   3   1 500   4   3   3   7

We can also replace more than one value or even replace values based on a logical expression

# replace the 6th and 7th element with 100
my_vec[c(6, 7)] <- 100
my_vec
## [1]   2   3   1 500   4 100 100   7

# replace element that are less than or equal to 4 with 1000
my_vec[my_vec <= 4] <- 1000
my_vec
## [1] 1000 1000 1000  500 1000  100  100    7

4.4.1 Knowledge check

Use c() to create a vector called weight containing the weight (in lbs) of 10 children: 75, 95, 92, 89, 101, 87, 79, 92, 70, 99.
Say we made a mistake and the third child actually weighs 96, change the value for that element.
Say the minimum weight allowed to be recorded is 80lbs. Change the value for all elements that are less than 80 to be equal to 80.

4.5 Operations

We can use other functions to perform useful operations on vectors. For example, we can calculate the mean, variance, standard deviation and number of elements in our vector by using the mean(), var(), sd() and length() functions

mean(my_vec)    # returns the mean of my_vec
## [1] 588.38
var(my_vec)     # returns the variance of my_vec
## [1] 214367
sd(my_vec)      # returns the standard deviation of my_vec
## [1] 463
length(my_vec)  # returns the number of elements in my_vec
## [1] 8

If we wanted to use any of these values later on in our analysis we can just assign the resulting value to another object

vec_mean <- mean(my_vec)
vec_mean
## [1] 588.38

We can also sort values in our element:

# ascending order
sort(my_vec)
## [1]    7  100  100  500 1000 1000 1000 1000

# decending order
sort(my_vec, decreasing = TRUE)
## [1] 1000 1000 1000 1000  500  100  100    7

There are a lot operations that we can perform on vectors. As we progress through this course we’ll see many of these in action.

4.5.1 Knowledge check

Use c() to create a vector called weight containing the weight (in lbs) of 10 children: 75, 95, 92, 89, 101, 87, 79, 92, 70, 99.
Identify functions that will compute the minimum and maximum values in this vector.
What is the mean, median, and standard deviation of child weights?
Apply the summary() function on this vector. What does this return?

4.6 Missing data

In R, missing data is usually represented by an NA symbol meaning ‘Not Available’. Data may be missing for a whole bunch of reasons, maybe your machine broke down, maybe the weather was too bad to collect data on a particular day, etc. Missing data can be a pain in the proverbial both from an R perspective and also a statistical perspective. From an R perspective missing data can be problematic as different functions deal with missing data in different ways. For example, let’s say we collected air temperature readings over 10 days, but our thermometer broke on day 2 and again on day 9 so we have no data for those days

temp  <- c(7.2, NA, 7.1, 6.9, 6.5, 5.8, 5.8, 5.5, NA, 5.5)
temp
##  [1] 7.2  NA 7.1 6.9 6.5 5.8 5.8 5.5  NA 5.5

We now want to calculate the mean temperature over these days using the mean() function

mean_temp <- mean(temp)
mean_temp
## [1] NA

Flippin heck, what’s happened here? Why does the mean() function return an NA? Actually, R is doing something very sensible (at least in our opinion!). If a vector has a missing value then the only possible value to return when calculating a mean is NA. R doesn’t know that you perhaps want to ignore the NA values. Happily, if we look at the help file via help("mean") we can see there is an argument na.rm = which is set to FALSE by default.

Most statistical operators will have an na.rm parameter that takes a TRUE or FALSE argument indicating whether NA values should be stripped before the computation proceeds.

If we change this argument to na.rm = TRUE when we use the mean() function this will allow us to ignore the NA values when calculating the mean

mean_temp <- mean(temp, na.rm = TRUE)
mean_temp
## [1] 6.2875

It’s important to note that the NA values have not been removed from our temp object (that would be bad practice), rather the mean() function has just ignored them. The point of the above is to highlight how we can change the default behaviour of a function using an appropriate argument.

4.6.1 Knowledge check

Use c() to create a vector called weight containing the weight (in lbs) of 10 children: 75, 95, 92, 89, 101, NA, 79, 92, 70, 99.
Compute the min, max, mean, median, and standard deviation of child weights?
Apply the summary() function on this vector. How does this differ from before?

4.7 Vectorization

4.7.1 Looping versus Vectorization

A key difference between R and many other languages is a topic known as vectorization. What does this mean? It means that many functions that are to be applied individually to each element in a vector of numbers require a loop assessment to evaluate; however, in R many of these functions have been coded in C to perform much faster than a for loop would perform. For example, let’s say you want to add the elements of two separate vectors of numbers (x and y).

x <- c(1, 3, 4)
y <- c(1, 2, 4)

x
## [1] 1 3 4
y
## [1] 1 2 4

In other languages you might have to run a loop to add two vectors together. In this for loop I print each iteration to show that the loop calculates the sum for the first elements in each vector, then performs the sum for the second elements, etc.

Don’t worry if you don’t understand each piece of this code. This is just for illustration purposes! You will learn about looping procedures in a later module.

# empty vector to store results
z <- as.vector(NULL)

# `for` loop to add corresponding elements in each vector
for (i in seq_along(x)) {
        z[i] <- x[i] + y[i]
        print(z)
}
## [1] 2
## [1] 2 5
## [1] 2 5 8

Instead, in R, + is a vectorized function which can operate on entire vectors at once. So rather than creating for loops for many functions, you can just use simple syntax to perform element-wise operations with both vectors:

# add each element in x and y
x + y
## [1] 2 5 8

# multiply each element in x and y
x * y
## [1]  1  6 16

# compare each element in x to y
x > y
## [1] FALSE  TRUE FALSE

4.7.2 Recycling

When performing vector operations in R, it is important to know about recycling. When performing an operation on two or more vectors of unequal length, R will recycle elements of the shorter vector(s) to match the longest vector. For example:

long <- 1:10
short <- 1:5

long
##  [1]  1  2  3  4  5  6  7  8  9 10
short
## [1] 1 2 3 4 5

# R will recycle (reuse) the short vector until it reaches 
# the end of the long vector
long + short
##  [1]  2  4  6  8 10  7  9 11 13 15

The elements of long and short are added together starting from the first element of both vectors. When R reaches the end of the short vector, it starts again at the first element of short and continues until it reaches the last element of the long vector. This functionality is very useful when you want to perform the same operation on every element of a vector. For example, say we want to multiply every element of our vector long by 3:

long <- 1:10
c <- 3

long * c
##  [1]  3  6  9 12 15 18 21 24 27 30

There are no scalars in R, so c is actually a vector of length 1; in order to add its value to every element of long, it is recycled to match the length of long.

Don’t get hung up with some of the verbiage used here (i.e. vectors vs. scalars), we will cover what this means in later a module.

When the length of the longer object is a multiple of the shorter object length, the recycling occurs silently. When the longer object length is not a multiple of the shorter object length, a warning is given:

even_length <- 1:10
odd_length <- 1:3

even_length + odd_length
## Warning in even_length + odd_length: longer object length is not a
## multiple of shorter object length
##  [1]  2  4  6  5  7  9  8 10 12 11

4.7.3 Knowledge check

Create this vector my_vec <- 1:10.
Add 1 to every element in my_vec.
Divide every element in my_vec by 2.
Create a second vector my_vec2 <- 10:18 and add my_vec to my_vec2.

4.8 Exercises

Create a vector called weight containing the weight (in kg) of 10 children: 69, 62, 57, 59, 59, 64, 56, 66, 67, 66.
Create a vector called height containing the height (in cm) of the same 10 children: 112, 102, 83, 84, 99, 90, 77, 112, 133, 112.
Use the summary() function to summarize these data.
Extract the height of the 2nd, 3rd, 9th and 10th child and assign these heights to a variable called some_child.
Also extract all the heights of children less than or equal to 99 cm and assign to a variable called shorter_child.
Use the information in your weight and height variables to calculate the body mass index (BMI) for each child. The BMI is calculated as weight (in kg) divided by the square of the height (in meters). Store the results of this calculation in a variable called bmi. Note: you don’t need to do this calculation for each child individually, you can leverage vectorization to do this!