4 Lesson 1c: Vectors
In the last module we started to explore how to use R as a calculator. This is great; however, we were only working with individual values. Often, we want to work on several values at once so we need a structure that will hold multiple pieces of data. We will discuss data structures more in the next module but in this lesson we introduce the vector, which is the fundamental data structure in R. Once you have a good grasp of working with vectors, then working with other data structures because much easier.
4.1 Learning objectives
By the end of this lesson you will be able to:
- Create vectors.
- Extract and replace elements within a vector.
- Perform basic operations on a vector (i.e. compute the mean).
- Work with missing data in a vector.
- Explain and take advantage of vectorization.
4.2 Creating vectors
There are multiple ways to create vectors but the first one you’ll be introduced to is by using c()
. The c()
function is short for concatenate and we use it to join together a series of values and store them in a vector.
To examine the value of our new object we can simply type out the name of the object as we did before:
4.2.1 Creating sequences
Sometimes it can be useful to create a vector that contains a regular sequence of values in steps of one. Here we can make use of a shortcut using the :
symbol.
my_seq <- 1:10 # create regular sequence
my_seq
## [1] 1 2 3 4 5 6 7 8 9 10
my_seq2 <- 10:1 # in decending order
my_seq2
## [1] 10 9 8 7 6 5 4 3 2 1
Other useful functions for generating vectors of sequences include the seq()
and rep()
functions. For example, to generate a sequence from 1 to 5 in steps of 0.5
Here we’ve used the arguments from =
and to =
to define the limits of the sequence and the by =
argument to specify the increment of the sequence. Play around with other values for these arguments to see their effect.
The rep()
function allows you to replicate (repeat) values a specified number of times. To repeat the value 2, 10 times
You can also repeat non-numeric values
or each element of a series
# repeats the series 1 to 5, 3 times
my_seq5 <- rep(1:5, times = 3)
my_seq5
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
or elements of a series
# repeats each element of the series 3 times
my_seq6 <- rep(1:5, each = 3)
my_seq6
## [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
We can also repeat a non-sequential series
# repeats each element of the series 3 times
my_seq7 <- rep(c(3, 1, 10, 7), each = 3)
my_seq7
## [1] 3 3 3 1 1 1 10 10 10 7 7 7
Note in the code above how we’ve used the c()
function inside the seq()
function. Nesting functions allows us to build quite complex commands within a single line of code and is a very common practice when using R. However, care needs to be taken as too many nested functions can make your code quite difficult for others to understand (or yourself some time in the future!). We could rewrite the code above to explicitly separate the two different steps to generate our vector. Either approach will give the same result, you just need to use your own judgement as to which is more readable.
4.2.2 Knowledge check
-
Use
c()
to create a vector calledweight
containing the weight (in lbs) of 10 children: 75, 95, 92, 89, 101, 87, 79, 92, 70, 99. -
Use the
seq()
function to create a sequence of numbers ranging from 0 to 1 in steps of 0.1. -
Generate the following sequences. You will need to experiment with
the arguments to the
rep()
function to generate these sequences:- 1 2 3 1 2 3 1 2 3
- “a” “a” “a” “c” “c” “c” “e” “e” “e” “g” “g” “g”
- “a” “c” “e” “g” “a” “c” “e” “g” “a” “c” “e” “g”
- 1 1 1 2 2 2 3 3 3 1 1 1 2 2 2 3 3 3
- 1 1 1 1 1 2 2 2 2 3 3 3 4 4 5
4.3 Extracting elements
To extract (also known as indexing) one or more values (more generally known as elements) from a vector we use the square bracket [ ]
notation. The general approach is to name the object you wish to extract from, then a set of square brackets with an index of the element you wish to extract contained within the square brackets.
This index can be a position or the result of a logical test.
4.3.1 Positional indexing
# extract the 3rd value
my_vec[3]
## [1] 1
# if you want to assign this value in another object
val_3 <- my_vec[3]
val_3
## [1] 1
Note that the positional index starts at 1 rather than 0 like some other other programming languages (i.e. Python).
We can also extract more than one value by using the c()
function inside the square brackets. Here we extract the 1st, 5th, 6th and 8th element from the my_vec
object
Or we can extract a range of values using the :
notation. To extract the values from the 3rd to the 8th elements
4.3.2 Logical indexing
Another really useful way to extract data from a vector is to use a logical expression as an index. For example, to extract all elements with a value greater than 4 in the vector my_vec
Here, the logical expression is my_vec > 4
and R will only extract those elements that satisfy this logical condition. So how does this actually work? If we look at the output of just the logical expression without the square brackets you can see that R returns a vector containing either TRUE
or FALSE
which correspond to whether the logical condition is satisfied for each element. In this case only the 4th and 8th elements return a TRUE
as their value is greater than 4.
So what R is actually doing under the hood is equivalent to
and only those element that are TRUE
will be extracted.
In addition to the <
and >
operators you can also use composite operators to increase the complexity of your expressions. For example the expression for ‘greater or equal to’ is >=
. To test whether a value is equal to a value we need to use a double equals symbol ==
and for ‘not equal to’ we use !=
(the !
symbol means ‘not’).
my_vec[my_vec >= 4] # values greater or equal to 4
## [1] 6 4 7
my_vec[my_vec < 4] # values less than 4
## [1] 2 3 1 3 3
my_vec[my_vec <= 4] # values less than or equal to 4
## [1] 2 3 1 4 3 3
my_vec[my_vec == 4] # values equal to 4
## [1] 4
my_vec[my_vec != 4] # values not equal to 4
## [1] 2 3 1 6 3 3 7
We can also combine multiple logical expressions using Boolean expressions. In R the &
symbol means AND and the |
symbol means OR. For example, to extract values in my_vec
which are less than 6 AND greater than 2
or extract values in my_vec
that are greater than 6 OR less than 3
4.3.3 Knowledge check
-
Use
c()
to create a vector calledweight
containing the weight (in lbs) of 10 children: 75, 95, 92, 89, 101, 87, 79, 92, 70, 99. - Extract the weights for the first and last child.
- Extract the weights for the first five children.
- Extract the weights for those children that weigh over 90lbs.
4.4 Replacing elements
We can change the values of some elements in a vector using our [ ]
notation in combination with the assignment operator <-
. For example, to replace the 4th value of our my_vec
object from 6
to 500
We can also replace more than one value or even replace values based on a logical expression
# replace the 6th and 7th element with 100
my_vec[c(6, 7)] <- 100
my_vec
## [1] 2 3 1 500 4 100 100 7
# replace element that are less than or equal to 4 with 1000
my_vec[my_vec <= 4] <- 1000
my_vec
## [1] 1000 1000 1000 500 1000 100 100 7
4.4.1 Knowledge check
-
Use
c()
to create a vector calledweight
containing the weight (in lbs) of 10 children: 75, 95, 92, 89, 101, 87, 79, 92, 70, 99. - Say we made a mistake and the third child actually weighs 96, change the value for that element.
- Say the minimum weight allowed to be recorded is 80lbs. Change the value for all elements that are less than 80 to be equal to 80.
4.5 Operations
We can use other functions to perform useful operations on vectors. For example, we can calculate the mean, variance, standard deviation and number of elements in our vector by using the mean()
, var()
, sd()
and length()
functions
mean(my_vec) # returns the mean of my_vec
## [1] 588.38
var(my_vec) # returns the variance of my_vec
## [1] 214367
sd(my_vec) # returns the standard deviation of my_vec
## [1] 463
length(my_vec) # returns the number of elements in my_vec
## [1] 8
If we wanted to use any of these values later on in our analysis we can just assign the resulting value to another object
We can also sort values in our element:
# ascending order
sort(my_vec)
## [1] 7 100 100 500 1000 1000 1000 1000
# decending order
sort(my_vec, decreasing = TRUE)
## [1] 1000 1000 1000 1000 500 100 100 7
There are a lot operations that we can perform on vectors. As we progress through this course we’ll see many of these in action.
4.5.1 Knowledge check
-
Use
c()
to create a vector calledweight
containing the weight (in lbs) of 10 children: 75, 95, 92, 89, 101, 87, 79, 92, 70, 99. - Identify functions that will compute the minimum and maximum values in this vector.
- What is the mean, median, and standard deviation of child weights?
-
Apply the
summary()
function on this vector. What does this return?
4.6 Missing data
In R, missing data is usually represented by an NA
symbol meaning ‘Not Available’. Data may be missing for a whole bunch of reasons, maybe your machine broke down, maybe the weather was too bad to collect data on a particular day, etc. Missing data can be a pain in the proverbial both from an R perspective and also a statistical perspective. From an R perspective missing data can be problematic as different functions deal with missing data in different ways. For example, let’s say we collected air temperature readings over 10 days, but our thermometer broke on day 2 and again on day 9 so we have no data for those days
temp <- c(7.2, NA, 7.1, 6.9, 6.5, 5.8, 5.8, 5.5, NA, 5.5)
temp
## [1] 7.2 NA 7.1 6.9 6.5 5.8 5.8 5.5 NA 5.5
We now want to calculate the mean temperature over these days using the mean()
function
Flippin heck, what’s happened here? Why does the mean()
function return an NA
? Actually, R is doing something very sensible (at least in our opinion!). If a vector has a missing value then the only possible value to return when calculating a mean is NA
. R doesn’t know that you perhaps want to ignore the NA
values. Happily, if we look at the help file via help("mean")
we can see there is an argument na.rm =
which is set to FALSE
by default.
Most statistical operators will have an na.rm
parameter
that takes a TRUE
or FALSE
argument indicating
whether NA values should be stripped before the computation
proceeds.
If we change this argument to na.rm = TRUE
when we use the mean()
function this will allow us to ignore the NA
values when calculating the mean
It’s important to note that the NA
values have not been removed from our temp
object (that would be bad practice), rather the mean()
function has just ignored them. The point of the above is to highlight how we can change the default behaviour of a function using an appropriate argument.
4.6.1 Knowledge check
-
Use
c()
to create a vector calledweight
containing the weight (in lbs) of 10 children: 75, 95, 92, 89, 101, NA, 79, 92, 70, 99. - Compute the min, max, mean, median, and standard deviation of child weights?
-
Apply the
summary()
function on this vector. How does this differ from before?
4.7 Vectorization
4.7.1 Looping versus Vectorization
A key difference between R and many other languages is a topic known as vectorization. What does this mean? It means that many functions that are to be applied individually to each element in a vector of numbers require a loop assessment to evaluate; however, in R many of these functions have been coded in C to perform much faster than a for loop would perform. For example, let’s say you want to add the elements of two separate vectors of numbers (x
and y
).
In other languages you might have to run a loop to add two vectors together. In this for
loop I print each iteration to show that the loop calculates the sum for the first elements in each vector, then performs the sum for the second elements, etc.
Don’t worry if you don’t understand each piece of this code. This is just for illustration purposes! You will learn about looping procedures in a later module.
# empty vector to store results
z <- as.vector(NULL)
# `for` loop to add corresponding elements in each vector
for (i in seq_along(x)) {
z[i] <- x[i] + y[i]
print(z)
}
## [1] 2
## [1] 2 5
## [1] 2 5 8
Instead, in R, +
is a vectorized function which can operate on entire vectors at once. So rather than creating for
loops for many functions, you can just use simple syntax to perform element-wise operations with both vectors:
4.7.2 Recycling
When performing vector operations in R, it is important to know about recycling. When performing an operation on two or more vectors of unequal length, R will recycle elements of the shorter vector(s) to match the longest vector. For example:
long <- 1:10
short <- 1:5
long
## [1] 1 2 3 4 5 6 7 8 9 10
short
## [1] 1 2 3 4 5
# R will recycle (reuse) the short vector until it reaches
# the end of the long vector
long + short
## [1] 2 4 6 8 10 7 9 11 13 15
The elements of long
and short
are added together starting from the first element of both vectors. When R reaches the end of the short vector, it starts again at the first element of short
and continues until it reaches the last element of the long
vector. This functionality is very useful when you want to perform the same operation on every element of a vector. For example, say we want to multiply every element of our vector long by 3:
There are no scalars in R, so c
is actually a vector of length 1; in order to add its value to every element of long
, it is recycled to match the length of long
.
Don’t get hung up with some of the verbiage used here (i.e. vectors vs. scalars), we will cover what this means in later a module.
When the length of the longer object is a multiple of the shorter object length, the recycling occurs silently. When the longer object length is not a multiple of the shorter object length, a warning is given:
even_length <- 1:10
odd_length <- 1:3
even_length + odd_length
## Warning in even_length + odd_length: longer object length is not a
## multiple of shorter object length
## [1] 2 4 6 5 7 9 8 10 12 11
4.8 Exercises
- Create a vector called weight containing the weight (in kg) of 10 children: 69, 62, 57, 59, 59, 64, 56, 66, 67, 66.
- Create a vector called height containing the height (in cm) of the same 10 children: 112, 102, 83, 84, 99, 90, 77, 112, 133, 112.
-
Use the
summary()
function to summarize these data. -
Extract the height of the 2nd, 3rd, 9th and 10th child and assign
these heights to a variable called
some_child
. -
Also extract all the heights of children less than or equal to 99 cm
and assign to a variable called
shorter_child
. -
Use the information in your weight and height variables to calculate
the body mass index (BMI) for each child. The BMI is calculated as
weight (in kg) divided by the square of the height (in meters). Store
the results of this calculation in a variable called
bmi
. Note: you don’t need to do this calculation for each child individually, you can leverage vectorization to do this!