15 Lesson 4b: Handling text data

Dealing with character strings is often under-emphasized in data analysis training. The focus typically remains on numeric values; however, the growth in data collection is also resulting in greater bits of information embedded in text. Consequently, handling, cleaning and processing character strings is becoming a prerequisite in daily data analysis. This lesson is meant to give you the foundation of working with character strings.

15.1 Learning objectives

By the end of this lesson you’ll be able to:

  • Perform basic character string manipulations.
  • Use regular expressions to identify and manipulate patterns in character strings.

15.2 Prerequisites

The following packages will be used throughout this lesson. Base R contains many functions to work with strings but we’ll avoid them because they can be inconsistent, which makes them hard to remember. Instead we’ll use functions from stringr for text manipulation and we’ll also combine this with the use of dplyr for data frame manipulation.

library(dplyr)
library(stringr)

For data, we’ll leverage the completejourney data. For example, the products data within completejourney provides various variables (e.g. product_category and product_type) that contain character strings we may need to clean, normalize, or identify patterns within.

library(completejourney)

products
## # A tibble: 92,331 × 7
##    product_id manufacturer_id department    brand    product_category  
##    <chr>      <chr>           <chr>         <fct>    <chr>             
##  1 25671      2               GROCERY       National FRZN ICE          
##  2 26081      2               MISCELLANEOUS National <NA>              
##  3 26093      69              PASTRY        Private  BREAD             
##  4 26190      69              GROCERY       Private  FRUIT - SHELF STA…
##  5 26355      69              GROCERY       Private  COOKIES/CONES     
##  6 26426      69              GROCERY       Private  SPICES & EXTRACTS 
##  7 26540      69              GROCERY       Private  COOKIES/CONES     
##  8 26601      69              DRUG GM       Private  VITAMINS          
##  9 26636      69              PASTRY        Private  BREAKFAST SWEETS  
## 10 26691      16              GROCERY       Private  PNT BTR/JELLY/JAMS
## # ℹ 92,321 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>


15.3 String basics

Basic string manipulation typically includes case conversion, counting characters, and extracting parts of a string. These operations can all be performed with base R functions; however, some operations (or at least their syntax) are greatly simplified with the stringr package. Moreover, stringr is typically very useful when combined with dplyr data manipulation functions.

15.3.1 Case conversion

To change the case of a character string we can use str_to_xxx:

# we'll focus on the product_category column
head(products$product_category)
## [1] "FRZN ICE"             NA                    
## [3] "BREAD"                "FRUIT - SHELF STABLE"
## [5] "COOKIES/CONES"        "SPICES & EXTRACTS"

# force everything to lower case
str_to_lower(products$product_category) %>% head()
## [1] "frzn ice"             NA                    
## [3] "bread"                "fruit - shelf stable"
## [5] "cookies/cones"        "spices & extracts"

# force everything to upper case
str_to_upper(products$product_category) %>% head()
## [1] "FRZN ICE"             NA                    
## [3] "BREAD"                "FRUIT - SHELF STABLE"
## [5] "COOKIES/CONES"        "SPICES & EXTRACTS"

Typically we want to normalize text by converting everything to lower case but keep the data in our original data frame. We can combine str_ functions with dplyr::mutate to accomplish this:

products %>% mutate(product_category = str_to_lower(product_category))
## # A tibble: 92,331 × 7
##    product_id manufacturer_id department    brand    product_category  
##    <chr>      <chr>           <chr>         <fct>    <chr>             
##  1 25671      2               GROCERY       National frzn ice          
##  2 26081      2               MISCELLANEOUS National <NA>              
##  3 26093      69              PASTRY        Private  bread             
##  4 26190      69              GROCERY       Private  fruit - shelf sta…
##  5 26355      69              GROCERY       Private  cookies/cones     
##  6 26426      69              GROCERY       Private  spices & extracts 
##  7 26540      69              GROCERY       Private  cookies/cones     
##  8 26601      69              DRUG GM       Private  vitamins          
##  9 26636      69              PASTRY        Private  breakfast sweets  
## 10 26691      16              GROCERY       Private  pnt btr/jelly/jams
## # ℹ 92,321 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>

When we apply mutate() to an existing column we do not add a new column, we simply overwrite the existing column.

15.3.2 Counting characters

There may be times where we are interested in how many characters exist in a string. Whether you need to extract this information or filter for observations that exceed a certain character length you can use str_count to accomplish this:

# count number of characters in product_category
str_count(products$product_category) %>% head()
## [1]  8 NA  5 20 13 17

# create a new variable with character length
products %>% 
  mutate(category_length = str_count(product_category)) %>%
  select(product_category, category_length)
## # A tibble: 92,331 × 2
##    product_category     category_length
##    <chr>                          <int>
##  1 FRZN ICE                           8
##  2 <NA>                              NA
##  3 BREAD                              5
##  4 FRUIT - SHELF STABLE              20
##  5 COOKIES/CONES                     13
##  6 SPICES & EXTRACTS                 17
##  7 COOKIES/CONES                     13
##  8 VITAMINS                           8
##  9 BREAKFAST SWEETS                  16
## 10 PNT BTR/JELLY/JAMS                18
## # ℹ 92,321 more rows

# filter for product categories that are greater than the mean character length
products %>% 
   filter(str_count(product_category) > mean(str_count(product_category), na.rm = TRUE))
## # A tibble: 45,192 × 7
##    product_id manufacturer_id department brand   product_category      
##    <chr>      <chr>           <chr>      <fct>   <chr>                 
##  1 26190      69              GROCERY    Private FRUIT - SHELF STABLE  
##  2 26426      69              GROCERY    Private SPICES & EXTRACTS     
##  3 26636      69              PASTRY     Private BREAKFAST SWEETS      
##  4 26691      16              GROCERY    Private PNT BTR/JELLY/JAMS    
##  5 26738      69              GROCERY    Private ICE CREAM/MILK/SHERBTS
##  6 26941      69              GROCERY    Private ICE CREAM/MILK/SHERBTS
##  7 27030      69              GROCERY    Private ICE CREAM/MILK/SHERBTS
##  8 27152      69              GROCERY    Private SPICES & EXTRACTS     
##  9 27158      69              GROCERY    Private ICE CREAM/MILK/SHERBTS
## 10 27334      69              DRUG GM    Private DIETARY AID PRODUCTS  
## # ℹ 45,182 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>

15.3.3 Extracting parts of character strings

If you need to extract parts of a character string you can use str_sub. By default start = 1 and end = 3 will tell the function to extract the first through third characters. If you want to start from the right side of the string then negative numbers such as start = -1 and end = -5 will tell the function to extract the last five characters.

# extract parts of a string (example: extract first three characters)
str_sub(products$product_category, start = 1, end = 3) %>% head()
## [1] "FRZ" NA    "BRE" "FRU" "COO" "SPI"

# create new variable for last five characters
products %>%
  select(product_category) %>%
  mutate(last_five = str_sub(product_category, start = -5, end = -1))
## # A tibble: 92,331 × 2
##    product_category     last_five
##    <chr>                <chr>    
##  1 FRZN ICE             N ICE    
##  2 <NA>                 <NA>     
##  3 BREAD                BREAD    
##  4 FRUIT - SHELF STABLE TABLE    
##  5 COOKIES/CONES        CONES    
##  6 SPICES & EXTRACTS    RACTS    
##  7 COOKIES/CONES        CONES    
##  8 VITAMINS             AMINS    
##  9 BREAKFAST SWEETS     WEETS    
## 10 PNT BTR/JELLY/JAMS   /JAMS    
## # ℹ 92,321 more rows


15.3.4 Knowledge check

  1. Using product_type, which product has the longest description? How about the shortest?
  2. Using product_id, get all products that start with “222”.


15.4 Regular expressions

A regular expression (aka regex) is a sequence of characters that define a search pattern, mainly for use in pattern matching with text strings. Typically, regex patterns consist of a combination of alphanumeric characters as well as special characters. The pattern can also be as simple as a single character or it can be more complex and include several characters.

To understand how to work with regular expressions in R, we need to consider two primary features of regular expressions. One has to do with the syntax, or the way regex patterns are expressed in R. The other has to do with the functions used for regex matching in R. You will be exposed to both of these in the following sections.

15.4.1 Regex basics

The stringr package provides us a convenient approach to regular expressions. The common function call will consist of

str_*(string, pattern)

where:

  • str_* represents a wide variety of regex functions depending on what you want to do,
  • string represents the character string of concern (i.e. products$product_category),
  • pattern represents the regex pattern you are looking to match.

For example let’s say you are looking for observations where the word “FRUIT” was used in the product_category description. str_detect detects the presence or absence of a pattern (“FRUIT” in this example) and returns a logical vector. Since the output is TRUE or FALSE, this is a handy function to combine with filter to filter for observations that have that pattern.

# detect if the word "FRUIT" is used in each product category description
str_detect(products$product_category, "FRUIT") %>% head()
## [1] FALSE    NA FALSE  TRUE FALSE FALSE

# filter for those observations that include the use of "FRUIT"
products %>% filter(str_detect(product_category, "FRUIT"))
## # A tibble: 1,153 × 7
##    product_id manufacturer_id department brand    product_category     
##    <chr>      <chr>           <chr>      <fct>    <chr>                
##  1 26190      69              GROCERY    Private  FRUIT - SHELF STABLE 
##  2 27503      69              GROCERY    Private  FRUIT - SHELF STABLE 
##  3 27960      69              GROCERY    Private  FRZN FRUITS          
##  4 33048      5820            PRODUCE    National ORGANICS FRUIT & VEG…
##  5 37543      876             NUTRITION  National DRIED FRUIT          
##  6 44599      69              GROCERY    Private  FRUIT - SHELF STABLE 
##  7 45218      69              GROCERY    Private  FRUIT - SHELF STABLE 
##  8 45607      69              GROCERY    Private  FRUIT - SHELF STABLE 
##  9 49748      69              GROCERY    Private  FRUIT - SHELF STABLE 
## 10 49930      2               PRODUCE    National STONE FRUIT          
## # ℹ 1,143 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>

There are wide variety of str_ functions (check them out by typing str_ + tab in the console). For example, check out the following functions that all look for the pattern “FRZN” but provide different outputs:

# count number of instances in each observation
str_count(products$product_category, "FRZN") %>% head()
## [1]  1 NA  0  0  0  0

# extract the first instance in each observation (NA results when no instance exists)
str_extract(products$product_category, "FRZN") %>% head()
## [1] "FRZN" NA     NA     NA     NA     NA

# locate the start/stop position of the first instance in each observation
str_locate(products$product_category, "FRZN") %>% head()
##      start end
## [1,]     1   4
## [2,]    NA  NA
## [3,]    NA  NA
## [4,]    NA  NA
## [5,]    NA  NA
## [6,]    NA  NA

# replace all instances of the word "FRZN" with "FROZEN"
str_replace_all(products$product_category, pattern = "FRZN", replacement = "FROZEN") %>% head()
## [1] "FROZEN ICE"           NA                    
## [3] "BREAD"                "FRUIT - SHELF STABLE"
## [5] "COOKIES/CONES"        "SPICES & EXTRACTS"

The patterns used in the str_ functions are case sensitive but you can use regex to control options (i.e. str_count(products$product_category, regex(“frzn”, ignore_case = TRUE))).

15.4.2 Multiple Words

In the previous section you saw that you can search for a given word using any of the str_ functions. We can build onto this and search for multiple words. For example, you can search for multiple phrases to identify seasonal products. Say you wanted to identify all products targeting summer or fall, rather than search for these words in separate str_detect() calls…

# filter for summer or fall items
products %>% 
   filter(
      str_detect(product_category, regex("summer", ignore_case = TRUE)) |
      str_detect(product_category, regex("fall", ignore_case = TRUE))   
   )
## # A tibble: 651 × 7
##    product_id manufacturer_id department brand    product_category     
##    <chr>      <chr>           <chr>      <fct>    <chr>                
##  1 37073      69              DRUG GM    Private  SPRING/SUMMER SEASON…
##  2 91614      1277            DRUG GM    National SPRING/SUMMER SEASON…
##  3 99090      1277            DRUG GM    National SPRING/SUMMER SEASON…
##  4 99172      1277            DRUG GM    National SPRING/SUMMER SEASON…
##  5 106668     1277            DRUG GM    National SPRING/SUMMER SEASON…
##  6 182402     2160            DRUG GM    National SPRING/SUMMER SEASON…
##  7 276046     1277            DRUG GM    National SPRING/SUMMER SEASON…
##  8 461689     5759            DRUG GM    National SPRING/SUMMER SEASON…
##  9 504050     1277            DRUG GM    National SPRING/SUMMER SEASON…
## 10 555735     1277            DRUG GM    National SPRING/SUMMER SEASON…
## # ℹ 641 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>

We can simply search for observations that include the words “summer” or “fall” within a single str_detect() call by using | between the search terms, which is equivalent to an or statement.

products %>% 
   filter(str_detect(product_category, regex("summer|fall", ignore_case = TRUE)))
## # A tibble: 651 × 7
##    product_id manufacturer_id department brand    product_category     
##    <chr>      <chr>           <chr>      <fct>    <chr>                
##  1 37073      69              DRUG GM    Private  SPRING/SUMMER SEASON…
##  2 91614      1277            DRUG GM    National SPRING/SUMMER SEASON…
##  3 99090      1277            DRUG GM    National SPRING/SUMMER SEASON…
##  4 99172      1277            DRUG GM    National SPRING/SUMMER SEASON…
##  5 106668     1277            DRUG GM    National SPRING/SUMMER SEASON…
##  6 182402     2160            DRUG GM    National SPRING/SUMMER SEASON…
##  7 276046     1277            DRUG GM    National SPRING/SUMMER SEASON…
##  8 461689     5759            DRUG GM    National SPRING/SUMMER SEASON…
##  9 504050     1277            DRUG GM    National SPRING/SUMMER SEASON…
## 10 555735     1277            DRUG GM    National SPRING/SUMMER SEASON…
## # ℹ 641 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>

15.4.3 Line anchors

Line anchors are used to identify patterns at the beginning or end of an element. To find a pattern at the beginning of the element we use ^ and a pattern at the end of the element is found with $. For example, if you wanted to find any observations with the word “fruit” in the product_category column we can use the following:

# contains "juice"
products %>% 
   filter(str_detect(product_category, regex("fruit", ignore_case = TRUE)))
## # A tibble: 1,153 × 7
##    product_id manufacturer_id department brand    product_category     
##    <chr>      <chr>           <chr>      <fct>    <chr>                
##  1 26190      69              GROCERY    Private  FRUIT - SHELF STABLE 
##  2 27503      69              GROCERY    Private  FRUIT - SHELF STABLE 
##  3 27960      69              GROCERY    Private  FRZN FRUITS          
##  4 33048      5820            PRODUCE    National ORGANICS FRUIT & VEG…
##  5 37543      876             NUTRITION  National DRIED FRUIT          
##  6 44599      69              GROCERY    Private  FRUIT - SHELF STABLE 
##  7 45218      69              GROCERY    Private  FRUIT - SHELF STABLE 
##  8 45607      69              GROCERY    Private  FRUIT - SHELF STABLE 
##  9 49748      69              GROCERY    Private  FRUIT - SHELF STABLE 
## 10 49930      2               PRODUCE    National STONE FRUIT          
## # ℹ 1,143 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>

However, if we only wanted those products where the category starts with “fruit” than we can use the ^ anchor:

# starts with "fruit"
products %>% 
   filter(str_detect(product_category, regex("^fruit", ignore_case = TRUE)))
## # A tibble: 396 × 7
##    product_id manufacturer_id department brand    product_category    
##    <chr>      <chr>           <chr>      <fct>    <chr>               
##  1 26190      69              GROCERY    Private  FRUIT - SHELF STABLE
##  2 27503      69              GROCERY    Private  FRUIT - SHELF STABLE
##  3 44599      69              GROCERY    Private  FRUIT - SHELF STABLE
##  4 45218      69              GROCERY    Private  FRUIT - SHELF STABLE
##  5 45607      69              GROCERY    Private  FRUIT - SHELF STABLE
##  6 49748      69              GROCERY    Private  FRUIT - SHELF STABLE
##  7 52240      69              GROCERY    Private  FRUIT - SHELF STABLE
##  8 53854      170             GROCERY    National FRUIT - SHELF STABLE
##  9 54776      69              GROCERY    Private  FRUIT - SHELF STABLE
## 10 58782      170             GROCERY    National FRUIT - SHELF STABLE
## # ℹ 386 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>

Alternatively, if we only wanted those products where the category ends with “fruit” than we can use the $ anchor:

# starts with "fruit"
products %>% 
   filter(str_detect(product_category, regex("fruit$", ignore_case = TRUE)))
## # A tibble: 398 × 7
##    product_id manufacturer_id department brand    product_category
##    <chr>      <chr>           <chr>      <fct>    <chr>           
##  1 37543      876             NUTRITION  National DRIED FRUIT     
##  2 49930      2               PRODUCE    National STONE FRUIT     
##  3 54460      69              NUTRITION  Private  DRIED FRUIT     
##  4 64583      2               PRODUCE    National TROPICAL FRUIT  
##  5 71432      876             NUTRITION  National DRIED FRUIT     
##  6 71792      876             NUTRITION  National DRIED FRUIT     
##  7 78306      612             NUTRITION  National DRIED FRUIT     
##  8 84804      612             NUTRITION  National DRIED FRUIT     
##  9 119875     1587            NUTRITION  National DRIED FRUIT     
## 10 136217     2               PRODUCE    National STONE FRUIT     
## # ℹ 388 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>

And we can combine the two if we only wanted those products where the category starts or ends with “fruit”:

# starts with "fruit"
products %>% 
   filter(str_detect(product_category, regex("^fruit|fruit$", ignore_case = TRUE)))
## # A tibble: 794 × 7
##    product_id manufacturer_id department brand    product_category    
##    <chr>      <chr>           <chr>      <fct>    <chr>               
##  1 26190      69              GROCERY    Private  FRUIT - SHELF STABLE
##  2 27503      69              GROCERY    Private  FRUIT - SHELF STABLE
##  3 37543      876             NUTRITION  National DRIED FRUIT         
##  4 44599      69              GROCERY    Private  FRUIT - SHELF STABLE
##  5 45218      69              GROCERY    Private  FRUIT - SHELF STABLE
##  6 45607      69              GROCERY    Private  FRUIT - SHELF STABLE
##  7 49748      69              GROCERY    Private  FRUIT - SHELF STABLE
##  8 49930      2               PRODUCE    National STONE FRUIT         
##  9 52240      69              GROCERY    Private  FRUIT - SHELF STABLE
## 10 53854      170             GROCERY    National FRUIT - SHELF STABLE
## # ℹ 784 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>

15.4.4 Metacharacters

Metacharacters consist of non-alphanumeric symbols such as:

.    \    |    (    )    [    {    $    *    +   ?

To match metacharacters in regex you need to escape. In R, we escape them with a double backslash “\”. The following displays the general escape syntax for the most common metacharacters:

Escaping metacharacters.
Metacharacter Literal Meaning Escape Syntax
. period or dot \\.
$ dollar sign \\$
* asterisk \\*
+ plus sign \\+
? question mark \\?
| vertical bar \\\|
^ caret \\^
[ square bracket \\[
{ curly brace \\{
( parenthesis \\(

The reason we need to escape these characters is because most of these actually have meaning when declaring regular expressions. For example, say we wanted to identify any product_category that contains a period (“.”). If we simply use the following we actually get all our records back.

This is because in regex expressions “.” is used to match any character.

products %>% 
    filter(str_detect(product_category, regex(".", ignore_case = TRUE)))
## # A tibble: 91,791 × 7
##    product_id manufacturer_id department brand    product_category     
##    <chr>      <chr>           <chr>      <fct>    <chr>                
##  1 25671      2               GROCERY    National FRZN ICE             
##  2 26093      69              PASTRY     Private  BREAD                
##  3 26190      69              GROCERY    Private  FRUIT - SHELF STABLE 
##  4 26355      69              GROCERY    Private  COOKIES/CONES        
##  5 26426      69              GROCERY    Private  SPICES & EXTRACTS    
##  6 26540      69              GROCERY    Private  COOKIES/CONES        
##  7 26601      69              DRUG GM    Private  VITAMINS             
##  8 26636      69              PASTRY     Private  BREAKFAST SWEETS     
##  9 26691      16              GROCERY    Private  PNT BTR/JELLY/JAMS   
## 10 26738      69              GROCERY    Private  ICE CREAM/MILK/SHERB…
## # ℹ 91,781 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>

So, we need to use an escape (“\”) to tell the regular expression you want to match a literal metacharacter.

products %>% 
    filter(str_detect(product_category, regex("\\.", ignore_case = TRUE)))
## # A tibble: 247 × 7
##    product_id manufacturer_id department brand    product_category
##    <chr>      <chr>           <chr>      <fct>    <chr>           
##  1 43361      1916            GROCERY    National MISC. DAIRY     
##  2 50776      69              GROCERY    Private  MISC. DAIRY     
##  3 53070      69              GROCERY    Private  MISC. DAIRY     
##  4 60924      435             GROCERY    National MISC. DAIRY     
##  5 64623      289             GROCERY    National MISC. DAIRY     
##  6 68011      908             GROCERY    National MISC. DAIRY     
##  7 68924      289             GROCERY    National MISC. DAIRY     
##  8 79411      1046            GROCERY    National MISC. DAIRY     
##  9 81194      69              GROCERY    Private  MISC. DAIRY     
## 10 82250      69              GROCERY    Private  MISC. DAIRY     
## # ℹ 237 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>

15.4.5 Character classes

To match one of several characters in a specified set we can enclose the characters of concern with square brackets [ ]. In addition, to matching any characters not in a specified character set we can include the caret ^ at the beginning of the set within the brackets. The following displays the general syntax for common character classes but these can be altered easily as shown in the examples that follow:

Common character classes.
Character class Description
[aeiou] match any specified lower case vowel
[AEIOU] match any specified upper case vowel
[0123456789] match any specified numeric values
[0-9] match any range specified numeric values
[a-z] match any range of lowercase letters
[A-Z] match any range of uppercase letters
[a-zA-Z0-9] match any of the above
[^aeiou] match anything other than a lowercase vowel
[^0-9] match anything other than the specified numeric values

For example, say we wanted to find any products where the package_size is not a round numeric size in ounces. The following identifies any rows where package_size contains a dot (remember, we need to escape that character with \\.) followed by “oz”.

products %>% 
   select(product_type, package_size) %>%
   filter(str_detect(package_size, regex("\\.[0-9] oz", ignore_case = TRUE)))
## # A tibble: 10,454 × 2
##    product_type        package_size
##    <chr>               <chr>       
##  1 SPICES & SEASONINGS 2.5 OZ      
##  2 SKILLET DINNERS     5.8 OZ      
##  3 SKILLET DINNERS     6.5 OZ      
##  4 GROUND COFFEE       34.5 OZ     
##  5 SKILLET DINNERS     6.5 OZ      
##  6 SW GDS: MUFFINS     10.8 OZ     
##  7 GELATIN             .3 OZ       
##  8 GELATIN             .3 OZ       
##  9 GELATIN             .3 OZ       
## 10 SKILLET DINNERS     6.2 OZ      
## # ℹ 10,444 more rows

Now, say we wanted to do the same but we are interested in any packages that are in ounces (“OZ”) or pounds (“LB”). Your first reaction is probably to do something like:

products %>% 
   select(product_type, package_size) %>%
   filter(str_detect(package_size, regex("\\.[0-9] oz|lb", ignore_case = TRUE)))
## # A tibble: 13,012 × 2
##    product_type        package_size
##    <chr>               <chr>       
##  1 ICE - CRUSHED/CUBED 22 LB       
##  2 SPICES & SEASONINGS 2.5 OZ      
##  3 SKILLET DINNERS     5.8 OZ      
##  4 SKILLET DINNERS     6.5 OZ      
##  5 GROUND COFFEE       34.5 OZ     
##  6 SKILLET DINNERS     6.5 OZ      
##  7 SW GDS: MUFFINS     10.8 OZ     
##  8 GELATIN             .3 OZ       
##  9 GELATIN             .3 OZ       
## 10 GELATIN             .3 OZ       
## # ℹ 13,002 more rows

Wait! The first observation is in pounds but its a round number and not a decimal. This is because our regex (\\.[0-9] oz|lb) is actually looking for any package size where its a decimal of ounces (\\.[0-9] oz) or in pounds (lb).

We need to modify our regex just a tad. If we change it to \\.[0-9] (oz|lb) (note that oz|lb is now in parenthesis), we are now specifying to search for \\.[0-9] followed by “oz” or “lb”.

products %>% 
   select(product_type, package_size) %>%
   filter(str_detect(package_size, regex("\\.[0-9] (oz|lb)", ignore_case = TRUE)))
## # A tibble: 10,712 × 2
##    product_type        package_size
##    <chr>               <chr>       
##  1 SPICES & SEASONINGS 2.5 OZ      
##  2 SKILLET DINNERS     5.8 OZ      
##  3 SKILLET DINNERS     6.5 OZ      
##  4 GROUND COFFEE       34.5 OZ     
##  5 SKILLET DINNERS     6.5 OZ      
##  6 SW GDS: MUFFINS     10.8 OZ     
##  7 GELATIN             .3 OZ       
##  8 GELATIN             .3 OZ       
##  9 GELATIN             .3 OZ       
## 10 SKILLET DINNERS     6.2 OZ      
## # ℹ 10,702 more rows

Now, say we wanted to find any package size that contains a decimal between 0-.4:

products %>% 
   select(product_type, package_size) %>%
   filter(str_detect(package_size, regex("\\.[0-4] (oz|lb)", ignore_case = TRUE)))
## # A tibble: 3,092 × 2
##    product_type                   package_size
##    <chr>                          <chr>       
##  1 GELATIN                        .3 OZ       
##  2 GELATIN                        .3 OZ       
##  3 GELATIN                        .3 OZ       
##  4 SKILLET DINNERS                6.2 OZ      
##  5 SKILLET DINNERS                6.4 OZ      
##  6 RICE SIDE DISH MIXES DRY       4.3 OZ      
##  7 ADULT CEREAL                   17.3 OZ     
##  8 NOODLE SIDE DISH MIXES         6.2 OZ      
##  9 CANDY BARS (SINGLES)(INCLUDING 1.4 OZ      
## 10 DRINKS - CARB JUICE (OVER 50%  25.4 OZ     
## # ℹ 3,082 more rows

15.4.6 Shorthand character classes

Since certain character classes are used often, a series of shorthand character classes are available. For example, rather than use [0-9] every time we are searching for a number in a regex, we can use just use \\d to match any digit. The following are a few of the commonly used shorthand character classes:

Common shorthand character classes.
Syntax Description
\\d match any digit
\\D match any non-digit
\\s match a space character
\\S match a non-space character
\\w match a word
\\W match a non-word
\\b match a word boundary
\\B match a non-word boundary

We can use these to find patterns such as…

Find all products where the package_size starts with a numeric digit:

products %>% 
   select(product_type, package_size) %>%
   filter(str_detect(package_size, regex("^\\d")))
## # A tibble: 59,790 × 2
##    product_type                package_size
##    <chr>                       <chr>       
##  1 ICE - CRUSHED/CUBED         22 LB       
##  2 APPLE SAUCE                 50 OZ       
##  3 SPECIALTY COOKIES           14 OZ       
##  4 SPICES & SEASONINGS         2.5 OZ      
##  5 TRAY PACK/CHOC CHIP COOKIES 16 OZ       
##  6 VITAMIN - MINERALS          300 CT(1)   
##  7 HONEY                       12 OZ       
##  8 TRADITIONAL                 56 OZ       
##  9 TRADITIONAL                 56 OZ       
## 10 TRADITIONAL                 56 OZ       
## # ℹ 59,780 more rows

Or all products where the package_size starts with a non-digit:

products %>% 
   select(product_type, package_size) %>%
   filter(str_detect(package_size, regex("^\\D")))
## # A tibble: 1,955 × 2
##    product_type                   package_size
##    <chr>                          <chr>       
##  1 NATURAL CHEESE EXACT WT CHUNKS A B D  8 OZ 
##  2 GELATIN                        .3 OZ       
##  3 GELATIN                        .3 OZ       
##  4 GELATIN                        .3 OZ       
##  5 AMMONIA                        .5 GAL      
##  6 EGGS - X-LARGE                 A D   1 DZ  
##  7 STRING CHEESE                  AB    8 OZ  
##  8 EGGS - MEDIUM                  ABD   1 DZ  
##  9 SPICES & SEASONINGS            .56 OZ      
## 10 STRING CHEESE                  AB    8 OZ  
## # ℹ 1,945 more rows

15.4.7 POSIX character classes

Closely related to regex character classes are POSIX character classes which are expressed in double brackets [[ ]]. Some commonly used POSIX character classes include:

Common POSIX character classes.
Syntax Description
[[:lower:]] lower case letters
[[:upper:]] upper case letters
[[:alpha:]] alphabetic characters [[:lower:]] + [[:alpha:]]
[[:digit:]] numeric values
[[:alnum:]] alpha numeric values [[:alpha:]] + [[:digit:]]
[[:blank:]] blank characters (e.g. space, tab)
[[:punct:]] punctuation characters (e.g. ! . , ? & / ( [ )

For example, if we wanted to find any products where the package_size includes a punctuation in the description:

products %>% 
   select(product_type, package_size) %>%
   filter(str_detect(package_size, regex("[[:punct:]]")))
## # A tibble: 17,891 × 2
##    product_type        package_size
##    <chr>               <chr>       
##  1 SPICES & SEASONINGS 2.5 OZ      
##  2 VITAMIN - MINERALS  300 CT(1)   
##  3 TRADITIONAL         1/2 GAL     
##  4 SKILLET DINNERS     5.8 OZ      
##  5 SKILLET DINNERS     6.5 OZ      
##  6 GROUND COFFEE       34.5 OZ     
##  7 SKILLET DINNERS     6.5 OZ      
##  8 AMMONIA             1/2 GAL     
##  9 SW GDS: MUFFINS     10.8 OZ     
## 10 GELATIN             .3 OZ       
## # ℹ 17,881 more rows

15.4.8 Repetition

When we want to match a certain number of characters that meet a certain criteria we can apply repetition operators to our pattern searches. Common repetition operators include:

Common repetition operators.
Syntax Description
. wildcard to match any character once
? the preceding item is optional and will be matched at most once
* the preceding item will be matched zero or more times
+ the preceding item will be matched one or more times
{n} the preceding item will be matched exactly n times
{n,} the preceding item will be matched n or more times
{n,m} the preceding item will be matched at least n times but not more than m times

For example, say we want to find all products where the package_size contains at least 3 digits:

products %>% 
   select(product_category, package_size) %>%
   filter(str_detect(package_size, regex("\\d{3,}")))
## # A tibble: 3,313 × 2
##    product_category       package_size
##    <chr>                  <chr>       
##  1 VITAMINS               300 CT(1)   
##  2 FACIAL TISS/DNR NAPKIN 250 CT      
##  3 CANDY - CHECKLANE      1.625 OZ    
##  4 TEAS                   100 CT      
##  5 ANALGESICS             100 CT      
##  6 COFFEE FILTERS         200 CT      
##  7 CANDY - PACKAGED       10.125 OZ   
##  8 TEAS                   100 CT      
##  9 TEAS                   100 CT      
## 10 ISOTONIC DRINKS        128 OZ      
## # ℹ 3,303 more rows

One thing you probably notice is that the above syntax will match three or more digits within the entire character string. But what if we wanted to identify repetition of a pattern sequence. For example, say we wanted to find product_ids where the number “8” is repeated. We can use a backreference to do so. A backreference will match the same text as previously matched within parentheses. So, in this example, we look for any repeated sequences of the number 8 in product_id.

products %>% 
   filter(str_detect(product_id, regex("(8)\\1")))
## # A tibble: 4,750 × 7
##    product_id manufacturer_id department    brand    product_category  
##    <chr>      <chr>           <chr>         <fct>    <chr>             
##  1 26889      32              DRUG GM       National MAGAZINE          
##  2 28889      69              SEAFOOD-PCKGD Private  SEAFOOD - FROZEN  
##  3 28892      69              GROCERY       Private  FROZEN PIZZA      
##  4 28897      69              GROCERY       Private  EGGS              
##  5 32888      1075            GROCERY       National CRACKERS/MISC BKD…
##  6 33882      544             GROCERY       National BAG SNACKS        
##  7 33883      764             GROCERY       National LAUNDRY ADDITIVES 
##  8 36880      69              GROCERY       Private  CRACKERS/MISC BKD…
##  9 37889      317             GROCERY       National SALD DRSNG/SNDWCH…
## 10 38877      1884            GROCERY       National MILK BY-PRODUCTS  
## # ℹ 4,740 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>

What if we wanted to look for product_ids that contain three “8”s in a row then we need to repeat that pattern:

products %>% 
   filter(str_detect(product_id, regex("(8)\\1{2}")))
## # A tibble: 343 × 7
##    product_id manufacturer_id department    brand    product_category  
##    <chr>      <chr>           <chr>         <fct>    <chr>             
##  1 28889      69              SEAFOOD-PCKGD Private  SEAFOOD - FROZEN  
##  2 32888      1075            GROCERY       National CRACKERS/MISC BKD…
##  3 62888      69              GROCERY       Private  LAUNDRY ADDITIVES 
##  4 75888      1448            GROCERY       National CAT LITTER        
##  5 78888      1051            DRUG GM       National ELECTRICAL SUPPPL…
##  6 88832      348             DRUG GM       National CANDY - CHECKLANE 
##  7 108889     69              MEAT-PCKGD    Private  LUNCHMEAT         
##  8 124888     857             DRUG GM       National CANDY - CHECKLANE 
##  9 238885     3830            SALAD BAR     National SALAD BAR         
## 10 255888     5474            DRUG GM       National NEWSPAPER         
## # ℹ 333 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>


15.4.9 Putting it altogether

Ok, let’s use a few tools we’ve learned to answer a question we may get asked by our boss. Say we were asked to identify the top 5 products that have the most total sales; however, we only want to focus on those products that weigh 10lbs or more. We can applly the following steps:

  1. filter for regex "^\\d{2,}(\\.)?.*lb" which means:
    • ^\\d{2,}: starts with at least 2 numeric digits
    • (\\.)?.: followed by an optional decimal
    • .*lb: followed by a character zero or more times
  2. take the resulting product list and inner join with transactions so we only retain those transactions and products that have a matching product ID in both tables,
  3. compute total sales grouped by product (here we use product type just to provide us more context over the product ID),
  4. and then use slice_max to get the top 5 total_sales values (you could also just use arrange(desc(total_sales))).
products %>%
   filter(str_detect(package_size, regex("^\\d{2,}(\\.)?.*lb", ignore_case = TRUE))) %>%
   inner_join(transactions_sample, by = "product_id") %>%
   group_by(product_type) %>%
   summarize(total_sales = sum(sales_value)) %>%
   slice_max(total_sales, n = 5)
## # A tibble: 5 × 2
##   product_type                  total_sales
##   <chr>                               <dbl>
## 1 BANANAS                              887.
## 2 GRAPES RED                           527.
## 3 GRAPES WHITE                         507.
## 4 TOMATOES HOTHOUSE ON THE VINE        364.
## 5 POTATOES RUSSET (BULK&BAG)           346.

15.4.10 Knowledge check

  1. How many products contain the word “bulk” in product_type?
  2. How many products do not contain punctuation in their package_size?
  3. Find all frozen pizza products. Be careful, this is not straight forward!


15.5 Exercises

To answer these questions you’ll need to use the products and transactions_sample data frames.

  1. Identify all different products that contain “pizza” in their product_type description. Which product produces the greatest amount of total sales?
  2. Identify all products that are categorized (product_category) as pizza but are considered a snack or appetizer (product_type). Which of these products have the most sales (measured by quantity)?
  3. How many products contain package_sizes that do not contain a numeric value.

15.6 Additional resources

Character string data are often considered semi-structured data. Text can be structured in a specified field; however, the quality and consistency of the text input can be far from structured. Consequently, managing and manipulating character strings can be extremely tedious and unique to each data wrangling process. As a result, taking the time to learn the nuances of dealing with character strings and regex functions can provide a great return on investment; however, the functions and techniques required will likely be greater than what I could offer here. So here are additional resources that are worth reading and learning from: