15 Lesson 4b: Handling text data
Dealing with character strings is often under-emphasized in data analysis training. The focus typically remains on numeric values; however, the growth in data collection is also resulting in greater bits of information embedded in text. Consequently, handling, cleaning and processing character strings is becoming a prerequisite in daily data analysis. This lesson is meant to give you the foundation of working with character strings.
15.1 Learning objectives
By the end of this lesson you’ll be able to:
- Perform basic character string manipulations.
- Use regular expressions to identify and manipulate patterns in character strings.
15.2 Prerequisites
The following packages will be used throughout this lesson. Base R contains many functions to work with strings but we’ll avoid them because they can be inconsistent, which makes them hard to remember. Instead we’ll use functions from stringr for text manipulation and we’ll also combine this with the use of dplyr for data frame manipulation.
For data, we’ll leverage the completejourney data. For example, the products
data within completejourney provides various variables (e.g. product_category
and product_type
) that contain character strings we may need to clean, normalize, or identify patterns within.
library(completejourney)
products
## # A tibble: 92,331 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 25671 2 GROCERY National FRZN ICE
## 2 26081 2 MISCELLANEOUS National <NA>
## 3 26093 69 PASTRY Private BREAD
## 4 26190 69 GROCERY Private FRUIT - SHELF STA…
## 5 26355 69 GROCERY Private COOKIES/CONES
## 6 26426 69 GROCERY Private SPICES & EXTRACTS
## 7 26540 69 GROCERY Private COOKIES/CONES
## 8 26601 69 DRUG GM Private VITAMINS
## 9 26636 69 PASTRY Private BREAKFAST SWEETS
## 10 26691 16 GROCERY Private PNT BTR/JELLY/JAMS
## # ℹ 92,321 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
15.3 String basics
Basic string manipulation typically includes case conversion, counting characters, and extracting parts of a string. These operations can all be performed with base R functions; however, some operations (or at least their syntax) are greatly simplified with the stringr package. Moreover, stringr is typically very useful when combined with dplyr
data manipulation functions.
15.3.1 Case conversion
To change the case of a character string we can use str_to_xxx
:
# we'll focus on the product_category column
head(products$product_category)
## [1] "FRZN ICE" NA
## [3] "BREAD" "FRUIT - SHELF STABLE"
## [5] "COOKIES/CONES" "SPICES & EXTRACTS"
# force everything to lower case
str_to_lower(products$product_category) %>% head()
## [1] "frzn ice" NA
## [3] "bread" "fruit - shelf stable"
## [5] "cookies/cones" "spices & extracts"
# force everything to upper case
str_to_upper(products$product_category) %>% head()
## [1] "FRZN ICE" NA
## [3] "BREAD" "FRUIT - SHELF STABLE"
## [5] "COOKIES/CONES" "SPICES & EXTRACTS"
Typically we want to normalize text by converting everything to lower case but keep the data in our original data frame. We can combine str_
functions with dplyr::mutate
to accomplish this:
products %>% mutate(product_category = str_to_lower(product_category))
## # A tibble: 92,331 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 25671 2 GROCERY National frzn ice
## 2 26081 2 MISCELLANEOUS National <NA>
## 3 26093 69 PASTRY Private bread
## 4 26190 69 GROCERY Private fruit - shelf sta…
## 5 26355 69 GROCERY Private cookies/cones
## 6 26426 69 GROCERY Private spices & extracts
## 7 26540 69 GROCERY Private cookies/cones
## 8 26601 69 DRUG GM Private vitamins
## 9 26636 69 PASTRY Private breakfast sweets
## 10 26691 16 GROCERY Private pnt btr/jelly/jams
## # ℹ 92,321 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
When we apply mutate()
to an existing column we do not
add a new column, we simply overwrite the existing column.
15.3.2 Counting characters
There may be times where we are interested in how many characters exist in a string. Whether you need to extract this information or filter for observations that exceed a certain character length you can use str_count
to accomplish this:
# count number of characters in product_category
str_count(products$product_category) %>% head()
## [1] 8 NA 5 20 13 17
# create a new variable with character length
products %>%
mutate(category_length = str_count(product_category)) %>%
select(product_category, category_length)
## # A tibble: 92,331 × 2
## product_category category_length
## <chr> <int>
## 1 FRZN ICE 8
## 2 <NA> NA
## 3 BREAD 5
## 4 FRUIT - SHELF STABLE 20
## 5 COOKIES/CONES 13
## 6 SPICES & EXTRACTS 17
## 7 COOKIES/CONES 13
## 8 VITAMINS 8
## 9 BREAKFAST SWEETS 16
## 10 PNT BTR/JELLY/JAMS 18
## # ℹ 92,321 more rows
# filter for product categories that are greater than the mean character length
products %>%
filter(str_count(product_category) > mean(str_count(product_category), na.rm = TRUE))
## # A tibble: 45,192 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 26190 69 GROCERY Private FRUIT - SHELF STABLE
## 2 26426 69 GROCERY Private SPICES & EXTRACTS
## 3 26636 69 PASTRY Private BREAKFAST SWEETS
## 4 26691 16 GROCERY Private PNT BTR/JELLY/JAMS
## 5 26738 69 GROCERY Private ICE CREAM/MILK/SHERBTS
## 6 26941 69 GROCERY Private ICE CREAM/MILK/SHERBTS
## 7 27030 69 GROCERY Private ICE CREAM/MILK/SHERBTS
## 8 27152 69 GROCERY Private SPICES & EXTRACTS
## 9 27158 69 GROCERY Private ICE CREAM/MILK/SHERBTS
## 10 27334 69 DRUG GM Private DIETARY AID PRODUCTS
## # ℹ 45,182 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
15.3.3 Extracting parts of character strings
If you need to extract parts of a character string you can use str_sub
. By default start = 1
and end = 3
will tell the function to extract the first through third characters. If you want to start from the right side of the string then negative numbers such as start = -1
and end = -5
will tell the function to extract the last five characters.
# extract parts of a string (example: extract first three characters)
str_sub(products$product_category, start = 1, end = 3) %>% head()
## [1] "FRZ" NA "BRE" "FRU" "COO" "SPI"
# create new variable for last five characters
products %>%
select(product_category) %>%
mutate(last_five = str_sub(product_category, start = -5, end = -1))
## # A tibble: 92,331 × 2
## product_category last_five
## <chr> <chr>
## 1 FRZN ICE N ICE
## 2 <NA> <NA>
## 3 BREAD BREAD
## 4 FRUIT - SHELF STABLE TABLE
## 5 COOKIES/CONES CONES
## 6 SPICES & EXTRACTS RACTS
## 7 COOKIES/CONES CONES
## 8 VITAMINS AMINS
## 9 BREAKFAST SWEETS WEETS
## 10 PNT BTR/JELLY/JAMS /JAMS
## # ℹ 92,321 more rows
15.4 Regular expressions
A regular expression (aka regex) is a sequence of characters that define a search pattern, mainly for use in pattern matching with text strings. Typically, regex patterns consist of a combination of alphanumeric characters as well as special characters. The pattern can also be as simple as a single character or it can be more complex and include several characters.
To understand how to work with regular expressions in R, we need to consider two primary features of regular expressions. One has to do with the syntax, or the way regex patterns are expressed in R. The other has to do with the functions used for regex matching in R. You will be exposed to both of these in the following sections.
15.4.1 Regex basics
The stringr package provides us a convenient approach to regular expressions. The common function call will consist of
str_*(string, pattern)
where:
str_*
represents a wide variety of regex functions depending on what you want to do,string
represents the character string of concern (i.e.products$product_category
),pattern
represents the regex pattern you are looking to match.
For example let’s say you are looking for observations where the word “FRUIT” was used in the product_category
description. str_detect
detects the presence or absence of a pattern (“FRUIT” in this example) and returns a logical vector. Since the output is TRUE
or FALSE
, this is a handy function to combine with filter
to filter for observations that have that pattern.
# detect if the word "FRUIT" is used in each product category description
str_detect(products$product_category, "FRUIT") %>% head()
## [1] FALSE NA FALSE TRUE FALSE FALSE
# filter for those observations that include the use of "FRUIT"
products %>% filter(str_detect(product_category, "FRUIT"))
## # A tibble: 1,153 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 26190 69 GROCERY Private FRUIT - SHELF STABLE
## 2 27503 69 GROCERY Private FRUIT - SHELF STABLE
## 3 27960 69 GROCERY Private FRZN FRUITS
## 4 33048 5820 PRODUCE National ORGANICS FRUIT & VEG…
## 5 37543 876 NUTRITION National DRIED FRUIT
## 6 44599 69 GROCERY Private FRUIT - SHELF STABLE
## 7 45218 69 GROCERY Private FRUIT - SHELF STABLE
## 8 45607 69 GROCERY Private FRUIT - SHELF STABLE
## 9 49748 69 GROCERY Private FRUIT - SHELF STABLE
## 10 49930 2 PRODUCE National STONE FRUIT
## # ℹ 1,143 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
There are wide variety of str_
functions (check them out by typing str_
+ tab in the console). For example, check out the following functions that all look for the pattern “FRZN” but provide different outputs:
# count number of instances in each observation
str_count(products$product_category, "FRZN") %>% head()
## [1] 1 NA 0 0 0 0
# extract the first instance in each observation (NA results when no instance exists)
str_extract(products$product_category, "FRZN") %>% head()
## [1] "FRZN" NA NA NA NA NA
# locate the start/stop position of the first instance in each observation
str_locate(products$product_category, "FRZN") %>% head()
## start end
## [1,] 1 4
## [2,] NA NA
## [3,] NA NA
## [4,] NA NA
## [5,] NA NA
## [6,] NA NA
# replace all instances of the word "FRZN" with "FROZEN"
str_replace_all(products$product_category, pattern = "FRZN", replacement = "FROZEN") %>% head()
## [1] "FROZEN ICE" NA
## [3] "BREAD" "FRUIT - SHELF STABLE"
## [5] "COOKIES/CONES" "SPICES & EXTRACTS"
The patterns used in the str_
functions are case
sensitive but you can use regex
to control options
(i.e. str_count(products$product_category, regex(“frzn”, ignore_case = TRUE))
).
15.4.2 Multiple Words
In the previous section you saw that you can search for a given word using any of the str_
functions. We can build onto this and search for multiple words. For example, you can search for multiple phrases to identify seasonal products. Say you wanted to identify all products targeting summer or fall, rather than search for these words in separate str_detect()
calls…
# filter for summer or fall items
products %>%
filter(
str_detect(product_category, regex("summer", ignore_case = TRUE)) |
str_detect(product_category, regex("fall", ignore_case = TRUE))
)
## # A tibble: 651 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 37073 69 DRUG GM Private SPRING/SUMMER SEASON…
## 2 91614 1277 DRUG GM National SPRING/SUMMER SEASON…
## 3 99090 1277 DRUG GM National SPRING/SUMMER SEASON…
## 4 99172 1277 DRUG GM National SPRING/SUMMER SEASON…
## 5 106668 1277 DRUG GM National SPRING/SUMMER SEASON…
## 6 182402 2160 DRUG GM National SPRING/SUMMER SEASON…
## 7 276046 1277 DRUG GM National SPRING/SUMMER SEASON…
## 8 461689 5759 DRUG GM National SPRING/SUMMER SEASON…
## 9 504050 1277 DRUG GM National SPRING/SUMMER SEASON…
## 10 555735 1277 DRUG GM National SPRING/SUMMER SEASON…
## # ℹ 641 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
We can simply search for observations that include the words “summer” or “fall” within a single str_detect()
call by using |
between the search terms, which is equivalent to an or statement.
products %>%
filter(str_detect(product_category, regex("summer|fall", ignore_case = TRUE)))
## # A tibble: 651 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 37073 69 DRUG GM Private SPRING/SUMMER SEASON…
## 2 91614 1277 DRUG GM National SPRING/SUMMER SEASON…
## 3 99090 1277 DRUG GM National SPRING/SUMMER SEASON…
## 4 99172 1277 DRUG GM National SPRING/SUMMER SEASON…
## 5 106668 1277 DRUG GM National SPRING/SUMMER SEASON…
## 6 182402 2160 DRUG GM National SPRING/SUMMER SEASON…
## 7 276046 1277 DRUG GM National SPRING/SUMMER SEASON…
## 8 461689 5759 DRUG GM National SPRING/SUMMER SEASON…
## 9 504050 1277 DRUG GM National SPRING/SUMMER SEASON…
## 10 555735 1277 DRUG GM National SPRING/SUMMER SEASON…
## # ℹ 641 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
15.4.3 Line anchors
Line anchors are used to identify patterns at the beginning or end of an element. To find a pattern at the beginning of the element we use ^
and a pattern at the end of the element is found with $
. For example, if you wanted to find any observations with the word “fruit” in the product_category
column we can use the following:
# contains "juice"
products %>%
filter(str_detect(product_category, regex("fruit", ignore_case = TRUE)))
## # A tibble: 1,153 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 26190 69 GROCERY Private FRUIT - SHELF STABLE
## 2 27503 69 GROCERY Private FRUIT - SHELF STABLE
## 3 27960 69 GROCERY Private FRZN FRUITS
## 4 33048 5820 PRODUCE National ORGANICS FRUIT & VEG…
## 5 37543 876 NUTRITION National DRIED FRUIT
## 6 44599 69 GROCERY Private FRUIT - SHELF STABLE
## 7 45218 69 GROCERY Private FRUIT - SHELF STABLE
## 8 45607 69 GROCERY Private FRUIT - SHELF STABLE
## 9 49748 69 GROCERY Private FRUIT - SHELF STABLE
## 10 49930 2 PRODUCE National STONE FRUIT
## # ℹ 1,143 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
However, if we only wanted those products where the category starts with “fruit” than we can use the ^
anchor:
# starts with "fruit"
products %>%
filter(str_detect(product_category, regex("^fruit", ignore_case = TRUE)))
## # A tibble: 396 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 26190 69 GROCERY Private FRUIT - SHELF STABLE
## 2 27503 69 GROCERY Private FRUIT - SHELF STABLE
## 3 44599 69 GROCERY Private FRUIT - SHELF STABLE
## 4 45218 69 GROCERY Private FRUIT - SHELF STABLE
## 5 45607 69 GROCERY Private FRUIT - SHELF STABLE
## 6 49748 69 GROCERY Private FRUIT - SHELF STABLE
## 7 52240 69 GROCERY Private FRUIT - SHELF STABLE
## 8 53854 170 GROCERY National FRUIT - SHELF STABLE
## 9 54776 69 GROCERY Private FRUIT - SHELF STABLE
## 10 58782 170 GROCERY National FRUIT - SHELF STABLE
## # ℹ 386 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
Alternatively, if we only wanted those products where the category ends with “fruit” than we can use the $
anchor:
# starts with "fruit"
products %>%
filter(str_detect(product_category, regex("fruit$", ignore_case = TRUE)))
## # A tibble: 398 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 37543 876 NUTRITION National DRIED FRUIT
## 2 49930 2 PRODUCE National STONE FRUIT
## 3 54460 69 NUTRITION Private DRIED FRUIT
## 4 64583 2 PRODUCE National TROPICAL FRUIT
## 5 71432 876 NUTRITION National DRIED FRUIT
## 6 71792 876 NUTRITION National DRIED FRUIT
## 7 78306 612 NUTRITION National DRIED FRUIT
## 8 84804 612 NUTRITION National DRIED FRUIT
## 9 119875 1587 NUTRITION National DRIED FRUIT
## 10 136217 2 PRODUCE National STONE FRUIT
## # ℹ 388 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
And we can combine the two if we only wanted those products where the category starts or ends with “fruit”:
# starts with "fruit"
products %>%
filter(str_detect(product_category, regex("^fruit|fruit$", ignore_case = TRUE)))
## # A tibble: 794 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 26190 69 GROCERY Private FRUIT - SHELF STABLE
## 2 27503 69 GROCERY Private FRUIT - SHELF STABLE
## 3 37543 876 NUTRITION National DRIED FRUIT
## 4 44599 69 GROCERY Private FRUIT - SHELF STABLE
## 5 45218 69 GROCERY Private FRUIT - SHELF STABLE
## 6 45607 69 GROCERY Private FRUIT - SHELF STABLE
## 7 49748 69 GROCERY Private FRUIT - SHELF STABLE
## 8 49930 2 PRODUCE National STONE FRUIT
## 9 52240 69 GROCERY Private FRUIT - SHELF STABLE
## 10 53854 170 GROCERY National FRUIT - SHELF STABLE
## # ℹ 784 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
15.4.4 Metacharacters
Metacharacters consist of non-alphanumeric symbols such as:
To match metacharacters in regex you need to escape. In R, we escape them with a double backslash “\”. The following displays the general escape syntax for the most common metacharacters:
Metacharacter | Literal Meaning | Escape Syntax |
---|---|---|
. | period or dot | \\. |
$ | dollar sign | \\$ |
* | asterisk | \\* |
+ | plus sign | \\+ |
? | question mark | \\? |
| | vertical bar | \\\| |
^ | caret | \\^ |
[ | square bracket | \\[ |
{ | curly brace | \\{ |
( | parenthesis | \\( |
The reason we need to escape these characters is because most of these actually have meaning when declaring regular expressions. For example, say we wanted to identify any product_category
that contains a period (“.”). If we simply use the following we actually get all our records back.
This is because in regex expressions “.” is used to match any character.
products %>%
filter(str_detect(product_category, regex(".", ignore_case = TRUE)))
## # A tibble: 91,791 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 25671 2 GROCERY National FRZN ICE
## 2 26093 69 PASTRY Private BREAD
## 3 26190 69 GROCERY Private FRUIT - SHELF STABLE
## 4 26355 69 GROCERY Private COOKIES/CONES
## 5 26426 69 GROCERY Private SPICES & EXTRACTS
## 6 26540 69 GROCERY Private COOKIES/CONES
## 7 26601 69 DRUG GM Private VITAMINS
## 8 26636 69 PASTRY Private BREAKFAST SWEETS
## 9 26691 16 GROCERY Private PNT BTR/JELLY/JAMS
## 10 26738 69 GROCERY Private ICE CREAM/MILK/SHERB…
## # ℹ 91,781 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
So, we need to use an escape (“\”) to tell the regular expression you want to match a literal metacharacter.
products %>%
filter(str_detect(product_category, regex("\\.", ignore_case = TRUE)))
## # A tibble: 247 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 43361 1916 GROCERY National MISC. DAIRY
## 2 50776 69 GROCERY Private MISC. DAIRY
## 3 53070 69 GROCERY Private MISC. DAIRY
## 4 60924 435 GROCERY National MISC. DAIRY
## 5 64623 289 GROCERY National MISC. DAIRY
## 6 68011 908 GROCERY National MISC. DAIRY
## 7 68924 289 GROCERY National MISC. DAIRY
## 8 79411 1046 GROCERY National MISC. DAIRY
## 9 81194 69 GROCERY Private MISC. DAIRY
## 10 82250 69 GROCERY Private MISC. DAIRY
## # ℹ 237 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
15.4.5 Character classes
To match one of several characters in a specified set we can enclose the characters of concern with square brackets [ ]
. In addition, to matching any characters not in a specified character set we can include the caret ^
at the beginning of the set within the brackets. The following displays the general syntax for common character classes but these can be altered easily as shown in the examples that follow:
Character class | Description |
---|---|
[aeiou] | match any specified lower case vowel |
[AEIOU] | match any specified upper case vowel |
[0123456789] | match any specified numeric values |
[0-9] | match any range specified numeric values |
[a-z] | match any range of lowercase letters |
[A-Z] | match any range of uppercase letters |
[a-zA-Z0-9] | match any of the above |
[^aeiou] | match anything other than a lowercase vowel |
[^0-9] | match anything other than the specified numeric values |
For example, say we wanted to find any products where the package_size
is not a round numeric size in ounces. The following identifies any rows where package_size
contains a dot (remember, we need to escape that character with \\.
) followed by “oz”.
products %>%
select(product_type, package_size) %>%
filter(str_detect(package_size, regex("\\.[0-9] oz", ignore_case = TRUE)))
## # A tibble: 10,454 × 2
## product_type package_size
## <chr> <chr>
## 1 SPICES & SEASONINGS 2.5 OZ
## 2 SKILLET DINNERS 5.8 OZ
## 3 SKILLET DINNERS 6.5 OZ
## 4 GROUND COFFEE 34.5 OZ
## 5 SKILLET DINNERS 6.5 OZ
## 6 SW GDS: MUFFINS 10.8 OZ
## 7 GELATIN .3 OZ
## 8 GELATIN .3 OZ
## 9 GELATIN .3 OZ
## 10 SKILLET DINNERS 6.2 OZ
## # ℹ 10,444 more rows
Now, say we wanted to do the same but we are interested in any packages that are in ounces (“OZ”) or pounds (“LB”). Your first reaction is probably to do something like:
products %>%
select(product_type, package_size) %>%
filter(str_detect(package_size, regex("\\.[0-9] oz|lb", ignore_case = TRUE)))
## # A tibble: 13,012 × 2
## product_type package_size
## <chr> <chr>
## 1 ICE - CRUSHED/CUBED 22 LB
## 2 SPICES & SEASONINGS 2.5 OZ
## 3 SKILLET DINNERS 5.8 OZ
## 4 SKILLET DINNERS 6.5 OZ
## 5 GROUND COFFEE 34.5 OZ
## 6 SKILLET DINNERS 6.5 OZ
## 7 SW GDS: MUFFINS 10.8 OZ
## 8 GELATIN .3 OZ
## 9 GELATIN .3 OZ
## 10 GELATIN .3 OZ
## # ℹ 13,002 more rows
Wait! The first observation is in pounds but its a round number and not a decimal. This is because our regex (\\.[0-9] oz|lb
) is actually looking for any package size where its a decimal of ounces (\\.[0-9] oz
) or in pounds (lb
).
We need to modify our regex just a tad. If we change it to \\.[0-9] (oz|lb)
(note that oz|lb
is now in parenthesis), we are now specifying to search for \\.[0-9]
followed by “oz” or “lb”.
products %>%
select(product_type, package_size) %>%
filter(str_detect(package_size, regex("\\.[0-9] (oz|lb)", ignore_case = TRUE)))
## # A tibble: 10,712 × 2
## product_type package_size
## <chr> <chr>
## 1 SPICES & SEASONINGS 2.5 OZ
## 2 SKILLET DINNERS 5.8 OZ
## 3 SKILLET DINNERS 6.5 OZ
## 4 GROUND COFFEE 34.5 OZ
## 5 SKILLET DINNERS 6.5 OZ
## 6 SW GDS: MUFFINS 10.8 OZ
## 7 GELATIN .3 OZ
## 8 GELATIN .3 OZ
## 9 GELATIN .3 OZ
## 10 SKILLET DINNERS 6.2 OZ
## # ℹ 10,702 more rows
Now, say we wanted to find any package size that contains a decimal between 0-.4:
products %>%
select(product_type, package_size) %>%
filter(str_detect(package_size, regex("\\.[0-4] (oz|lb)", ignore_case = TRUE)))
## # A tibble: 3,092 × 2
## product_type package_size
## <chr> <chr>
## 1 GELATIN .3 OZ
## 2 GELATIN .3 OZ
## 3 GELATIN .3 OZ
## 4 SKILLET DINNERS 6.2 OZ
## 5 SKILLET DINNERS 6.4 OZ
## 6 RICE SIDE DISH MIXES DRY 4.3 OZ
## 7 ADULT CEREAL 17.3 OZ
## 8 NOODLE SIDE DISH MIXES 6.2 OZ
## 9 CANDY BARS (SINGLES)(INCLUDING 1.4 OZ
## 10 DRINKS - CARB JUICE (OVER 50% 25.4 OZ
## # ℹ 3,082 more rows
15.4.6 Shorthand character classes
Since certain character classes are used often, a series of shorthand character classes are available. For example, rather than use [0-9]
every time we are searching for a number in a regex, we can use just use \\d
to match any digit. The following are a few of the commonly used shorthand character classes:
Syntax | Description |
---|---|
\\d |
match any digit |
\\D |
match any non-digit |
\\s |
match a space character |
\\S |
match a non-space character |
\\w |
match a word |
\\W |
match a non-word |
\\b |
match a word boundary |
\\B |
match a non-word boundary |
We can use these to find patterns such as…
Find all products where the package_size
starts with a numeric digit:
products %>%
select(product_type, package_size) %>%
filter(str_detect(package_size, regex("^\\d")))
## # A tibble: 59,790 × 2
## product_type package_size
## <chr> <chr>
## 1 ICE - CRUSHED/CUBED 22 LB
## 2 APPLE SAUCE 50 OZ
## 3 SPECIALTY COOKIES 14 OZ
## 4 SPICES & SEASONINGS 2.5 OZ
## 5 TRAY PACK/CHOC CHIP COOKIES 16 OZ
## 6 VITAMIN - MINERALS 300 CT(1)
## 7 HONEY 12 OZ
## 8 TRADITIONAL 56 OZ
## 9 TRADITIONAL 56 OZ
## 10 TRADITIONAL 56 OZ
## # ℹ 59,780 more rows
Or all products where the package_size
starts with a non-digit:
products %>%
select(product_type, package_size) %>%
filter(str_detect(package_size, regex("^\\D")))
## # A tibble: 1,955 × 2
## product_type package_size
## <chr> <chr>
## 1 NATURAL CHEESE EXACT WT CHUNKS A B D 8 OZ
## 2 GELATIN .3 OZ
## 3 GELATIN .3 OZ
## 4 GELATIN .3 OZ
## 5 AMMONIA .5 GAL
## 6 EGGS - X-LARGE A D 1 DZ
## 7 STRING CHEESE AB 8 OZ
## 8 EGGS - MEDIUM ABD 1 DZ
## 9 SPICES & SEASONINGS .56 OZ
## 10 STRING CHEESE AB 8 OZ
## # ℹ 1,945 more rows
15.4.7 POSIX character classes
Closely related to regex character classes are POSIX character classes which are expressed in double brackets [[ ]]
. Some commonly used POSIX character classes include:
Syntax | Description |
---|---|
[[:lower:]] |
lower case letters |
[[:upper:]] |
upper case letters |
[[:alpha:]] |
alphabetic characters [[:lower:]] + [[:alpha:]] |
[[:digit:]] |
numeric values |
[[:alnum:]] |
alpha numeric values [[:alpha:]] + [[:digit:]] |
[[:blank:]] |
blank characters (e.g. space, tab) |
[[:punct:]] |
punctuation characters (e.g. ! . , ? & / ( [ ) |
For example, if we wanted to find any products where the package_size
includes a punctuation in the description:
products %>%
select(product_type, package_size) %>%
filter(str_detect(package_size, regex("[[:punct:]]")))
## # A tibble: 17,891 × 2
## product_type package_size
## <chr> <chr>
## 1 SPICES & SEASONINGS 2.5 OZ
## 2 VITAMIN - MINERALS 300 CT(1)
## 3 TRADITIONAL 1/2 GAL
## 4 SKILLET DINNERS 5.8 OZ
## 5 SKILLET DINNERS 6.5 OZ
## 6 GROUND COFFEE 34.5 OZ
## 7 SKILLET DINNERS 6.5 OZ
## 8 AMMONIA 1/2 GAL
## 9 SW GDS: MUFFINS 10.8 OZ
## 10 GELATIN .3 OZ
## # ℹ 17,881 more rows
15.4.8 Repetition
When we want to match a certain number of characters that meet a certain criteria we can apply repetition operators to our pattern searches. Common repetition operators include:
Syntax | Description |
---|---|
. |
wildcard to match any character once |
? |
the preceding item is optional and will be matched at most once |
* |
the preceding item will be matched zero or more times |
+ |
the preceding item will be matched one or more times |
{n} |
the preceding item will be matched exactly n times |
{n,} |
the preceding item will be matched n or more times |
{n,m} |
the preceding item will be matched at least n times but not more than m times |
For example, say we want to find all products where the package_size
contains at least 3 digits:
products %>%
select(product_category, package_size) %>%
filter(str_detect(package_size, regex("\\d{3,}")))
## # A tibble: 3,313 × 2
## product_category package_size
## <chr> <chr>
## 1 VITAMINS 300 CT(1)
## 2 FACIAL TISS/DNR NAPKIN 250 CT
## 3 CANDY - CHECKLANE 1.625 OZ
## 4 TEAS 100 CT
## 5 ANALGESICS 100 CT
## 6 COFFEE FILTERS 200 CT
## 7 CANDY - PACKAGED 10.125 OZ
## 8 TEAS 100 CT
## 9 TEAS 100 CT
## 10 ISOTONIC DRINKS 128 OZ
## # ℹ 3,303 more rows
One thing you probably notice is that the above syntax will match three or more digits within the entire character string. But what if we wanted to identify repetition of a pattern sequence. For example, say we wanted to find product_id
s where the number “8” is repeated. We can use a backreference to do so. A backreference will match the same text as previously matched within parentheses. So, in this example, we look for any repeated sequences of the number 8 in product_id
.
products %>%
filter(str_detect(product_id, regex("(8)\\1")))
## # A tibble: 4,750 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 26889 32 DRUG GM National MAGAZINE
## 2 28889 69 SEAFOOD-PCKGD Private SEAFOOD - FROZEN
## 3 28892 69 GROCERY Private FROZEN PIZZA
## 4 28897 69 GROCERY Private EGGS
## 5 32888 1075 GROCERY National CRACKERS/MISC BKD…
## 6 33882 544 GROCERY National BAG SNACKS
## 7 33883 764 GROCERY National LAUNDRY ADDITIVES
## 8 36880 69 GROCERY Private CRACKERS/MISC BKD…
## 9 37889 317 GROCERY National SALD DRSNG/SNDWCH…
## 10 38877 1884 GROCERY National MILK BY-PRODUCTS
## # ℹ 4,740 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
What if we wanted to look for product_id
s that contain three “8”s in a row then we need to repeat that pattern:
products %>%
filter(str_detect(product_id, regex("(8)\\1{2}")))
## # A tibble: 343 × 7
## product_id manufacturer_id department brand product_category
## <chr> <chr> <chr> <fct> <chr>
## 1 28889 69 SEAFOOD-PCKGD Private SEAFOOD - FROZEN
## 2 32888 1075 GROCERY National CRACKERS/MISC BKD…
## 3 62888 69 GROCERY Private LAUNDRY ADDITIVES
## 4 75888 1448 GROCERY National CAT LITTER
## 5 78888 1051 DRUG GM National ELECTRICAL SUPPPL…
## 6 88832 348 DRUG GM National CANDY - CHECKLANE
## 7 108889 69 MEAT-PCKGD Private LUNCHMEAT
## 8 124888 857 DRUG GM National CANDY - CHECKLANE
## 9 238885 3830 SALAD BAR National SALAD BAR
## 10 255888 5474 DRUG GM National NEWSPAPER
## # ℹ 333 more rows
## # ℹ 2 more variables: product_type <chr>, package_size <chr>
15.4.9 Putting it altogether
Ok, let’s use a few tools we’ve learned to answer a question we may get asked by our boss. Say we were asked to identify the top 5 products that have the most total sales; however, we only want to focus on those products that weigh 10lbs or more. We can applly the following steps:
- filter for regex
"^\\d{2,}(\\.)?.*lb"
which means:^\\d{2,}
: starts with at least 2 numeric digits(\\.)?.
: followed by an optional decimal.*lb
: followed by a character zero or more times
- take the resulting product list and inner join with transactions so we only retain those transactions and products that have a matching product ID in both tables,
- compute total sales grouped by product (here we use product type just to provide us more context over the product ID),
- and then use
slice_max
to get the top 5total_sales
values (you could also just usearrange(desc(total_sales))
).
products %>%
filter(str_detect(package_size, regex("^\\d{2,}(\\.)?.*lb", ignore_case = TRUE))) %>%
inner_join(transactions_sample, by = "product_id") %>%
group_by(product_type) %>%
summarize(total_sales = sum(sales_value)) %>%
slice_max(total_sales, n = 5)
## # A tibble: 5 × 2
## product_type total_sales
## <chr> <dbl>
## 1 BANANAS 887.
## 2 GRAPES RED 527.
## 3 GRAPES WHITE 507.
## 4 TOMATOES HOTHOUSE ON THE VINE 364.
## 5 POTATOES RUSSET (BULK&BAG) 346.
15.5 Exercises
To answer these questions you’ll need to use the
products
and transactions_sample
data
frames.
-
Identify all different products that contain “pizza” in their
product_type
description. Which product produces the greatest amount of total sales? -
Identify all products that are categorized
(
product_category
) as pizza but are considered a snack or appetizer (product_type
). Which of these products have the most sales (measured by quantity)? -
How many products contain
package_size
s that do not contain a numeric value.
15.6 Additional resources
Character string data are often considered semi-structured data. Text can be structured in a specified field; however, the quality and consistency of the text input can be far from structured. Consequently, managing and manipulating character strings can be extremely tedious and unique to each data wrangling process. As a result, taking the time to learn the nuances of dealing with character strings and regex functions can provide a great return on investment; however, the functions and techniques required will likely be greater than what I could offer here. So here are additional resources that are worth reading and learning from: