Lesson 4c: Handling text data

Lesson 4c: Handling text data#

Dealing with character strings is often under-emphasized in data analysis training. The focus typically remains on numeric values; however, the growth in data collection is also resulting in greater bits of information embedded in text. Consequently, handling, cleaning and processing character strings is becoming a prerequisite in daily data analysis. This lesson is meant to give you the foundation of working with character strings.

Learning objectives#

By the end of this lesson you’ll be able to:

Perform basic character string manipulations.
Use regular expressions to identify and manipulate patterns in character strings.

Prerequisites#

The Python standard library has lots of built-in capabilities to manipulate character strings. Pandas has incorporated many of these capabilities and even expanded upon them. In this lesson, we will demonstrate different character string capabilities with the .str.xxx() methods provided by Pandas.

Note

Most of the Pandas string methods have a similar standard library method that can be applied to character strings outside of DataFrames. See the standard library documentation for help.

import pandas as pd

For data, we’ll leverage the completejourney data. For example, the products data within completejourney provides various variables (e.g. product_category and product_type) that contain character strings we may need to clean, normalize, or identify patterns within.

from completejourney_py import get_data

cj_data = get_data()
products = cj_data['products']
products

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
0	25671	2	GROCERY	National	FRZN ICE	ICE - CRUSHED/CUBED	22 LB
1	26081	2	MISCELLANEOUS	National	None	None	None
2	26093	69	PASTRY	Private	BREAD	BREAD:ITALIAN/FRENCH	None
3	26190	69	GROCERY	Private	FRUIT - SHELF STABLE	APPLE SAUCE	50 OZ
4	26355	69	GROCERY	Private	COOKIES/CONES	SPECIALTY COOKIES	14 OZ
...	...	...	...	...	...	...	...
92326	18293142	6384	DRUG GM	National	BOOKSTORE	PAPERBACK BOOKS	None
92327	18293439	6393	DRUG GM	National	BOOKSTORE	CHILDRENS LOW END	None
92328	18293696	6406	DRUG GM	National	BOOKSTORE	PAPERBACK BEST SELLER	None
92329	18294080	6442	DRUG GM	National	BOOKSTORE	PAPERBACK BOOKS	None
92330	18316298	764	GROCERY	National	PAPER TOWELS	PAPER TOWELS & HOLDERS	None

92331 rows × 7 columns

String basics#

Basic string manipulation typically includes case conversion, counting characters, and extracting parts of a string. The following will demonstrate some of these basic string methods on a Pandas DataFrame but realize that Pandas provides many more string methods. Refer to the docs to learn more about the many string methods.

Video 🎥:

Tip

The string methods provided by Pandas are accessed by appending the object with .str followed by another dot and then the specific string method. Think of the term ‘accessor’ as giving the object access to specialized string methods.

Case conversion#

To change the case of a character string we can use .str.lower() and .str.upper():

products['product_category'].str.lower().head()

              frzn ice
                  None
                 bread
  fruit - shelf stable
         cookies/cones
Name: product_category, dtype: object

products['product_category'].str.upper().head()

              FRZN ICE
                  None
                 BREAD
  FRUIT - SHELF STABLE
         COOKIES/CONES
Name: product_category, dtype: object

Value counts#

We’ve already seen this method in a previous lesson but it’s worth discussing more. The value_counts() method is one of the most valuable methods for string columns. It returns the count of each unique value in the Series and sorts it from most to least common.

# raw value counts
products['product_category'].value_counts()

product_category
GREETING CARDS/WRAP/PARTY SPLY    2785
CANDY - PACKAGED                  2475
MAKEUP AND TREATMENT              2467
HAIR CARE PRODUCTS                1744
SOFT DRINKS                       1704
                                  ... 
BOUQUET (NON ROSE)                   1
MISCELLANEOUS CROUTONS               1
EASTER LILY                          1
PKG.SEAFOOD MISC                     1
FROZEN PACKAGE MEAT                  1
Name: count, Length: 303, dtype: int64

# normalized value counts
products['product_category'].value_counts(normalize=True)

product_category
GREETING CARDS/WRAP/PARTY SPLY    0.030341
CANDY - PACKAGED                  0.026963
MAKEUP AND TREATMENT              0.026876
HAIR CARE PRODUCTS                0.019000
SOFT DRINKS                       0.018564
                                    ...   
BOUQUET (NON ROSE)                0.000011
MISCELLANEOUS CROUTONS            0.000011
EASTER LILY                       0.000011
PKG.SEAFOOD MISC                  0.000011
FROZEN PACKAGE MEAT               0.000011
Name: proportion, Length: 303, dtype: float64

Note how value_counts() does not start with .str. This is not a string method but is most commonly used for strings. In fact we can use value_counts() for any data type.

# value counts on date-time data
cj_data['transactions']['transaction_timestamp'].value_counts().head()

transaction_timestamp
2017-05-28 15:19:55    161
2017-06-11 19:07:33    155
2017-01-04 00:55:24    153
2017-05-09 20:41:28    146
2017-12-05 17:58:13    144
Name: count, dtype: int64

Character counts#

We can get the count of the entire character string with len(). This returns the total number of characters in a string to include white space and other non-alphanumeric characters.

products['product_category'].str.len().head()

   8.0
   NaN
   5.0
  20.0
  13.0
Name: product_category, dtype: float64

Sometimes we only want to understand how many times a particular word, statement, or even particular letters are used. We can do this with the count() method. This method becomes more flexible as we introduce regular expressions, which we will get to shortly.

The below illustrates that we can easily count how many times ‘meat’ is used in the product_category column:

products['product_category'].str.lower().str.count('meat').sum()

np.float64(4875.0)

Detecting words#

The contains() method returns a boolean whether or not the passed string is contained somewhere within the string. We can use this to filter for particular words or expressions. For example, the following finds all products where product_category contains ‘meat’.

Tip

We can use case=False to ignore case sensitivity. We also use na=False to fill in all missing product_category values with False. If we didn’t do this then the subsetting would throw an error. However, use this wisely as it is basically treating missing product_category values as non-meat products!

meat_products = products['product_category'].str.contains('meat', case=False, na=False)
products[meat_products]

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
123	30003	397	MEAT-PCKGD	National	FROZEN MEAT	FRZN BREADED PREPARED CHICK	9 OZ
156	31493	151	GROCERY	National	FRZN MEAT/MEAT DINNERS	FRZN SS PREMIUM ENTREES/DNRS/T	10 OZ
274	34997	1329	GROCERY	National	FRZN MEAT/MEAT DINNERS	FRZN SS PREMIUM ENTREES/DNRS/T	12 OZ
319	36406	1329	GROCERY	National	FRZN MEAT/MEAT DINNERS	FRZN SS PREMIUM ENTREES/DNRS/T	12 OZ
322	36561	165	GROCERY	National	FRZN MEAT/MEAT DINNERS	FRZN MULTI SERVE ENTREES ALL	22 OZ
...	...	...	...	...	...	...	...
92199	18107338	1094	MEAT-PCKGD	National	LUNCHMEAT	LUNCH COMBO	6.8 OZ
92201	18107424	1094	MEAT-PCKGD	National	LUNCHMEAT	LUNCH COMBO	7.1 OZ
92237	18147446	69	MEAT-PCKGD	Private	LUNCHMEAT	VARIETY PACK	10 OZ
92253	18148186	1251	GROCERY	National	FRZN MEAT/MEAT DINNERS	SS ECONOMY ENTREES/DINNERS ALL	14 OZ
92261	18148548	1251	GROCERY	National	FRZN MEAT/MEAT DINNERS	SS ECONOMY ENTREES/DINNERS ALL	14 OZ

3607 rows × 7 columns

Extracting parts of strings#

We can use traditional index-styling to extract or slice parts of a character string. For example, the following gets the first and last 5 characters in each string.

products['product_category'].str[:5].head()

  FRZN 
   None
  BREAD
  FRUIT
  COOKI
Name: product_category, dtype: object

products['product_category'].str[-5:].head()

  N ICE
   None
  BREAD
  TABLE
  CONES
Name: product_category, dtype: object

Replacing parts of strings#

Sometimes we may want to clean up character strings by replacing certain words or phrases. For example, this data contains the word ‘frzn’ in place of ‘frozen’. If we wanted to fix this we could use the replace() method.

# product_category observations containing 'frzn'
frzn_products = products['product_category'].str.contains('frzn', case=False, na=False)
products[frzn_products].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
0	25671	2	GROCERY	National	FRZN ICE	ICE - CRUSHED/CUBED	22 LB
39	27754	69	GROCERY	Private	FRZN VEGETABLE/VEG DSH	FRZN BAGGED VEGETABLES - PLAIN	16 OZ
54	27960	69	GROCERY	Private	FRZN FRUITS	FROZEN FRUIT	16 OZ
99	29132	69	GROCERY	Private	FRZN VEGETABLE/VEG DSH	FRZN BAGGED VEGETABLES - PLAIN	16 OZ
127	30179	69	GROCERY	Private	FRZN VEGETABLE/VEG DSH	FRZN BOXED VEGETABLES - PLAIN	10 OZ

# replacing 'frzn' with 'frozen' in `product_category` column
products['product_category'] = products['product_category'].str.replace('frzn', 'frozen', case=False)
products[frzn_products].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
0	25671	2	GROCERY	National	frozen ICE	ICE - CRUSHED/CUBED	22 LB
39	27754	69	GROCERY	Private	frozen VEGETABLE/VEG DSH	FRZN BAGGED VEGETABLES - PLAIN	16 OZ
54	27960	69	GROCERY	Private	frozen FRUITS	FROZEN FRUIT	16 OZ
99	29132	69	GROCERY	Private	frozen VEGETABLE/VEG DSH	FRZN BAGGED VEGETABLES - PLAIN	16 OZ
127	30179	69	GROCERY	Private	frozen VEGETABLE/VEG DSH	FRZN BOXED VEGETABLES - PLAIN	10 OZ

Knowledge check#

Questions

Using the product_type column, which product has the longest description? How about the shortest?
Which package_size values are most common in our data? Which is the least common?
Replace all instances of ‘DSH’ with ‘dish’ in the product_category column. How many products does this impact?

Video 🎥:

Regular expressions#

A regular expression (aka regex) is a sequence of characters that define a search pattern, mainly for use in pattern matching with text strings. Typically, regex patterns consist of a combination of alphanumeric characters as well as special characters. The pattern can also be as simple as a single character or it can be more complex and include several combinations of characters.

To understand how to work with regular expressions in Python, we need to consider two primary features of regular expressions. One has to do with the syntax, or the way regex patterns are expressed in Python. The other has to do with the functions used for regex matching in Python. You will be exposed to both of these in the following sections.

Regex basics#

Python’s standard library, and Pandas adoption of these string methods provides us a convenient approach to regular expressions. Many of the methods used in the previous sections can be used with regex. The common pattern for our method calls consist of:

string_object.str.*(pattern)

where:

string_object represents the character string input (i.e. products[‘product_category’]),
.str.* represents a wide variety of regex methods depending on what you want to do,
pattern represents the regex pattern you are looking to match.

For example let’s say you are looking for observations where the word “FRUIT” was used in the product_category description. As we saw earlier .str.contains() detects the presence or absence of a pattern (“FRUIT” in this example) and returns a boolean response. Since the output is TRUE or FALSE, this is a handy function to combine with indexing to filter for observations that have that pattern.

# detect if the word "fruit" is used in each comment
fruit_products = products['product_category'].str.contains('fruit', case=False, na=False)
products[fruit_products]

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
3	26190	69	GROCERY	Private	FRUIT - SHELF STABLE	APPLE SAUCE	50 OZ
25	27503	69	GROCERY	Private	FRUIT - SHELF STABLE	CRANBERRY SAUCE	16 OZ
54	27960	69	GROCERY	Private	frozen FRUITS	FROZEN FRUIT	16 OZ
204	33048	5820	PRODUCE	National	ORGANICS FRUIT & VEGETABLES	ORGANIC CITRUS	2 LB BAG
344	37543	876	NUTRITION	National	DRIED FRUIT	RAISINS	24 OZ
...	...	...	...	...	...	...	...
92140	18056453	1378	NUTRITION	National	DRIED FRUIT	DRIED FRUIT - OTHER	.71 OZ
92287	18185766	5819	PRODUCE	National	ORGANICS FRUIT & VEGETABLES	ORGANIC HERBS	.75 OZ
92289	18186504	5819	PRODUCE	National	ORGANICS FRUIT & VEGETABLES	ORGANIC HERBS	.75 OZ
92290	18186637	5819	PRODUCE	National	ORGANICS FRUIT & VEGETABLES	ORGANIC HERBS	.75 OZ
92295	18193059	5819	PRODUCE	National	ORGANICS FRUIT & VEGETABLES	ORGANIC HERBS	.75 OZ

1153 rows × 7 columns

There are wide variety of .str. methods. See the previous sections for some of the more common ones or peruse the Pandas Working with Text Data documentation for examples of more.

Multiple words#

In the previous section you saw that you can search for a given word using some of the .str. methods. We can build onto this and search for multiple words. For example, you can search for the phrase “summer” or “fall”. One may initially think about performing this search with the following in which case we use traditional word matching and then subsetting. This is known as a ‘literal’ search.

summer_products = products['product_category'].str.contains('summer', case=False, na=False)
fall_products = products['product_category'].str.contains('fall', case=False, na=False)

products[summer_products | fall_products].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
335	37073	69	DRUG GM	Private	SPRING/SUMMER SEASONAL	GRILL ACCESSORIES	None
1747	91614	1277	DRUG GM	National	SPRING/SUMMER SEASONAL	DISPOSABLE FOILWARE	3 CT
1943	99090	1277	DRUG GM	National	SPRING/SUMMER SEASONAL	DISPOSABLE FOILWARE	2 CT
1947	99172	1277	DRUG GM	National	SPRING/SUMMER SEASONAL	DISPOSABLE FOILWARE	2 CT
2114	106668	1277	DRUG GM	National	SPRING/SUMMER SEASONAL	DISPOSABLE FOILWARE	None

This works, but as our searches get more complex this approach becomes quite tedious. Instead, we can start using regex search patterns to simplify this process. The below approach will search for observations that include the word “summer”, “fall”, or a combination of the two. This is equivalent to the above statement but more succinct. This is truly a regex, as instead of searching for a literal statement, we are looking for a pattern of letter/word combinations.

summer_or_fall = products['product_category'].str.contains('summer|fall', case=False, na=False)
products[summer_or_fall].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
335	37073	69	DRUG GM	Private	SPRING/SUMMER SEASONAL	GRILL ACCESSORIES	None
1747	91614	1277	DRUG GM	National	SPRING/SUMMER SEASONAL	DISPOSABLE FOILWARE	3 CT
1943	99090	1277	DRUG GM	National	SPRING/SUMMER SEASONAL	DISPOSABLE FOILWARE	2 CT
1947	99172	1277	DRUG GM	National	SPRING/SUMMER SEASONAL	DISPOSABLE FOILWARE	2 CT
2114	106668	1277	DRUG GM	National	SPRING/SUMMER SEASONAL	DISPOSABLE FOILWARE	None

In this example, we can see that our regex of ‘summer|fall’ returns the same number of observations as our literal search.

literal_count = products[summer_products | fall_products].shape[0]
regex_count = products[summer_or_fall].shape[0]

print(f'Literal approach: {literal_count} rows')
print(f'Regex approach: {regex_count} rows')

Literal approach: 651 rows
Regex approach: 651 rows

Line anchors#

Line anchors are used to identify patterns at the beginning or end of an element. To find a pattern at the beginning of the element we use ^ and a pattern at the end of the element is found with $. For example, if you wanted to find any observations with the word “fruit” in the product_category column we can use the following as we saw earlier:

# detect if the word "fruit" is used in each comment
fruit_products = products['product_category'].str.contains('fruit', case=False, na=False)
products[fruit_products].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
3	26190	69	GROCERY	Private	FRUIT - SHELF STABLE	APPLE SAUCE	50 OZ
25	27503	69	GROCERY	Private	FRUIT - SHELF STABLE	CRANBERRY SAUCE	16 OZ
54	27960	69	GROCERY	Private	frozen FRUITS	FROZEN FRUIT	16 OZ
204	33048	5820	PRODUCE	National	ORGANICS FRUIT & VEGETABLES	ORGANIC CITRUS	2 LB BAG
344	37543	876	NUTRITION	National	DRIED FRUIT	RAISINS	24 OZ

However, if we only wanted those products where the category starts with “fruit” than we can use the ^ anchor:

starts_with_fruit = products['product_category'].str.contains('^fruit', case=False, na=False)
products[starts_with_fruit].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
3	26190	69	GROCERY	Private	FRUIT - SHELF STABLE	APPLE SAUCE	50 OZ
25	27503	69	GROCERY	Private	FRUIT - SHELF STABLE	CRANBERRY SAUCE	16 OZ
590	44599	69	GROCERY	Private	FRUIT - SHELF STABLE	FRUIT COCKTAIL FRUIT SALAD	15.25 OZ
610	45218	69	GROCERY	Private	FRUIT - SHELF STABLE	PEARS	15.25 OZ
633	45607	69	GROCERY	Private	FRUIT - SHELF STABLE	PINEAPPLE	8 OZ

Alternatively, if we only wanted those products where the category ends with “fruit” than we can use the $ anchor:

ends_with_fruit = products['product_category'].str.contains('fruit$', case=False, na=False)
products[ends_with_fruit].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
344	37543	876	NUTRITION	National	DRIED FRUIT	RAISINS	24 OZ
772	49930	2	PRODUCE	National	STONE FRUIT	PEACHES YELLOW FLESH	None
896	54460	69	NUTRITION	Private	DRIED FRUIT	RAISINS	24 OZ BAG
1113	64583	2	PRODUCE	National	TROPICAL FRUIT	TROPICAL FRUIT - OTHER	CTN
1267	71432	876	NUTRITION	National	DRIED FRUIT	DRIED FRUIT - OTHER	7 OZ

And we can combine the two if we only wanted those products where the category starts or ends with “fruit”:

starts_or_ends_with_fruit = (
    products['product_category']
    .str.contains('^fruit|fruit$', case=False, na=False)
)

products[starts_or_ends_with_fruit].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
3	26190	69	GROCERY	Private	FRUIT - SHELF STABLE	APPLE SAUCE	50 OZ
25	27503	69	GROCERY	Private	FRUIT - SHELF STABLE	CRANBERRY SAUCE	16 OZ
344	37543	876	NUTRITION	National	DRIED FRUIT	RAISINS	24 OZ
590	44599	69	GROCERY	Private	FRUIT - SHELF STABLE	FRUIT COCKTAIL FRUIT SALAD	15.25 OZ
610	45218	69	GROCERY	Private	FRUIT - SHELF STABLE	PEARS	15.25 OZ

Metacharacters#

Metacharacters consist of non-alphanumeric symbols such as:

. \ | ( ) [ { $ * + ?

To match metacharacters in regex you need to escape. In Python, we escape them with a single or double backslash “\”. The following displays the general escape syntax for the most common metacharacters:

Table 2 Escaping metacharacters.#
Metacharacter	Literal Meaning	Escape Syntax
.	period or dot	`\\.`
$	dollar sign	`\$`
*	asterisk	`\*`
+	plus sign	`\+`
?	question mark	`\?`
\|	vertical bar	`\\|`
^	caret	`\^`
[	square bracket	`\[`
{	curly brace	`\{`
(	parenthesis	`\(`

The reason we need to escape these characters is because most of these actually have meaning when declaring regular expressions. For example, say we wanted to identify any product_category that contains a period (“.”). If we simply use the following we actually get all our records back.

Warning

Actually, this returns almost all our records. Since we use na=False, any product_category that has a missing value will be dropped.

wildcard = products['product_category'].str.contains('.', case=False, na=False)
products[wildcard].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
0	25671	2	GROCERY	National	frozen ICE	ICE - CRUSHED/CUBED	22 LB
2	26093	69	PASTRY	Private	BREAD	BREAD:ITALIAN/FRENCH	None
3	26190	69	GROCERY	Private	FRUIT - SHELF STABLE	APPLE SAUCE	50 OZ
4	26355	69	GROCERY	Private	COOKIES/CONES	SPECIALTY COOKIES	14 OZ
5	26426	69	GROCERY	Private	SPICES & EXTRACTS	SPICES & SEASONINGS	2.5 OZ

So, we need to use an escape (“”) to tell the regular expression you want to match a literal metacharacter.

includes_dot = products['product_category'].str.contains('\\.', case=False, na=False)
products[includes_dot].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
527	43361	1916	GROCERY	National	MISC. DAIRY	MEXICAN SOFT TORTILLAS AND WRA	33 OZ
802	50776	69	GROCERY	Private	MISC. DAIRY	REFRIGERATED PUDDING	24 OZ
866	53070	69	GROCERY	Private	MISC. DAIRY	REFRIGERATED PUDDING	24 OZ
1019	60924	435	GROCERY	National	MISC. DAIRY	REFRIGERATED PASTA	9 OZ
1114	64623	289	GROCERY	National	MISC. DAIRY	MISC DAIRY REFIGERATED	20 OZ

Character classes#

To match one of several characters in a specified set we can enclose the characters of concern with square brackets [ ]. In addition, to matching any characters not in a specified character set we can include the caret ^ at the beginning of the set within the brackets. The following displays the general syntax for common character classes but these can be altered easily as shown in the examples that follow:

Table 3 Common character classes.#
Character class	Description
`[aeiou]`	match any specified lower case vowel
`[AEIOU]`	match any specified upper case vowel
`[0123456789]`	match any specified numeric values
`[0-9]`	match any range specified numeric values
`[a-z]`	match any range of lowercase letters
`[A-Z]`	match any range of uppercase letters
`[a-zA-Z0-9]`	match any of the above
`[^aeiou]`	match anything other than a lowercase vowel
`[^0-9]`	match anything other than the specified numeric values

For example, say we wanted to find any products where the package_size is not a round numeric size in ounces. The following identifies any rows where package_size contains a dot (remember, we need to escape that character with \\.) followed by “oz”.

decimal_ounces = products['package_size'].str.contains('\\.[0-9] oz', case=False, na=False)
products[decimal_ounces].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
5	26426	69	GROCERY	Private	SPICES & EXTRACTS	SPICES & SEASONINGS	2.5 OZ
34	27695	69	GROCERY	Private	DINNER MXS:DRY	SKILLET DINNERS	5.8 OZ
43	27767	69	GROCERY	Private	DINNER MXS:DRY	SKILLET DINNERS	6.5 OZ
44	27812	69	GROCERY	Private	COFFEE	GROUND COFFEE	34.5 OZ
50	27925	69	GROCERY	Private	DINNER MXS:DRY	SKILLET DINNERS	6.5 OZ

Now, say we wanted to do the same but we are interested in any packages that are in ounces (“OZ”) or pounds (“LB”). Your first reaction is probably to do something like:

decimal_ounces_or_lbs = (
    products['package_size']
    .str.contains('\\.[0-9] oz|lb', case=False, na=False)
)

products[decimal_ounces_or_lbs].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
0	25671	2	GROCERY	National	frozen ICE	ICE - CRUSHED/CUBED	22 LB
5	26426	69	GROCERY	Private	SPICES & EXTRACTS	SPICES & SEASONINGS	2.5 OZ
34	27695	69	GROCERY	Private	DINNER MXS:DRY	SKILLET DINNERS	5.8 OZ
43	27767	69	GROCERY	Private	DINNER MXS:DRY	SKILLET DINNERS	6.5 OZ
44	27812	69	GROCERY	Private	COFFEE	GROUND COFFEE	34.5 OZ

Wait! The first observation is in pounds but its a round number and not a decimal. This is because our regex (\\.[0-9] oz|lb) is actually looking for any package size where its a decimal of ounces (\\.[0-9] oz) or in pounds (lb).

We need to modify our regex just a tad. If we change it to \\.[0-9] (oz|lb) (note that oz|lb is now in parenthesis), we are now specifying to search for \\.[0-9] followed by “oz” or “lb”.

Tip

Using parentheses such as (oz|lb) is known as grouping. If we simply use (oz|lb) then it will be treated as a capture grouping and Pandas will throw a warning letting you know that you could extract the matching groups if you prefer. If this warning bothers you then you can add ?: at the begining ((?:oz|lb))to tell Pandas that you just care about matching this pattern and not capturing each group to extract.

decimal_ounces_or_lbs = (
    products['package_size']
    .str.contains('\\.[0-9] (oz|lb)', case=False, na=False)
)

products[decimal_ounces_or_lbs].head()

/var/folders/8f/c06lv6q17tjbyjv2nkt0_s4s1sh0tg/T/ipykernel_27980/923952225.py:3: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.
  .str.contains('\\.[0-9] (oz|lb)', case=False, na=False)

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
5	26426	69	GROCERY	Private	SPICES & EXTRACTS	SPICES & SEASONINGS	2.5 OZ
34	27695	69	GROCERY	Private	DINNER MXS:DRY	SKILLET DINNERS	5.8 OZ
43	27767	69	GROCERY	Private	DINNER MXS:DRY	SKILLET DINNERS	6.5 OZ
44	27812	69	GROCERY	Private	COFFEE	GROUND COFFEE	34.5 OZ
50	27925	69	GROCERY	Private	DINNER MXS:DRY	SKILLET DINNERS	6.5 OZ

Now, say we wanted to find any package size that contains a decimal between 0-.4:

smaller_sizes = (
    products['package_size']
    .str.contains('\\.[0-4] (?:oz|lb)', case=False, na=False)
)

products[smaller_sizes].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
60	28116	69	GROCERY	Private	DRY MIX DESSERTS	GELATIN	.3 OZ
61	28117	69	GROCERY	Private	DRY MIX DESSERTS	GELATIN	.3 OZ
62	28143	69	GROCERY	Private	DRY MIX DESSERTS	GELATIN	.3 OZ
69	28208	69	GROCERY	Private	DINNER MXS:DRY	SKILLET DINNERS	6.2 OZ
79	28377	69	GROCERY	Private	DINNER MXS:DRY	SKILLET DINNERS	6.4 OZ

Shorthand character classes#

Since certain character classes are used often, a series of shorthand character classes are available. For example, rather than use [0-9] every time we are searching for a number in a regex, we can use just use \\d to match any digit. The following are a few of the commonly used shorthand character classes:

Table 4 Common shorthand character classes.#
Syntax	Description
`\\d`	match any digit
`\\D`	match any non-digit
`\\s`	match a space character
`\\S`	match a non-space character
`\\w`	match a word
`\\W`	match a non-word
`\\b`	match a word boundary
`\\B`	match a non-word boundary

We can use these to find patterns such as…

Find all products where the package_size starts with a numeric digit:

starts_with_digit = products['package_size'].str.contains('^\\d', case=False, na=False)
products[starts_with_digit].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
0	25671	2	GROCERY	National	frozen ICE	ICE - CRUSHED/CUBED	22 LB
3	26190	69	GROCERY	Private	FRUIT - SHELF STABLE	APPLE SAUCE	50 OZ
4	26355	69	GROCERY	Private	COOKIES/CONES	SPECIALTY COOKIES	14 OZ
5	26426	69	GROCERY	Private	SPICES & EXTRACTS	SPICES & SEASONINGS	2.5 OZ
6	26540	69	GROCERY	Private	COOKIES/CONES	TRAY PACK/CHOC CHIP COOKIES	16 OZ

Or all products where the package_size starts with a non-digit:

starts_with_nondigit = products['package_size'].str.contains('^\\D', case=False, na=False)
products[starts_with_nondigit].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
31	27657	69	GROCERY	Private	CHEESE	NATURAL CHEESE EXACT WT CHUNKS	A B D 8 OZ
60	28116	69	GROCERY	Private	DRY MIX DESSERTS	GELATIN	.3 OZ
61	28117	69	GROCERY	Private	DRY MIX DESSERTS	GELATIN	.3 OZ
62	28143	69	GROCERY	Private	DRY MIX DESSERTS	GELATIN	.3 OZ
67	28192	69	GROCERY	Private	HOUSEHOLD CLEANG NEEDS	AMMONIA	.5 GAL

Repetition#

When we want to match a certain number of characters that meet a certain criteria we can apply repetition operators to our pattern searches. Common repetition operators include:

Table 5 Common repetition operators.#
Syntax	Description
`.`	wildcard to match any character once
`?`	the preceding item is optional and will be matched at most once
`*`	the preceding item will be matched zero or more times
`+`	the preceding item will be matched one or more times
`{n}`	the preceding item will be matched exactly n times
`{n,}`	the preceding item will be matched n or more times
`{n,m}`	the preceding item will be matched at least n times but not more than m times

For example, say we want to find all products where the package_size contains at least 3 digits:

three_digits = products['package_size'].str.contains('\\d{3,}', na=False)
products[three_digits].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
7	26601	69	DRUG GM	Private	VITAMINS	VITAMIN - MINERALS	300 CT(1)
85	28603	69	GROCERY	Private	FACIAL TISS/DNR NAPKIN	PAPER NAPKINS	250 CT
171	32265	693	DRUG GM	National	CANDY - CHECKLANE	CANDY BARS (SINGLES)(INCLUDING	1.625 OZ
243	34214	869	GROCERY	National	TEAS	TEA BAGS & BULK TEA	100 CT
281	35215	69	DRUG GM	Private	ANALGESICS	ADULT ANALGESICS	100 CT

One thing you probably notice is that the above syntax will match three or more digits within the entire character string. But what if we wanted to identify repetition of a pattern sequence. For example, say we wanted to find product_ids where the number “8” is repeated. We simply combine 8 with the corresponding repitition operator from above.

prod_id_8s = products['product_id'].astype(str).str.contains('8{2,}', na=False)
products[prod_id_8s].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
11	26889	32	DRUG GM	National	MAGAZINE	TV/MOVIE-MAGAZINE	None
88	28889	69	SEAFOOD-PCKGD	Private	SEAFOOD - FROZEN	SEAFOOD-FRZ-RAW FILLETS	12 OZ
89	28892	69	GROCERY	Private	FROZEN PIZZA	SNACKS/APPETIZERS	20 OZ
90	28897	69	GROCERY	Private	EGGS	EGGS - X-LARGE	A D 1 DZ
199	32888	1075	GROCERY	National	CRACKERS/MISC BKD FD	SNACK CRACKERS	8.5 OZ

What if we wanted to look for product_ids that contain two, but no more than three “8”s in a row:

prod_id_8s = products['product_id'].astype(str).str.contains('8{2,3}', na=False)
products[prod_id_8s].head()

	product_id	manufacturer_id	department	brand	product_category	product_type	package_size
11	26889	32	DRUG GM	National	MAGAZINE	TV/MOVIE-MAGAZINE	None
88	28889	69	SEAFOOD-PCKGD	Private	SEAFOOD - FROZEN	SEAFOOD-FRZ-RAW FILLETS	12 OZ
89	28892	69	GROCERY	Private	FROZEN PIZZA	SNACKS/APPETIZERS	20 OZ
90	28897	69	GROCERY	Private	EGGS	EGGS - X-LARGE	A D 1 DZ
199	32888	1075	GROCERY	National	CRACKERS/MISC BKD FD	SNACK CRACKERS	8.5 OZ

Putting it altogether#

Ok, let’s use a few tools we’ve learned to answer a question we may get asked by our boss. Say we were asked to identify the top 5 products that have the most total sales; however, we only want to focus on those products that weigh 10lbs or more. We can apply the following steps:

filter for regex "^\\d{2,}\\.?.*lb" which means:
- ^\\d{2,}: starts with at least 2 numeric digits
- \\.?.: followed by an optional decimal
- .*lb: followed by a character zero or more times and then the identifier for pounds (lb).
take the resulting product list and inner join with transactions so we only retain those transactions and products that have a matching product ID in both tables,
compute total sales grouped by product (here we use product type just to provide us more context over the product ID),
and then use nlargest to get the top 5 total_sales values (you could’ve also used sort_values() to get to the same conclusion).

size_filter = products['package_size'].str.contains('^\\d{2,}\\.?.*lb', case=False, na=False)
(
    products[size_filter]
    .merge(cj_data['transactions'], how='inner', on='product_id')
    .groupby('product_type', as_index=False)
    .agg({'sales_value': 'sum'})
    .nlargest(5, 'sales_value')
)

	product_type	sales_value
3	BANANAS	17480.66
28	GRAPES RED	9999.96
29	GRAPES WHITE	8467.86
61	POTATOES RUSSET (BULK&BAG)	7805.49
22	DRY DOG FOOD PREMIUM (ALPO/PUR	7146.87

Knowledge check#

Questions:

How many products contain the word “bulk” in product_type?
How many products do not contain punctuation in their package_size?
Find all frozen pizza products. Be careful, this is not straight forward!

Video 🎥:

Exercises#

Questions:

To answer these questions you’ll need to use the products and transactions data frames.

Identify all different products that contain “pizza” in their product_type description. Which product produces the greatest amount of total sales?
Identify all products that are categorized (product_category) as pizza but are considered a snack or appetizer (product_type). Which of these products have the most sales (measured by quantity)?
How many products contain package_sizes that do not contain a numeric value.

Computing environment#

Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.26.0

jupyterlab        : 4.2.3
pandas            : 2.2.2
completejourney_py: 0.0.3

Lesson 4c: Handling text data

Contents

Lesson 4c: Handling text data#

Learning objectives#

Prerequisites#

String basics#

Case conversion#

Value counts#

Character counts#

Detecting words#

Extracting parts of strings#

Replacing parts of strings#

Knowledge check#

Regular expressions#

Regex basics#

Multiple words#

Line anchors#

Metacharacters#

Character classes#

Shorthand character classes#

Repetition#

Putting it altogether#

Knowledge check#

Exercises#

Computing environment#