{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lesson 4c: Handling text data\n", "\n", "Dealing with character strings is often under-emphasized in data analysis training. The focus typically remains on numeric values; however, the growth in data collection is also resulting in greater bits of information embedded in text. Consequently, handling, cleaning and processing character strings is becoming a prerequisite in daily data analysis. This lesson is meant to give you the foundation of working with character strings.\n", "\n", "## Learning objectives\n", "\n", "By the end of this lesson you'll be able to:\n", "\n", "* Perform basic character string manipulations.\n", "* Use regular expressions to identify and manipulate patterns in character strings." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prerequisites\n", "\n", "The Python standard library has lots of built-in capabilities to manipulate character strings. Pandas has incorporated many of these capabilities and even expanded upon them. In this lesson, we will demonstrate different character string capabilities with the `.str.xxx()` methods provided by Pandas.\n", "\n", "```{note}\n", "Most of the Pandas string methods have a similar standard library method that can be applied to character strings outside of DataFrames. See the [standard library documentation](https://docs.python.org/3/library/string.html) for help.\n", "```" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For data, we'll leverage the **completejourney** data. For example, the `products` data within **completejourney** provides various variables (e.g. `product_category` and `product_type`) that contain character strings we may need to clean, normalize, or identify patterns within." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
0256712GROCERYNationalFRZN ICEICE - CRUSHED/CUBED22 LB
1260812MISCELLANEOUSNationalNoneNoneNone
22609369PASTRYPrivateBREADBREAD:ITALIAN/FRENCHNone
32619069GROCERYPrivateFRUIT - SHELF STABLEAPPLE SAUCE50 OZ
42635569GROCERYPrivateCOOKIES/CONESSPECIALTY COOKIES14 OZ
........................
92326182931426384DRUG GMNationalBOOKSTOREPAPERBACK BOOKSNone
92327182934396393DRUG GMNationalBOOKSTORECHILDRENS LOW ENDNone
92328182936966406DRUG GMNationalBOOKSTOREPAPERBACK BEST SELLERNone
92329182940806442DRUG GMNationalBOOKSTOREPAPERBACK BOOKSNone
9233018316298764GROCERYNationalPAPER TOWELSPAPER TOWELS & HOLDERSNone
\n", "

92331 rows × 7 columns

\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand \\\n", "0 25671 2 GROCERY National \n", "1 26081 2 MISCELLANEOUS National \n", "2 26093 69 PASTRY Private \n", "3 26190 69 GROCERY Private \n", "4 26355 69 GROCERY Private \n", "... ... ... ... ... \n", "92326 18293142 6384 DRUG GM National \n", "92327 18293439 6393 DRUG GM National \n", "92328 18293696 6406 DRUG GM National \n", "92329 18294080 6442 DRUG GM National \n", "92330 18316298 764 GROCERY National \n", "\n", " product_category product_type package_size \n", "0 FRZN ICE ICE - CRUSHED/CUBED 22 LB \n", "1 None None None \n", "2 BREAD BREAD:ITALIAN/FRENCH None \n", "3 FRUIT - SHELF STABLE APPLE SAUCE 50 OZ \n", "4 COOKIES/CONES SPECIALTY COOKIES 14 OZ \n", "... ... ... ... \n", "92326 BOOKSTORE PAPERBACK BOOKS None \n", "92327 BOOKSTORE CHILDRENS LOW END None \n", "92328 BOOKSTORE PAPERBACK BEST SELLER None \n", "92329 BOOKSTORE PAPERBACK BOOKS None \n", "92330 PAPER TOWELS PAPER TOWELS & HOLDERS None \n", "\n", "[92331 rows x 7 columns]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from completejourney_py import get_data\n", "\n", "cj_data = get_data()\n", "products = cj_data['products']\n", "products" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## String basics\n", "\n", "Basic string manipulation typically includes case conversion, counting characters, and extracting parts of a string. The following will demonstrate some of these basic string methods on a Pandas DataFrame but realize that Pandas provides many more string methods. Refer to the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) to learn more about the many string methods.\n", "\n", "```{admonition} Video 🎥:\n", "\n", "```\n", "\n", "```{tip}\n", "The string methods provided by Pandas are accessed by appending the object with `.str` followed by another dot and then the specific string method. Think of the term ‘accessor’ as giving the object access to specialized string methods.\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Case conversion\n", "\n", "To change the case of a character string we can use `.str.lower()` and `.str.upper()`:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 frzn ice\n", "1 None\n", "2 bread\n", "3 fruit - shelf stable\n", "4 cookies/cones\n", "Name: product_category, dtype: object" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "products['product_category'].str.lower().head()" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 FRZN ICE\n", "1 None\n", "2 BREAD\n", "3 FRUIT - SHELF STABLE\n", "4 COOKIES/CONES\n", "Name: product_category, dtype: object" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "products['product_category'].str.upper().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Value counts\n", "\n", "We've already seen this method in a previous lesson but it's worth discussing more. The `value_counts()` method is one of the most valuable methods for string columns. It returns the count of each unique value in the Series and sorts it from most to least common." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GREETING CARDS/WRAP/PARTY SPLY 2785\n", "CANDY - PACKAGED 2475\n", "MAKEUP AND TREATMENT 2467\n", "HAIR CARE PRODUCTS 1744\n", "SOFT DRINKS 1704\n", " ... \n", "EASTER LILY 1\n", "BOUQUET (NON ROSE) 1\n", "MISCELLANEOUS CROUTONS 1\n", "PKG.SEAFOOD MISC 1\n", "FROZEN PACKAGE MEAT 1\n", "Name: product_category, Length: 303, dtype: int64" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# raw value counts\n", "products['product_category'].value_counts()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GREETING CARDS/WRAP/PARTY SPLY 0.030341\n", "CANDY - PACKAGED 0.026963\n", "MAKEUP AND TREATMENT 0.026876\n", "HAIR CARE PRODUCTS 0.019000\n", "SOFT DRINKS 0.018564\n", " ... \n", "EASTER LILY 0.000011\n", "BOUQUET (NON ROSE) 0.000011\n", "MISCELLANEOUS CROUTONS 0.000011\n", "PKG.SEAFOOD MISC 0.000011\n", "FROZEN PACKAGE MEAT 0.000011\n", "Name: product_category, Length: 303, dtype: float64" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# normalized value counts\n", "products['product_category'].value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how `value_counts()` does not start with `.str`. This is not a string method but is most commonly used for strings. In fact we can use `value_counts()` for any data type." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2017-05-28 15:19:55 161\n", "2017-06-11 19:07:33 155\n", "2017-01-04 00:55:24 153\n", "2017-05-09 20:41:28 146\n", "2017-12-05 17:58:13 144\n", "Name: transaction_timestamp, dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# value counts on date-time data\n", "cj_data['transactions']['transaction_timestamp'].value_counts().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Character counts\n", "\n", "We can get the count of the entire character string with `len()`. This returns the total number of characters in a string to include white space and other non-alphanumeric characters." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 8.0\n", "1 NaN\n", "2 5.0\n", "3 20.0\n", "4 13.0\n", "Name: product_category, dtype: float64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "products['product_category'].str.len().head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes we only want to understand how many times a particular word, statement, or even particular letters are used. We can do this with the `count()` method. This method becomes more flexible as we introduce regular expressions, which we will get to shortly.\n", "\n", "The below illustrates that we can easily count how many times 'meat' is used in the `product_category` column:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4875.0" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "products['product_category'].str.lower().str.count('meat').sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Detecting words\n", "\n", "The `contains()` method returns a boolean whether or not the passed string is contained somewhere within the string. We can use this to filter for particular words or expressions. For example, the following finds all products where `product_category` contains 'meat'.\n", "\n", "```{tip}\n", "We can use `case=False` to ignore case sensitivity. We also use `na=False` to fill in all missing `product_category` values with `False`. If we didn't do this then the subsetting would throw an error. However, use this wisely as it is basically treating missing `product_category` values as non-meat products!\n", "```" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
12330003397MEAT-PCKGDNationalFROZEN MEATFRZN BREADED PREPARED CHICK9 OZ
15631493151GROCERYNationalFRZN MEAT/MEAT DINNERSFRZN SS PREMIUM ENTREES/DNRS/T10 OZ
274349971329GROCERYNationalFRZN MEAT/MEAT DINNERSFRZN SS PREMIUM ENTREES/DNRS/T12 OZ
319364061329GROCERYNationalFRZN MEAT/MEAT DINNERSFRZN SS PREMIUM ENTREES/DNRS/T12 OZ
32236561165GROCERYNationalFRZN MEAT/MEAT DINNERSFRZN MULTI SERVE ENTREES ALL22 OZ
........................
92199181073381094MEAT-PCKGDNationalLUNCHMEATLUNCH COMBO6.8 OZ
92201181074241094MEAT-PCKGDNationalLUNCHMEATLUNCH COMBO7.1 OZ
922371814744669MEAT-PCKGDPrivateLUNCHMEATVARIETY PACK10 OZ
92253181481861251GROCERYNationalFRZN MEAT/MEAT DINNERSSS ECONOMY ENTREES/DINNERS ALL14 OZ
92261181485481251GROCERYNationalFRZN MEAT/MEAT DINNERSSS ECONOMY ENTREES/DINNERS ALL14 OZ
\n", "

3607 rows × 7 columns

\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand \\\n", "123 30003 397 MEAT-PCKGD National \n", "156 31493 151 GROCERY National \n", "274 34997 1329 GROCERY National \n", "319 36406 1329 GROCERY National \n", "322 36561 165 GROCERY National \n", "... ... ... ... ... \n", "92199 18107338 1094 MEAT-PCKGD National \n", "92201 18107424 1094 MEAT-PCKGD National \n", "92237 18147446 69 MEAT-PCKGD Private \n", "92253 18148186 1251 GROCERY National \n", "92261 18148548 1251 GROCERY National \n", "\n", " product_category product_type package_size \n", "123 FROZEN MEAT FRZN BREADED PREPARED CHICK 9 OZ \n", "156 FRZN MEAT/MEAT DINNERS FRZN SS PREMIUM ENTREES/DNRS/T 10 OZ \n", "274 FRZN MEAT/MEAT DINNERS FRZN SS PREMIUM ENTREES/DNRS/T 12 OZ \n", "319 FRZN MEAT/MEAT DINNERS FRZN SS PREMIUM ENTREES/DNRS/T 12 OZ \n", "322 FRZN MEAT/MEAT DINNERS FRZN MULTI SERVE ENTREES ALL 22 OZ \n", "... ... ... ... \n", "92199 LUNCHMEAT LUNCH COMBO 6.8 OZ \n", "92201 LUNCHMEAT LUNCH COMBO 7.1 OZ \n", "92237 LUNCHMEAT VARIETY PACK 10 OZ \n", "92253 FRZN MEAT/MEAT DINNERS SS ECONOMY ENTREES/DINNERS ALL 14 OZ \n", "92261 FRZN MEAT/MEAT DINNERS SS ECONOMY ENTREES/DINNERS ALL 14 OZ \n", "\n", "[3607 rows x 7 columns]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "meat_products = products['product_category'].str.contains('meat', case=False, na=False)\n", "products[meat_products]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extracting parts of strings\n", "\n", "We can use traditional index-styling to extract or slice parts of a character string. For example, the following gets the first and last 5 characters in each string." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 FRZN \n", "1 None\n", "2 BREAD\n", "3 FRUIT\n", "4 COOKI\n", "Name: product_category, dtype: object" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "products['product_category'].str[:5].head()" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 N ICE\n", "1 None\n", "2 BREAD\n", "3 TABLE\n", "4 CONES\n", "Name: product_category, dtype: object" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "products['product_category'].str[-5:].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Replacing parts of strings\n", "\n", "Sometimes we may want to clean up character strings by replacing certain words or phrases. For example, this data contains the word ‘frzn’ in place of 'frozen'. If we wanted to fix this we could use the `replace()` method." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
0256712GROCERYNationalFRZN ICEICE - CRUSHED/CUBED22 LB
392775469GROCERYPrivateFRZN VEGETABLE/VEG DSHFRZN BAGGED VEGETABLES - PLAIN16 OZ
542796069GROCERYPrivateFRZN FRUITSFROZEN FRUIT16 OZ
992913269GROCERYPrivateFRZN VEGETABLE/VEG DSHFRZN BAGGED VEGETABLES - PLAIN16 OZ
1273017969GROCERYPrivateFRZN VEGETABLE/VEG DSHFRZN BOXED VEGETABLES - PLAIN10 OZ
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand product_category \\\n", "0 25671 2 GROCERY National FRZN ICE \n", "39 27754 69 GROCERY Private FRZN VEGETABLE/VEG DSH \n", "54 27960 69 GROCERY Private FRZN FRUITS \n", "99 29132 69 GROCERY Private FRZN VEGETABLE/VEG DSH \n", "127 30179 69 GROCERY Private FRZN VEGETABLE/VEG DSH \n", "\n", " product_type package_size \n", "0 ICE - CRUSHED/CUBED 22 LB \n", "39 FRZN BAGGED VEGETABLES - PLAIN 16 OZ \n", "54 FROZEN FRUIT 16 OZ \n", "99 FRZN BAGGED VEGETABLES - PLAIN 16 OZ \n", "127 FRZN BOXED VEGETABLES - PLAIN 10 OZ " ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# product_category observations containing 'frzn'\n", "frzn_products = products['product_category'].str.contains('frzn', case=False, na=False)\n", "products[frzn_products].head()" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
0256712GROCERYNationalfrozen ICEICE - CRUSHED/CUBED22 LB
392775469GROCERYPrivatefrozen VEGETABLE/VEG DSHFRZN BAGGED VEGETABLES - PLAIN16 OZ
542796069GROCERYPrivatefrozen FRUITSFROZEN FRUIT16 OZ
992913269GROCERYPrivatefrozen VEGETABLE/VEG DSHFRZN BAGGED VEGETABLES - PLAIN16 OZ
1273017969GROCERYPrivatefrozen VEGETABLE/VEG DSHFRZN BOXED VEGETABLES - PLAIN10 OZ
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand \\\n", "0 25671 2 GROCERY National \n", "39 27754 69 GROCERY Private \n", "54 27960 69 GROCERY Private \n", "99 29132 69 GROCERY Private \n", "127 30179 69 GROCERY Private \n", "\n", " product_category product_type package_size \n", "0 frozen ICE ICE - CRUSHED/CUBED 22 LB \n", "39 frozen VEGETABLE/VEG DSH FRZN BAGGED VEGETABLES - PLAIN 16 OZ \n", "54 frozen FRUITS FROZEN FRUIT 16 OZ \n", "99 frozen VEGETABLE/VEG DSH FRZN BAGGED VEGETABLES - PLAIN 16 OZ \n", "127 frozen VEGETABLE/VEG DSH FRZN BOXED VEGETABLES - PLAIN 10 OZ " ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# replacing 'frzn' with 'frozen' in `product_category` column\n", "products['product_category'] = products['product_category'].str.replace('frzn', 'frozen', case=False)\n", "products[frzn_products].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Knowledge check\n", "\n", "```{admonition} Questions\n", ":class: attention\n", "1. Using the `product_type` column, which product has the longest description? How about the shortest?\n", "2. Which `package_size` values are most common in our data? Which is the least common?\n", "2. Replace all instances of 'DSH' with 'dish' in the `product_category` column. How many products does this impact?\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Video 🎥:\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regular expressions\n", "\n", "A regular expression (aka regex) is a sequence of characters that define a search pattern, mainly for use in pattern matching with text strings. Typically, regex patterns consist of a combination of alphanumeric characters as well as special characters. The pattern can also be as simple as a single character or it can be more complex and include several combinations of characters.\n", "\n", "To understand how to work with regular expressions in Python, we need to consider two primary features of regular expressions. One has to do with the syntax, or the way regex patterns are expressed in Python. The other has to do with the functions used for regex matching in Python. You will be exposed to both of these in the following sections." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regex basics\n", "\n", "Python’s standard library, and Pandas adoption of these string methods provides us a convenient approach to regular expressions. Many of the methods used in the previous sections can be used with regex. The common pattern for our method calls consist of:\n", "\n", "
\n", "string_object.str.*(pattern)\n", "
\n", "\n", "where:\n", "\n", "* `string_object` represents the character string input (i.e. products['product_category']),\n", "* `.str.*` represents a wide variety of regex methods depending on what you want to do,\n", "* `pattern` represents the regex pattern you are looking to match.\n", "\n", "For example let’s say you are looking for observations where the word “FRUIT” was used in the `product_category` description. As we saw earlier `.str.contains()` detects the presence or absence of a pattern (“FRUIT” in this example) and returns a boolean response. Since the output is TRUE or FALSE, this is a handy function to combine with indexing to filter for observations that have that pattern." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
32619069GROCERYPrivateFRUIT - SHELF STABLEAPPLE SAUCE50 OZ
252750369GROCERYPrivateFRUIT - SHELF STABLECRANBERRY SAUCE16 OZ
542796069GROCERYPrivatefrozen FRUITSFROZEN FRUIT16 OZ
204330485820PRODUCENationalORGANICS FRUIT & VEGETABLESORGANIC CITRUS2 LB BAG
34437543876NUTRITIONNationalDRIED FRUITRAISINS24 OZ
........................
92140180564531378NUTRITIONNationalDRIED FRUITDRIED FRUIT - OTHER.71 OZ
92287181857665819PRODUCENationalORGANICS FRUIT & VEGETABLESORGANIC HERBS.75 OZ
92289181865045819PRODUCENationalORGANICS FRUIT & VEGETABLESORGANIC HERBS.75 OZ
92290181866375819PRODUCENationalORGANICS FRUIT & VEGETABLESORGANIC HERBS.75 OZ
92295181930595819PRODUCENationalORGANICS FRUIT & VEGETABLESORGANIC HERBS.75 OZ
\n", "

1153 rows × 7 columns

\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand \\\n", "3 26190 69 GROCERY Private \n", "25 27503 69 GROCERY Private \n", "54 27960 69 GROCERY Private \n", "204 33048 5820 PRODUCE National \n", "344 37543 876 NUTRITION National \n", "... ... ... ... ... \n", "92140 18056453 1378 NUTRITION National \n", "92287 18185766 5819 PRODUCE National \n", "92289 18186504 5819 PRODUCE National \n", "92290 18186637 5819 PRODUCE National \n", "92295 18193059 5819 PRODUCE National \n", "\n", " product_category product_type package_size \n", "3 FRUIT - SHELF STABLE APPLE SAUCE 50 OZ \n", "25 FRUIT - SHELF STABLE CRANBERRY SAUCE 16 OZ \n", "54 frozen FRUITS FROZEN FRUIT 16 OZ \n", "204 ORGANICS FRUIT & VEGETABLES ORGANIC CITRUS 2 LB BAG \n", "344 DRIED FRUIT RAISINS 24 OZ \n", "... ... ... ... \n", "92140 DRIED FRUIT DRIED FRUIT - OTHER .71 OZ \n", "92287 ORGANICS FRUIT & VEGETABLES ORGANIC HERBS .75 OZ \n", "92289 ORGANICS FRUIT & VEGETABLES ORGANIC HERBS .75 OZ \n", "92290 ORGANICS FRUIT & VEGETABLES ORGANIC HERBS .75 OZ \n", "92295 ORGANICS FRUIT & VEGETABLES ORGANIC HERBS .75 OZ \n", "\n", "[1153 rows x 7 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# detect if the word \"fruit\" is used in each comment\n", "fruit_products = products['product_category'].str.contains('fruit', case=False, na=False)\n", "products[fruit_products]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are wide variety of `.str.` methods. See the previous sections for some of the more common ones or peruse the Pandas [Working with Text Data documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) for examples of more." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Multiple words\n", "\n", "In the previous section you saw that you can search for a given word using some of the `.str.` methods. We can build onto this and search for multiple words. For example, you can search for the phrase “summer” or \"fall\". One may initially think about performing this search with the following in which case we use traditional word matching and then subsetting. This is known as a ‘literal’ search." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
3353707369DRUG GMPrivateSPRING/SUMMER SEASONALGRILL ACCESSORIESNone
1747916141277DRUG GMNationalSPRING/SUMMER SEASONALDISPOSABLE FOILWARE3 CT
1943990901277DRUG GMNationalSPRING/SUMMER SEASONALDISPOSABLE FOILWARE2 CT
1947991721277DRUG GMNationalSPRING/SUMMER SEASONALDISPOSABLE FOILWARE2 CT
21141066681277DRUG GMNationalSPRING/SUMMER SEASONALDISPOSABLE FOILWARENone
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand \\\n", "335 37073 69 DRUG GM Private \n", "1747 91614 1277 DRUG GM National \n", "1943 99090 1277 DRUG GM National \n", "1947 99172 1277 DRUG GM National \n", "2114 106668 1277 DRUG GM National \n", "\n", " product_category product_type package_size \n", "335 SPRING/SUMMER SEASONAL GRILL ACCESSORIES None \n", "1747 SPRING/SUMMER SEASONAL DISPOSABLE FOILWARE 3 CT \n", "1943 SPRING/SUMMER SEASONAL DISPOSABLE FOILWARE 2 CT \n", "1947 SPRING/SUMMER SEASONAL DISPOSABLE FOILWARE 2 CT \n", "2114 SPRING/SUMMER SEASONAL DISPOSABLE FOILWARE None " ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summer_products = products['product_category'].str.contains('summer', case=False, na=False)\n", "fall_products = products['product_category'].str.contains('fall', case=False, na=False)\n", "\n", "products[summer_products | fall_products].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This works, but as our searches get more complex this approach becomes quite tedious. Instead, we can start using regex search patterns to simplify this process. The below approach will search for observations that include the word “summer”, “fall”, or a combination of the two. This is equivalent to the above statement but more succinct. This is truly a regex, as instead of searching for a literal statement, we are looking for a pattern of letter/word combinations." ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
3353707369DRUG GMPrivateSPRING/SUMMER SEASONALGRILL ACCESSORIESNone
1747916141277DRUG GMNationalSPRING/SUMMER SEASONALDISPOSABLE FOILWARE3 CT
1943990901277DRUG GMNationalSPRING/SUMMER SEASONALDISPOSABLE FOILWARE2 CT
1947991721277DRUG GMNationalSPRING/SUMMER SEASONALDISPOSABLE FOILWARE2 CT
21141066681277DRUG GMNationalSPRING/SUMMER SEASONALDISPOSABLE FOILWARENone
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand \\\n", "335 37073 69 DRUG GM Private \n", "1747 91614 1277 DRUG GM National \n", "1943 99090 1277 DRUG GM National \n", "1947 99172 1277 DRUG GM National \n", "2114 106668 1277 DRUG GM National \n", "\n", " product_category product_type package_size \n", "335 SPRING/SUMMER SEASONAL GRILL ACCESSORIES None \n", "1747 SPRING/SUMMER SEASONAL DISPOSABLE FOILWARE 3 CT \n", "1943 SPRING/SUMMER SEASONAL DISPOSABLE FOILWARE 2 CT \n", "1947 SPRING/SUMMER SEASONAL DISPOSABLE FOILWARE 2 CT \n", "2114 SPRING/SUMMER SEASONAL DISPOSABLE FOILWARE None " ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summer_or_fall = products['product_category'].str.contains('summer|fall', case=False, na=False)\n", "products[summer_or_fall].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we can see that our regex of ‘summer|fall’ returns the same number of observations as our literal search." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Literal approach: 651 rows\n", "Regex approach: 651 rows\n" ] } ], "source": [ "literal_count = products[summer_products | fall_products].shape[0]\n", "regex_count = products[summer_or_fall].shape[0]\n", "\n", "print(f'Literal approach: {literal_count} rows')\n", "print(f'Regex approach: {regex_count} rows')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Line anchors\n", "\n", "Line anchors are used to identify patterns at the beginning or end of an element. To find a pattern at the beginning of the element we use `^` and a pattern at the end of the element is found with `$`. For example, if you wanted to find any observations with the word \"fruit\" in the `product_category` column we can use the following as we saw earlier:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
32619069GROCERYPrivateFRUIT - SHELF STABLEAPPLE SAUCE50 OZ
252750369GROCERYPrivateFRUIT - SHELF STABLECRANBERRY SAUCE16 OZ
542796069GROCERYPrivatefrozen FRUITSFROZEN FRUIT16 OZ
204330485820PRODUCENationalORGANICS FRUIT & VEGETABLESORGANIC CITRUS2 LB BAG
34437543876NUTRITIONNationalDRIED FRUITRAISINS24 OZ
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand \\\n", "3 26190 69 GROCERY Private \n", "25 27503 69 GROCERY Private \n", "54 27960 69 GROCERY Private \n", "204 33048 5820 PRODUCE National \n", "344 37543 876 NUTRITION National \n", "\n", " product_category product_type package_size \n", "3 FRUIT - SHELF STABLE APPLE SAUCE 50 OZ \n", "25 FRUIT - SHELF STABLE CRANBERRY SAUCE 16 OZ \n", "54 frozen FRUITS FROZEN FRUIT 16 OZ \n", "204 ORGANICS FRUIT & VEGETABLES ORGANIC CITRUS 2 LB BAG \n", "344 DRIED FRUIT RAISINS 24 OZ " ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# detect if the word \"fruit\" is used in each comment\n", "fruit_products = products['product_category'].str.contains('fruit', case=False, na=False)\n", "products[fruit_products].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, if we only wanted those products where the category starts with “fruit” than we can use the `^` anchor:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
32619069GROCERYPrivateFRUIT - SHELF STABLEAPPLE SAUCE50 OZ
252750369GROCERYPrivateFRUIT - SHELF STABLECRANBERRY SAUCE16 OZ
5904459969GROCERYPrivateFRUIT - SHELF STABLEFRUIT COCKTAIL FRUIT SALAD15.25 OZ
6104521869GROCERYPrivateFRUIT - SHELF STABLEPEARS15.25 OZ
6334560769GROCERYPrivateFRUIT - SHELF STABLEPINEAPPLE8 OZ
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand product_category \\\n", "3 26190 69 GROCERY Private FRUIT - SHELF STABLE \n", "25 27503 69 GROCERY Private FRUIT - SHELF STABLE \n", "590 44599 69 GROCERY Private FRUIT - SHELF STABLE \n", "610 45218 69 GROCERY Private FRUIT - SHELF STABLE \n", "633 45607 69 GROCERY Private FRUIT - SHELF STABLE \n", "\n", " product_type package_size \n", "3 APPLE SAUCE 50 OZ \n", "25 CRANBERRY SAUCE 16 OZ \n", "590 FRUIT COCKTAIL FRUIT SALAD 15.25 OZ \n", "610 PEARS 15.25 OZ \n", "633 PINEAPPLE 8 OZ " ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "starts_with_fruit = products['product_category'].str.contains('^fruit', case=False, na=False)\n", "products[starts_with_fruit].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, if we only wanted those products where the category *ends with* \"fruit\" than we can use the `$` anchor:\n" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
34437543876NUTRITIONNationalDRIED FRUITRAISINS24 OZ
772499302PRODUCENationalSTONE FRUITPEACHES YELLOW FLESHNone
8965446069NUTRITIONPrivateDRIED FRUITRAISINS24 OZ BAG
1113645832PRODUCENationalTROPICAL FRUITTROPICAL FRUIT - OTHERCTN
126771432876NUTRITIONNationalDRIED FRUITDRIED FRUIT - OTHER7 OZ
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand product_category \\\n", "344 37543 876 NUTRITION National DRIED FRUIT \n", "772 49930 2 PRODUCE National STONE FRUIT \n", "896 54460 69 NUTRITION Private DRIED FRUIT \n", "1113 64583 2 PRODUCE National TROPICAL FRUIT \n", "1267 71432 876 NUTRITION National DRIED FRUIT \n", "\n", " product_type package_size \n", "344 RAISINS 24 OZ \n", "772 PEACHES YELLOW FLESH None \n", "896 RAISINS 24 OZ BAG \n", "1113 TROPICAL FRUIT - OTHER CTN \n", "1267 DRIED FRUIT - OTHER 7 OZ " ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ends_with_fruit = products['product_category'].str.contains('fruit$', case=False, na=False)\n", "products[ends_with_fruit].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can combine the two if we only wanted those products where the category *starts or ends with* \"fruit\":" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
32619069GROCERYPrivateFRUIT - SHELF STABLEAPPLE SAUCE50 OZ
252750369GROCERYPrivateFRUIT - SHELF STABLECRANBERRY SAUCE16 OZ
34437543876NUTRITIONNationalDRIED FRUITRAISINS24 OZ
5904459969GROCERYPrivateFRUIT - SHELF STABLEFRUIT COCKTAIL FRUIT SALAD15.25 OZ
6104521869GROCERYPrivateFRUIT - SHELF STABLEPEARS15.25 OZ
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand product_category \\\n", "3 26190 69 GROCERY Private FRUIT - SHELF STABLE \n", "25 27503 69 GROCERY Private FRUIT - SHELF STABLE \n", "344 37543 876 NUTRITION National DRIED FRUIT \n", "590 44599 69 GROCERY Private FRUIT - SHELF STABLE \n", "610 45218 69 GROCERY Private FRUIT - SHELF STABLE \n", "\n", " product_type package_size \n", "3 APPLE SAUCE 50 OZ \n", "25 CRANBERRY SAUCE 16 OZ \n", "344 RAISINS 24 OZ \n", "590 FRUIT COCKTAIL FRUIT SALAD 15.25 OZ \n", "610 PEARS 15.25 OZ " ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "starts_or_ends_with_fruit = (\n", " products['product_category']\n", " .str.contains('^fruit|fruit$', case=False, na=False)\n", ")\n", "\n", "products[starts_or_ends_with_fruit].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Metacharacters\n", "\n", "Metacharacters consist of non-alphanumeric symbols such as: \n", "\n", "

\n", "

\n", ".    \    |    (    )    [    {    $    *    +   ?\n", "
\n", "

\n", "\n", "To match metacharacters in regex you need to *escape*. In Python, we escape them with a single or double backslash \"`\\`\". The following displays the general escape syntax for the most common metacharacters:\n", "\n", "```{list-table} Escaping metacharacters.\n", ":header-rows: 1\n", ":name: escaping-characters\n", "\n", "* - Metacharacter\n", " - Literal Meaning\n", " - Escape Syntax\n", "* - .\n", " - period or dot\n", " - `\\\\.`\n", "* - $\n", " - dollar sign\n", " - `\\$`\n", "* - *\n", " - asterisk\n", " - `\\*`\n", "* - +\n", " - plus sign\n", " - `\\+`\n", "* - ?\n", " - question mark\n", " - `\\?`\n", "* - \\|\n", " - vertical bar\n", " - `\\|`\n", "* - ^\n", " - caret\n", " - `\\^`\n", "* - [\n", " - square bracket\n", " - `\\[`\n", "* - {\n", " - curly brace\n", " - `\\{`\n", "* - (\n", " - parenthesis\n", " - `\\(`\n", "```\n", "The reason we need to escape these characters is because most of these actually have meaning when declaring regular expressions. For example, say we wanted to identify any `product_category` that contains a period (\".\"). If we simply use the following we actually get ***all*** our records back. \n", "\n", "```{warning}\n", "Actually, this returns ***almost*** all our records. Since we use `na=False`, any `product_category` that has a missing value will be dropped.\n", "```" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
0256712GROCERYNationalfrozen ICEICE - CRUSHED/CUBED22 LB
22609369PASTRYPrivateBREADBREAD:ITALIAN/FRENCHNone
32619069GROCERYPrivateFRUIT - SHELF STABLEAPPLE SAUCE50 OZ
42635569GROCERYPrivateCOOKIES/CONESSPECIALTY COOKIES14 OZ
52642669GROCERYPrivateSPICES & EXTRACTSSPICES & SEASONINGS2.5 OZ
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand product_category \\\n", "0 25671 2 GROCERY National frozen ICE \n", "2 26093 69 PASTRY Private BREAD \n", "3 26190 69 GROCERY Private FRUIT - SHELF STABLE \n", "4 26355 69 GROCERY Private COOKIES/CONES \n", "5 26426 69 GROCERY Private SPICES & EXTRACTS \n", "\n", " product_type package_size \n", "0 ICE - CRUSHED/CUBED 22 LB \n", "2 BREAD:ITALIAN/FRENCH None \n", "3 APPLE SAUCE 50 OZ \n", "4 SPECIALTY COOKIES 14 OZ \n", "5 SPICES & SEASONINGS 2.5 OZ " ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wildcard = products['product_category'].str.contains('.', case=False, na=False)\n", "products[wildcard].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, we need to use an **escape** (\"\\\") to tell the regular expression you want to match a literal metacharacter." ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
527433611916GROCERYNationalMISC. DAIRYMEXICAN SOFT TORTILLAS AND WRA33 OZ
8025077669GROCERYPrivateMISC. DAIRYREFRIGERATED PUDDING24 OZ
8665307069GROCERYPrivateMISC. DAIRYREFRIGERATED PUDDING24 OZ
101960924435GROCERYNationalMISC. DAIRYREFRIGERATED PASTA9 OZ
111464623289GROCERYNationalMISC. DAIRYMISC DAIRY REFIGERATED20 OZ
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand product_category \\\n", "527 43361 1916 GROCERY National MISC. DAIRY \n", "802 50776 69 GROCERY Private MISC. DAIRY \n", "866 53070 69 GROCERY Private MISC. DAIRY \n", "1019 60924 435 GROCERY National MISC. DAIRY \n", "1114 64623 289 GROCERY National MISC. DAIRY \n", "\n", " product_type package_size \n", "527 MEXICAN SOFT TORTILLAS AND WRA 33 OZ \n", "802 REFRIGERATED PUDDING 24 OZ \n", "866 REFRIGERATED PUDDING 24 OZ \n", "1019 REFRIGERATED PASTA 9 OZ \n", "1114 MISC DAIRY REFIGERATED 20 OZ " ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "includes_dot = products['product_category'].str.contains('\\\\.', case=False, na=False)\n", "products[includes_dot].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Character classes\n", "\n", "To match one of several characters in a specified set we can enclose the characters of concern with square brackets `[ ]`. In addition, to matching any characters __not__ in a specified character set we can include the caret `^` at the beginning of the set within the brackets. The following displays the general syntax for common character classes but these can be altered easily as shown in the examples that follow:\n", "\n", "```{list-table} Common character classes.\n", ":header-rows: 1\n", ":name: common-characters-classes\n", "\n", "* - Character class\n", " - Description\n", "* - `[aeiou]`\n", " - match any specified lower case vowel \n", "* - `[AEIOU]`\n", " - match any specified upper case vowel \n", "* - `[0123456789]`\n", " - match any specified numeric values \n", "* - `[0-9]`\n", " - match any range specified numeric values \n", "* - `[a-z]`\n", " - match any range of lowercase letters \n", "* - `[A-Z]`\n", " - match any range of uppercase letters \n", "* - `[a-zA-Z0-9]`\n", " - match any of the above \n", "* - `[^aeiou]`\n", " - match anything other than a lowercase vowel \n", "* - `[^0-9]`\n", " - match anything other than the specified numeric values\n", "```\n", "\n", "For example, say we wanted to find any products where the `package_size` is not a round numeric size in ounces. The following identifies any rows where `package_size` contains a dot (remember, we need to escape that character with `\\\\.`) followed by \"oz\"." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
52642669GROCERYPrivateSPICES & EXTRACTSSPICES & SEASONINGS2.5 OZ
342769569GROCERYPrivateDINNER MXS:DRYSKILLET DINNERS5.8 OZ
432776769GROCERYPrivateDINNER MXS:DRYSKILLET DINNERS6.5 OZ
442781269GROCERYPrivateCOFFEEGROUND COFFEE34.5 OZ
502792569GROCERYPrivateDINNER MXS:DRYSKILLET DINNERS6.5 OZ
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand product_category \\\n", "5 26426 69 GROCERY Private SPICES & EXTRACTS \n", "34 27695 69 GROCERY Private DINNER MXS:DRY \n", "43 27767 69 GROCERY Private DINNER MXS:DRY \n", "44 27812 69 GROCERY Private COFFEE \n", "50 27925 69 GROCERY Private DINNER MXS:DRY \n", "\n", " product_type package_size \n", "5 SPICES & SEASONINGS 2.5 OZ \n", "34 SKILLET DINNERS 5.8 OZ \n", "43 SKILLET DINNERS 6.5 OZ \n", "44 GROUND COFFEE 34.5 OZ \n", "50 SKILLET DINNERS 6.5 OZ " ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "decimal_ounces = products['package_size'].str.contains('\\\\.[0-9] oz', case=False, na=False)\n", "products[decimal_ounces].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, say we wanted to do the same but we are interested in any packages that are in ounces (“OZ”) or pounds (“LB”). Your first reaction is probably to do something like:" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
0256712GROCERYNationalfrozen ICEICE - CRUSHED/CUBED22 LB
52642669GROCERYPrivateSPICES & EXTRACTSSPICES & SEASONINGS2.5 OZ
342769569GROCERYPrivateDINNER MXS:DRYSKILLET DINNERS5.8 OZ
432776769GROCERYPrivateDINNER MXS:DRYSKILLET DINNERS6.5 OZ
442781269GROCERYPrivateCOFFEEGROUND COFFEE34.5 OZ
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand product_category \\\n", "0 25671 2 GROCERY National frozen ICE \n", "5 26426 69 GROCERY Private SPICES & EXTRACTS \n", "34 27695 69 GROCERY Private DINNER MXS:DRY \n", "43 27767 69 GROCERY Private DINNER MXS:DRY \n", "44 27812 69 GROCERY Private COFFEE \n", "\n", " product_type package_size \n", "0 ICE - CRUSHED/CUBED 22 LB \n", "5 SPICES & SEASONINGS 2.5 OZ \n", "34 SKILLET DINNERS 5.8 OZ \n", "43 SKILLET DINNERS 6.5 OZ \n", "44 GROUND COFFEE 34.5 OZ " ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "decimal_ounces_or_lbs = (\n", " products['package_size']\n", " .str.contains('\\\\.[0-9] oz|lb', case=False, na=False)\n", ")\n", "\n", "products[decimal_ounces_or_lbs].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wait! The first observation is in pounds but its a round number and not a decimal. This is because our regex (`\\\\.[0-9] oz|lb`) is actually looking for any package size where its a decimal of ounces (`\\\\.[0-9] oz`) ***or*** in pounds (`lb`). \n", "\n", "We need to modify our regex just a tad. If we change it to `\\\\.[0-9] (oz|lb)` (note that `oz|lb` is now in parenthesis), we are now specifying to search for `\\\\.[0-9]` followed by \"oz\" or \"lb\".\n", "\n", "```{tip}\n", "Using parentheses such as `(oz|lb)` is known as grouping. If we simply use `(oz|lb)` then it will be treated as a [capture grouping](https://www.regular-expressions.info/brackets.html) and Pandas will throw a warning letting you know that you could extract the matching groups if you prefer. If this warning bothers you then you can add `?:` at the begining (`(?:oz|lb)`)to tell Pandas that you just care about matching this pattern and not capturing each group to extract.\n", "```" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/8f/c06lv6q17tjbyjv2nkt0_s4s1sh0tg/T/ipykernel_64719/2514522656.py:2: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.\n", " products['package_size']\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
52642669GROCERYPrivateSPICES & EXTRACTSSPICES & SEASONINGS2.5 OZ
342769569GROCERYPrivateDINNER MXS:DRYSKILLET DINNERS5.8 OZ
432776769GROCERYPrivateDINNER MXS:DRYSKILLET DINNERS6.5 OZ
442781269GROCERYPrivateCOFFEEGROUND COFFEE34.5 OZ
502792569GROCERYPrivateDINNER MXS:DRYSKILLET DINNERS6.5 OZ
........................
922391814750769GROCERYPrivateSPICES & EXTRACTSSPICES & SEASONINGS3.1 OZ
9226618148654194GROCERYNationalfrozen BREAKFAST FOODSWAFFLES/PANCAKES/FRENCH TOAST11.1 OZ
92269181487262082MEAT-PCKGDNationalHEAT/SERVEENTREES14.5 OZ
92320182730182223GROCERYNationalBAG SNACKSBAGGED CHEESE SNACKS11.5 OZ
92321182730192223GROCERYNationalBAG SNACKSBAGGED CHEESE SNACKS11.5 OZ
\n", "

10712 rows × 7 columns

\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand \\\n", "5 26426 69 GROCERY Private \n", "34 27695 69 GROCERY Private \n", "43 27767 69 GROCERY Private \n", "44 27812 69 GROCERY Private \n", "50 27925 69 GROCERY Private \n", "... ... ... ... ... \n", "92239 18147507 69 GROCERY Private \n", "92266 18148654 194 GROCERY National \n", "92269 18148726 2082 MEAT-PCKGD National \n", "92320 18273018 2223 GROCERY National \n", "92321 18273019 2223 GROCERY National \n", "\n", " product_category product_type package_size \n", "5 SPICES & EXTRACTS SPICES & SEASONINGS 2.5 OZ \n", "34 DINNER MXS:DRY SKILLET DINNERS 5.8 OZ \n", "43 DINNER MXS:DRY SKILLET DINNERS 6.5 OZ \n", "44 COFFEE GROUND COFFEE 34.5 OZ \n", "50 DINNER MXS:DRY SKILLET DINNERS 6.5 OZ \n", "... ... ... ... \n", "92239 SPICES & EXTRACTS SPICES & SEASONINGS 3.1 OZ \n", "92266 frozen BREAKFAST FOODS WAFFLES/PANCAKES/FRENCH TOAST 11.1 OZ \n", "92269 HEAT/SERVE ENTREES 14.5 OZ \n", "92320 BAG SNACKS BAGGED CHEESE SNACKS 11.5 OZ \n", "92321 BAG SNACKS BAGGED CHEESE SNACKS 11.5 OZ \n", "\n", "[10712 rows x 7 columns]" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "decimal_ounces_or_lbs = (\n", " products['package_size']\n", " .str.contains('\\\\.[0-9] (oz|lb)', case=False, na=False)\n", ")\n", "\n", "products[decimal_ounces_or_lbs].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, say we wanted to find any package size that contains a decimal between 0-.4:" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
602811669GROCERYPrivateDRY MIX DESSERTSGELATIN.3 OZ
612811769GROCERYPrivateDRY MIX DESSERTSGELATIN.3 OZ
622814369GROCERYPrivateDRY MIX DESSERTSGELATIN.3 OZ
692820869GROCERYPrivateDINNER MXS:DRYSKILLET DINNERS6.2 OZ
792837769GROCERYPrivateDINNER MXS:DRYSKILLET DINNERS6.4 OZ
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand product_category \\\n", "60 28116 69 GROCERY Private DRY MIX DESSERTS \n", "61 28117 69 GROCERY Private DRY MIX DESSERTS \n", "62 28143 69 GROCERY Private DRY MIX DESSERTS \n", "69 28208 69 GROCERY Private DINNER MXS:DRY \n", "79 28377 69 GROCERY Private DINNER MXS:DRY \n", "\n", " product_type package_size \n", "60 GELATIN .3 OZ \n", "61 GELATIN .3 OZ \n", "62 GELATIN .3 OZ \n", "69 SKILLET DINNERS 6.2 OZ \n", "79 SKILLET DINNERS 6.4 OZ " ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "smaller_sizes = (\n", " products['package_size']\n", " .str.contains('\\\\.[0-4] (?:oz|lb)', case=False, na=False)\n", ")\n", "\n", "products[smaller_sizes].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Shorthand character classes\n", "\n", "Since certain character classes are used often, a series of shorthand character classes are available. For example, rather than use `[0-9]` every time we are searching for a number in a regex, we can use just use `\\\\d` to match any digit. The following are a few of the commonly used shorthand character classes:\n", "\n", "```{list-table} Common shorthand character classes.\n", ":header-rows: 1\n", ":name: common-shorthand-characters\n", "\n", "* - Syntax\n", " - Description\n", "* - `\\\\d`\n", " - match any digit\n", "* - `\\\\D`\n", " - match any non-digit\n", "* - `\\\\s`\n", " - match a space character\n", "* - `\\\\S`\n", " - match a non-space character\n", "* - `\\\\w`\n", " - match a word\n", "* - `\\\\W`\n", " - match a non-word\n", "* - `\\\\b`\n", " - match a word boundary\n", "* - `\\\\B`\n", " - match a non-word boundary\n", "```\n", "\n", "We can use these to find patterns such as...\n", "\n", "Find all products where the `package_size` starts with a numeric digit:" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
0256712GROCERYNationalfrozen ICEICE - CRUSHED/CUBED22 LB
32619069GROCERYPrivateFRUIT - SHELF STABLEAPPLE SAUCE50 OZ
42635569GROCERYPrivateCOOKIES/CONESSPECIALTY COOKIES14 OZ
52642669GROCERYPrivateSPICES & EXTRACTSSPICES & SEASONINGS2.5 OZ
62654069GROCERYPrivateCOOKIES/CONESTRAY PACK/CHOC CHIP COOKIES16 OZ
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand product_category \\\n", "0 25671 2 GROCERY National frozen ICE \n", "3 26190 69 GROCERY Private FRUIT - SHELF STABLE \n", "4 26355 69 GROCERY Private COOKIES/CONES \n", "5 26426 69 GROCERY Private SPICES & EXTRACTS \n", "6 26540 69 GROCERY Private COOKIES/CONES \n", "\n", " product_type package_size \n", "0 ICE - CRUSHED/CUBED 22 LB \n", "3 APPLE SAUCE 50 OZ \n", "4 SPECIALTY COOKIES 14 OZ \n", "5 SPICES & SEASONINGS 2.5 OZ \n", "6 TRAY PACK/CHOC CHIP COOKIES 16 OZ " ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "starts_with_digit = products['package_size'].str.contains('^\\\\d', case=False, na=False)\n", "products[starts_with_digit].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or all products where the `package_size` starts with a non-digit:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
312765769GROCERYPrivateCHEESENATURAL CHEESE EXACT WT CHUNKSA B D 8 OZ
602811669GROCERYPrivateDRY MIX DESSERTSGELATIN.3 OZ
612811769GROCERYPrivateDRY MIX DESSERTSGELATIN.3 OZ
622814369GROCERYPrivateDRY MIX DESSERTSGELATIN.3 OZ
672819269GROCERYPrivateHOUSEHOLD CLEANG NEEDSAMMONIA.5 GAL
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand product_category \\\n", "31 27657 69 GROCERY Private CHEESE \n", "60 28116 69 GROCERY Private DRY MIX DESSERTS \n", "61 28117 69 GROCERY Private DRY MIX DESSERTS \n", "62 28143 69 GROCERY Private DRY MIX DESSERTS \n", "67 28192 69 GROCERY Private HOUSEHOLD CLEANG NEEDS \n", "\n", " product_type package_size \n", "31 NATURAL CHEESE EXACT WT CHUNKS A B D 8 OZ \n", "60 GELATIN .3 OZ \n", "61 GELATIN .3 OZ \n", "62 GELATIN .3 OZ \n", "67 AMMONIA .5 GAL " ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "starts_with_nondigit = products['package_size'].str.contains('^\\\\D', case=False, na=False)\n", "products[starts_with_nondigit].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Repetition\n", "\n", "When we want to match a __certain number__ of characters that meet a certain criteria we can apply repetition operators to our pattern searches. Common repetition operators include:\n", "\n", "```{list-table} Common repetition operators.\n", ":header-rows: 1\n", ":name: common-repitions\n", "\n", "* - Syntax\n", " - Description\n", "* - `.`\n", " - wildcard to match **any character once**\n", "* - `?`\n", " - the preceding item is optional and will be matched **at most once**\n", "* - `*`\n", " - the preceding item will be matched **zero or more times**\n", "* - `+`\n", " - the preceding item will be matched **one or more times**\n", "* - `{n}`\n", " - the preceding item will be matched **exactly n times**\n", "* - `{n,}`\n", " - the preceding item will be matched **n or more times**\n", "* - `{n,m}`\n", " - the preceding item will be matched **at least n times but not more than m times**\n", "```\n", "\n", "For example, say we want to find all products where the `package_size` contains at least 3 digits:" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
72660169DRUG GMPrivateVITAMINSVITAMIN - MINERALS300 CT(1)
852860369GROCERYPrivateFACIAL TISS/DNR NAPKINPAPER NAPKINS250 CT
17132265693DRUG GMNationalCANDY - CHECKLANECANDY BARS (SINGLES)(INCLUDING1.625 OZ
24334214869GROCERYNationalTEASTEA BAGS & BULK TEA100 CT
2813521569DRUG GMPrivateANALGESICSADULT ANALGESICS100 CT
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand product_category \\\n", "7 26601 69 DRUG GM Private VITAMINS \n", "85 28603 69 GROCERY Private FACIAL TISS/DNR NAPKIN \n", "171 32265 693 DRUG GM National CANDY - CHECKLANE \n", "243 34214 869 GROCERY National TEAS \n", "281 35215 69 DRUG GM Private ANALGESICS \n", "\n", " product_type package_size \n", "7 VITAMIN - MINERALS 300 CT(1) \n", "85 PAPER NAPKINS 250 CT \n", "171 CANDY BARS (SINGLES)(INCLUDING 1.625 OZ \n", "243 TEA BAGS & BULK TEA 100 CT \n", "281 ADULT ANALGESICS 100 CT " ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "three_digits = products['package_size'].str.contains('\\\\d{3,}', na=False)\n", "products[three_digits].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One thing you probably notice is that the above syntax will match three or more digits within the entire character string. But what if we wanted to identify repetition of a pattern sequence. For example, say we wanted to find `product_id`s where the number \"8\" is repeated. We simply combine 8 with the corresponding repitition operator from above." ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/8f/c06lv6q17tjbyjv2nkt0_s4s1sh0tg/T/ipykernel_64719/3233621946.py:1: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.\n", " prod_id_8s = products['product_id'].astype(str).str.contains('(8)\\\\1', na=False)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
112688932DRUG GMNationalMAGAZINETV/MOVIE-MAGAZINENone
882888969SEAFOOD-PCKGDPrivateSEAFOOD - FROZENSEAFOOD-FRZ-RAW FILLETS12 OZ
892889269GROCERYPrivateFROZEN PIZZASNACKS/APPETIZERS20 OZ
902889769GROCERYPrivateEGGSEGGS - X-LARGEA D 1 DZ
199328881075GROCERYNationalCRACKERS/MISC BKD FDSNACK CRACKERS8.5 OZ
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand \\\n", "11 26889 32 DRUG GM National \n", "88 28889 69 SEAFOOD-PCKGD Private \n", "89 28892 69 GROCERY Private \n", "90 28897 69 GROCERY Private \n", "199 32888 1075 GROCERY National \n", "\n", " product_category product_type package_size \n", "11 MAGAZINE TV/MOVIE-MAGAZINE None \n", "88 SEAFOOD - FROZEN SEAFOOD-FRZ-RAW FILLETS 12 OZ \n", "89 FROZEN PIZZA SNACKS/APPETIZERS 20 OZ \n", "90 EGGS EGGS - X-LARGE A D 1 DZ \n", "199 CRACKERS/MISC BKD FD SNACK CRACKERS 8.5 OZ " ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prod_id_8s = products['product_id'].astype(str).str.contains('8{2,}', na=False)\n", "products[prod_id_8s].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we wanted to look for `product_id`s that contain two, but no more than three \"8\"s in a row:" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/8f/c06lv6q17tjbyjv2nkt0_s4s1sh0tg/T/ipykernel_64719/1949681077.py:1: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.\n", " prod_id_8s = products['product_id'].astype(str).str.contains('(8)\\\\1{2}', na=False)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idmanufacturer_iddepartmentbrandproduct_categoryproduct_typepackage_size
882888969SEAFOOD-PCKGDPrivateSEAFOOD - FROZENSEAFOOD-FRZ-RAW FILLETS12 OZ
199328881075GROCERYNationalCRACKERS/MISC BKD FDSNACK CRACKERS8.5 OZ
10656288869GROCERYPrivateLAUNDRY ADDITIVESFABRIC SOFTENER LIQUID60 LOADS
1368758881448GROCERYNationalCAT LITTERLITTER ACCESSORIES12 CT
1436788881051DRUG GMNationalELECTRICAL SUPPPLIESDECOR BULBS2 CT
\n", "
" ], "text/plain": [ " product_id manufacturer_id department brand \\\n", "88 28889 69 SEAFOOD-PCKGD Private \n", "199 32888 1075 GROCERY National \n", "1065 62888 69 GROCERY Private \n", "1368 75888 1448 GROCERY National \n", "1436 78888 1051 DRUG GM National \n", "\n", " product_category product_type package_size \n", "88 SEAFOOD - FROZEN SEAFOOD-FRZ-RAW FILLETS 12 OZ \n", "199 CRACKERS/MISC BKD FD SNACK CRACKERS 8.5 OZ \n", "1065 LAUNDRY ADDITIVES FABRIC SOFTENER LIQUID 60 LOADS \n", "1368 CAT LITTER LITTER ACCESSORIES 12 CT \n", "1436 ELECTRICAL SUPPPLIES DECOR BULBS 2 CT " ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prod_id_8s = products['product_id'].astype(str).str.contains('8{2,3}', na=False)\n", "products[prod_id_8s].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Putting it altogether\n", "\n", "Ok, let's use a few tools we've learned to answer a question we may get asked by our boss. Say we were asked to identify the top 5 products that have the most total sales; however, we only want to focus on those products that weigh 10lbs or more. We can apply the following steps:\n", "\n", "1. filter for regex `\"^\\\\d{2,}\\\\.?.*lb\"` which means:\n", " - `^\\\\d{2,}`: starts with at least 2 numeric digits\n", " - `\\\\.?.`: followed by an optional decimal\n", " - `.*lb`: followed by a character zero or more times and then the identifier for pounds (lb).\n", "2. take the resulting product list and inner join with transactions so we only retain those transactions and products that have a matching product ID in both tables,\n", "3. compute total sales grouped by product (here we use product type just to provide us more context over the product ID),\n", "4. and then use `nlargest` to get the top 5 `total_sales` values (you could've also used `sort_values()` to get to the same conclusion)." ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/var/folders/8f/c06lv6q17tjbyjv2nkt0_s4s1sh0tg/T/ipykernel_64719/3801605409.py:1: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.\n", " size_filter = products['package_size'].str.contains('^\\d{2,}(\\.)?.*lb', case=False, na=False)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_typesales_value
3BANANAS17480.66
28GRAPES RED9999.96
29GRAPES WHITE8467.86
61POTATOES RUSSET (BULK&BAG)7805.49
22DRY DOG FOOD PREMIUM (ALPO/PUR7146.87
\n", "
" ], "text/plain": [ " product_type sales_value\n", "3 BANANAS 17480.66\n", "28 GRAPES RED 9999.96\n", "29 GRAPES WHITE 8467.86\n", "61 POTATOES RUSSET (BULK&BAG) 7805.49\n", "22 DRY DOG FOOD PREMIUM (ALPO/PUR 7146.87" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "size_filter = products['package_size'].str.contains('^\\\\d{2,}\\\\.?.*lb', case=False, na=False)\n", "(\n", " products[size_filter]\n", " .merge(cj_data['transactions'], how='inner', on='product_id')\n", " .groupby('product_type', as_index=False)\n", " .agg({'sales_value': 'sum'})\n", " .nlargest(5, 'sales_value')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Knowledge check\n", "\n", "```{admonition} Questions:\n", ":class: attention\n", "1. How many `products` contain the word \"bulk\" in `product_type`?\n", "2. How many `products` do not contain punctuation in their `package_size`?\n", "3. Find all frozen pizza products. Be careful, this is not straight forward!\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Video 🎥:\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises\n", "\n", "```{admonition} Questions:\n", ":class: attention\n", "To answer these questions you'll need to use the `products` and `transactions` data frames.\n", "\n", "1. Identify all different products that contain \"pizza\" in their `product_type` description. Which product produces the greatest amount of total sales?\n", "2. Identify all products that are categorized (`product_category`) as pizza but are considered a snack or appetizer (`product_type`). Which of these products have the most sales (measured by quantity)?\n", "3. How many products contain `package_size`s that do not contain a numeric value.\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python implementation: CPython\n", "Python version : 3.9.12\n", "IPython version : 8.2.0\n", "\n", "jupyterlab : 3.3.2\n", "pandas : 1.4.2\n", "completejourney_py: 0.0.3\n", "\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -v -p jupyterlab,pandas,completejourney_py" ] } ], "metadata": { "interpreter": { "hash": "40d3a090f54c6569ab1632332b64b2c03c39dcf918b08424e98f38b5ae0af88f" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }