{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lesson 4c: Handling text data\n",
    "\n",
    "Dealing with character strings is often under-emphasized in data analysis training. The focus typically remains on numeric values; however, the growth in data collection is also resulting in greater bits of information embedded in text. Consequently, handling, cleaning and processing character strings is becoming a prerequisite in daily data analysis. This lesson is meant to give you the foundation of working with character strings.\n",
    "\n",
    "## Learning objectives\n",
    "\n",
    "By the end of this lesson you'll be able to:\n",
    "\n",
    "* Perform basic character string manipulations.\n",
    "* Use regular expressions to identify and manipulate patterns in character strings."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prerequisites\n",
    "\n",
    "The Python standard library has lots of built-in capabilities to manipulate character strings. Pandas has incorporated many of these capabilities and even expanded upon them. In this lesson, we will demonstrate different character string capabilities with the `.str.xxx()` methods provided by Pandas.\n",
    "\n",
    "```{note}\n",
    "Most of the Pandas string methods have a similar standard library method that can be applied to character strings outside of DataFrames. See the [standard library documentation](https://docs.python.org/3/library/string.html) for help.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For data, we'll leverage the **completejourney** data. For example, the `products` data within **completejourney** provides various variables (e.g. `product_category` and `product_type`) that contain character strings we may need to clean, normalize, or identify patterns within."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>25671</td>\n",
       "      <td>2</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>FRZN ICE</td>\n",
       "      <td>ICE - CRUSHED/CUBED</td>\n",
       "      <td>22 LB</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>26081</td>\n",
       "      <td>2</td>\n",
       "      <td>MISCELLANEOUS</td>\n",
       "      <td>National</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>26093</td>\n",
       "      <td>69</td>\n",
       "      <td>PASTRY</td>\n",
       "      <td>Private</td>\n",
       "      <td>BREAD</td>\n",
       "      <td>BREAD:ITALIAN/FRENCH</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>26190</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>APPLE SAUCE</td>\n",
       "      <td>50 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>26355</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>COOKIES/CONES</td>\n",
       "      <td>SPECIALTY COOKIES</td>\n",
       "      <td>14 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92326</th>\n",
       "      <td>18293142</td>\n",
       "      <td>6384</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>BOOKSTORE</td>\n",
       "      <td>PAPERBACK BOOKS</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92327</th>\n",
       "      <td>18293439</td>\n",
       "      <td>6393</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>BOOKSTORE</td>\n",
       "      <td>CHILDRENS LOW END</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92328</th>\n",
       "      <td>18293696</td>\n",
       "      <td>6406</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>BOOKSTORE</td>\n",
       "      <td>PAPERBACK BEST SELLER</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92329</th>\n",
       "      <td>18294080</td>\n",
       "      <td>6442</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>BOOKSTORE</td>\n",
       "      <td>PAPERBACK BOOKS</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92330</th>\n",
       "      <td>18316298</td>\n",
       "      <td>764</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>PAPER TOWELS</td>\n",
       "      <td>PAPER TOWELS &amp; HOLDERS</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>92331 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       product_id  manufacturer_id     department     brand  \\\n",
       "0           25671                2        GROCERY  National   \n",
       "1           26081                2  MISCELLANEOUS  National   \n",
       "2           26093               69         PASTRY   Private   \n",
       "3           26190               69        GROCERY   Private   \n",
       "4           26355               69        GROCERY   Private   \n",
       "...           ...              ...            ...       ...   \n",
       "92326    18293142             6384        DRUG GM  National   \n",
       "92327    18293439             6393        DRUG GM  National   \n",
       "92328    18293696             6406        DRUG GM  National   \n",
       "92329    18294080             6442        DRUG GM  National   \n",
       "92330    18316298              764        GROCERY  National   \n",
       "\n",
       "           product_category            product_type package_size  \n",
       "0                  FRZN ICE     ICE - CRUSHED/CUBED        22 LB  \n",
       "1                      None                    None         None  \n",
       "2                     BREAD    BREAD:ITALIAN/FRENCH         None  \n",
       "3      FRUIT - SHELF STABLE             APPLE SAUCE        50 OZ  \n",
       "4             COOKIES/CONES       SPECIALTY COOKIES        14 OZ  \n",
       "...                     ...                     ...          ...  \n",
       "92326             BOOKSTORE         PAPERBACK BOOKS         None  \n",
       "92327             BOOKSTORE       CHILDRENS LOW END         None  \n",
       "92328             BOOKSTORE   PAPERBACK BEST SELLER         None  \n",
       "92329             BOOKSTORE         PAPERBACK BOOKS         None  \n",
       "92330          PAPER TOWELS  PAPER TOWELS & HOLDERS         None  \n",
       "\n",
       "[92331 rows x 7 columns]"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from completejourney_py import get_data\n",
    "\n",
    "cj_data = get_data()\n",
    "products = cj_data['products']\n",
    "products"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## String basics\n",
    "\n",
    "Basic string manipulation typically includes case conversion, counting characters, and extracting parts of a string. The following will demonstrate some of these basic string methods on a Pandas DataFrame but realize that Pandas provides many more string methods. Refer to the [docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) to learn more about the many string methods.\n",
    "\n",
    "```{admonition} Video 🎥:\n",
    "<iframe id=\"kaltura_player\" src=\"https://cdnapisec.kaltura.com/p/1492301/sp/149230100/embedIframeJs/uiconf_id/49148882/partner_id/1492301?iframeembed=true&playerId=kaltura_player&entry_id=1_n5969nsw&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en_US&amp;flashvars[forceMobileHTML5]=true&amp;flashvars[scrubber.sliderPreview]=false&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=1_birkvk0f\" width=\"640\" height=\"610\" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow=\"autoplay *; fullscreen *; encrypted-media *\" sandbox=\"allow-downloads allow-forms allow-same-origin allow-scripts allow-top-navigation allow-pointer-lock allow-popups allow-modals allow-orientation-lock allow-popups-to-escape-sandbox allow-presentation allow-top-navigation-by-user-activation\" frameborder=\"0\" title=\"BANA 6043: How do I use string methods in pandas?\"></iframe>\n",
    "```\n",
    "\n",
    "```{tip}\n",
    "The string methods provided by Pandas are accessed by appending the object with `.str` followed by another dot and then the specific string method. Think of the term ‘accessor’ as giving the object access to specialized string methods.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Case conversion\n",
    "\n",
    "To change the case of a character string we can use `.str.lower()` and `.str.upper()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                frzn ice\n",
       "1                    None\n",
       "2                   bread\n",
       "3    fruit - shelf stable\n",
       "4           cookies/cones\n",
       "Name: product_category, dtype: object"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "products['product_category'].str.lower().head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                FRZN ICE\n",
       "1                    None\n",
       "2                   BREAD\n",
       "3    FRUIT - SHELF STABLE\n",
       "4           COOKIES/CONES\n",
       "Name: product_category, dtype: object"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "products['product_category'].str.upper().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Value counts\n",
    "\n",
    "We've already seen this method in a previous lesson but it's worth discussing more. The `value_counts()` method is one of the most valuable methods for string columns. It returns the count of each unique value in the Series and sorts it from most to least common."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GREETING CARDS/WRAP/PARTY SPLY    2785\n",
       "CANDY - PACKAGED                  2475\n",
       "MAKEUP AND TREATMENT              2467\n",
       "HAIR CARE PRODUCTS                1744\n",
       "SOFT DRINKS                       1704\n",
       "                                  ... \n",
       "EASTER LILY                          1\n",
       "BOUQUET (NON ROSE)                   1\n",
       "MISCELLANEOUS CROUTONS               1\n",
       "PKG.SEAFOOD MISC                     1\n",
       "FROZEN PACKAGE MEAT                  1\n",
       "Name: product_category, Length: 303, dtype: int64"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# raw value counts\n",
    "products['product_category'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GREETING CARDS/WRAP/PARTY SPLY    0.030341\n",
       "CANDY - PACKAGED                  0.026963\n",
       "MAKEUP AND TREATMENT              0.026876\n",
       "HAIR CARE PRODUCTS                0.019000\n",
       "SOFT DRINKS                       0.018564\n",
       "                                    ...   \n",
       "EASTER LILY                       0.000011\n",
       "BOUQUET (NON ROSE)                0.000011\n",
       "MISCELLANEOUS CROUTONS            0.000011\n",
       "PKG.SEAFOOD MISC                  0.000011\n",
       "FROZEN PACKAGE MEAT               0.000011\n",
       "Name: product_category, Length: 303, dtype: float64"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# normalized value counts\n",
    "products['product_category'].value_counts(normalize=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note how `value_counts()` does not start with `.str`. This is not a string method but is most commonly used for strings. In fact we can use `value_counts()` for any data type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2017-05-28 15:19:55    161\n",
       "2017-06-11 19:07:33    155\n",
       "2017-01-04 00:55:24    153\n",
       "2017-05-09 20:41:28    146\n",
       "2017-12-05 17:58:13    144\n",
       "Name: transaction_timestamp, dtype: int64"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# value counts on date-time data\n",
    "cj_data['transactions']['transaction_timestamp'].value_counts().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Character counts\n",
    "\n",
    "We can get the count of the entire character string with `len()`. This returns the total number of characters in a string to include white space and other non-alphanumeric characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     8.0\n",
       "1     NaN\n",
       "2     5.0\n",
       "3    20.0\n",
       "4    13.0\n",
       "Name: product_category, dtype: float64"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "products['product_category'].str.len().head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sometimes we only want to understand how many times a particular word, statement, or even particular letters are used. We can do this with the `count()` method. This method becomes more flexible as we introduce regular expressions, which we will get to shortly.\n",
    "\n",
    "The below illustrates that we can easily count how many times 'meat' is used in the `product_category` column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4875.0"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "products['product_category'].str.lower().str.count('meat').sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Detecting words\n",
    "\n",
    "The `contains()` method returns a boolean whether or not the passed string is contained somewhere within the string. We can use this to filter for particular words or expressions. For example, the following finds all products where `product_category` contains 'meat'.\n",
    "\n",
    "```{tip}\n",
    "We can use `case=False` to ignore case sensitivity. We also use `na=False` to fill in all missing `product_category` values with `False`. If we didn't do this then the subsetting would throw an error. However, use this wisely as it is basically treating missing `product_category` values as non-meat products!\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>123</th>\n",
       "      <td>30003</td>\n",
       "      <td>397</td>\n",
       "      <td>MEAT-PCKGD</td>\n",
       "      <td>National</td>\n",
       "      <td>FROZEN MEAT</td>\n",
       "      <td>FRZN BREADED PREPARED CHICK</td>\n",
       "      <td>9 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>156</th>\n",
       "      <td>31493</td>\n",
       "      <td>151</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>FRZN MEAT/MEAT DINNERS</td>\n",
       "      <td>FRZN SS PREMIUM ENTREES/DNRS/T</td>\n",
       "      <td>10 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>274</th>\n",
       "      <td>34997</td>\n",
       "      <td>1329</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>FRZN MEAT/MEAT DINNERS</td>\n",
       "      <td>FRZN SS PREMIUM ENTREES/DNRS/T</td>\n",
       "      <td>12 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>319</th>\n",
       "      <td>36406</td>\n",
       "      <td>1329</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>FRZN MEAT/MEAT DINNERS</td>\n",
       "      <td>FRZN SS PREMIUM ENTREES/DNRS/T</td>\n",
       "      <td>12 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>322</th>\n",
       "      <td>36561</td>\n",
       "      <td>165</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>FRZN MEAT/MEAT DINNERS</td>\n",
       "      <td>FRZN MULTI SERVE ENTREES ALL</td>\n",
       "      <td>22 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92199</th>\n",
       "      <td>18107338</td>\n",
       "      <td>1094</td>\n",
       "      <td>MEAT-PCKGD</td>\n",
       "      <td>National</td>\n",
       "      <td>LUNCHMEAT</td>\n",
       "      <td>LUNCH COMBO</td>\n",
       "      <td>6.8 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92201</th>\n",
       "      <td>18107424</td>\n",
       "      <td>1094</td>\n",
       "      <td>MEAT-PCKGD</td>\n",
       "      <td>National</td>\n",
       "      <td>LUNCHMEAT</td>\n",
       "      <td>LUNCH COMBO</td>\n",
       "      <td>7.1 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92237</th>\n",
       "      <td>18147446</td>\n",
       "      <td>69</td>\n",
       "      <td>MEAT-PCKGD</td>\n",
       "      <td>Private</td>\n",
       "      <td>LUNCHMEAT</td>\n",
       "      <td>VARIETY PACK</td>\n",
       "      <td>10 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92253</th>\n",
       "      <td>18148186</td>\n",
       "      <td>1251</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>FRZN MEAT/MEAT DINNERS</td>\n",
       "      <td>SS ECONOMY ENTREES/DINNERS ALL</td>\n",
       "      <td>14 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92261</th>\n",
       "      <td>18148548</td>\n",
       "      <td>1251</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>FRZN MEAT/MEAT DINNERS</td>\n",
       "      <td>SS ECONOMY ENTREES/DINNERS ALL</td>\n",
       "      <td>14 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3607 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       product_id  manufacturer_id  department     brand  \\\n",
       "123         30003              397  MEAT-PCKGD  National   \n",
       "156         31493              151     GROCERY  National   \n",
       "274         34997             1329     GROCERY  National   \n",
       "319         36406             1329     GROCERY  National   \n",
       "322         36561              165     GROCERY  National   \n",
       "...           ...              ...         ...       ...   \n",
       "92199    18107338             1094  MEAT-PCKGD  National   \n",
       "92201    18107424             1094  MEAT-PCKGD  National   \n",
       "92237    18147446               69  MEAT-PCKGD   Private   \n",
       "92253    18148186             1251     GROCERY  National   \n",
       "92261    18148548             1251     GROCERY  National   \n",
       "\n",
       "             product_category                    product_type package_size  \n",
       "123               FROZEN MEAT     FRZN BREADED PREPARED CHICK         9 OZ  \n",
       "156    FRZN MEAT/MEAT DINNERS  FRZN SS PREMIUM ENTREES/DNRS/T        10 OZ  \n",
       "274    FRZN MEAT/MEAT DINNERS  FRZN SS PREMIUM ENTREES/DNRS/T        12 OZ  \n",
       "319    FRZN MEAT/MEAT DINNERS  FRZN SS PREMIUM ENTREES/DNRS/T        12 OZ  \n",
       "322    FRZN MEAT/MEAT DINNERS    FRZN MULTI SERVE ENTREES ALL        22 OZ  \n",
       "...                       ...                             ...          ...  \n",
       "92199               LUNCHMEAT                     LUNCH COMBO       6.8 OZ  \n",
       "92201               LUNCHMEAT                     LUNCH COMBO       7.1 OZ  \n",
       "92237               LUNCHMEAT                    VARIETY PACK        10 OZ  \n",
       "92253  FRZN MEAT/MEAT DINNERS  SS ECONOMY ENTREES/DINNERS ALL        14 OZ  \n",
       "92261  FRZN MEAT/MEAT DINNERS  SS ECONOMY ENTREES/DINNERS ALL        14 OZ  \n",
       "\n",
       "[3607 rows x 7 columns]"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "meat_products = products['product_category'].str.contains('meat', case=False, na=False)\n",
    "products[meat_products]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extracting parts of strings\n",
    "\n",
    "We can use traditional index-styling to extract or slice parts of a character string. For example, the following gets the first and last 5 characters in each string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    FRZN \n",
       "1     None\n",
       "2    BREAD\n",
       "3    FRUIT\n",
       "4    COOKI\n",
       "Name: product_category, dtype: object"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "products['product_category'].str[:5].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    N ICE\n",
       "1     None\n",
       "2    BREAD\n",
       "3    TABLE\n",
       "4    CONES\n",
       "Name: product_category, dtype: object"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "products['product_category'].str[-5:].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Replacing parts of strings\n",
    "\n",
    "Sometimes we may want to clean up character strings by replacing certain words or phrases. For example, this data contains the word ‘frzn’ in place of 'frozen'. If we wanted to fix this we could use the `replace()` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>25671</td>\n",
       "      <td>2</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>FRZN ICE</td>\n",
       "      <td>ICE - CRUSHED/CUBED</td>\n",
       "      <td>22 LB</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>27754</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRZN VEGETABLE/VEG DSH</td>\n",
       "      <td>FRZN BAGGED VEGETABLES - PLAIN</td>\n",
       "      <td>16 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>27960</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRZN FRUITS</td>\n",
       "      <td>FROZEN FRUIT</td>\n",
       "      <td>16 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>29132</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRZN VEGETABLE/VEG DSH</td>\n",
       "      <td>FRZN BAGGED VEGETABLES - PLAIN</td>\n",
       "      <td>16 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>127</th>\n",
       "      <td>30179</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRZN VEGETABLE/VEG DSH</td>\n",
       "      <td>FRZN BOXED VEGETABLES - PLAIN</td>\n",
       "      <td>10 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     product_id  manufacturer_id department     brand        product_category  \\\n",
       "0         25671                2    GROCERY  National                FRZN ICE   \n",
       "39        27754               69    GROCERY   Private  FRZN VEGETABLE/VEG DSH   \n",
       "54        27960               69    GROCERY   Private             FRZN FRUITS   \n",
       "99        29132               69    GROCERY   Private  FRZN VEGETABLE/VEG DSH   \n",
       "127       30179               69    GROCERY   Private  FRZN VEGETABLE/VEG DSH   \n",
       "\n",
       "                       product_type package_size  \n",
       "0               ICE - CRUSHED/CUBED        22 LB  \n",
       "39   FRZN BAGGED VEGETABLES - PLAIN        16 OZ  \n",
       "54                     FROZEN FRUIT        16 OZ  \n",
       "99   FRZN BAGGED VEGETABLES - PLAIN        16 OZ  \n",
       "127   FRZN BOXED VEGETABLES - PLAIN        10 OZ  "
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# product_category observations containing 'frzn'\n",
    "frzn_products = products['product_category'].str.contains('frzn', case=False, na=False)\n",
    "products[frzn_products].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>25671</td>\n",
       "      <td>2</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>frozen ICE</td>\n",
       "      <td>ICE - CRUSHED/CUBED</td>\n",
       "      <td>22 LB</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>27754</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>frozen VEGETABLE/VEG DSH</td>\n",
       "      <td>FRZN BAGGED VEGETABLES - PLAIN</td>\n",
       "      <td>16 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>27960</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>frozen FRUITS</td>\n",
       "      <td>FROZEN FRUIT</td>\n",
       "      <td>16 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>29132</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>frozen VEGETABLE/VEG DSH</td>\n",
       "      <td>FRZN BAGGED VEGETABLES - PLAIN</td>\n",
       "      <td>16 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>127</th>\n",
       "      <td>30179</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>frozen VEGETABLE/VEG DSH</td>\n",
       "      <td>FRZN BOXED VEGETABLES - PLAIN</td>\n",
       "      <td>10 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     product_id  manufacturer_id department     brand  \\\n",
       "0         25671                2    GROCERY  National   \n",
       "39        27754               69    GROCERY   Private   \n",
       "54        27960               69    GROCERY   Private   \n",
       "99        29132               69    GROCERY   Private   \n",
       "127       30179               69    GROCERY   Private   \n",
       "\n",
       "             product_category                    product_type package_size  \n",
       "0                  frozen ICE             ICE - CRUSHED/CUBED        22 LB  \n",
       "39   frozen VEGETABLE/VEG DSH  FRZN BAGGED VEGETABLES - PLAIN        16 OZ  \n",
       "54              frozen FRUITS                    FROZEN FRUIT        16 OZ  \n",
       "99   frozen VEGETABLE/VEG DSH  FRZN BAGGED VEGETABLES - PLAIN        16 OZ  \n",
       "127  frozen VEGETABLE/VEG DSH   FRZN BOXED VEGETABLES - PLAIN        10 OZ  "
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# replacing 'frzn' with 'frozen' in `product_category` column\n",
    "products['product_category'] = products['product_category'].str.replace('frzn', 'frozen', case=False)\n",
    "products[frzn_products].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Knowledge check\n",
    "\n",
    "```{admonition} Questions\n",
    ":class: attention\n",
    "1. Using the `product_type` column, which product has the longest description? How about the shortest?\n",
    "2. Which `package_size` values are most common in our data? Which is the least common?\n",
    "2. Replace all instances of 'DSH' with 'dish' in the `product_category` column. How many products does this impact?\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Video 🎥:\n",
    "<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/xuncB4e2QkI?si=Jf2cOKX5cUkjsb0Q\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen></iframe>\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regular expressions\n",
    "\n",
    "A regular expression (aka regex) is a sequence of characters that define a search pattern, mainly for use in pattern matching with text strings. Typically, regex patterns consist of a combination of alphanumeric characters as well as special characters. The pattern can also be as simple as a single character or it can be more complex and include several combinations of characters.\n",
    "\n",
    "To understand how to work with regular expressions in Python, we need to consider two primary features of regular expressions. One has to do with the syntax, or the way regex patterns are expressed in Python. The other has to do with the functions used for regex matching in Python. You will be exposed to both of these in the following sections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regex basics\n",
    "\n",
    "Python’s standard library, and Pandas adoption of these string methods provides us a convenient approach to regular expressions. Many of the methods used in the previous sections can be used with regex. The common pattern for our method calls consist of:\n",
    "\n",
    "<center>\n",
    "<code>string_object.str.*(pattern)</code>\n",
    "</center>\n",
    "\n",
    "where:\n",
    "\n",
    "* `string_object` represents the character string input (i.e. products['product_category']),\n",
    "* `.str.*` represents a wide variety of regex methods depending on what you want to do,\n",
    "* `pattern` represents the regex pattern you are looking to match.\n",
    "\n",
    "For example let’s say you are looking for observations where the word “FRUIT” was used in the `product_category` description. As we saw earlier `.str.contains()` detects the presence or absence of a pattern (“FRUIT” in this example) and returns a boolean response. Since the output is TRUE or FALSE, this is a handy function to combine with indexing to filter for observations that have that pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>26190</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>APPLE SAUCE</td>\n",
       "      <td>50 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>27503</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>CRANBERRY SAUCE</td>\n",
       "      <td>16 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>27960</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>frozen FRUITS</td>\n",
       "      <td>FROZEN FRUIT</td>\n",
       "      <td>16 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>204</th>\n",
       "      <td>33048</td>\n",
       "      <td>5820</td>\n",
       "      <td>PRODUCE</td>\n",
       "      <td>National</td>\n",
       "      <td>ORGANICS FRUIT &amp; VEGETABLES</td>\n",
       "      <td>ORGANIC CITRUS</td>\n",
       "      <td>2 LB BAG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>344</th>\n",
       "      <td>37543</td>\n",
       "      <td>876</td>\n",
       "      <td>NUTRITION</td>\n",
       "      <td>National</td>\n",
       "      <td>DRIED FRUIT</td>\n",
       "      <td>RAISINS</td>\n",
       "      <td>24 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92140</th>\n",
       "      <td>18056453</td>\n",
       "      <td>1378</td>\n",
       "      <td>NUTRITION</td>\n",
       "      <td>National</td>\n",
       "      <td>DRIED FRUIT</td>\n",
       "      <td>DRIED FRUIT - OTHER</td>\n",
       "      <td>.71 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92287</th>\n",
       "      <td>18185766</td>\n",
       "      <td>5819</td>\n",
       "      <td>PRODUCE</td>\n",
       "      <td>National</td>\n",
       "      <td>ORGANICS FRUIT &amp; VEGETABLES</td>\n",
       "      <td>ORGANIC HERBS</td>\n",
       "      <td>.75 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92289</th>\n",
       "      <td>18186504</td>\n",
       "      <td>5819</td>\n",
       "      <td>PRODUCE</td>\n",
       "      <td>National</td>\n",
       "      <td>ORGANICS FRUIT &amp; VEGETABLES</td>\n",
       "      <td>ORGANIC HERBS</td>\n",
       "      <td>.75 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92290</th>\n",
       "      <td>18186637</td>\n",
       "      <td>5819</td>\n",
       "      <td>PRODUCE</td>\n",
       "      <td>National</td>\n",
       "      <td>ORGANICS FRUIT &amp; VEGETABLES</td>\n",
       "      <td>ORGANIC HERBS</td>\n",
       "      <td>.75 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92295</th>\n",
       "      <td>18193059</td>\n",
       "      <td>5819</td>\n",
       "      <td>PRODUCE</td>\n",
       "      <td>National</td>\n",
       "      <td>ORGANICS FRUIT &amp; VEGETABLES</td>\n",
       "      <td>ORGANIC HERBS</td>\n",
       "      <td>.75 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1153 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       product_id  manufacturer_id department     brand  \\\n",
       "3           26190               69    GROCERY   Private   \n",
       "25          27503               69    GROCERY   Private   \n",
       "54          27960               69    GROCERY   Private   \n",
       "204         33048             5820    PRODUCE  National   \n",
       "344         37543              876  NUTRITION  National   \n",
       "...           ...              ...        ...       ...   \n",
       "92140    18056453             1378  NUTRITION  National   \n",
       "92287    18185766             5819    PRODUCE  National   \n",
       "92289    18186504             5819    PRODUCE  National   \n",
       "92290    18186637             5819    PRODUCE  National   \n",
       "92295    18193059             5819    PRODUCE  National   \n",
       "\n",
       "                  product_category         product_type package_size  \n",
       "3             FRUIT - SHELF STABLE          APPLE SAUCE        50 OZ  \n",
       "25            FRUIT - SHELF STABLE      CRANBERRY SAUCE        16 OZ  \n",
       "54                   frozen FRUITS         FROZEN FRUIT        16 OZ  \n",
       "204    ORGANICS FRUIT & VEGETABLES       ORGANIC CITRUS     2 LB BAG  \n",
       "344                    DRIED FRUIT              RAISINS        24 OZ  \n",
       "...                            ...                  ...          ...  \n",
       "92140                  DRIED FRUIT  DRIED FRUIT - OTHER       .71 OZ  \n",
       "92287  ORGANICS FRUIT & VEGETABLES        ORGANIC HERBS       .75 OZ  \n",
       "92289  ORGANICS FRUIT & VEGETABLES        ORGANIC HERBS       .75 OZ  \n",
       "92290  ORGANICS FRUIT & VEGETABLES        ORGANIC HERBS       .75 OZ  \n",
       "92295  ORGANICS FRUIT & VEGETABLES        ORGANIC HERBS       .75 OZ  \n",
       "\n",
       "[1153 rows x 7 columns]"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# detect if the word \"fruit\" is used in each comment\n",
    "fruit_products = products['product_category'].str.contains('fruit', case=False, na=False)\n",
    "products[fruit_products]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are wide variety of `.str.` methods. See the previous sections for some of the more common ones or peruse the Pandas [Working with Text Data documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) for examples of more."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple words\n",
    "\n",
    "In the previous section you saw that you can search for a given word using some of the `.str.` methods. We can build onto this and search for multiple words. For example, you can search for the phrase “summer” or \"fall\". One may initially think about performing this search with the following in which case we use traditional word matching and then subsetting. This is known as a ‘literal’ search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>335</th>\n",
       "      <td>37073</td>\n",
       "      <td>69</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>Private</td>\n",
       "      <td>SPRING/SUMMER SEASONAL</td>\n",
       "      <td>GRILL ACCESSORIES</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1747</th>\n",
       "      <td>91614</td>\n",
       "      <td>1277</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>SPRING/SUMMER SEASONAL</td>\n",
       "      <td>DISPOSABLE FOILWARE</td>\n",
       "      <td>3 CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1943</th>\n",
       "      <td>99090</td>\n",
       "      <td>1277</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>SPRING/SUMMER SEASONAL</td>\n",
       "      <td>DISPOSABLE FOILWARE</td>\n",
       "      <td>2 CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1947</th>\n",
       "      <td>99172</td>\n",
       "      <td>1277</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>SPRING/SUMMER SEASONAL</td>\n",
       "      <td>DISPOSABLE FOILWARE</td>\n",
       "      <td>2 CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2114</th>\n",
       "      <td>106668</td>\n",
       "      <td>1277</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>SPRING/SUMMER SEASONAL</td>\n",
       "      <td>DISPOSABLE FOILWARE</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      product_id  manufacturer_id department     brand  \\\n",
       "335        37073               69    DRUG GM   Private   \n",
       "1747       91614             1277    DRUG GM  National   \n",
       "1943       99090             1277    DRUG GM  National   \n",
       "1947       99172             1277    DRUG GM  National   \n",
       "2114      106668             1277    DRUG GM  National   \n",
       "\n",
       "            product_category         product_type package_size  \n",
       "335   SPRING/SUMMER SEASONAL    GRILL ACCESSORIES         None  \n",
       "1747  SPRING/SUMMER SEASONAL  DISPOSABLE FOILWARE         3 CT  \n",
       "1943  SPRING/SUMMER SEASONAL  DISPOSABLE FOILWARE         2 CT  \n",
       "1947  SPRING/SUMMER SEASONAL  DISPOSABLE FOILWARE         2 CT  \n",
       "2114  SPRING/SUMMER SEASONAL  DISPOSABLE FOILWARE         None  "
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "summer_products = products['product_category'].str.contains('summer', case=False, na=False)\n",
    "fall_products = products['product_category'].str.contains('fall', case=False, na=False)\n",
    "\n",
    "products[summer_products | fall_products].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This works, but as our searches get more complex this approach becomes quite tedious. Instead, we can start using regex search patterns to simplify this process.  The below approach will search for observations that include the word “summer”, “fall”, or a combination of the two. This is equivalent to the above statement but more succinct. This is truly a regex, as instead of searching for a literal statement, we are looking for a pattern of letter/word combinations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>335</th>\n",
       "      <td>37073</td>\n",
       "      <td>69</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>Private</td>\n",
       "      <td>SPRING/SUMMER SEASONAL</td>\n",
       "      <td>GRILL ACCESSORIES</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1747</th>\n",
       "      <td>91614</td>\n",
       "      <td>1277</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>SPRING/SUMMER SEASONAL</td>\n",
       "      <td>DISPOSABLE FOILWARE</td>\n",
       "      <td>3 CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1943</th>\n",
       "      <td>99090</td>\n",
       "      <td>1277</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>SPRING/SUMMER SEASONAL</td>\n",
       "      <td>DISPOSABLE FOILWARE</td>\n",
       "      <td>2 CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1947</th>\n",
       "      <td>99172</td>\n",
       "      <td>1277</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>SPRING/SUMMER SEASONAL</td>\n",
       "      <td>DISPOSABLE FOILWARE</td>\n",
       "      <td>2 CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2114</th>\n",
       "      <td>106668</td>\n",
       "      <td>1277</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>SPRING/SUMMER SEASONAL</td>\n",
       "      <td>DISPOSABLE FOILWARE</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      product_id  manufacturer_id department     brand  \\\n",
       "335        37073               69    DRUG GM   Private   \n",
       "1747       91614             1277    DRUG GM  National   \n",
       "1943       99090             1277    DRUG GM  National   \n",
       "1947       99172             1277    DRUG GM  National   \n",
       "2114      106668             1277    DRUG GM  National   \n",
       "\n",
       "            product_category         product_type package_size  \n",
       "335   SPRING/SUMMER SEASONAL    GRILL ACCESSORIES         None  \n",
       "1747  SPRING/SUMMER SEASONAL  DISPOSABLE FOILWARE         3 CT  \n",
       "1943  SPRING/SUMMER SEASONAL  DISPOSABLE FOILWARE         2 CT  \n",
       "1947  SPRING/SUMMER SEASONAL  DISPOSABLE FOILWARE         2 CT  \n",
       "2114  SPRING/SUMMER SEASONAL  DISPOSABLE FOILWARE         None  "
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "summer_or_fall = products['product_category'].str.contains('summer|fall', case=False, na=False)\n",
    "products[summer_or_fall].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we can see that our regex of ‘summer|fall’ returns the same number of observations as our literal search."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Literal approach: 651 rows\n",
      "Regex approach: 651 rows\n"
     ]
    }
   ],
   "source": [
    "literal_count = products[summer_products | fall_products].shape[0]\n",
    "regex_count = products[summer_or_fall].shape[0]\n",
    "\n",
    "print(f'Literal approach: {literal_count} rows')\n",
    "print(f'Regex approach: {regex_count} rows')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Line anchors\n",
    "\n",
    "Line anchors are used to identify patterns at the beginning or end of an element.  To find a pattern at the beginning of the element we use `^` and a pattern at the end of the element is found with `$`.  For example, if you wanted to find any observations with the word \"fruit\" in the `product_category` column we can use the following as we saw earlier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>26190</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>APPLE SAUCE</td>\n",
       "      <td>50 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>27503</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>CRANBERRY SAUCE</td>\n",
       "      <td>16 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54</th>\n",
       "      <td>27960</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>frozen FRUITS</td>\n",
       "      <td>FROZEN FRUIT</td>\n",
       "      <td>16 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>204</th>\n",
       "      <td>33048</td>\n",
       "      <td>5820</td>\n",
       "      <td>PRODUCE</td>\n",
       "      <td>National</td>\n",
       "      <td>ORGANICS FRUIT &amp; VEGETABLES</td>\n",
       "      <td>ORGANIC CITRUS</td>\n",
       "      <td>2 LB BAG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>344</th>\n",
       "      <td>37543</td>\n",
       "      <td>876</td>\n",
       "      <td>NUTRITION</td>\n",
       "      <td>National</td>\n",
       "      <td>DRIED FRUIT</td>\n",
       "      <td>RAISINS</td>\n",
       "      <td>24 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     product_id  manufacturer_id department     brand  \\\n",
       "3         26190               69    GROCERY   Private   \n",
       "25        27503               69    GROCERY   Private   \n",
       "54        27960               69    GROCERY   Private   \n",
       "204       33048             5820    PRODUCE  National   \n",
       "344       37543              876  NUTRITION  National   \n",
       "\n",
       "                product_category     product_type package_size  \n",
       "3           FRUIT - SHELF STABLE      APPLE SAUCE        50 OZ  \n",
       "25          FRUIT - SHELF STABLE  CRANBERRY SAUCE        16 OZ  \n",
       "54                 frozen FRUITS     FROZEN FRUIT        16 OZ  \n",
       "204  ORGANICS FRUIT & VEGETABLES   ORGANIC CITRUS     2 LB BAG  \n",
       "344                  DRIED FRUIT          RAISINS        24 OZ  "
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# detect if the word \"fruit\" is used in each comment\n",
    "fruit_products = products['product_category'].str.contains('fruit', case=False, na=False)\n",
    "products[fruit_products].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, if we only wanted those products where the category starts with “fruit” than we can use the `^` anchor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>26190</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>APPLE SAUCE</td>\n",
       "      <td>50 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>27503</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>CRANBERRY SAUCE</td>\n",
       "      <td>16 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>590</th>\n",
       "      <td>44599</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>FRUIT COCKTAIL FRUIT SALAD</td>\n",
       "      <td>15.25 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>610</th>\n",
       "      <td>45218</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>PEARS</td>\n",
       "      <td>15.25 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>633</th>\n",
       "      <td>45607</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>PINEAPPLE</td>\n",
       "      <td>8 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     product_id  manufacturer_id department    brand      product_category  \\\n",
       "3         26190               69    GROCERY  Private  FRUIT - SHELF STABLE   \n",
       "25        27503               69    GROCERY  Private  FRUIT - SHELF STABLE   \n",
       "590       44599               69    GROCERY  Private  FRUIT - SHELF STABLE   \n",
       "610       45218               69    GROCERY  Private  FRUIT - SHELF STABLE   \n",
       "633       45607               69    GROCERY  Private  FRUIT - SHELF STABLE   \n",
       "\n",
       "                   product_type package_size  \n",
       "3                   APPLE SAUCE        50 OZ  \n",
       "25              CRANBERRY SAUCE        16 OZ  \n",
       "590  FRUIT COCKTAIL FRUIT SALAD     15.25 OZ  \n",
       "610                       PEARS     15.25 OZ  \n",
       "633                   PINEAPPLE         8 OZ  "
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "starts_with_fruit = products['product_category'].str.contains('^fruit', case=False, na=False)\n",
    "products[starts_with_fruit].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, if we only wanted those products where the category *ends with* \"fruit\" than we can use the `$` anchor:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>344</th>\n",
       "      <td>37543</td>\n",
       "      <td>876</td>\n",
       "      <td>NUTRITION</td>\n",
       "      <td>National</td>\n",
       "      <td>DRIED FRUIT</td>\n",
       "      <td>RAISINS</td>\n",
       "      <td>24 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>772</th>\n",
       "      <td>49930</td>\n",
       "      <td>2</td>\n",
       "      <td>PRODUCE</td>\n",
       "      <td>National</td>\n",
       "      <td>STONE FRUIT</td>\n",
       "      <td>PEACHES YELLOW FLESH</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>896</th>\n",
       "      <td>54460</td>\n",
       "      <td>69</td>\n",
       "      <td>NUTRITION</td>\n",
       "      <td>Private</td>\n",
       "      <td>DRIED FRUIT</td>\n",
       "      <td>RAISINS</td>\n",
       "      <td>24 OZ BAG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1113</th>\n",
       "      <td>64583</td>\n",
       "      <td>2</td>\n",
       "      <td>PRODUCE</td>\n",
       "      <td>National</td>\n",
       "      <td>TROPICAL FRUIT</td>\n",
       "      <td>TROPICAL FRUIT - OTHER</td>\n",
       "      <td>CTN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1267</th>\n",
       "      <td>71432</td>\n",
       "      <td>876</td>\n",
       "      <td>NUTRITION</td>\n",
       "      <td>National</td>\n",
       "      <td>DRIED FRUIT</td>\n",
       "      <td>DRIED FRUIT - OTHER</td>\n",
       "      <td>7 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      product_id  manufacturer_id department     brand product_category  \\\n",
       "344        37543              876  NUTRITION  National      DRIED FRUIT   \n",
       "772        49930                2    PRODUCE  National      STONE FRUIT   \n",
       "896        54460               69  NUTRITION   Private      DRIED FRUIT   \n",
       "1113       64583                2    PRODUCE  National   TROPICAL FRUIT   \n",
       "1267       71432              876  NUTRITION  National      DRIED FRUIT   \n",
       "\n",
       "                product_type package_size  \n",
       "344                  RAISINS        24 OZ  \n",
       "772     PEACHES YELLOW FLESH         None  \n",
       "896                  RAISINS    24 OZ BAG  \n",
       "1113  TROPICAL FRUIT - OTHER          CTN  \n",
       "1267     DRIED FRUIT - OTHER         7 OZ  "
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ends_with_fruit = products['product_category'].str.contains('fruit$', case=False, na=False)\n",
    "products[ends_with_fruit].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we can combine the two if we only wanted those products where the category *starts or ends with* \"fruit\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>26190</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>APPLE SAUCE</td>\n",
       "      <td>50 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>27503</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>CRANBERRY SAUCE</td>\n",
       "      <td>16 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>344</th>\n",
       "      <td>37543</td>\n",
       "      <td>876</td>\n",
       "      <td>NUTRITION</td>\n",
       "      <td>National</td>\n",
       "      <td>DRIED FRUIT</td>\n",
       "      <td>RAISINS</td>\n",
       "      <td>24 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>590</th>\n",
       "      <td>44599</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>FRUIT COCKTAIL FRUIT SALAD</td>\n",
       "      <td>15.25 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>610</th>\n",
       "      <td>45218</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>PEARS</td>\n",
       "      <td>15.25 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     product_id  manufacturer_id department     brand      product_category  \\\n",
       "3         26190               69    GROCERY   Private  FRUIT - SHELF STABLE   \n",
       "25        27503               69    GROCERY   Private  FRUIT - SHELF STABLE   \n",
       "344       37543              876  NUTRITION  National           DRIED FRUIT   \n",
       "590       44599               69    GROCERY   Private  FRUIT - SHELF STABLE   \n",
       "610       45218               69    GROCERY   Private  FRUIT - SHELF STABLE   \n",
       "\n",
       "                   product_type package_size  \n",
       "3                   APPLE SAUCE        50 OZ  \n",
       "25              CRANBERRY SAUCE        16 OZ  \n",
       "344                     RAISINS        24 OZ  \n",
       "590  FRUIT COCKTAIL FRUIT SALAD     15.25 OZ  \n",
       "610                       PEARS     15.25 OZ  "
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "starts_or_ends_with_fruit = (\n",
    "    products['product_category']\n",
    "    .str.contains('^fruit|fruit$', case=False, na=False)\n",
    ")\n",
    "\n",
    "products[starts_or_ends_with_fruit].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Metacharacters\n",
    "\n",
    "Metacharacters consist of non-alphanumeric symbols such as: \n",
    "\n",
    "<p>\n",
    "<center>\n",
    ". &nbsp;&nbsp; &#92; &nbsp;&nbsp; | &nbsp;&nbsp; ( &nbsp;&nbsp; ) &nbsp;&nbsp; [ &nbsp;&nbsp; { &nbsp;&nbsp; $ &nbsp;&nbsp; &#x2A; &nbsp;&nbsp; + &nbsp;&nbsp;?\n",
    "</center>\n",
    "</p>\n",
    "\n",
    "To match metacharacters in regex you need to *escape*. In Python, we escape them with a single or double backslash \"`\\`\".  The following displays the general escape syntax for the most common metacharacters:\n",
    "\n",
    "```{list-table} Escaping metacharacters.\n",
    ":header-rows: 1\n",
    ":name: escaping-characters\n",
    "\n",
    "* - Metacharacter\n",
    "  - Literal Meaning\n",
    "  - Escape Syntax\n",
    "* - .\n",
    "  - period or dot\n",
    "  - `\\\\.`\n",
    "* - $\n",
    "  - dollar sign\n",
    "  - `\\$`\n",
    "* - &#x2A;\n",
    "  - asterisk\n",
    "  - `\\*`\n",
    "* - &plus;\n",
    "  - plus sign\n",
    "  - `\\+`\n",
    "* - ?\n",
    "  - question mark\n",
    "  - `\\?`\n",
    "* - \\|\n",
    "  - vertical bar\n",
    "  - `\\|`\n",
    "* - ^\n",
    "  - caret\n",
    "  - `\\^`\n",
    "* - [\n",
    "  - square bracket\n",
    "  - `\\[`\n",
    "* - {\n",
    "  - curly brace\n",
    "  - `\\{`\n",
    "* - (\n",
    "  - parenthesis\n",
    "  - `\\(`\n",
    "```\n",
    "The reason we need to escape these characters is because most of these actually have meaning when declaring regular expressions. For example, say we wanted to identify any `product_category` that contains a period (\".\"). If we simply use the following we actually get ***all*** our records back. \n",
    "\n",
    "```{warning}\n",
    "Actually, this returns ***almost*** all our records. Since we use `na=False`, any `product_category` that has a missing value will be dropped.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>25671</td>\n",
       "      <td>2</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>frozen ICE</td>\n",
       "      <td>ICE - CRUSHED/CUBED</td>\n",
       "      <td>22 LB</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>26093</td>\n",
       "      <td>69</td>\n",
       "      <td>PASTRY</td>\n",
       "      <td>Private</td>\n",
       "      <td>BREAD</td>\n",
       "      <td>BREAD:ITALIAN/FRENCH</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>26190</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>APPLE SAUCE</td>\n",
       "      <td>50 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>26355</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>COOKIES/CONES</td>\n",
       "      <td>SPECIALTY COOKIES</td>\n",
       "      <td>14 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>26426</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>SPICES &amp; EXTRACTS</td>\n",
       "      <td>SPICES &amp; SEASONINGS</td>\n",
       "      <td>2.5 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   product_id  manufacturer_id department     brand      product_category  \\\n",
       "0       25671                2    GROCERY  National            frozen ICE   \n",
       "2       26093               69     PASTRY   Private                 BREAD   \n",
       "3       26190               69    GROCERY   Private  FRUIT - SHELF STABLE   \n",
       "4       26355               69    GROCERY   Private         COOKIES/CONES   \n",
       "5       26426               69    GROCERY   Private     SPICES & EXTRACTS   \n",
       "\n",
       "           product_type package_size  \n",
       "0   ICE - CRUSHED/CUBED        22 LB  \n",
       "2  BREAD:ITALIAN/FRENCH         None  \n",
       "3           APPLE SAUCE        50 OZ  \n",
       "4     SPECIALTY COOKIES        14 OZ  \n",
       "5   SPICES & SEASONINGS       2.5 OZ  "
      ]
     },
     "execution_count": 63,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "wildcard = products['product_category'].str.contains('.', case=False, na=False)\n",
    "products[wildcard].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, we need to use an **escape** (\"\\\") to tell the regular expression you want to match a literal metacharacter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>527</th>\n",
       "      <td>43361</td>\n",
       "      <td>1916</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>MISC. DAIRY</td>\n",
       "      <td>MEXICAN SOFT TORTILLAS AND WRA</td>\n",
       "      <td>33 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>802</th>\n",
       "      <td>50776</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>MISC. DAIRY</td>\n",
       "      <td>REFRIGERATED PUDDING</td>\n",
       "      <td>24 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>866</th>\n",
       "      <td>53070</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>MISC. DAIRY</td>\n",
       "      <td>REFRIGERATED PUDDING</td>\n",
       "      <td>24 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1019</th>\n",
       "      <td>60924</td>\n",
       "      <td>435</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>MISC. DAIRY</td>\n",
       "      <td>REFRIGERATED PASTA</td>\n",
       "      <td>9 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1114</th>\n",
       "      <td>64623</td>\n",
       "      <td>289</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>MISC. DAIRY</td>\n",
       "      <td>MISC DAIRY REFIGERATED</td>\n",
       "      <td>20 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      product_id  manufacturer_id department     brand product_category  \\\n",
       "527        43361             1916    GROCERY  National      MISC. DAIRY   \n",
       "802        50776               69    GROCERY   Private      MISC. DAIRY   \n",
       "866        53070               69    GROCERY   Private      MISC. DAIRY   \n",
       "1019       60924              435    GROCERY  National      MISC. DAIRY   \n",
       "1114       64623              289    GROCERY  National      MISC. DAIRY   \n",
       "\n",
       "                        product_type package_size  \n",
       "527   MEXICAN SOFT TORTILLAS AND WRA        33 OZ  \n",
       "802             REFRIGERATED PUDDING        24 OZ  \n",
       "866             REFRIGERATED PUDDING        24 OZ  \n",
       "1019              REFRIGERATED PASTA         9 OZ  \n",
       "1114          MISC DAIRY REFIGERATED        20 OZ  "
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "includes_dot = products['product_category'].str.contains('\\\\.', case=False, na=False)\n",
    "products[includes_dot].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Character classes\n",
    "\n",
    "To match one of several characters in a specified set we can enclose the characters of concern with square brackets `[ ]`. In addition, to matching any characters __not__ in a specified character set we can include the caret `^` at the beginning of the set within the brackets. The following displays the general syntax for common character classes but these can be altered easily as shown in the examples that follow:\n",
    "\n",
    "```{list-table} Common character classes.\n",
    ":header-rows: 1\n",
    ":name: common-characters-classes\n",
    "\n",
    "* - Character class\n",
    "  - Description\n",
    "* - `[aeiou]`\n",
    "  - match any specified lower case vowel \n",
    "* - `[AEIOU]`\n",
    "  - match any specified upper case vowel \n",
    "* - `[0123456789]`\n",
    "  - match any specified numeric values \n",
    "* - `[0-9]`\n",
    "  - match any range specified numeric values \n",
    "* - `[a-z]`\n",
    "  - match any range of lowercase letters \n",
    "* - `[A-Z]`\n",
    "  - match any range of uppercase letters \n",
    "* - `[a-zA-Z0-9]`\n",
    "  - match any of the above \n",
    "* - `[^aeiou]`\n",
    "  - match anything other than a lowercase vowel \n",
    "* - `[^0-9]`\n",
    "  - match anything other than the specified numeric values\n",
    "```\n",
    "\n",
    "For example, say we wanted to find any products where the `package_size` is not a round numeric size in ounces. The following identifies any rows where `package_size` contains a dot (remember, we need to escape that character with `\\\\.`) followed by \"oz\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>26426</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>SPICES &amp; EXTRACTS</td>\n",
       "      <td>SPICES &amp; SEASONINGS</td>\n",
       "      <td>2.5 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>27695</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DINNER MXS:DRY</td>\n",
       "      <td>SKILLET DINNERS</td>\n",
       "      <td>5.8 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <td>27767</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DINNER MXS:DRY</td>\n",
       "      <td>SKILLET DINNERS</td>\n",
       "      <td>6.5 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <td>27812</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>COFFEE</td>\n",
       "      <td>GROUND COFFEE</td>\n",
       "      <td>34.5 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50</th>\n",
       "      <td>27925</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DINNER MXS:DRY</td>\n",
       "      <td>SKILLET DINNERS</td>\n",
       "      <td>6.5 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    product_id  manufacturer_id department    brand   product_category  \\\n",
       "5        26426               69    GROCERY  Private  SPICES & EXTRACTS   \n",
       "34       27695               69    GROCERY  Private     DINNER MXS:DRY   \n",
       "43       27767               69    GROCERY  Private     DINNER MXS:DRY   \n",
       "44       27812               69    GROCERY  Private             COFFEE   \n",
       "50       27925               69    GROCERY  Private     DINNER MXS:DRY   \n",
       "\n",
       "           product_type package_size  \n",
       "5   SPICES & SEASONINGS       2.5 OZ  \n",
       "34      SKILLET DINNERS       5.8 OZ  \n",
       "43      SKILLET DINNERS       6.5 OZ  \n",
       "44        GROUND COFFEE      34.5 OZ  \n",
       "50      SKILLET DINNERS       6.5 OZ  "
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "decimal_ounces = products['package_size'].str.contains('\\\\.[0-9] oz', case=False, na=False)\n",
    "products[decimal_ounces].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, say we wanted to do the same but we are interested in any packages that are in ounces (“OZ”) or pounds (“LB”). Your first reaction is probably to do something like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>25671</td>\n",
       "      <td>2</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>frozen ICE</td>\n",
       "      <td>ICE - CRUSHED/CUBED</td>\n",
       "      <td>22 LB</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>26426</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>SPICES &amp; EXTRACTS</td>\n",
       "      <td>SPICES &amp; SEASONINGS</td>\n",
       "      <td>2.5 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>27695</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DINNER MXS:DRY</td>\n",
       "      <td>SKILLET DINNERS</td>\n",
       "      <td>5.8 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <td>27767</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DINNER MXS:DRY</td>\n",
       "      <td>SKILLET DINNERS</td>\n",
       "      <td>6.5 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <td>27812</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>COFFEE</td>\n",
       "      <td>GROUND COFFEE</td>\n",
       "      <td>34.5 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    product_id  manufacturer_id department     brand   product_category  \\\n",
       "0        25671                2    GROCERY  National         frozen ICE   \n",
       "5        26426               69    GROCERY   Private  SPICES & EXTRACTS   \n",
       "34       27695               69    GROCERY   Private     DINNER MXS:DRY   \n",
       "43       27767               69    GROCERY   Private     DINNER MXS:DRY   \n",
       "44       27812               69    GROCERY   Private             COFFEE   \n",
       "\n",
       "           product_type package_size  \n",
       "0   ICE - CRUSHED/CUBED        22 LB  \n",
       "5   SPICES & SEASONINGS       2.5 OZ  \n",
       "34      SKILLET DINNERS       5.8 OZ  \n",
       "43      SKILLET DINNERS       6.5 OZ  \n",
       "44        GROUND COFFEE      34.5 OZ  "
      ]
     },
     "execution_count": 73,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "decimal_ounces_or_lbs = (\n",
    "    products['package_size']\n",
    "    .str.contains('\\\\.[0-9] oz|lb', case=False, na=False)\n",
    ")\n",
    "\n",
    "products[decimal_ounces_or_lbs].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wait! The first observation is in pounds but its a round number and not a decimal.  This is because our regex (`\\\\.[0-9] oz|lb`) is actually looking for any package size where its a decimal of ounces (`\\\\.[0-9] oz`) ***or*** in pounds (`lb`). \n",
    "\n",
    "We need to modify our regex just a tad. If we change it to `\\\\.[0-9] (oz|lb)` (note that `oz|lb` is now in parenthesis), we are now specifying to search for `\\\\.[0-9]` followed by \"oz\" or \"lb\".\n",
    "\n",
    "```{tip}\n",
    "Using parentheses such as `(oz|lb)` is known as grouping. If we simply use `(oz|lb)` then it will be treated as a [capture grouping](https://www.regular-expressions.info/brackets.html) and Pandas will throw a warning letting you know that you could extract the matching groups if you prefer. If this warning bothers you then you can add `?:` at the begining (`(?:oz|lb)`)to tell Pandas that you just care about matching this pattern and not capturing each group to extract.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/8f/c06lv6q17tjbyjv2nkt0_s4s1sh0tg/T/ipykernel_64719/2514522656.py:2: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.\n",
      "  products['package_size']\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>26426</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>SPICES &amp; EXTRACTS</td>\n",
       "      <td>SPICES &amp; SEASONINGS</td>\n",
       "      <td>2.5 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>27695</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DINNER MXS:DRY</td>\n",
       "      <td>SKILLET DINNERS</td>\n",
       "      <td>5.8 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <td>27767</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DINNER MXS:DRY</td>\n",
       "      <td>SKILLET DINNERS</td>\n",
       "      <td>6.5 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <td>27812</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>COFFEE</td>\n",
       "      <td>GROUND COFFEE</td>\n",
       "      <td>34.5 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50</th>\n",
       "      <td>27925</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DINNER MXS:DRY</td>\n",
       "      <td>SKILLET DINNERS</td>\n",
       "      <td>6.5 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92239</th>\n",
       "      <td>18147507</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>SPICES &amp; EXTRACTS</td>\n",
       "      <td>SPICES &amp; SEASONINGS</td>\n",
       "      <td>3.1 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92266</th>\n",
       "      <td>18148654</td>\n",
       "      <td>194</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>frozen BREAKFAST FOODS</td>\n",
       "      <td>WAFFLES/PANCAKES/FRENCH TOAST</td>\n",
       "      <td>11.1 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92269</th>\n",
       "      <td>18148726</td>\n",
       "      <td>2082</td>\n",
       "      <td>MEAT-PCKGD</td>\n",
       "      <td>National</td>\n",
       "      <td>HEAT/SERVE</td>\n",
       "      <td>ENTREES</td>\n",
       "      <td>14.5 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92320</th>\n",
       "      <td>18273018</td>\n",
       "      <td>2223</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>BAG SNACKS</td>\n",
       "      <td>BAGGED CHEESE SNACKS</td>\n",
       "      <td>11.5 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92321</th>\n",
       "      <td>18273019</td>\n",
       "      <td>2223</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>BAG SNACKS</td>\n",
       "      <td>BAGGED CHEESE SNACKS</td>\n",
       "      <td>11.5 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10712 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       product_id  manufacturer_id  department     brand  \\\n",
       "5           26426               69     GROCERY   Private   \n",
       "34          27695               69     GROCERY   Private   \n",
       "43          27767               69     GROCERY   Private   \n",
       "44          27812               69     GROCERY   Private   \n",
       "50          27925               69     GROCERY   Private   \n",
       "...           ...              ...         ...       ...   \n",
       "92239    18147507               69     GROCERY   Private   \n",
       "92266    18148654              194     GROCERY  National   \n",
       "92269    18148726             2082  MEAT-PCKGD  National   \n",
       "92320    18273018             2223     GROCERY  National   \n",
       "92321    18273019             2223     GROCERY  National   \n",
       "\n",
       "             product_category                   product_type package_size  \n",
       "5           SPICES & EXTRACTS            SPICES & SEASONINGS       2.5 OZ  \n",
       "34             DINNER MXS:DRY                SKILLET DINNERS       5.8 OZ  \n",
       "43             DINNER MXS:DRY                SKILLET DINNERS       6.5 OZ  \n",
       "44                     COFFEE                  GROUND COFFEE      34.5 OZ  \n",
       "50             DINNER MXS:DRY                SKILLET DINNERS       6.5 OZ  \n",
       "...                       ...                            ...          ...  \n",
       "92239       SPICES & EXTRACTS            SPICES & SEASONINGS       3.1 OZ  \n",
       "92266  frozen BREAKFAST FOODS  WAFFLES/PANCAKES/FRENCH TOAST      11.1 OZ  \n",
       "92269              HEAT/SERVE                        ENTREES      14.5 OZ  \n",
       "92320              BAG SNACKS           BAGGED CHEESE SNACKS      11.5 OZ  \n",
       "92321              BAG SNACKS           BAGGED CHEESE SNACKS      11.5 OZ  \n",
       "\n",
       "[10712 rows x 7 columns]"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "decimal_ounces_or_lbs = (\n",
    "    products['package_size']\n",
    "    .str.contains('\\\\.[0-9] (oz|lb)', case=False, na=False)\n",
    ")\n",
    "\n",
    "products[decimal_ounces_or_lbs].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, say we wanted to find any package size that contains a decimal between 0-.4:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>60</th>\n",
       "      <td>28116</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DRY MIX DESSERTS</td>\n",
       "      <td>GELATIN</td>\n",
       "      <td>.3 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61</th>\n",
       "      <td>28117</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DRY MIX DESSERTS</td>\n",
       "      <td>GELATIN</td>\n",
       "      <td>.3 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>62</th>\n",
       "      <td>28143</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DRY MIX DESSERTS</td>\n",
       "      <td>GELATIN</td>\n",
       "      <td>.3 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>69</th>\n",
       "      <td>28208</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DINNER MXS:DRY</td>\n",
       "      <td>SKILLET DINNERS</td>\n",
       "      <td>6.2 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>79</th>\n",
       "      <td>28377</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DINNER MXS:DRY</td>\n",
       "      <td>SKILLET DINNERS</td>\n",
       "      <td>6.4 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    product_id  manufacturer_id department    brand  product_category  \\\n",
       "60       28116               69    GROCERY  Private  DRY MIX DESSERTS   \n",
       "61       28117               69    GROCERY  Private  DRY MIX DESSERTS   \n",
       "62       28143               69    GROCERY  Private  DRY MIX DESSERTS   \n",
       "69       28208               69    GROCERY  Private    DINNER MXS:DRY   \n",
       "79       28377               69    GROCERY  Private    DINNER MXS:DRY   \n",
       "\n",
       "       product_type package_size  \n",
       "60          GELATIN        .3 OZ  \n",
       "61          GELATIN        .3 OZ  \n",
       "62          GELATIN        .3 OZ  \n",
       "69  SKILLET DINNERS       6.2 OZ  \n",
       "79  SKILLET DINNERS       6.4 OZ  "
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "smaller_sizes = (\n",
    "    products['package_size']\n",
    "    .str.contains('\\\\.[0-4] (?:oz|lb)', case=False, na=False)\n",
    ")\n",
    "\n",
    "products[smaller_sizes].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Shorthand character classes\n",
    "\n",
    "Since certain character classes are used often, a series of shorthand character classes are available. For example, rather than use `[0-9]` every time we are searching for a number in a regex, we can use just use `\\\\d` to match any digit. The following are a few of the commonly used shorthand character classes:\n",
    "\n",
    "```{list-table} Common shorthand character classes.\n",
    ":header-rows: 1\n",
    ":name: common-shorthand-characters\n",
    "\n",
    "* - Syntax\n",
    "  - Description\n",
    "* - `\\\\d`\n",
    "  - match any digit\n",
    "* - `\\\\D`\n",
    "  - match any non-digit\n",
    "* - `\\\\s`\n",
    "  - match a space character\n",
    "* - `\\\\S`\n",
    "  - match a non-space character\n",
    "* - `\\\\w`\n",
    "  - match a word\n",
    "* - `\\\\W`\n",
    "  - match a non-word\n",
    "* - `\\\\b`\n",
    "  - match a word boundary\n",
    "* - `\\\\B`\n",
    "  - match a non-word boundary\n",
    "```\n",
    "\n",
    "We can use these to find patterns such as...\n",
    "\n",
    "Find all products where the `package_size` starts with a numeric digit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>25671</td>\n",
       "      <td>2</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>frozen ICE</td>\n",
       "      <td>ICE - CRUSHED/CUBED</td>\n",
       "      <td>22 LB</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>26190</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FRUIT - SHELF STABLE</td>\n",
       "      <td>APPLE SAUCE</td>\n",
       "      <td>50 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>26355</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>COOKIES/CONES</td>\n",
       "      <td>SPECIALTY COOKIES</td>\n",
       "      <td>14 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>26426</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>SPICES &amp; EXTRACTS</td>\n",
       "      <td>SPICES &amp; SEASONINGS</td>\n",
       "      <td>2.5 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>26540</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>COOKIES/CONES</td>\n",
       "      <td>TRAY PACK/CHOC CHIP COOKIES</td>\n",
       "      <td>16 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   product_id  manufacturer_id department     brand      product_category  \\\n",
       "0       25671                2    GROCERY  National            frozen ICE   \n",
       "3       26190               69    GROCERY   Private  FRUIT - SHELF STABLE   \n",
       "4       26355               69    GROCERY   Private         COOKIES/CONES   \n",
       "5       26426               69    GROCERY   Private     SPICES & EXTRACTS   \n",
       "6       26540               69    GROCERY   Private         COOKIES/CONES   \n",
       "\n",
       "                  product_type package_size  \n",
       "0          ICE - CRUSHED/CUBED        22 LB  \n",
       "3                  APPLE SAUCE        50 OZ  \n",
       "4            SPECIALTY COOKIES        14 OZ  \n",
       "5          SPICES & SEASONINGS       2.5 OZ  \n",
       "6  TRAY PACK/CHOC CHIP COOKIES        16 OZ  "
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "starts_with_digit = products['package_size'].str.contains('^\\\\d', case=False, na=False)\n",
    "products[starts_with_digit].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or all products where the `package_size` starts with a non-digit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>27657</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>CHEESE</td>\n",
       "      <td>NATURAL CHEESE EXACT WT CHUNKS</td>\n",
       "      <td>A B D  8 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>60</th>\n",
       "      <td>28116</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DRY MIX DESSERTS</td>\n",
       "      <td>GELATIN</td>\n",
       "      <td>.3 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61</th>\n",
       "      <td>28117</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DRY MIX DESSERTS</td>\n",
       "      <td>GELATIN</td>\n",
       "      <td>.3 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>62</th>\n",
       "      <td>28143</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>DRY MIX DESSERTS</td>\n",
       "      <td>GELATIN</td>\n",
       "      <td>.3 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>67</th>\n",
       "      <td>28192</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>HOUSEHOLD CLEANG NEEDS</td>\n",
       "      <td>AMMONIA</td>\n",
       "      <td>.5 GAL</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    product_id  manufacturer_id department    brand        product_category  \\\n",
       "31       27657               69    GROCERY  Private                  CHEESE   \n",
       "60       28116               69    GROCERY  Private        DRY MIX DESSERTS   \n",
       "61       28117               69    GROCERY  Private        DRY MIX DESSERTS   \n",
       "62       28143               69    GROCERY  Private        DRY MIX DESSERTS   \n",
       "67       28192               69    GROCERY  Private  HOUSEHOLD CLEANG NEEDS   \n",
       "\n",
       "                      product_type package_size  \n",
       "31  NATURAL CHEESE EXACT WT CHUNKS  A B D  8 OZ  \n",
       "60                         GELATIN        .3 OZ  \n",
       "61                         GELATIN        .3 OZ  \n",
       "62                         GELATIN        .3 OZ  \n",
       "67                         AMMONIA       .5 GAL  "
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "starts_with_nondigit = products['package_size'].str.contains('^\\\\D', case=False, na=False)\n",
    "products[starts_with_nondigit].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Repetition\n",
    "\n",
    "When we want to match a __certain number__ of characters that meet a certain criteria we can apply repetition operators to our pattern searches. Common repetition operators include:\n",
    "\n",
    "```{list-table} Common repetition operators.\n",
    ":header-rows: 1\n",
    ":name: common-repitions\n",
    "\n",
    "* - Syntax\n",
    "  - Description\n",
    "* - `.`\n",
    "  - wildcard to match **any character once**\n",
    "* - `?`\n",
    "  - the preceding item is optional and will be matched **at most once**\n",
    "* - `*`\n",
    "  - the preceding item will be matched **zero or more times**\n",
    "* - `+`\n",
    "  - the preceding item will be matched **one or more times**\n",
    "* - `{n}`\n",
    "  - the preceding item will be matched **exactly n times**\n",
    "* - `{n,}`\n",
    "  - the preceding item will be matched **n or more times**\n",
    "* - `{n,m}`\n",
    "  - the preceding item will be matched **at least n times but not more than m times**\n",
    "```\n",
    "\n",
    "For example, say we want to find all products where the `package_size` contains at least 3 digits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>26601</td>\n",
       "      <td>69</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>Private</td>\n",
       "      <td>VITAMINS</td>\n",
       "      <td>VITAMIN - MINERALS</td>\n",
       "      <td>300 CT(1)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>85</th>\n",
       "      <td>28603</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FACIAL TISS/DNR NAPKIN</td>\n",
       "      <td>PAPER NAPKINS</td>\n",
       "      <td>250 CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>171</th>\n",
       "      <td>32265</td>\n",
       "      <td>693</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>CANDY - CHECKLANE</td>\n",
       "      <td>CANDY BARS (SINGLES)(INCLUDING</td>\n",
       "      <td>1.625 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>243</th>\n",
       "      <td>34214</td>\n",
       "      <td>869</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>TEAS</td>\n",
       "      <td>TEA BAGS &amp; BULK TEA</td>\n",
       "      <td>100 CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>281</th>\n",
       "      <td>35215</td>\n",
       "      <td>69</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>Private</td>\n",
       "      <td>ANALGESICS</td>\n",
       "      <td>ADULT ANALGESICS</td>\n",
       "      <td>100 CT</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     product_id  manufacturer_id department     brand        product_category  \\\n",
       "7         26601               69    DRUG GM   Private                VITAMINS   \n",
       "85        28603               69    GROCERY   Private  FACIAL TISS/DNR NAPKIN   \n",
       "171       32265              693    DRUG GM  National       CANDY - CHECKLANE   \n",
       "243       34214              869    GROCERY  National                    TEAS   \n",
       "281       35215               69    DRUG GM   Private              ANALGESICS   \n",
       "\n",
       "                       product_type package_size  \n",
       "7                VITAMIN - MINERALS    300 CT(1)  \n",
       "85                    PAPER NAPKINS       250 CT  \n",
       "171  CANDY BARS (SINGLES)(INCLUDING     1.625 OZ  \n",
       "243             TEA BAGS & BULK TEA       100 CT  \n",
       "281                ADULT ANALGESICS       100 CT  "
      ]
     },
     "execution_count": 91,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "three_digits = products['package_size'].str.contains('\\\\d{3,}', na=False)\n",
    "products[three_digits].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One thing you probably notice is that the above syntax will match three or more digits within the entire character string. But what if we wanted to identify repetition of a pattern sequence. For example, say we wanted to find `product_id`s where the number \"8\" is repeated. We simply combine 8 with the corresponding repitition operator from above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/8f/c06lv6q17tjbyjv2nkt0_s4s1sh0tg/T/ipykernel_64719/3233621946.py:1: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.\n",
      "  prod_id_8s = products['product_id'].astype(str).str.contains('(8)\\\\1', na=False)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>26889</td>\n",
       "      <td>32</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>MAGAZINE</td>\n",
       "      <td>TV/MOVIE-MAGAZINE</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>88</th>\n",
       "      <td>28889</td>\n",
       "      <td>69</td>\n",
       "      <td>SEAFOOD-PCKGD</td>\n",
       "      <td>Private</td>\n",
       "      <td>SEAFOOD - FROZEN</td>\n",
       "      <td>SEAFOOD-FRZ-RAW FILLETS</td>\n",
       "      <td>12 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>89</th>\n",
       "      <td>28892</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>FROZEN PIZZA</td>\n",
       "      <td>SNACKS/APPETIZERS</td>\n",
       "      <td>20 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>90</th>\n",
       "      <td>28897</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>EGGS</td>\n",
       "      <td>EGGS - X-LARGE</td>\n",
       "      <td>A D   1 DZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>32888</td>\n",
       "      <td>1075</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>CRACKERS/MISC BKD FD</td>\n",
       "      <td>SNACK CRACKERS</td>\n",
       "      <td>8.5 OZ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     product_id  manufacturer_id     department     brand  \\\n",
       "11        26889               32        DRUG GM  National   \n",
       "88        28889               69  SEAFOOD-PCKGD   Private   \n",
       "89        28892               69        GROCERY   Private   \n",
       "90        28897               69        GROCERY   Private   \n",
       "199       32888             1075        GROCERY  National   \n",
       "\n",
       "         product_category             product_type package_size  \n",
       "11               MAGAZINE        TV/MOVIE-MAGAZINE         None  \n",
       "88       SEAFOOD - FROZEN  SEAFOOD-FRZ-RAW FILLETS        12 OZ  \n",
       "89           FROZEN PIZZA        SNACKS/APPETIZERS        20 OZ  \n",
       "90                   EGGS           EGGS - X-LARGE   A D   1 DZ  \n",
       "199  CRACKERS/MISC BKD FD           SNACK CRACKERS       8.5 OZ  "
      ]
     },
     "execution_count": 105,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prod_id_8s = products['product_id'].astype(str).str.contains('8{2,}', na=False)\n",
    "products[prod_id_8s].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if we wanted to look for `product_id`s that contain two, but no more than three \"8\"s in a row:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/8f/c06lv6q17tjbyjv2nkt0_s4s1sh0tg/T/ipykernel_64719/1949681077.py:1: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.\n",
      "  prod_id_8s = products['product_id'].astype(str).str.contains('(8)\\\\1{2}', na=False)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_id</th>\n",
       "      <th>manufacturer_id</th>\n",
       "      <th>department</th>\n",
       "      <th>brand</th>\n",
       "      <th>product_category</th>\n",
       "      <th>product_type</th>\n",
       "      <th>package_size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>88</th>\n",
       "      <td>28889</td>\n",
       "      <td>69</td>\n",
       "      <td>SEAFOOD-PCKGD</td>\n",
       "      <td>Private</td>\n",
       "      <td>SEAFOOD - FROZEN</td>\n",
       "      <td>SEAFOOD-FRZ-RAW FILLETS</td>\n",
       "      <td>12 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>32888</td>\n",
       "      <td>1075</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>CRACKERS/MISC BKD FD</td>\n",
       "      <td>SNACK CRACKERS</td>\n",
       "      <td>8.5 OZ</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1065</th>\n",
       "      <td>62888</td>\n",
       "      <td>69</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>Private</td>\n",
       "      <td>LAUNDRY ADDITIVES</td>\n",
       "      <td>FABRIC SOFTENER LIQUID</td>\n",
       "      <td>60 LOADS</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1368</th>\n",
       "      <td>75888</td>\n",
       "      <td>1448</td>\n",
       "      <td>GROCERY</td>\n",
       "      <td>National</td>\n",
       "      <td>CAT LITTER</td>\n",
       "      <td>LITTER ACCESSORIES</td>\n",
       "      <td>12 CT</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1436</th>\n",
       "      <td>78888</td>\n",
       "      <td>1051</td>\n",
       "      <td>DRUG GM</td>\n",
       "      <td>National</td>\n",
       "      <td>ELECTRICAL SUPPPLIES</td>\n",
       "      <td>DECOR BULBS</td>\n",
       "      <td>2 CT</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      product_id  manufacturer_id     department     brand  \\\n",
       "88         28889               69  SEAFOOD-PCKGD   Private   \n",
       "199        32888             1075        GROCERY  National   \n",
       "1065       62888               69        GROCERY   Private   \n",
       "1368       75888             1448        GROCERY  National   \n",
       "1436       78888             1051        DRUG GM  National   \n",
       "\n",
       "          product_category             product_type package_size  \n",
       "88        SEAFOOD - FROZEN  SEAFOOD-FRZ-RAW FILLETS        12 OZ  \n",
       "199   CRACKERS/MISC BKD FD           SNACK CRACKERS       8.5 OZ  \n",
       "1065     LAUNDRY ADDITIVES   FABRIC SOFTENER LIQUID     60 LOADS  \n",
       "1368            CAT LITTER       LITTER ACCESSORIES        12 CT  \n",
       "1436  ELECTRICAL SUPPPLIES              DECOR BULBS         2 CT  "
      ]
     },
     "execution_count": 106,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prod_id_8s = products['product_id'].astype(str).str.contains('8{2,3}', na=False)\n",
    "products[prod_id_8s].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Putting it altogether\n",
    "\n",
    "Ok, let's use a few tools we've learned to answer a question we may get asked by our boss. Say we were asked to identify the top 5 products that have the most total sales; however, we only want to focus on those products that weigh 10lbs or more. We can apply the following steps:\n",
    "\n",
    "1. filter for regex `\"^\\\\d{2,}\\\\.?.*lb\"` which means:\n",
    "   - `^\\\\d{2,}`: starts with at least 2 numeric digits\n",
    "   - `\\\\.?.`: followed by an optional decimal\n",
    "   - `.*lb`: followed by a character zero or more times and then the identifier for pounds (lb).\n",
    "2. take the resulting product list and inner join with transactions so we only retain those transactions and products that have a matching product ID in both tables,\n",
    "3. compute total sales grouped by product (here we use product type just to provide us more context over the product ID),\n",
    "4. and then use `nlargest` to get the top 5 `total_sales` values (you could've also used `sort_values()` to get to the same conclusion)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/8f/c06lv6q17tjbyjv2nkt0_s4s1sh0tg/T/ipykernel_64719/3801605409.py:1: UserWarning: This pattern is interpreted as a regular expression, and has match groups. To actually get the groups, use str.extract.\n",
      "  size_filter = products['package_size'].str.contains('^\\d{2,}(\\.)?.*lb', case=False, na=False)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>product_type</th>\n",
       "      <th>sales_value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>BANANAS</td>\n",
       "      <td>17480.66</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>GRAPES RED</td>\n",
       "      <td>9999.96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>GRAPES WHITE</td>\n",
       "      <td>8467.86</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61</th>\n",
       "      <td>POTATOES RUSSET (BULK&amp;BAG)</td>\n",
       "      <td>7805.49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>DRY DOG FOOD PREMIUM (ALPO/PUR</td>\n",
       "      <td>7146.87</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      product_type  sales_value\n",
       "3                          BANANAS     17480.66\n",
       "28                      GRAPES RED      9999.96\n",
       "29                    GRAPES WHITE      8467.86\n",
       "61      POTATOES RUSSET (BULK&BAG)      7805.49\n",
       "22  DRY DOG FOOD PREMIUM (ALPO/PUR      7146.87"
      ]
     },
     "execution_count": 116,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "size_filter = products['package_size'].str.contains('^\\\\d{2,}\\\\.?.*lb', case=False, na=False)\n",
    "(\n",
    "    products[size_filter]\n",
    "    .merge(cj_data['transactions'], how='inner', on='product_id')\n",
    "    .groupby('product_type', as_index=False)\n",
    "    .agg({'sales_value': 'sum'})\n",
    "    .nlargest(5, 'sales_value')\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Knowledge check\n",
    "\n",
    "```{admonition} Questions:\n",
    ":class: attention\n",
    "1. How many `products` contain the word \"bulk\" in `product_type`?\n",
    "2. How many `products` do not contain punctuation in their `package_size`?\n",
    "3. Find all frozen pizza products. Be careful, this is not straight forward!\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} Video 🎥:\n",
    "<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/Xg8y3UkQDpo?si=egVqYMvYj5mf3YVT\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" allowfullscreen></iframe>\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "```{admonition} Questions:\n",
    ":class: attention\n",
    "To answer these questions you'll need to use the `products` and `transactions` data frames.\n",
    "\n",
    "1. Identify all different products that contain \"pizza\" in their `product_type` description. Which product produces the greatest amount of total sales?\n",
    "2. Identify all products that are categorized (`product_category`) as pizza but are considered a snack or appetizer (`product_type`). Which of these products have the most sales (measured by quantity)?\n",
    "3. How many products contain `package_size`s that do not contain a numeric value.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Python implementation: CPython\n",
      "Python version       : 3.9.12\n",
      "IPython version      : 8.2.0\n",
      "\n",
      "jupyterlab        : 3.3.2\n",
      "pandas            : 1.4.2\n",
      "completejourney_py: 0.0.3\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -v -p jupyterlab,pandas,completejourney_py"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "40d3a090f54c6569ab1632332b64b2c03c39dcf918b08424e98f38b5ae0af88f"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}