{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Lesson 2c: Deeper dive on DataFrames\n", "\n", "Now that we know how to import data as a DataFrame, let's spend a little time discussing DataFrames and what they are made up of. This should create a stronger foundation of Pandas and DataFrames.\n", "\n", "## Learning objectives\n", "\n", "At the end of this lesson you should be able to:\n", "\n", "* Explain the difference between DataFrames and Series\n", "* Set and manipulate index values\n", "\n", "To illustrate our points throughout this lesson, we'll use the following airlines data which simply includes the name of the airline carrier and the airline carrier code:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carriername
09EEndeavor Air Inc.
1AAAmerican Airlines Inc.
2ASAlaska Airlines Inc.
3B6JetBlue Airways
4DLDelta Air Lines Inc.
\n", "
" ], "text/plain": [ " carrier name\n", "0 9E Endeavor Air Inc.\n", "1 AA American Airlines Inc.\n", "2 AS Alaska Airlines Inc.\n", "3 B6 JetBlue Airways\n", "4 DL Delta Air Lines Inc." ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "df = pd.read_csv('../data/airlines.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What Are DataFrames Made of?\n", "\n", "```{admonition} Video šŸŽ„:\n", "\n", "```\n", "\n", "Accessing an individual column of a DataFrame can be done by passing the column name as a string, in brackets (`[]`)." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 9E\n", "1 AA\n", "2 AS\n", "3 B6\n", "4 DL\n", "5 EV\n", "6 F9\n", "7 FL\n", "8 HA\n", "9 MQ\n", "10 OO\n", "11 UA\n", "12 US\n", "13 VX\n", "14 WN\n", "15 YV\n", "Name: carrier, dtype: object" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "carrier_column = df['carrier']\n", "carrier_column" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Individual columns are pandas `Series` objects." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(carrier_column)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "A Series is like a NumPy array but with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc), including a mix of them. Series can be created from a scalar, a list, ndarray or dictionary using `pd.Series()` (note the captial ā€œSā€). Here are some example series:\n", "\n", "
\n", "\"import-framework.png\"\n", "
\n", "\n", "Source: [Python Programming for Data Science](https://www.tomasbeuzen.com/python-programming-for-data-science)\n", "\n", "How are Series different from DataFrames?\n", "\n", "- They're always 1-dimensional\n", "- They have different attributes & methods than DataFrames\n", " - For example, Series have a `to_list` method -- which doesn't make sense to have on DataFrames\n", "- They don't print in the pretty format of DataFrames, but in plain text (see above)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "(16,)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# A Series only has one dimension (number of elements)\n", "carrier_column.shape" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(16, 2)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Whereas a DataFrame has two (rows and columns)\n", "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As mentioned above, Series will have different attributes & methods than DataFrames. For example, Series have a `to_list` method which converts the single array of values into a list. This is a common practice in a lot of workflows. However, a DataFrame does not have this method." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['9E',\n", " 'AA',\n", " 'AS',\n", " 'B6',\n", " 'DL',\n", " 'EV',\n", " 'F9',\n", " 'FL',\n", " 'HA',\n", " 'MQ',\n", " 'OO',\n", " 'UA',\n", " 'US',\n", " 'VX',\n", " 'WN',\n", " 'YV']" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# A unique method to Series\n", "carrier_column.to_list()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [ "ci-skip" ] }, "outputs": [ { "ename": "AttributeError", "evalue": "'DataFrame' object has no attribute 'to_list'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mto_list\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/Desktop/Workspace/Training/intro-python-datasci/venv/lib/python3.8/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m__getattr__\u001b[0;34m(self, name)\u001b[0m\n\u001b[1;32m 5271\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_info_axis\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_can_hold_identifiers_and_holds_name\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5272\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 5273\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mobject\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m__getattribute__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5274\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5275\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__setattr__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mstr\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalue\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m->\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mAttributeError\u001b[0m: 'DataFrame' object has no attribute 'to_list'" ] } ], "source": [ "# DataFrames don't have this method so we'll get an error!\n", "df.to_list()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "It's important to be familiar with Series because they are fundamentally the core of DataFrames.\n", "\n", "
\n", "\"import-framework.png\"\n", "
\n", "\n", "Source: [Python Programming for Data Science](https://www.tomasbeuzen.com/python-programming-for-data-science)\n", "\n", "Not only are columns represented as Series, but so are rows!" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carriername
09EEndeavor Air Inc.
\n", "
" ], "text/plain": [ " carrier name\n", "0 9E Endeavor Air Inc." ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(1)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "carrier 9E\n", "name Endeavor Air Inc.\n", "Name: 0, dtype: object" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fetch the first row of the DataFrame\n", "first_row = df.loc[0]\n", "first_row" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "pandas.core.series.Series" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(first_row)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "```{note}\n", "Whenever you select individual columns or rows, you'll get Series objects.\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What Can You Do with a Series?\n", "\n", "```{admonition} Video šŸŽ„:\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "First, let's create our own Series object from scratch -- they don't need to come from a DataFrame. Here, we pass a list in as an argument and it will be converted to a Series." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 10\n", "1 20\n", "2 30\n", "3 40\n", "4 50\n", "dtype: int64" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = pd.Series([10, 20, 30, 40, 50])\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "There are 3 things to notice about this Series:\n", "\n", "- The values (10, 20, 30...)\n", "- The *dtype*, short for data type.\n", "- The *index* (0, 1, 2...)\n", "\n", "**Values** are fairly self-explanatory; we chose them in our input list. \n", "\n", "**Data types** are also straightforward. Series are often homogeneous, holding only integers, floats, or generic Python objects (called just `object`); however, they actually can hold a mix of data types (though this is quite rare and we'll stay away from doing this for the most part). As we mentioned in the last lesson, string elements in DataFrames are often labeled as having an _object_ dtype. Because a Python object is general enough to contain any other type, any Series holding strings or other non-numeric data will typically default to be of type `object`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "For example, going back to our carriers DataFrame, note that the carrier column is of type `object`." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 9E\n", "1 AA\n", "2 AS\n", "3 B6\n", "4 DL\n", "5 EV\n", "6 F9\n", "7 FL\n", "8 HA\n", "9 MQ\n", "10 OO\n", "11 UA\n", "12 US\n", "13 VX\n", "14 WN\n", "15 YV\n", "Name: carrier, dtype: object" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['carrier']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Indexes** are more interesting. Every Series has an index, a way to reference each element. The index of a Series is a lot like the keys of a dictionary: each index element corresponds to a value in the Series, and can be used to look up that element." ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=5, step=1)" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Our index is a range from 0 (inclusive) to 5 (exclusive).\n", "s.index" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "0 10\n", "1 20\n", "2 30\n", "3 40\n", "4 50\n", "dtype: int64" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we learned to index other data structures in the previous module, we can continue to use `[]` for indexing Series. When we specify below, we are stating to get the 4th element in our Series located at index value 3." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "40" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s[3]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "In our example, the index is just the integers 0-4, so right now it looks no different that referencing elements of a regular Python list.\n", "*But* indexes can be changed to something different -- like the letters a-e, for example." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "a 10\n", "b 20\n", "c 30\n", "d 40\n", "e 50\n", "dtype: int64" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.index = ['a', 'b', 'c', 'd', 'e']\n", "s" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Now to look up the value 40, we reference `'d'`." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "40" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s['d']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We saw earlier that rows of a DataFrame are Series.\n", "In such cases, the flexibility of Series indexes comes in handy;\n", "the index is set to the DataFrame column names." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carriername
09EEndeavor Air Inc.
1AAAmerican Airlines Inc.
2ASAlaska Airlines Inc.
3B6JetBlue Airways
4DLDelta Air Lines Inc.
\n", "
" ], "text/plain": [ " carrier name\n", "0 9E Endeavor Air Inc.\n", "1 AA American Airlines Inc.\n", "2 AS Alaska Airlines Inc.\n", "3 B6 JetBlue Airways\n", "4 DL Delta Air Lines Inc." ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "carrier 9E\n", "name Endeavor Air Inc.\n", "Name: 0, dtype: object" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Note that the index is ['carrier', 'name']\n", "first_row = df.loc[0]\n", "first_row" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "This is particularly handy because it means you can extract individual elements based on a column name." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'9E'" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "first_row['carrier']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## DataFrame Indexes\n", "\n", "```{admonition} Video šŸŽ„:\n", "\n", "```\n", "\n", "It's not just Series that have indexes! DataFrames have them too. Take a look at the carrier DataFrame again and note the bold numbers on the left." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carriername
09EEndeavor Air Inc.
1AAAmerican Airlines Inc.
2ASAlaska Airlines Inc.
3B6JetBlue Airways
4DLDelta Air Lines Inc.
\n", "
" ], "text/plain": [ " carrier name\n", "0 9E Endeavor Air Inc.\n", "1 AA American Airlines Inc.\n", "2 AS Alaska Airlines Inc.\n", "3 B6 JetBlue Airways\n", "4 DL Delta Air Lines Inc." ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "These numbers are an index, just like the one we saw on our example Series.\n", "And DataFrame indexes support similar functionality." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "RangeIndex(start=0, stop=16, step=1)" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Our index is a range from 0 (inclusive) to 16 (exclusive).\n", "df.index" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "When loading in a DataFrame, the default index will always be 0 to N-1, where N is the number of rows in your DataFrame.\n", "This is called a `RangeIndex`. Selecting individual rows by their index is done with the `.loc` accessor.\n", "\n", "```{tip}\n", "An **accessor** is an attribute designed specifically to help users reference something else (like rows within a DataFrame).\n", "```" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "carrier DL\n", "name Delta Air Lines Inc.\n", "Name: 4, dtype: object" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the row at index 4 (the fifth row).\n", "df.loc[4]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "As with Series, DataFrames support reassigning their index. However, with DataFrames it often makes sense to change one of your columns into the index. This is analogous to a primary key in relational databases: a way to rapidly look up rows within a table.\n", "\n", "In our case, maybe we will often use the carrier code (`carrier`) to look up the full name of the airline. In that case, it would make sense set the carrier column as our index." ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name
carrier
9EEndeavor Air Inc.
AAAmerican Airlines Inc.
ASAlaska Airlines Inc.
B6JetBlue Airways
DLDelta Air Lines Inc.
\n", "
" ], "text/plain": [ " name\n", "carrier \n", "9E Endeavor Air Inc.\n", "AA American Airlines Inc.\n", "AS Alaska Airlines Inc.\n", "B6 JetBlue Airways\n", "DL Delta Air Lines Inc." ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.set_index('carrier')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Now the RangeIndex has been replaced with a more meaningful index, and it's possible to look up rows of the table by passing carrier code to the `.loc` accessor." ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "name American Airlines Inc.\n", "Name: AA, dtype: object" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc['AA']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "```{warning}\n", "Pandas does not require that indexes have unique values (that is, no duplicates) although many relational databases do have that requirement of a primary key. This means that it is *possible* to create a non-unique index, but highly inadvisable. Having duplicate values in your index can cause unexpected results when you refer to rows by index -- but multiple rows have that index. Don't do it if you can help it!\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "When starting to work with a DataFrame, it's often a good idea to determine what column makes sense as your index and to set it immediately. This will make your code nicer -- by letting you directly look up values with the index -- and also make your selections and filters faster, because Pandas is optimized for operations by index. If you want to change the index of your DataFrame later, you can always `reset_index` (and then assign a new one)." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name
carrier
9EEndeavor Air Inc.
AAAmerican Airlines Inc.
ASAlaska Airlines Inc.
B6JetBlue Airways
DLDelta Air Lines Inc.
\n", "
" ], "text/plain": [ " name\n", "carrier \n", "9E Endeavor Air Inc.\n", "AA American Airlines Inc.\n", "AS Alaska Airlines Inc.\n", "B6 JetBlue Airways\n", "DL Delta Air Lines Inc." ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
carriername
09EEndeavor Air Inc.
1AAAmerican Airlines Inc.
2ASAlaska Airlines Inc.
3B6JetBlue Airways
4DLDelta Air Lines Inc.
\n", "
" ], "text/plain": [ " carrier name\n", "0 9E Endeavor Air Inc.\n", "1 AA American Airlines Inc.\n", "2 AS Alaska Airlines Inc.\n", "3 B6 JetBlue Airways\n", "4 DL Delta Air Lines Inc." ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.reset_index()\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Exercises\n", "\n", "```{admonition} Questions:\n", ":class: attention\n", "Import the [airports data](https://github.com/bradleyboehmke/uc-bana-6043/blob/main/instructor-material/module-2/data/airports.csv) as `airports`. The data contains the airport code, airport name, and some basic facts about the airport location.\n", "\n", "1. What kind of index is the current index of `airports`? \n", "2. Is this a good choice for the DataFrame's index? If not, what column or columns would be a better candidate?\n", "3. Set the `faa` column as your new index.\n", "4. Using your new index, look up \"Pittsburgh-Monroeville Airport\", which has FAA code 4G0. What is its altitude?\n", "5. Reset your index in case you want to make a different column your index in the future.\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python implementation: CPython\n", "Python version : 3.9.4\n", "IPython version : 7.26.0\n", "\n", "jupyterlab: 3.1.4\n", "pandas : 1.2.4\n", "\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -v -p jupyterlab,pandas" ] } ], "metadata": { "interpreter": { "hash": "8e346f37fddfdb8857cb357a73c60b4cb7ac7bbe3e050d88264656a45d51c4e3" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "rise": { "autolaunch": true, "transition": "none" } }, "nbformat": 4, "nbformat_minor": 4 }