Lesson 2c: Deeper dive on DataFrames#

Now that we know how to import data as a DataFrame, let’s spend a little time discussing DataFrames and what they are made up of. This should create a stronger foundation of Pandas and DataFrames.

Learning objectives#

At the end of this lesson you should be able to:

  • Explain the difference between DataFrames and Series

  • Set and manipulate index values

To illustrate our points throughout this lesson, we’ll use the following airlines data which simply includes the name of the airline carrier and the airline carrier code:

import pandas as pd

df = pd.read_csv('../data/airlines.csv')
df.head()
carrier name
0 9E Endeavor Air Inc.
1 AA American Airlines Inc.
2 AS Alaska Airlines Inc.
3 B6 JetBlue Airways
4 DL Delta Air Lines Inc.

What Are DataFrames Made of?#

Video 🎥:

Accessing an individual column of a DataFrame can be done by passing the column name as a string, in brackets ([]).

carrier_column = df['carrier']
carrier_column
0     9E
1     AA
2     AS
3     B6
4     DL
5     EV
6     F9
7     FL
8     HA
9     MQ
10    OO
11    UA
12    US
13    VX
14    WN
15    YV
Name: carrier, dtype: object

Individual columns are pandas Series objects.

type(carrier_column)
pandas.core.series.Series

A Series is like a NumPy array but with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, objects, etc), including a mix of them. Series can be created from a scalar, a list, ndarray or dictionary using pd.Series() (note the captial “S”). Here are some example series:

import-framework.png

Source: Python Programming for Data Science

How are Series different from DataFrames?

  • They’re always 1-dimensional

  • They have different attributes & methods than DataFrames

    • For example, Series have a to_list method – which doesn’t make sense to have on DataFrames

  • They don’t print in the pretty format of DataFrames, but in plain text (see above)

# A Series only has one dimension (number of elements)
carrier_column.shape
(16,)
# Whereas a DataFrame has two (rows and columns)
df.shape
(16, 2)

As mentioned above, Series will have different attributes & methods than DataFrames. For example, Series have a to_list method which converts the single array of values into a list. This is a common practice in a lot of workflows. However, a DataFrame does not have this method.

# A unique method to Series
carrier_column.to_list()
['9E',
 'AA',
 'AS',
 'B6',
 'DL',
 'EV',
 'F9',
 'FL',
 'HA',
 'MQ',
 'OO',
 'UA',
 'US',
 'VX',
 'WN',
 'YV']
# DataFrames don't have this method so we'll get an error!
df.to_list()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/8f/c06lv6q17tjbyjv2nkt0_s4s1sh0tg/T/ipykernel_4260/33797969.py in ?()
      1 # DataFrames don't have this method so we'll get an error!
----> 2 df.to_list()

/opt/anaconda3/envs/bana6043/lib/python3.12/site-packages/pandas/core/generic.py in ?(self, name)
   6295             and name not in self._accessors
   6296             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6297         ):
   6298             return self[name]
-> 6299         return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'to_list'

It’s important to be familiar with Series because they are fundamentally the core of DataFrames.

import-framework.png

Source: Python Programming for Data Science

Not only are columns represented as Series, but so are rows!

df.head(1)
carrier name
0 9E Endeavor Air Inc.
# Fetch the first row of the DataFrame
first_row = df.loc[0]
first_row
carrier                   9E
name       Endeavor Air Inc.
Name: 0, dtype: object
type(first_row)
pandas.core.series.Series

Note

Whenever you select individual columns or rows, you’ll get Series objects.

What Can You Do with a Series?#

Video 🎥:

First, let’s create our own Series object from scratch – they don’t need to come from a DataFrame. Here, we pass a list in as an argument and it will be converted to a Series.

s = pd.Series([10, 20, 30, 40, 50])
s
0    10
1    20
2    30
3    40
4    50
dtype: int64

There are 3 things to notice about this Series:

  • The values (10, 20, 30…)

  • The dtype, short for data type.

  • The index (0, 1, 2…)

Values are fairly self-explanatory; we chose them in our input list.

Data types are also straightforward. Series are often homogeneous, holding only integers, floats, or generic Python objects (called just object); however, they actually can hold a mix of data types (though this is quite rare and we’ll stay away from doing this for the most part). As we mentioned in the last lesson, string elements in DataFrames are often labeled as having an object dtype. Because a Python object is general enough to contain any other type, any Series holding strings or other non-numeric data will typically default to be of type object.

For example, going back to our carriers DataFrame, note that the carrier column is of type object.

df['carrier']
0     9E
1     AA
2     AS
3     B6
4     DL
5     EV
6     F9
7     FL
8     HA
9     MQ
10    OO
11    UA
12    US
13    VX
14    WN
15    YV
Name: carrier, dtype: object

Indexes are more interesting. Every Series has an index, a way to reference each element. The index of a Series is a lot like the keys of a dictionary: each index element corresponds to a value in the Series, and can be used to look up that element.

# Our index is a range from 0 (inclusive) to 5 (exclusive).
s.index
RangeIndex(start=0, stop=5, step=1)
s
0    10
1    20
2    30
3    40
4    50
dtype: int64

As we learned to index other data structures in the previous module, we can continue to use [] for indexing Series. When we specify below, we are stating to get the 4th element in our Series located at index value 3.

s[3]
40

In our example, the index is just the integers 0-4, so right now it looks no different that referencing elements of a regular Python list. But indexes can be changed to something different – like the letters a-e, for example.

s.index = ['a', 'b', 'c', 'd', 'e']
s
a    10
b    20
c    30
d    40
e    50
dtype: int64

Now to look up the value 40, we reference 'd'.

s['d']
40

We saw earlier that rows of a DataFrame are Series. In such cases, the flexibility of Series indexes comes in handy; the index is set to the DataFrame column names.

df.head()
carrier name
0 9E Endeavor Air Inc.
1 AA American Airlines Inc.
2 AS Alaska Airlines Inc.
3 B6 JetBlue Airways
4 DL Delta Air Lines Inc.
# Note that the index is ['carrier', 'name']
first_row = df.loc[0]
first_row
carrier                   9E
name       Endeavor Air Inc.
Name: 0, dtype: object

This is particularly handy because it means you can extract individual elements based on a column name.

first_row['carrier']
'9E'

DataFrame Indexes#

Video 🎥:

It’s not just Series that have indexes! DataFrames have them too. Take a look at the carrier DataFrame again and note the bold numbers on the left.

df.head()
carrier name
0 9E Endeavor Air Inc.
1 AA American Airlines Inc.
2 AS Alaska Airlines Inc.
3 B6 JetBlue Airways
4 DL Delta Air Lines Inc.

These numbers are an index, just like the one we saw on our example Series. And DataFrame indexes support similar functionality.

# Our index is a range from 0 (inclusive) to 16 (exclusive).
df.index
RangeIndex(start=0, stop=16, step=1)

When loading in a DataFrame, the default index will always be 0 to N-1, where N is the number of rows in your DataFrame. This is called a RangeIndex. Selecting individual rows by their index is done with the .loc accessor.

Tip

An accessor is an attribute designed specifically to help users reference something else (like rows within a DataFrame).

# Get the row at index 4 (the fifth row).
df.loc[4]
carrier                      DL
name       Delta Air Lines Inc.
Name: 4, dtype: object

As with Series, DataFrames support reassigning their index. However, with DataFrames it often makes sense to change one of your columns into the index. This is analogous to a primary key in relational databases: a way to rapidly look up rows within a table.

In our case, maybe we will often use the carrier code (carrier) to look up the full name of the airline. In that case, it would make sense set the carrier column as our index.

df = df.set_index('carrier')
df.head()
name
carrier
9E Endeavor Air Inc.
AA American Airlines Inc.
AS Alaska Airlines Inc.
B6 JetBlue Airways
DL Delta Air Lines Inc.

Now the RangeIndex has been replaced with a more meaningful index, and it’s possible to look up rows of the table by passing carrier code to the .loc accessor.

df.loc['AA']
name    American Airlines Inc.
Name: AA, dtype: object

Warning

Pandas does not require that indexes have unique values (that is, no duplicates) although many relational databases do have that requirement of a primary key. This means that it is possible to create a non-unique index, but highly inadvisable. Having duplicate values in your index can cause unexpected results when you refer to rows by index – but multiple rows have that index. Don’t do it if you can help it!

When starting to work with a DataFrame, it’s often a good idea to determine what column makes sense as your index and to set it immediately. This will make your code nicer – by letting you directly look up values with the index – and also make your selections and filters faster, because Pandas is optimized for operations by index. If you want to change the index of your DataFrame later, you can always reset_index (and then assign a new one).

df.head()
name
carrier
9E Endeavor Air Inc.
AA American Airlines Inc.
AS Alaska Airlines Inc.
B6 JetBlue Airways
DL Delta Air Lines Inc.
df = df.reset_index()
df.head()
carrier name
0 9E Endeavor Air Inc.
1 AA American Airlines Inc.
2 AS Alaska Airlines Inc.
3 B6 JetBlue Airways
4 DL Delta Air Lines Inc.

Exercises#

Questions:

Import the airports data as airports. The data contains the airport code, airport name, and some basic facts about the airport location.

  1. What kind of index is the current index of airports?

  2. Is this a good choice for the DataFrame’s index? If not, what column or columns would be a better candidate?

  3. Set the faa column as your new index.

  4. Using your new index, look up “Pittsburgh-Monroeville Airport”, which has FAA code 4G0. What is its altitude?

  5. Reset your index in case you want to make a different column your index in the future.

Computing environment#

Hide code cell source
%load_ext watermark
%watermark -v -p jupyterlab,pandas
Python implementation: CPython
Python version       : 3.9.4
IPython version      : 7.26.0

jupyterlab: 3.1.4
pandas    : 1.2.4