Lesson 3b: Manipulating data#

During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst’s time.

- Wes McKinney, the creator of Pandas, in his book Python for Data Analysis

We’ve learned how to subset our DataFrames, in this lesson we’ll focus on how to manipulate, create, drop, even identify missing value patterns across our DataFrame’s columns.

Learning objectives#

By the end of this lesson you will be able to:

  • Rename columns

  • Perform calculations and operations with one or more columns

  • Add, drop, and overwrite columns in your DataFrame

  • Identify missing values and replace these (and even non-missing) values.

Renaming columns#

Often, one of the first things we want to do with a new data set is clean up the column names. We can do this a few different ways and to illustrate, let’s look at the Ames housing data:

import pandas as pd

ames = pd.read_csv('../data/ames_raw.csv')
ames.head()
Order PID MS SubClass MS Zoning Lot Frontage Lot Area Street Alley Lot Shape Land Contour ... Pool Area Pool QC Fence Misc Feature Misc Val Mo Sold Yr Sold Sale Type Sale Condition SalePrice
0 1 526301100 20 RL 141.0 31770 Pave NaN IR1 Lvl ... 0 NaN NaN NaN 0 5 2010 WD Normal 215000
1 2 526350040 20 RH 80.0 11622 Pave NaN Reg Lvl ... 0 NaN MnPrv NaN 0 6 2010 WD Normal 105000
2 3 526351010 20 RL 81.0 14267 Pave NaN IR1 Lvl ... 0 NaN NaN Gar2 12500 6 2010 WD Normal 172000
3 4 526353030 20 RL 93.0 11160 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 4 2010 WD Normal 244000
4 5 527105010 60 RL 74.0 13830 Pave NaN IR1 Lvl ... 0 NaN MnPrv NaN 0 3 2010 WD Normal 189900

5 rows × 82 columns

Say we want to rename the “MS SubClass” and “MS Zoning” columns. We can do so with the rename method and passing a dictionary that maps old names to new names: df.rename(columns={'old_name1': 'new_name1', 'old_name2': 'new_name2'}).

ames.rename(columns={'MS SubClass': 'ms_subclass', 'MS Zoning': 'ms_zoning'})
ames.head()
Order PID MS SubClass MS Zoning Lot Frontage Lot Area Street Alley Lot Shape Land Contour ... Pool Area Pool QC Fence Misc Feature Misc Val Mo Sold Yr Sold Sale Type Sale Condition SalePrice
0 1 526301100 20 RL 141.0 31770 Pave NaN IR1 Lvl ... 0 NaN NaN NaN 0 5 2010 WD Normal 215000
1 2 526350040 20 RH 80.0 11622 Pave NaN Reg Lvl ... 0 NaN MnPrv NaN 0 6 2010 WD Normal 105000
2 3 526351010 20 RL 81.0 14267 Pave NaN IR1 Lvl ... 0 NaN NaN Gar2 12500 6 2010 WD Normal 172000
3 4 526353030 20 RL 93.0 11160 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 4 2010 WD Normal 244000
4 5 527105010 60 RL 74.0 13830 Pave NaN IR1 Lvl ... 0 NaN MnPrv NaN 0 3 2010 WD Normal 189900

5 rows × 82 columns

Wait? What happened? Nothing changed? In the code above we did actually rename columns of our DataFrame but we didn’t modify the DataFrame inplace, we made a copy of it. There are generally two options for making permanent DataFrame changes:

  1. Use the argument inplace=True, e.g., df.rename(..., inplace=True), available in most Pandas functions/methods

  2. Re-assign, e.g., df = df.rename(...) The Pandas team recommends Method 2 (re-assign), for a few reasons (mostly to do with how memory is allocated under the hood).

Warning

Be sure to include the columns= when providing the argument dictionary. rename can be used to rename index values as well, which is actually the default behavior. So if you don’t specify columns= it’ll behave differently then expected and no error/warning messages will be provided.

ames = ames.rename(columns={'MS SubClass': 'ms_subclass', 'MS Zoning': 'ms_zoning'})
ames.head()
Order PID ms_subclass ms_zoning Lot Frontage Lot Area Street Alley Lot Shape Land Contour ... Pool Area Pool QC Fence Misc Feature Misc Val Mo Sold Yr Sold Sale Type Sale Condition SalePrice
0 1 526301100 20 RL 141.0 31770 Pave NaN IR1 Lvl ... 0 NaN NaN NaN 0 5 2010 WD Normal 215000
1 2 526350040 20 RH 80.0 11622 Pave NaN Reg Lvl ... 0 NaN MnPrv NaN 0 6 2010 WD Normal 105000
2 3 526351010 20 RL 81.0 14267 Pave NaN IR1 Lvl ... 0 NaN NaN Gar2 12500 6 2010 WD Normal 172000
3 4 526353030 20 RL 93.0 11160 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 4 2010 WD Normal 244000
4 5 527105010 60 RL 74.0 13830 Pave NaN IR1 Lvl ... 0 NaN MnPrv NaN 0 3 2010 WD Normal 189900

5 rows × 82 columns

Using rename is great for renaming a single or even a handful of columns but can be tedious for renaming many columns. For this we can use the .columns attribute, which just returns all the column names.

ames.columns
Index(['Order', 'PID', 'ms_subclass', 'ms_zoning', 'Lot Frontage', 'Lot Area',
       'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities',
       'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style',
       'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF',
       'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath',
       'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr',
       'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt',
       'Garage Finish', 'Garage Cars', 'Garage Area', 'Garage Qual',
       'Garage Cond', 'Paved Drive', 'Wood Deck SF', 'Open Porch SF',
       'Enclosed Porch', '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Pool QC',
       'Fence', 'Misc Feature', 'Misc Val', 'Mo Sold', 'Yr Sold', 'Sale Type',
       'Sale Condition', 'SalePrice'],
      dtype='object')

Pandas offers a lot of string methods that we can apply to string objects.

Tip

Check out some of the more common string methods here.

We can manipulate these column name values by using string methods that you can access via .str.xxxx(). For example, we can coerce all the column names to lower case with:

ames.columns.str.lower()
Index(['order', 'pid', 'ms_subclass', 'ms_zoning', 'lot frontage', 'lot area',
       'street', 'alley', 'lot shape', 'land contour', 'utilities',
       'lot config', 'land slope', 'neighborhood', 'condition 1',
       'condition 2', 'bldg type', 'house style', 'overall qual',
       'overall cond', 'year built', 'year remod/add', 'roof style',
       'roof matl', 'exterior 1st', 'exterior 2nd', 'mas vnr type',
       'mas vnr area', 'exter qual', 'exter cond', 'foundation', 'bsmt qual',
       'bsmt cond', 'bsmt exposure', 'bsmtfin type 1', 'bsmtfin sf 1',
       'bsmtfin type 2', 'bsmtfin sf 2', 'bsmt unf sf', 'total bsmt sf',
       'heating', 'heating qc', 'central air', 'electrical', '1st flr sf',
       '2nd flr sf', 'low qual fin sf', 'gr liv area', 'bsmt full bath',
       'bsmt half bath', 'full bath', 'half bath', 'bedroom abvgr',
       'kitchen abvgr', 'kitchen qual', 'totrms abvgrd', 'functional',
       'fireplaces', 'fireplace qu', 'garage type', 'garage yr blt',
       'garage finish', 'garage cars', 'garage area', 'garage qual',
       'garage cond', 'paved drive', 'wood deck sf', 'open porch sf',
       'enclosed porch', '3ssn porch', 'screen porch', 'pool area', 'pool qc',
       'fence', 'misc feature', 'misc val', 'mo sold', 'yr sold', 'sale type',
       'sale condition', 'saleprice'],
      dtype='object')

We can even chain multiple string methods together. For example the following coerces the column names to lower case, replaces all white space in the column names with an underscore, and then assigns these converted values back to the .columns attribute.

ames.columns = ames.columns.str.lower().str.replace(" ", "_")
ames.head()
order pid ms_subclass ms_zoning lot_frontage lot_area street alley lot_shape land_contour ... pool_area pool_qc fence misc_feature misc_val mo_sold yr_sold sale_type sale_condition saleprice
0 1 526301100 20 RL 141.0 31770 Pave NaN IR1 Lvl ... 0 NaN NaN NaN 0 5 2010 WD Normal 215000
1 2 526350040 20 RH 80.0 11622 Pave NaN Reg Lvl ... 0 NaN MnPrv NaN 0 6 2010 WD Normal 105000
2 3 526351010 20 RL 81.0 14267 Pave NaN IR1 Lvl ... 0 NaN NaN Gar2 12500 6 2010 WD Normal 172000
3 4 526353030 20 RL 93.0 11160 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 4 2010 WD Normal 244000
4 5 527105010 60 RL 74.0 13830 Pave NaN IR1 Lvl ... 0 NaN MnPrv NaN 0 3 2010 WD Normal 189900

5 rows × 82 columns

Video 🎥:

Calculations using columns#

It’s common to want to modify a column of a DataFrame, or sometimes even to create a new column. For example, let’s look at the saleprice column in our data.

sale_price = ames['saleprice']
sale_price
0       215000
1       105000
2       172000
3       244000
4       189900
         ...  
2925    142500
2926    131000
2927    132000
2928    170000
2929    188000
Name: saleprice, Length: 2930, dtype: int64

Say we wanted to convert the sales price of our homes to be represented as thousands; so rather than “215000” we want to represent it as “215”? To do this we can simply divide by 1,000.

sale_price_k = sale_price / 1000
sale_price_k
0       215.0
1       105.0
2       172.0
3       244.0
4       189.9
        ...  
2925    142.5
2926    131.0
2927    132.0
2928    170.0
2929    188.0
Name: saleprice, Length: 2930, dtype: float64

Adding & removing columns#

So we’ve create a new series, sale_price_k. Right now it’s totally separate from our original ames DataFrame, but we can make it a column of ames using the assignment syntax with the column reference syntax.

df['new_column_name'] = new_column_series
ames['sale_price_k'] = sale_price_k
ames.head()
order pid ms_subclass ms_zoning lot_frontage lot_area street alley lot_shape land_contour ... pool_qc fence misc_feature misc_val mo_sold yr_sold sale_type sale_condition saleprice sale_price_k
0 1 526301100 20 RL 141.0 31770 Pave NaN IR1 Lvl ... NaN NaN NaN 0 5 2010 WD Normal 215000 215.0
1 2 526350040 20 RH 80.0 11622 Pave NaN Reg Lvl ... NaN MnPrv NaN 0 6 2010 WD Normal 105000 105.0
2 3 526351010 20 RL 81.0 14267 Pave NaN IR1 Lvl ... NaN NaN Gar2 12500 6 2010 WD Normal 172000 172.0
3 4 526353030 20 RL 93.0 11160 Pave NaN Reg Lvl ... NaN NaN NaN 0 4 2010 WD Normal 244000 244.0
4 5 527105010 60 RL 74.0 13830 Pave NaN IR1 Lvl ... NaN MnPrv NaN 0 3 2010 WD Normal 189900 189.9

5 rows × 83 columns

Note that ames now has a “sale_price_k” column at the end.

Note

Also note that in the code above, the column name goes in quotes within the bracket syntax, while the values that will become the column – the Series we’re using – are on the right side of the statement, without any brackets or quotes.

This sequence of operations can be expressed as a single line:

ames['sale_price_k'] = ames['saleprice'] / 1000

From a mathematical perspective, what we’re doing here is adding a scalar – a single value – to a vector – a series of values (aka a Series). Other vector-scalar math is supported as well.

# Subtraction
(ames['saleprice']- 12).head()
0    214988
1    104988
2    171988
3    243988
4    189888
Name: saleprice, dtype: int64
# Multiplication
(ames['saleprice'] * 10).head()
0    2150000
1    1050000
2    1720000
3    2440000
4    1899000
Name: saleprice, dtype: int64
# Exponentiation
(ames['saleprice'] ** 2).head()
0    46225000000
1    11025000000
2    29584000000
3    59536000000
4    36062010000
Name: saleprice, dtype: int64

We may want to drop columns as well. For this we can use the .drop() method:

ames = ames.drop(columns=['order', 'sale_price_k'])
ames.head()
pid ms_subclass ms_zoning lot_frontage lot_area street alley lot_shape land_contour utilities ... pool_area pool_qc fence misc_feature misc_val mo_sold yr_sold sale_type sale_condition saleprice
0 526301100 20 RL 141.0 31770 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 5 2010 WD Normal 215000
1 526350040 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 6 2010 WD Normal 105000
2 526351010 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN Gar2 12500 6 2010 WD Normal 172000
3 526353030 20 RL 93.0 11160 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2010 WD Normal 244000
4 527105010 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 NaN MnPrv NaN 0 3 2010 WD Normal 189900

5 rows × 81 columns

Knowledge check#

Question:

  1. Create a new column utility_space that is 1/5 of the above ground living space (gr_liv_area).

  2. You will get fractional output with step #1. See if you can figure out how to round this output to the nearest integer.

  3. Now remove this column from your DataFrame

Video 🎥:

Overwriting columns#

What if we discovered a systematic error in our data? Perhaps we find out that the “lot_area” column is not entirely accurate because the recording process includes an extra 50 square feet for every property. We could create a new column, “real_lot_area” but we’re not going to need the original “lot_area” column, and leaving it could cause confusion for others looking at our data.

A better solution would be to replace the original column with the new, recalculated, values. We can do so using the same syntax as for creating a new column.

ames.head()
pid ms_subclass ms_zoning lot_frontage lot_area street alley lot_shape land_contour utilities ... pool_area pool_qc fence misc_feature misc_val mo_sold yr_sold sale_type sale_condition saleprice
0 526301100 20 RL 141.0 31770 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 5 2010 WD Normal 215000
1 526350040 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 6 2010 WD Normal 105000
2 526351010 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN Gar2 12500 6 2010 WD Normal 172000
3 526353030 20 RL 93.0 11160 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2010 WD Normal 244000
4 527105010 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 NaN MnPrv NaN 0 3 2010 WD Normal 189900

5 rows × 81 columns

# Subtract 50 from lot area, and then overwrite the original data.
ames['lot_area'] = ames['lot_area'] - 50
ames.head()
pid ms_subclass ms_zoning lot_frontage lot_area street alley lot_shape land_contour utilities ... pool_area pool_qc fence misc_feature misc_val mo_sold yr_sold sale_type sale_condition saleprice
0 526301100 20 RL 141.0 31720 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 5 2010 WD Normal 215000
1 526350040 20 RH 80.0 11572 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 6 2010 WD Normal 105000
2 526351010 20 RL 81.0 14217 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN Gar2 12500 6 2010 WD Normal 172000
3 526353030 20 RL 93.0 11110 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2010 WD Normal 244000
4 527105010 60 RL 74.0 13780 Pave NaN IR1 Lvl AllPub ... 0 NaN MnPrv NaN 0 3 2010 WD Normal 189900

5 rows × 81 columns

Calculating based on multiple columns#

So far we’ve only seen vector-scalar math. But vector-vector math is supported as well. Let’s look at a toy example of creating a column that contains the price per square foot.

price_per_sqft = ames['saleprice'] / ames['gr_liv_area']
price_per_sqft.head()
0    129.830918
1    117.187500
2    129.420617
3    115.639810
4    116.574586
dtype: float64
ames['price_per_sqft'] = price_per_sqft
ames.head()
pid ms_subclass ms_zoning lot_frontage lot_area street alley lot_shape land_contour utilities ... pool_qc fence misc_feature misc_val mo_sold yr_sold sale_type sale_condition saleprice price_per_sqft
0 526301100 20 RL 141.0 31720 Pave NaN IR1 Lvl AllPub ... NaN NaN NaN 0 5 2010 WD Normal 215000 129.830918
1 526350040 20 RH 80.0 11572 Pave NaN Reg Lvl AllPub ... NaN MnPrv NaN 0 6 2010 WD Normal 105000 117.187500
2 526351010 20 RL 81.0 14217 Pave NaN IR1 Lvl AllPub ... NaN NaN Gar2 12500 6 2010 WD Normal 172000 129.420617
3 526353030 20 RL 93.0 11110 Pave NaN Reg Lvl AllPub ... NaN NaN NaN 0 4 2010 WD Normal 244000 115.639810
4 527105010 60 RL 74.0 13780 Pave NaN IR1 Lvl AllPub ... NaN MnPrv NaN 0 3 2010 WD Normal 189900 116.574586

5 rows × 82 columns

You can combine vector-vector and vector-scalar calculations in arbitrarily complex ways.

ames['nonsense'] = (ames['yr_sold'] + 12) * ames['gr_liv_area'] + ames['lot_area'] - 50
ames.head()
pid ms_subclass ms_zoning lot_frontage lot_area street alley lot_shape land_contour utilities ... fence misc_feature misc_val mo_sold yr_sold sale_type sale_condition saleprice price_per_sqft nonsense
0 526301100 20 RL 141.0 31720 Pave NaN IR1 Lvl AllPub ... NaN NaN 0 5 2010 WD Normal 215000 129.830918 3380102
1 526350040 20 RH 80.0 11572 Pave NaN Reg Lvl AllPub ... MnPrv NaN 0 6 2010 WD Normal 105000 117.187500 1823234
2 526351010 20 RL 81.0 14217 Pave NaN IR1 Lvl AllPub ... NaN Gar2 12500 6 2010 WD Normal 172000 129.420617 2701405
3 526353030 20 RL 93.0 11110 Pave NaN Reg Lvl AllPub ... NaN NaN 0 4 2010 WD Normal 244000 115.639810 4277480
4 527105010 60 RL 74.0 13780 Pave NaN IR1 Lvl AllPub ... MnPrv NaN 0 3 2010 WD Normal 189900 116.574586 3307568

5 rows × 83 columns

Knowledge check#

Question:

Create a new column price_per_total_sqft that is saleprice divided by the sum of gr_liv_area, total_bsmt_sf, wood_deck_sf, open_porch_sf.

Video 🎥:

Non-numeric column operations#

For simplicity, we started with mathematical operations. However, pandas supports string operations as well. We can use + to concatenate strings, with both vectors and scalars.

'Home in ' + ames['neighborhood'] + ' neighborhood sold under ' + ames['sale_condition'] + ' condition'
0       Home in NAmes neighborhood sold under Normal c...
1       Home in NAmes neighborhood sold under Normal c...
2       Home in NAmes neighborhood sold under Normal c...
3       Home in NAmes neighborhood sold under Normal c...
4       Home in Gilbert neighborhood sold under Normal...
                              ...                        
2925    Home in Mitchel neighborhood sold under Normal...
2926    Home in Mitchel neighborhood sold under Normal...
2927    Home in Mitchel neighborhood sold under Normal...
2928    Home in Mitchel neighborhood sold under Normal...
2929    Home in Mitchel neighborhood sold under Normal...
Length: 2930, dtype: object

More complex string operations are possible using methods available through the .str accessor.

Tip

We won’t cover them in detail, so refer to the documentation if you’re interested. But realize that we can do many different manipulations with string columns and its worth taking time to familiarize yourself with Pandas string capabilities.

# number of characters in string
ames['neighborhood'].str.len()
0       5
1       5
2       5
3       5
4       7
       ..
2925    7
2926    7
2927    7
2928    7
2929    7
Name: neighborhood, Length: 2930, dtype: int64
ames['garage_type'].str.lower().str.replace('tchd', 'tached')
0       attached
1       attached
2       attached
3       attached
4       attached
          ...   
2925    detached
2926    attached
2927         NaN
2928    attached
2929    attached
Name: garage_type, Length: 2930, dtype: object

More Complex Column Manipulation#

Replacing Values#

One fairly common situation in data wrangling is needing to convert one set of values to another, where there is a one-to-one correspondence between the values currently in the column and the new values that should replace them. This operation can be described as “mapping one set of values to another”.

Let’s look at an example of this. In our Ames data the month sold is represented numerically:

ames['mo_sold'].head()
0    5
1    6
2    6
3    4
4    3
Name: mo_sold, dtype: int64

Suppose we want to change this so that values are represented by the month name:

  • 1 = ‘Jan’

  • 2 = ‘Feb’

  • 12 = ‘Dec’

We can express this mapping of old values to new values using a Python dictionary.

# Only specify the values we want to replace; don't include the ones that should stay the same.
value_mapping = {
    1: 'Jan',
    2: 'Feb',
    3: 'Mar',
    4: 'Apr',
    5: 'May',
    6: 'Jun',
    7: 'Jul',
    8: 'Aug',
    9: 'Sep',
    10: 'Oct',
    11: 'Nov',
    12: 'Dec'
    }

Pandas provides a handy method on Series, .replace, that accepts this value mapping and updates the Series accordingly. We can use it to recode our values.

ames['mo_sold'].replace(value_mapping).head()
0    May
1    Jun
2    Jun
3    Apr
4    Mar
Name: mo_sold, dtype: object

If you are a SQL user, this workflow may look familiar to you; it’s quite similar to a CASE WHEN statement in SQL.

Missing values#

Missing values are typically denoted with NaN. We can use df.isnull() to find missing values in a dataframe. It returns a boolean for each element in the dataframe:

ames.isnull()
pid ms_subclass ms_zoning lot_frontage lot_area street alley lot_shape land_contour utilities ... fence misc_feature misc_val mo_sold yr_sold sale_type sale_condition saleprice price_per_sqft nonsense
0 False False False False False False True False False False ... True True False False False False False False False False
1 False False False False False False True False False False ... False True False False False False False False False False
2 False False False False False False True False False False ... True False False False False False False False False False
3 False False False False False False True False False False ... True True False False False False False False False False
4 False False False False False False True False False False ... False True False False False False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2925 False False False False False False True False False False ... False True False False False False False False False False
2926 False False False True False False True False False False ... False True False False False False False False False False
2927 False False False False False False True False False False ... False False False False False False False False False False
2928 False False False False False False True False False False ... True True False False False False False False False False
2929 False False False False False False True False False False ... True True False False False False False False False False

2930 rows × 83 columns

We can use this to easily compute the total number of missing values in each column:

ames.isnull().sum()
pid                 0
ms_subclass         0
ms_zoning           0
lot_frontage      490
lot_area            0
                 ... 
sale_type           0
sale_condition      0
saleprice           0
price_per_sqft      0
nonsense            0
Length: 83, dtype: int64

Recall we also get this information with .info(). Actually, we get the inverse as .info() tells us how many non-null values exist in each column.

ames.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 83 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   pid              2930 non-null   int64  
 1   ms_subclass      2930 non-null   int64  
 2   ms_zoning        2930 non-null   object 
 3   lot_frontage     2440 non-null   float64
 4   lot_area         2930 non-null   int64  
 5   street           2930 non-null   object 
 6   alley            198 non-null    object 
 7   lot_shape        2930 non-null   object 
 8   land_contour     2930 non-null   object 
 9   utilities        2930 non-null   object 
 10  lot_config       2930 non-null   object 
 11  land_slope       2930 non-null   object 
 12  neighborhood     2930 non-null   object 
 13  condition_1      2930 non-null   object 
 14  condition_2      2930 non-null   object 
 15  bldg_type        2930 non-null   object 
 16  house_style      2930 non-null   object 
 17  overall_qual     2930 non-null   int64  
 18  overall_cond     2930 non-null   int64  
 19  year_built       2930 non-null   int64  
 20  year_remod/add   2930 non-null   int64  
 21  roof_style       2930 non-null   object 
 22  roof_matl        2930 non-null   object 
 23  exterior_1st     2930 non-null   object 
 24  exterior_2nd     2930 non-null   object 
 25  mas_vnr_type     1155 non-null   object 
 26  mas_vnr_area     2907 non-null   float64
 27  exter_qual       2930 non-null   object 
 28  exter_cond       2930 non-null   object 
 29  foundation       2930 non-null   object 
 30  bsmt_qual        2850 non-null   object 
 31  bsmt_cond        2850 non-null   object 
 32  bsmt_exposure    2847 non-null   object 
 33  bsmtfin_type_1   2850 non-null   object 
 34  bsmtfin_sf_1     2929 non-null   float64
 35  bsmtfin_type_2   2849 non-null   object 
 36  bsmtfin_sf_2     2929 non-null   float64
 37  bsmt_unf_sf      2929 non-null   float64
 38  total_bsmt_sf    2929 non-null   float64
 39  heating          2930 non-null   object 
 40  heating_qc       2930 non-null   object 
 41  central_air      2930 non-null   object 
 42  electrical       2929 non-null   object 
 43  1st_flr_sf       2930 non-null   int64  
 44  2nd_flr_sf       2930 non-null   int64  
 45  low_qual_fin_sf  2930 non-null   int64  
 46  gr_liv_area      2930 non-null   int64  
 47  bsmt_full_bath   2928 non-null   float64
 48  bsmt_half_bath   2928 non-null   float64
 49  full_bath        2930 non-null   int64  
 50  half_bath        2930 non-null   int64  
 51  bedroom_abvgr    2930 non-null   int64  
 52  kitchen_abvgr    2930 non-null   int64  
 53  kitchen_qual     2930 non-null   object 
 54  totrms_abvgrd    2930 non-null   int64  
 55  functional       2930 non-null   object 
 56  fireplaces       2930 non-null   int64  
 57  fireplace_qu     1508 non-null   object 
 58  garage_type      2773 non-null   object 
 59  garage_yr_blt    2771 non-null   float64
 60  garage_finish    2771 non-null   object 
 61  garage_cars      2929 non-null   float64
 62  garage_area      2929 non-null   float64
 63  garage_qual      2771 non-null   object 
 64  garage_cond      2771 non-null   object 
 65  paved_drive      2930 non-null   object 
 66  wood_deck_sf     2930 non-null   int64  
 67  open_porch_sf    2930 non-null   int64  
 68  enclosed_porch   2930 non-null   int64  
 69  3ssn_porch       2930 non-null   int64  
 70  screen_porch     2930 non-null   int64  
 71  pool_area        2930 non-null   int64  
 72  pool_qc          13 non-null     object 
 73  fence            572 non-null    object 
 74  misc_feature     106 non-null    object 
 75  misc_val         2930 non-null   int64  
 76  mo_sold          2930 non-null   int64  
 77  yr_sold          2930 non-null   int64  
 78  sale_type        2930 non-null   object 
 79  sale_condition   2930 non-null   object 
 80  saleprice        2930 non-null   int64  
 81  price_per_sqft   2930 non-null   float64
 82  nonsense         2930 non-null   int64  
dtypes: float64(12), int64(28), object(43)
memory usage: 1.9+ MB

We can use any() to identify which columns have missing values. We can use this information for various reasons such as subsetting for just those columns that have missing values.

missing = ames.isnull().any() # identify if missing values exist in each column
ames[missing[missing].index]  # subset for just those columns that have missing values
lot_frontage alley mas_vnr_type mas_vnr_area bsmt_qual bsmt_cond bsmt_exposure bsmtfin_type_1 bsmtfin_sf_1 bsmtfin_type_2 ... garage_type garage_yr_blt garage_finish garage_cars garage_area garage_qual garage_cond pool_qc fence misc_feature
0 141.0 NaN Stone 112.0 TA Gd Gd BLQ 639.0 Unf ... Attchd 1960.0 Fin 2.0 528.0 TA TA NaN NaN NaN
1 80.0 NaN NaN 0.0 TA TA No Rec 468.0 LwQ ... Attchd 1961.0 Unf 1.0 730.0 TA TA NaN MnPrv NaN
2 81.0 NaN BrkFace 108.0 TA TA No ALQ 923.0 Unf ... Attchd 1958.0 Unf 1.0 312.0 TA TA NaN NaN Gar2
3 93.0 NaN NaN 0.0 TA TA No ALQ 1065.0 Unf ... Attchd 1968.0 Fin 2.0 522.0 TA TA NaN NaN NaN
4 74.0 NaN NaN 0.0 Gd TA No GLQ 791.0 Unf ... Attchd 1997.0 Fin 2.0 482.0 TA TA NaN MnPrv NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2925 37.0 NaN NaN 0.0 TA TA Av GLQ 819.0 Unf ... Detchd 1984.0 Unf 2.0 588.0 TA TA NaN GdPrv NaN
2926 NaN NaN NaN 0.0 Gd TA Av BLQ 301.0 ALQ ... Attchd 1983.0 Unf 2.0 484.0 TA TA NaN MnPrv NaN
2927 62.0 NaN NaN 0.0 Gd TA Av GLQ 337.0 Unf ... NaN NaN NaN 0.0 0.0 NaN NaN NaN MnPrv Shed
2928 77.0 NaN NaN 0.0 Gd TA Av ALQ 1071.0 LwQ ... Attchd 1975.0 RFn 2.0 418.0 TA TA NaN NaN NaN
2929 74.0 NaN BrkFace 94.0 Gd TA Av LwQ 758.0 Unf ... Attchd 1993.0 Fin 3.0 650.0 TA TA NaN NaN NaN

2930 rows × 27 columns

When you have missing values, we usually either drop them or impute them.You can drop missing values with .dropna():

ames.dropna()
pid ms_subclass ms_zoning lot_frontage lot_area street alley lot_shape land_contour utilities ... fence misc_feature misc_val mo_sold yr_sold sale_type sale_condition saleprice price_per_sqft nonsense

0 rows × 83 columns

Whoa! What just happened? Well, this data set actually has a missing value in every single row. .dropna() drops every row that contains a missing value so we end up dropping all observations. Consequently, we probably want to figure out what’s going on with these missing values and isolate the column causing the problem and imputing the values if possible.

Tip

Another “drop” method is .drop_duplcates() which will drop duplicated rows in your DataFrame.

Sometimes visualizations help identify patterns in missing values. One thing I often do is print a heatmap of my dataframe to get a feel for where my missing values are. We’ll get into data visualization in future lessons but for now here is an example using the searborn library. We can see that several variables have a lot of missing values (alley, fireplace_qu, pool_qc, fence, misc_feature).

import seaborn as sns
sns.set(rc={'figure.figsize':(12, 8)})
ames_missing = ames[missing[missing].index]
sns.heatmap(ames_missing.isnull(), cmap='viridis', cbar=False);
../_images/d61f8a98e429661b09b4840ec21d62d92677581c5bd73ac1085e4e312e030dd3.png

Since we can’t drop all missing values in this data set (since it leaves us with no rows), we need to impute (“fill”) them in. There are several approaches we can use to do this; one of which uses the .fillna() method. This method has various options for filling, you can use a fixed value, the mean of the column, the previous non-nan value, etc:

import numpy as np

# example DataFrame with missing values
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                   [3, 4, np.nan, 1],
                   [np.nan, np.nan, np.nan, 5],
                   [np.nan, 3, np.nan, 4]],
                  columns=list('ABCD'))
df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
df.fillna(0)  # fill with 0
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
df.fillna(df.mean())  # fill with the mean
A B C D
0 3.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 3.0 NaN 5
3 3.0 3.0 NaN 4
df.bfill()  # backward (upwards) fill from non-nan values
A B C D
0 3.0 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN 3.0 NaN 5
3 NaN 3.0 NaN 4
df.ffill()  # forward (downward) fill from non-nan values
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4

Applying custom functions#

There will be times when you want to apply a function that is not built-in to Pandas. For this, we have methods:

  • df.apply(), applies a function column-wise or row-wise across a dataframe (the function must be able to accept/return an array)

  • df.applymap(), applies a function element-wise (for functions that accept/return single values at a time)

  • series.apply()/series.map(), same as above but for Pandas series

For example, say you had the following custom function that defines if a home is considered a luxery home simply based on the price sold.

Note

Don’t worry, you’ll learn more about writing your own functions in future lessons!

def is_luxery_home(x):
    if x > 500000:
        return 'Luxery'
    else:
        return 'Non-luxery'

ames['saleprice'].apply(is_luxery_home)
0       Non-luxery
1       Non-luxery
2       Non-luxery
3       Non-luxery
4       Non-luxery
           ...    
2925    Non-luxery
2926    Non-luxery
2927    Non-luxery
2928    Non-luxery
2929    Non-luxery
Name: saleprice, Length: 2930, dtype: object

This may have been better as a lambda function, which is just a shorter approach to writing functions. This may be a bit confusing but we’ll talk more about lambda functions in the writing functions lesson. For now, just think of it as being able to write a function for single use application on the fly.

ames['saleprice'].apply(lambda x: 'Luxery' if x > 500000 else 'Non-luxery')
0       Non-luxery
1       Non-luxery
2       Non-luxery
3       Non-luxery
4       Non-luxery
           ...    
2925    Non-luxery
2926    Non-luxery
2927    Non-luxery
2928    Non-luxery
2929    Non-luxery
Name: saleprice, Length: 2930, dtype: object

You can even use functions that require additional arguments. Just specify the arguments in .apply():

def is_luxery_home(x, price):
    if x > price:
        return 'Luxery'
    else:
        return 'Non-luxery'

ames['saleprice'].apply(is_luxery_home, price=200000)
0           Luxery
1       Non-luxery
2       Non-luxery
3           Luxery
4       Non-luxery
           ...    
2925    Non-luxery
2926    Non-luxery
2927    Non-luxery
2928    Non-luxery
2929    Non-luxery
Name: saleprice, Length: 2930, dtype: object

Sometimes we may have a function that we want to apply to every element across multiple columns. For example, say we wanted to convert several of the square footage variables to be represented as square meters. For this we can use the .applymap() method.

def convert_to_sq_meters(x):
    return x*0.092903

ames[['gr_liv_area', 'garage_area', 'lot_area']].map(convert_to_sq_meters)
gr_liv_area garage_area lot_area
0 153.847368 49.052784 2946.883160
1 83.241088 67.819190 1075.073516
2 123.468087 28.985736 1320.801951
3 196.025330 48.495366 1032.152330
4 151.338987 44.779246 1280.203340
... ... ... ...
2925 93.181709 54.626964 732.725961
2926 83.798506 44.965052 820.798005
2927 90.115910 0.000000 965.355073
2928 129.042267 38.833454 925.313880
2929 185.806000 60.386950 889.732031

2930 rows × 3 columns

Exercises#

Questions:

  1. Import the heart.csv dataset.

  2. Are there any missing values in this data? If so, which columns? For these columns, fill the missing values with the value that appears most often (aka “mode”). This is a multi-step process and it would be worth reviewing the .fillna() docs.

  3. Create a new column called risk that is equal to \( \frac{age}{\text{rest_bp} + chol + \text{max_hr}} \)

  4. Replace the values in the rest_ecg column so that:

    • normal = normal

    • left ventricular hypertrophy = lvh

    • ST-T wave abnormality = stt_wav_abn

Computing environment#

Hide code cell source
%load_ext watermark
%watermark -v -p jupyterlab,pandas,numpy,seaborn
Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.26.0

jupyterlab: 4.2.3
pandas    : 2.2.2
numpy     : 2.0.0
seaborn   : 0.13.2