Lesson 3c: Summarizing data#

What we have is a data glut.

- Vernor Vinge, Professor Emeritus of Mathematics, San Diego State University

Computing summary statistics across columns and various groupings within your dataset is a fundamental task in creating insights from your data. This lesson will focus on how we can compute various aggregations with Pandas Series and DataFrames.

Learning objectives#

By the end of this lesson you will be able to:

  • Compute summary statistics across an entire Series

  • Compute summary statistics across one or more columns in a DataFrame

  • Compute grouped-level summary statistics across one or more columns in a DataFrame

Simple aggregation#

In the previous lesson we learned how to manipulate data across one or more variables within the row(s):

series-plus-series.png

Note

We return the same number of elements that we started with. This is known as a window function, but you can also think of it as summarizing at the row-level.

We could achieve this result with the following code:

DataFrame['A'] + DataFrame['B']

We subset the two Series and then add them together using the + operator to achieve the sum. Note that we could also use some other operation on DataFrame['B'] as long as it returns the same number of elements.

However, sometimes we want to work with data across rows within a variable – that is, aggregate/summarize values rowwise rather than columnwise.

aggregate-series.png

Note

We return a single value representing some aggregation of the elements we started with. This is known as a summary function, but you can think of it as aggregating values across rows.

Summarizing a series#

The easiest way to summarize a specific series is by using bracket subsetting notation and the built-in Series methods (i.e. col.sum()):

import pandas as pd

ames = pd.read_csv('../data/ames_raw.csv')

ames['SalePrice'].sum()
np.int64(529732456)

Note that a single value was returned because this is a summary operation – we are summing the SalePrice variable across all rows.

There are other summary methods with a series:

ames['SalePrice'].mean()
np.float64(180796.0600682594)
ames['SalePrice'].median()
np.float64(160000.0)
ames['SalePrice'].std()
np.float64(79886.692356665)

All of the above methods work on quantitative variables but not character variables. However, there are summary methods that will work on all types of variables:

# Number of unique values in the neighborhood variable
ames['Neighborhood'].nunique()
28
# Most frequent value observed in the neighborhood variable
ames['Neighborhood'].mode()
0    NAmes
Name: Neighborhood, dtype: object

Knowledge check#

Questions:

  1. What is the difference between a window operation and a summary operation?

  2. What is the mean, median, and standard deviation of the above ground square footage (Gr Liv Area variable)?

  3. Find the count of each value observed in the Neighborhood column. This may take a Google search. Would you consider the output a summary operation?

Video 🎥:

Describe method#

There is also a method describe() that provides a lot of this summary information – this is especially useful in initial exploratory data analysis.

ames['SalePrice'].describe()
count      2930.000000
mean     180796.060068
std       79886.692357
min       12789.000000
25%      129500.000000
50%      160000.000000
75%      213500.000000
max      755000.000000
Name: SalePrice, dtype: float64

Note

The describe() method will return different results depending on the type of the Series.

ames['Neighborhood'].describe()
count      2930
unique       28
top       NAmes
freq        443
Name: Neighborhood, dtype: object

Summarizing a DataFrame#

The above methods and operations are nice, but sometimes we want to work with multiple variables rather than just one. Recall how we select variables from a DataFrame:

  • Single-bracket subset notation

  • Pass a list of quoted variable names into the list

ames[['SalePrice', 'Gr Liv Area']]

We can use the same summary methods from the Series on the DataFrame to summarize data:

ames[['SalePrice', 'Gr Liv Area']].mean()
SalePrice      180796.060068
Gr Liv Area      1499.690444
dtype: float64
ames[['SalePrice', 'Gr Liv Area']].median()
SalePrice      160000.0
Gr Liv Area      1442.0
dtype: float64

This returns a pandas.core.series.Series object – the Index is the variable name and the values are the summarized values.

The Aggregation method#

While summary methods can be convenient, there are a few drawbacks to using them on DataFrames:

  1. You can only apply one summary method at a time

  2. You have to apply the same summary method to all variables

  3. A Series is returned rather than a DataFrame – this makes it difficult to use the values in our analysis later

In order to get around these problems, the DataFrame has a powerful method .agg():

ames.agg({
    'SalePrice': ['mean']
})
SalePrice
mean 180796.060068

There are a few things to notice about the agg() method:

  1. A dict is passed to the method with variable names as keys and a list of quoted summaries as values

  2. A DataFrame is returned with variable names as variables and summaries as rows

Tip

The .agg() method is just shorthand for .aggregate().

# I'm feeling quite verbose today!
ames.aggregate({
    'SalePrice': ['mean']
})
SalePrice
mean 180796.060068
# I don't have that kind of time!
ames.agg({
    'SalePrice': ['mean']
})
SalePrice
mean 180796.060068

We can extend this to multiple variables by adding elements to the dict:

ames.agg({
    'SalePrice': ['mean'],
    'Gr Liv Area': ['mean']
})
SalePrice Gr Liv Area
mean 180796.060068 1499.690444

And because the values of the dict are lists, we can do additional aggregations at the same time:

ames.agg({
    'SalePrice': ['mean', 'median'],
    'Gr Liv Area': ['mean', 'min']
})
SalePrice Gr Liv Area
mean 180796.060068 1499.690444
median 160000.000000 NaN
min NaN 334.000000

Note

Not all variables have to have the same list of summaries. Note how NaN values fill in for those summary statistics not computed on a given variable.

Knowledge check#

Questions:

  1. Fill in the blanks to compute the average number of rooms above ground (TotRms AbvGrd) and the average number of bedrooms above ground (Bedroom AbvGr). What type of object is returned?

    ames[['______', '______']].______()
    
    
  2. Use the .agg() method to complete the same computation as above. How does the output differ?

  3. Fill in the blanks in the below code to calculate the minimum and maximum year built (Year Built) and the mean and median number of garage stalls (Garage Cars):

    ames.agg({
        '_____': ['min', '_____'],
        '_____': ['_____', 'median']
    })
    

Video 🎥:

Describe method#

While agg() is a powerful method, the describe() method – similar to the Series describe() method – is a great choice during exploratory data analysis:

ames.describe()
Order PID MS SubClass Lot Frontage Lot Area Overall Qual Overall Cond Year Built Year Remod/Add Mas Vnr Area ... Wood Deck SF Open Porch SF Enclosed Porch 3Ssn Porch Screen Porch Pool Area Misc Val Mo Sold Yr Sold SalePrice
count 2930.00000 2.930000e+03 2930.000000 2440.000000 2930.000000 2930.000000 2930.000000 2930.000000 2930.000000 2907.000000 ... 2930.000000 2930.000000 2930.000000 2930.000000 2930.000000 2930.000000 2930.000000 2930.000000 2930.000000 2930.000000
mean 1465.50000 7.144645e+08 57.387372 69.224590 10147.921843 6.094881 5.563140 1971.356314 1984.266553 101.896801 ... 93.751877 47.533447 23.011604 2.592491 16.002048 2.243345 50.635154 6.216041 2007.790444 180796.060068
std 845.96247 1.887308e+08 42.638025 23.365335 7880.017759 1.411026 1.111537 30.245361 20.860286 179.112611 ... 126.361562 67.483400 64.139059 25.141331 56.087370 35.597181 566.344288 2.714492 1.316613 79886.692357
min 1.00000 5.263011e+08 20.000000 21.000000 1300.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 12789.000000
25% 733.25000 5.284770e+08 20.000000 58.000000 7440.250000 5.000000 5.000000 1954.000000 1965.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 4.000000 2007.000000 129500.000000
50% 1465.50000 5.354536e+08 50.000000 68.000000 9436.500000 6.000000 5.000000 1973.000000 1993.000000 0.000000 ... 0.000000 27.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2008.000000 160000.000000
75% 2197.75000 9.071811e+08 70.000000 80.000000 11555.250000 7.000000 6.000000 2001.000000 2004.000000 164.000000 ... 168.000000 70.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2009.000000 213500.000000
max 2930.00000 1.007100e+09 190.000000 313.000000 215245.000000 10.000000 9.000000 2010.000000 2010.000000 1600.000000 ... 1424.000000 742.000000 1012.000000 508.000000 576.000000 800.000000 17000.000000 12.000000 2010.000000 755000.000000

8 rows × 39 columns

Warning

What is missing from the above result?

The string variables are missing! We can make describe() compute on all variable types using the include parameter and passing a list of data types to include:

ames.describe(include = ['int', 'float', 'object'])
Order PID MS SubClass MS Zoning Lot Frontage Lot Area Street Alley Lot Shape Land Contour ... Pool Area Pool QC Fence Misc Feature Misc Val Mo Sold Yr Sold Sale Type Sale Condition SalePrice
count 2930.00000 2.930000e+03 2930.000000 2930 2440.000000 2930.000000 2930 198 2930 2930 ... 2930.000000 13 572 106 2930.000000 2930.000000 2930.000000 2930 2930 2930.000000
unique NaN NaN NaN 7 NaN NaN 2 2 4 4 ... NaN 4 4 5 NaN NaN NaN 10 6 NaN
top NaN NaN NaN RL NaN NaN Pave Grvl Reg Lvl ... NaN Ex MnPrv Shed NaN NaN NaN WD Normal NaN
freq NaN NaN NaN 2273 NaN NaN 2918 120 1859 2633 ... NaN 4 330 95 NaN NaN NaN 2536 2413 NaN
mean 1465.50000 7.144645e+08 57.387372 NaN 69.224590 10147.921843 NaN NaN NaN NaN ... 2.243345 NaN NaN NaN 50.635154 6.216041 2007.790444 NaN NaN 180796.060068
std 845.96247 1.887308e+08 42.638025 NaN 23.365335 7880.017759 NaN NaN NaN NaN ... 35.597181 NaN NaN NaN 566.344288 2.714492 1.316613 NaN NaN 79886.692357
min 1.00000 5.263011e+08 20.000000 NaN 21.000000 1300.000000 NaN NaN NaN NaN ... 0.000000 NaN NaN NaN 0.000000 1.000000 2006.000000 NaN NaN 12789.000000
25% 733.25000 5.284770e+08 20.000000 NaN 58.000000 7440.250000 NaN NaN NaN NaN ... 0.000000 NaN NaN NaN 0.000000 4.000000 2007.000000 NaN NaN 129500.000000
50% 1465.50000 5.354536e+08 50.000000 NaN 68.000000 9436.500000 NaN NaN NaN NaN ... 0.000000 NaN NaN NaN 0.000000 6.000000 2008.000000 NaN NaN 160000.000000
75% 2197.75000 9.071811e+08 70.000000 NaN 80.000000 11555.250000 NaN NaN NaN NaN ... 0.000000 NaN NaN NaN 0.000000 8.000000 2009.000000 NaN NaN 213500.000000
max 2930.00000 1.007100e+09 190.000000 NaN 313.000000 215245.000000 NaN NaN NaN NaN ... 800.000000 NaN NaN NaN 17000.000000 12.000000 2010.000000 NaN NaN 755000.000000

11 rows × 82 columns

Grouped aggregation#

In the section above, we talked about summary operations in the context of collapsing a DataFrame to a single row. This is not always the case – often we are interested in examining specific groups in our data and we want to perform summary operations for these groups. Thus, we are interested in collapsing to a single row per group. This is known as a grouped aggregation.

For example, in the following illustration we are interested in finding the sum of variable B for each category/value in variable A.

summarizing-by-groups.png

This can be useful when we want to aggregate by category:

  • Maximum temperature by month

  • Total home runs by team

  • Total sales by neighborhood

  • Average number of seats by plane manufacturer

When we summarize by groups, we can use the same aggregation methods we previously did

  • Summary methods for a specific summary operation: DataFrame.sum()

  • Describe method for a collection of summary operations: DataFrame.describe()

  • Agg method for flexibility in summary operations: DataFrame.agg({'VariableName': ['sum', 'mean']})

The only difference is the need to set the DataFrame group prior to aggregating. We can set the DataFrame group by calling the DataFrame.groupby() method and passing a variable name:

ames_grp = ames.groupby('Neighborhood')
ames_grp
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x108c694f0>

Notice that a DataFrame doesn’t print when it’s grouped. The groupby() method is just setting the group - you can see the changed DataFrame class:

type(ames_grp)
pandas.core.groupby.generic.DataFrameGroupBy

The groupby object is really just a dictionary of index-mappings, which we could look at if we wanted to:

ames_grp.groups
{'Blmngtn': [52, 53, 468, 469, 470, 471, 472, 473, 1080, 1081, 1082, 1083, 1084, 1741, 1742, 1743, 1744, 2419, 2420, 2421, 2422, 2423, 2424, 2425, 2426, 2427, 2428, 2429], 'Blueste': [298, 299, 932, 933, 934, 935, 1542, 1543, 2225, 2227], 'BrDale': [29, 30, 31, 402, 403, 404, 405, 406, 407, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1675, 1676, 1677, 1678, 1679, 2364, 2365, 2366, 2367, 2368, 2369, 2370, 2371, 2372], 'BrkSide': [129, 130, 191, 192, 193, 194, 195, 196, 197, 198, 199, 614, 615, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 1219, 1220, 1221, 1222, 1327, 1328, 1329, 1330, 1331, 1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1901, 1902, 1903, 1904, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2022, 2023, 2024, 2025, 2554, 2555, 2556, 2557, 2670, 2671, 2672, 2673, 2675, 2676, 2677, ...], 'ClearCr': [208, 209, 210, 228, 229, 232, 233, 255, 779, 781, 782, 785, 833, 834, 1374, 1392, 1395, 1398, 1399, 1400, 1401, 1402, 1406, 1429, 1430, 2045, 2071, 2072, 2073, 2077, 2115, 2116, 2117, 2118, 2701, 2725, 2726, 2727, 2730, 2731, 2764, 2765, 2766, 2767], 'CollgCr': [249, 250, 251, 252, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 1423, 1424, 1425, 1426, 1431, 1432, 1433, 1434, 1435, 1436, 1437, 1438, 1439, 1440, 1441, 1442, ...], 'Crawfor': [293, 294, 295, 296, 297, 300, 308, 912, 915, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 936, 937, 938, 944, 1523, 1524, 1525, 1526, 1528, 1530, 1531, 1532, 1533, 1534, 1535, 1536, 1537, 1538, 1539, 1540, 1541, 1559, 1560, 1561, 1562, 2196, 2197, 2198, 2199, 2201, 2202, 2203, 2204, 2205, 2206, 2207, 2208, 2209, 2210, 2211, 2212, 2213, 2214, 2215, 2216, 2217, 2218, 2219, 2220, 2221, 2222, 2223, 2224, 2226, 2228, 2229, 2230, 2245, 2246, 2850, 2853, 2854, 2857, 2858, 2859, 2860, 2861, 2862, 2863, 2864, 2865, 2866, 2867, 2868, 2869, 2870, ...], 'Edwards': [234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 287, 762, 763, 764, 765, 766, 767, 783, 784, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 1375, 1403, 1404, 1405, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1485, 1486, 1487, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499, 1500, 1501, 1502, 1503, 1504, ...], 'Gilbert': [4, 5, 9, 10, 11, 12, 13, 16, 18, 51, 54, 55, 56, 57, 58, 344, 345, 346, 347, 348, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 369, 464, 465, 466, 467, 474, 475, 476, 477, 478, 479, 480, 481, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 992, 993, 994, 995, 996, 997, 998, 1003, 1004, 1005, 1006, 1007, 1013, 1015, 1085, 1086, 1087, 1088, 1089, 1090, 1092, 1093, 1094, 1095, 1096, 1097, 1615, 1616, 1617, 1618, 1619, 1621, 1622, 1623, 1624, 1625, 1626, 1627, 1628, 1629, 1630, 1638, 1727, 1728, ...], 'Greens': [106, 107, 575, 1857, 2518, 2519, 2520, 2521], 'GrnHill': [2256, 2892], 'IDOTRR': [205, 206, 301, 302, 303, 304, 305, 306, 307, 726, 727, 754, 755, 758, 759, 760, 939, 940, 941, 942, 943, 945, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1368, 1369, 1370, 1544, 1545, 1546, 1547, 1548, 1549, 1550, 1551, 1552, 1553, 1554, 1555, 1556, 1557, 1558, 1610, 2030, 2031, 2032, 2033, 2034, 2035, 2036, 2037, 2038, 2039, 2043, 2231, 2232, 2233, 2234, 2235, 2236, 2237, 2238, 2239, 2240, 2241, 2242, 2243, 2244, 2669, 2690, 2691, 2692, 2693, 2694, 2695, 2696, 2697, 2872, 2873, 2874, 2875, 2876, 2877, 2878, 2879, 2880, 2881, 2882], 'Landmrk': [2788], 'MeadowV': [326, 327, 328, 329, 330, 331, 332, 973, 975, 977, 978, 979, 1593, 1594, 1595, 1596, 1597, 1599, 1600, 2283, 2284, 2285, 2286, 2287, 2288, 2289, 2290, 2908, 2909, 2910, 2913, 2914, 2916, 2917, 2918, 2919, 2920], 'Mitchel': [309, 310, 311, 312, 313, 322, 323, 324, 325, 333, 334, 335, 336, 337, 338, 339, 340, 946, 947, 948, 949, 950, 951, 952, 953, 954, 969, 970, 971, 972, 974, 976, 980, 981, 982, 983, 984, 985, 986, 987, 988, 1563, 1564, 1565, 1566, 1567, 1568, 1569, 1588, 1589, 1590, 1591, 1592, 1598, 1601, 1602, 1603, 1604, 1605, 1606, 1607, 1608, 1609, 2247, 2248, 2249, 2250, 2251, 2252, 2253, 2254, 2277, 2278, 2279, 2280, 2281, 2282, 2291, 2292, 2293, 2294, 2295, 2296, 2297, 2298, 2299, 2300, 2301, 2302, 2303, 2304, 2885, 2886, 2887, 2888, 2889, 2890, 2903, 2904, 2905, ...], 'NAmes': [0, 1, 2, 3, 23, 24, 25, 26, 27, 28, 117, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 341, 342, 343, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 418, 419, 593, 594, 595, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 616, 617, 618, 619, 620, 621, 622, 623, 624, ...], 'NPkVill': [32, 33, 34, 35, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 1046, 1047, 1048, 1680, 1681, 1682, 2373, 2376, 2377], 'NWAmes': [19, 20, 21, 110, 111, 112, 113, 114, 115, 116, 118, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 387, 388, 389, 390, 391, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 596, 597, 598, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1185, 1186, 1187, 1188, 1189, 1190, 1191, 1197, 1198, 1199, 1200, 1201, 1203, 1643, 1644, 1645, 1646, 1647, 1648, 1649, 1650, 1651, 1652, 1653, 1654, 1655, 1656, 1657, 1658, 1659, 1660, 1661, 1863, 1864, 1865, 1866, 1867, ...], 'NoRidge': [59, 60, 61, 62, 63, 64, 65, 90, 91, 92, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 564, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106, 1107, 1108, 1109, 1157, 1158, 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768, 1769, 1770, 1771, 1772, 1773, 1774, 1832, 1833, 2442, 2443, 2444, 2445, 2446, 2447, 2448, 2449, 2450, 2451, 2452, 2453, 2499, 2500, 2501, 2502, 2503], 'NridgHt': [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 482, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1091, 1684, 1685, 1686, 1687, 1688, 1689, 1690, 1691, 1692, ...], 'OldTown': [158, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 200, 201, 202, 203, 204, 207, 650, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 749, 750, 751, 752, 753, 756, 757, 1254, 1255, 1257, 1258, 1259, 1283, 1284, 1285, 1286, 1287, 1288, 1289, 1290, 1291, 1292, 1293, ...], 'SWISU': [211, 212, 213, 214, 285, 286, 288, 289, 290, 291, 292, 905, 906, 907, 908, 909, 910, 911, 913, 914, 916, 1515, 1516, 1517, 1518, 1519, 1520, 1521, 1522, 1527, 1529, 2047, 2080, 2194, 2195, 2200, 2703, 2704, 2842, 2845, 2846, 2847, 2848, 2849, 2851, 2852, 2855, 2856], 'Sawyer': [83, 84, 85, 86, 87, 88, 89, 108, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 230, 231, 554, 555, 556, 557, 558, 559, 560, 561, 562, 578, 761, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 780, 1149, 1150, 1151, 1152, 1153, 1154, 1156, 1371, 1372, 1373, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1393, 1394, 1396, 1397, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1829, 1830, 1831, 1861, 2044, 2048, 2049, 2050, 2051, 2052, 2053, 2054, 2055, 2056, 2057, ...], 'SawyerW': [72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 244, 245, 246, 247, 248, 253, 254, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 832, 1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1418, 1419, 1420, 1421, 1422, 1427, 1428, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 2089, 2090, 2091, 2092, ...], 'Somerst': [22, 66, 67, 68, 69, 70, 71, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 109, 383, 384, 385, 386, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 565, 566, 567, 568, 569, 570, 571, 572, 573, 1110, 1111, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1159, 1160, 1161, 1162, 1163, 1164, 1165, 1166, 1167, 1168, 1169, 1170, 1171, 1172, 1173, 1174, 1175, 1176, ...], 'StoneBr': [6, 7, 8, 14, 15, 17, 349, 350, 351, 352, 365, 366, 367, 368, 999, 1000, 1001, 1002, 1008, 1009, 1010, 1011, 1012, 1014, 1620, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1639, 1640, 1641, 1642, 2322, 2326, 2327, 2328, 2329, 2330, 2331, 2332, 2333, 2334, 2335, 2336, 2339, 2340, 2341], 'Timber': [314, 315, 316, 317, 318, 319, 320, 321, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 1570, 1571, 1572, 1573, 1574, 1575, 1576, 1577, 1578, 1579, 1580, 1581, 1582, 1583, 1584, 1585, 1586, 1587, 2255, 2257, 2258, 2259, 2260, 2261, 2262, 2263, 2264, 2265, 2266, 2267, 2268, 2269, 2270, 2271, 2272, 2273, 2274, 2275, 2276, 2891, 2893, 2894, 2895, 2896, 2897, 2898, 2899, 2900, 2901, 2902], 'Veenker': [563, 574, 576, 577, 1155, 1178, 1179, 1180, 1181, 1182, 1183, 1827, 1828, 1853, 1854, 1855, 1856, 1858, 1859, 1860, 2498, 2516, 2517, 2522]}

We can also access a group using the .get_group() method:

# get the Bloomington neighborhood group
ames_grp.get_group('Blmngtn').head()
Order PID MS SubClass MS Zoning Lot Frontage Lot Area Street Alley Lot Shape Land Contour ... Pool Area Pool QC Fence Misc Feature Misc Val Mo Sold Yr Sold Sale Type Sale Condition SalePrice
52 53 528228285 120 RL 43.0 3203 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 1 2010 WD Normal 160000
53 54 528228440 120 RL 43.0 3182 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 4 2010 WD Normal 192000
468 469 528228290 120 RL 53.0 3684 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 6 2009 WD Normal 174000
469 470 528228295 120 RL 51.0 3635 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 5 2009 WD Normal 175900
470 471 528228435 120 RL 43.0 3182 Pave NaN Reg Lvl ... 0 NaN NaN NaN 0 5 2009 WD Normal 192500

5 rows × 82 columns

If we then call an aggregation method after our groupby() call, we will see the DataFrame returned with group-level aggregations:

ames.groupby('Neighborhood').agg({'SalePrice': ['mean', 'median']}).head()
SalePrice
mean median
Neighborhood
Blmngtn 196661.678571 191500.0
Blueste 143590.000000 130500.0
BrDale 105608.333333 106000.0
BrkSide 124756.250000 126750.0
ClearCr 208662.090909 197500.0

This process always follows this model:

model-for-grouped-aggs.png

Groups as index vs. variables#

Note

Notice that the grouped variable becomes the Index in our example!

ames.groupby('Neighborhood').agg({'SalePrice': ['mean', 'median']}).index
Index(['Blmngtn', 'Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr',
       'Crawfor', 'Edwards', 'Gilbert', 'Greens', 'GrnHill', 'IDOTRR',
       'Landmrk', 'MeadowV', 'Mitchel', 'NAmes', 'NPkVill', 'NWAmes',
       'NoRidge', 'NridgHt', 'OldTown', 'SWISU', 'Sawyer', 'SawyerW',
       'Somerst', 'StoneBr', 'Timber', 'Veenker'],
      dtype='object', name='Neighborhood')

This is the default behavior of pandas, and probably how pandas wants to be used. In fact, this is the fastest way to do it, but it’s a matter of less than a millisecond. However, you aren’t always going to see people group by the index. Instead of setting the group as the index, we can set the group as a variable.

Tip

The grouped variable can remain a Series/variable by adding the as_index = False parameter/argument to groupby().

ames.groupby('Neighborhood', as_index=False).agg({'SalePrice': ['mean', 'median']}).head()
Neighborhood SalePrice
mean median
0 Blmngtn 196661.678571 191500.0
1 Blueste 143590.000000 130500.0
2 BrDale 105608.333333 106000.0
3 BrkSide 124756.250000 126750.0
4 ClearCr 208662.090909 197500.0

Grouping by multiple variables#

Sometimes we have multiple categories by which we’d like to group. To extend our example, assume we want to find the average sale price by neighborhood AND year sold. We can pass a list of variable names to the groupby() method:

ames.groupby(['Neighborhood', 'Yr Sold'], as_index=False).agg({'SalePrice': 'mean'})
Neighborhood Yr Sold SalePrice
0 Blmngtn 2006 214424.454545
1 Blmngtn 2007 194671.500000
2 Blmngtn 2008 190714.400000
3 Blmngtn 2009 177266.666667
4 Blmngtn 2010 176000.000000
... ... ... ...
125 Timber 2010 224947.625000
126 Veenker 2006 270000.000000
127 Veenker 2007 253577.777778
128 Veenker 2008 225928.571429
129 Veenker 2009 253962.500000

130 rows × 3 columns

Knowledge check#

Questions:

  1. How would you convert the following statement into a grouped aggregation syntax: “what is the average above ground square footage of homes based on neighbhorhood and bedroom count”?

  2. Compute the above statement (variable hints: Gr Liv Area = above ground square footage, Neighborhood = neighborhood, Bedroom AbvGr = bedroom count).

  3. Using the results from #2, find out which neighborhoods have 1 bedrooms homes that average more than 1500 above ground square feet.

Video 🎥:

Exercises#

Questions:

Using the Ames housing data…

  1. What neighbhorhood has the largest median sales price? (hint: check out sort_values())

  2. What is the mean and median sales price based on the number of bedrooms (Bedroom AbvGr)?

  3. Which neighbhorhood has the largest median sales price for 3 bedroom homes? Which neighborhood has the smallest median sales price for 3 bedroom homes?

  4. Compute the sales price per square footage (Gr Liv Area) per home. Call this variable price_per_sqft. Now compute the median price_per_sqft per neighborhood. Which neighborhood has the largest median price_per_sqft? Does this differ from the neighborhood identified in #1? What information does this provide you?

Computing environment#

Hide code cell source
%load_ext watermark
%watermark -v -p jupyterlab,pandas
Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.26.0

jupyterlab: 4.2.3
pandas    : 2.2.2