Lesson 3c: Summarizing data

Lesson 3c: Summarizing data#

What we have is a data glut.

- Vernor Vinge, Professor Emeritus of Mathematics, San Diego State University

Computing summary statistics across columns and various groupings within your dataset is a fundamental task in creating insights from your data. This lesson will focus on how we can compute various aggregations with Pandas Series and DataFrames.

Learning objectives#

By the end of this lesson you will be able to:

Compute summary statistics across an entire Series
Compute summary statistics across one or more columns in a DataFrame
Compute grouped-level summary statistics across one or more columns in a DataFrame

Simple aggregation#

In the previous lesson we learned how to manipulate data across one or more variables within the row(s):

Note

We return the same number of elements that we started with. This is known as a window function, but you can also think of it as summarizing at the row-level.

We could achieve this result with the following code:

DataFrame['A'] + DataFrame['B']

We subset the two Series and then add them together using the + operator to achieve the sum. Note that we could also use some other operation on DataFrame['B'] as long as it returns the same number of elements.

However, sometimes we want to work with data across rows within a variable – that is, aggregate/summarize values rowwise rather than columnwise.

Note

We return a single value representing some aggregation of the elements we started with. This is known as a summary function, but you can think of it as aggregating values across rows.

Summarizing a series#

The easiest way to summarize a specific series is by using bracket subsetting notation and the built-in Series methods (i.e. col.sum()):

import pandas as pd

ames = pd.read_csv('../data/ames_raw.csv')

ames['SalePrice'].sum()

np.int64(529732456)

Note that a single value was returned because this is a summary operation – we are summing the SalePrice variable across all rows.

There are other summary methods with a series:

ames['SalePrice'].mean()

np.float64(180796.0600682594)

ames['SalePrice'].median()

np.float64(160000.0)

ames['SalePrice'].std()

np.float64(79886.692356665)

All of the above methods work on quantitative variables but not character variables. However, there are summary methods that will work on all types of variables:

# Number of unique values in the neighborhood variable
ames['Neighborhood'].nunique()

# Most frequent value observed in the neighborhood variable
ames['Neighborhood'].mode()

0    NAmes
Name: Neighborhood, dtype: object

Knowledge check#

Questions:

What is the difference between a window operation and a summary operation?
What is the mean, median, and standard deviation of the above ground square footage (Gr Liv Area variable)?
Find the count of each value observed in the Neighborhood column. This may take a Google search. Would you consider the output a summary operation?

Video 🎥:

Describe method#

There is also a method describe() that provides a lot of this summary information – this is especially useful in initial exploratory data analysis.

ames['SalePrice'].describe()

count      2930.000000
mean     180796.060068
std       79886.692357
min       12789.000000
25%      129500.000000
50%      160000.000000
75%      213500.000000
max      755000.000000
Name: SalePrice, dtype: float64

Note

The describe() method will return different results depending on the type of the Series.

ames['Neighborhood'].describe()

count      2930
unique       28
top       NAmes
freq        443
Name: Neighborhood, dtype: object

Summarizing a DataFrame#

The above methods and operations are nice, but sometimes we want to work with multiple variables rather than just one. Recall how we select variables from a DataFrame:

Single-bracket subset notation
Pass a list of quoted variable names into the list

ames[['SalePrice', 'Gr Liv Area']]

We can use the same summary methods from the Series on the DataFrame to summarize data:

ames[['SalePrice', 'Gr Liv Area']].mean()

SalePrice      180796.060068
Gr Liv Area      1499.690444
dtype: float64

ames[['SalePrice', 'Gr Liv Area']].median()

SalePrice      160000.0
Gr Liv Area      1442.0
dtype: float64

This returns a pandas.core.series.Series object – the Index is the variable name and the values are the summarized values.

The Aggregation method#

While summary methods can be convenient, there are a few drawbacks to using them on DataFrames:

You can only apply one summary method at a time
You have to apply the same summary method to all variables
A Series is returned rather than a DataFrame – this makes it difficult to use the values in our analysis later

In order to get around these problems, the DataFrame has a powerful method .agg():

ames.agg({
    'SalePrice': ['mean']
})

	SalePrice
mean	180796.060068

There are a few things to notice about the agg() method:

A dict is passed to the method with variable names as keys and a list of quoted summaries as values
A DataFrame is returned with variable names as variables and summaries as rows

Tip

The .agg() method is just shorthand for .aggregate().

# I'm feeling quite verbose today!
ames.aggregate({
    'SalePrice': ['mean']
})

	SalePrice
mean	180796.060068

# I don't have that kind of time!
ames.agg({
    'SalePrice': ['mean']
})

	SalePrice
mean	180796.060068

We can extend this to multiple variables by adding elements to the dict:

ames.agg({
    'SalePrice': ['mean'],
    'Gr Liv Area': ['mean']
})

	SalePrice	Gr Liv Area
mean	180796.060068	1499.690444

And because the values of the dict are lists, we can do additional aggregations at the same time:

ames.agg({
    'SalePrice': ['mean', 'median'],
    'Gr Liv Area': ['mean', 'min']
})

	SalePrice	Gr Liv Area
mean	180796.060068	1499.690444
median	160000.000000	NaN
min	NaN	334.000000

Note

Not all variables have to have the same list of summaries. Note how NaN values fill in for those summary statistics not computed on a given variable.

Knowledge check#

Questions:

Fill in the blanks to compute the average number of rooms above ground (TotRms AbvGrd) and the average number of bedrooms above ground (Bedroom AbvGr). What type of object is returned?
```
ames[['______', '______']].______()
```
Use the .agg() method to complete the same computation as above. How does the output differ?
Fill in the blanks in the below code to calculate the minimum and maximum year built (Year Built) and the mean and median number of garage stalls (Garage Cars):
```
ames.agg({
    '_____': ['min', '_____'],
    '_____': ['_____', 'median']
})
```

Video 🎥:

Describe method#

While agg() is a powerful method, the describe() method – similar to the Series describe() method – is a great choice during exploratory data analysis:

ames.describe()

	Order	PID	MS SubClass	Lot Frontage	Lot Area	Overall Qual	Overall Cond	Year Built	Year Remod/Add	Mas Vnr Area	...	Wood Deck SF	Open Porch SF	Enclosed Porch	3Ssn Porch	Screen Porch	Pool Area	Misc Val	Mo Sold	Yr Sold	SalePrice
count	2930.00000	2.930000e+03	2930.000000	2440.000000	2930.000000	2930.000000	2930.000000	2930.000000	2930.000000	2907.000000	...	2930.000000	2930.000000	2930.000000	2930.000000	2930.000000	2930.000000	2930.000000	2930.000000	2930.000000	2930.000000
mean	1465.50000	7.144645e+08	57.387372	69.224590	10147.921843	6.094881	5.563140	1971.356314	1984.266553	101.896801	...	93.751877	47.533447	23.011604	2.592491	16.002048	2.243345	50.635154	6.216041	2007.790444	180796.060068
std	845.96247	1.887308e+08	42.638025	23.365335	7880.017759	1.411026	1.111537	30.245361	20.860286	179.112611	...	126.361562	67.483400	64.139059	25.141331	56.087370	35.597181	566.344288	2.714492	1.316613	79886.692357
min	1.00000	5.263011e+08	20.000000	21.000000	1300.000000	1.000000	1.000000	1872.000000	1950.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	2006.000000	12789.000000
25%	733.25000	5.284770e+08	20.000000	58.000000	7440.250000	5.000000	5.000000	1954.000000	1965.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	4.000000	2007.000000	129500.000000
50%	1465.50000	5.354536e+08	50.000000	68.000000	9436.500000	6.000000	5.000000	1973.000000	1993.000000	0.000000	...	0.000000	27.000000	0.000000	0.000000	0.000000	0.000000	0.000000	6.000000	2008.000000	160000.000000
75%	2197.75000	9.071811e+08	70.000000	80.000000	11555.250000	7.000000	6.000000	2001.000000	2004.000000	164.000000	...	168.000000	70.000000	0.000000	0.000000	0.000000	0.000000	0.000000	8.000000	2009.000000	213500.000000
max	2930.00000	1.007100e+09	190.000000	313.000000	215245.000000	10.000000	9.000000	2010.000000	2010.000000	1600.000000	...	1424.000000	742.000000	1012.000000	508.000000	576.000000	800.000000	17000.000000	12.000000	2010.000000	755000.000000

8 rows × 39 columns

Warning

What is missing from the above result?

The string variables are missing! We can make describe() compute on all variable types using the include parameter and passing a list of data types to include:

ames.describe(include = ['int', 'float', 'object'])

	Order	PID	MS SubClass	MS Zoning	Lot Frontage	Lot Area	Street	Alley	Lot Shape	Land Contour	...	Pool Area	Pool QC	Fence	Misc Feature	Misc Val	Mo Sold	Yr Sold	Sale Type	Sale Condition	SalePrice
count	2930.00000	2.930000e+03	2930.000000	2930	2440.000000	2930.000000	2930	198	2930	2930	...	2930.000000	13	572	106	2930.000000	2930.000000	2930.000000	2930	2930	2930.000000
unique	NaN	NaN	NaN	7	NaN	NaN	2	2	4	4	...	NaN	4	4	5	NaN	NaN	NaN	10	6	NaN
top	NaN	NaN	NaN	RL	NaN	NaN	Pave	Grvl	Reg	Lvl	...	NaN	Ex	MnPrv	Shed	NaN	NaN	NaN	WD	Normal	NaN
freq	NaN	NaN	NaN	2273	NaN	NaN	2918	120	1859	2633	...	NaN	4	330	95	NaN	NaN	NaN	2536	2413	NaN
mean	1465.50000	7.144645e+08	57.387372	NaN	69.224590	10147.921843	NaN	NaN	NaN	NaN	...	2.243345	NaN	NaN	NaN	50.635154	6.216041	2007.790444	NaN	NaN	180796.060068
std	845.96247	1.887308e+08	42.638025	NaN	23.365335	7880.017759	NaN	NaN	NaN	NaN	...	35.597181	NaN	NaN	NaN	566.344288	2.714492	1.316613	NaN	NaN	79886.692357
min	1.00000	5.263011e+08	20.000000	NaN	21.000000	1300.000000	NaN	NaN	NaN	NaN	...	0.000000	NaN	NaN	NaN	0.000000	1.000000	2006.000000	NaN	NaN	12789.000000
25%	733.25000	5.284770e+08	20.000000	NaN	58.000000	7440.250000	NaN	NaN	NaN	NaN	...	0.000000	NaN	NaN	NaN	0.000000	4.000000	2007.000000	NaN	NaN	129500.000000
50%	1465.50000	5.354536e+08	50.000000	NaN	68.000000	9436.500000	NaN	NaN	NaN	NaN	...	0.000000	NaN	NaN	NaN	0.000000	6.000000	2008.000000	NaN	NaN	160000.000000
75%	2197.75000	9.071811e+08	70.000000	NaN	80.000000	11555.250000	NaN	NaN	NaN	NaN	...	0.000000	NaN	NaN	NaN	0.000000	8.000000	2009.000000	NaN	NaN	213500.000000
max	2930.00000	1.007100e+09	190.000000	NaN	313.000000	215245.000000	NaN	NaN	NaN	NaN	...	800.000000	NaN	NaN	NaN	17000.000000	12.000000	2010.000000	NaN	NaN	755000.000000

11 rows × 82 columns

Grouped aggregation#

In the section above, we talked about summary operations in the context of collapsing a DataFrame to a single row. This is not always the case – often we are interested in examining specific groups in our data and we want to perform summary operations for these groups. Thus, we are interested in collapsing to a single row per group. This is known as a grouped aggregation.

For example, in the following illustration we are interested in finding the sum of variable B for each category/value in variable A.

This can be useful when we want to aggregate by category:

Maximum temperature by month
Total home runs by team
Total sales by neighborhood
Average number of seats by plane manufacturer

When we summarize by groups, we can use the same aggregation methods we previously did

Summary methods for a specific summary operation: DataFrame.sum()
Describe method for a collection of summary operations: DataFrame.describe()
Agg method for flexibility in summary operations: DataFrame.agg({'VariableName': ['sum', 'mean']})

The only difference is the need to set the DataFrame group prior to aggregating. We can set the DataFrame group by calling the DataFrame.groupby() method and passing a variable name:

ames_grp = ames.groupby('Neighborhood')
ames_grp

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x108c694f0>

Notice that a DataFrame doesn’t print when it’s grouped. The groupby() method is just setting the group - you can see the changed DataFrame class:

type(ames_grp)

pandas.core.groupby.generic.DataFrameGroupBy

The groupby object is really just a dictionary of index-mappings, which we could look at if we wanted to:

ames_grp.groups

{'Blmngtn': [52, 53, 468, 469, 470, 471, 472, 473, 1080, 1081, 1082, 1083, 1084, 1741, 1742, 1743, 1744, 2419, 2420, 2421, 2422, 2423, 2424, 2425, 2426, 2427, 2428, 2429], 'Blueste': [298, 299, 932, 933, 934, 935, 1542, 1543, 2225, 2227], 'BrDale': [29, 30, 31, 402, 403, 404, 405, 406, 407, 1039, 1040, 1041, 1042, 1043, 1044, 1045, 1675, 1676, 1677, 1678, 1679, 2364, 2365, 2366, 2367, 2368, 2369, 2370, 2371, 2372], 'BrkSide': [129, 130, 191, 192, 193, 194, 195, 196, 197, 198, 199, 614, 615, 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 744, 745, 746, 747, 748, 1219, 1220, 1221, 1222, 1327, 1328, 1329, 1330, 1331, 1332, 1333, 1334, 1335, 1336, 1337, 1338, 1339, 1340, 1341, 1342, 1343, 1344, 1345, 1348, 1349, 1350, 1351, 1352, 1353, 1354, 1355, 1901, 1902, 1903, 1904, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2022, 2023, 2024, 2025, 2554, 2555, 2556, 2557, 2670, 2671, 2672, 2673, 2675, 2676, 2677, ...], 'ClearCr': [208, 209, 210, 228, 229, 232, 233, 255, 779, 781, 782, 785, 833, 834, 1374, 1392, 1395, 1398, 1399, 1400, 1401, 1402, 1406, 1429, 1430, 2045, 2071, 2072, 2073, 2077, 2115, 2116, 2117, 2118, 2701, 2725, 2726, 2727, 2730, 2731, 2764, 2765, 2766, 2767], 'CollgCr': [249, 250, 251, 252, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 1423, 1424, 1425, 1426, 1431, 1432, 1433, 1434, 1435, 1436, 1437, 1438, 1439, 1440, 1441, 1442, ...], 'Crawfor': [293, 294, 295, 296, 297, 300, 308, 912, 915, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 936, 937, 938, 944, 1523, 1524, 1525, 1526, 1528, 1530, 1531, 1532, 1533, 1534, 1535, 1536, 1537, 1538, 1539, 1540, 1541, 1559, 1560, 1561, 1562, 2196, 2197, 2198, 2199, 2201, 2202, 2203, 2204, 2205, 2206, 2207, 2208, 2209, 2210, 2211, 2212, 2213, 2214, 2215, 2216, 2217, 2218, 2219, 2220, 2221, 2222, 2223, 2224, 2226, 2228, 2229, 2230, 2245, 2246, 2850, 2853, 2854, 2857, 2858, 2859, 2860, 2861, 2862, 2863, 2864, 2865, 2866, 2867, 2868, 2869, 2870, ...], 'Edwards': [234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 287, 762, 763, 764, 765, 766, 767, 783, 784, 786, 787, 788, 789, 790, 791, 792, 793, 794, 795, 796, 797, 798, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 1375, 1403, 1404, 1405, 1407, 1408, 1409, 1410, 1411, 1412, 1413, 1414, 1415, 1416, 1417, 1485, 1486, 1487, 1488, 1489, 1490, 1491, 1492, 1493, 1494, 1495, 1496, 1497, 1498, 1499, 1500, 1501, 1502, 1503, 1504, ...], 'Gilbert': [4, 5, 9, 10, 11, 12, 13, 16, 18, 51, 54, 55, 56, 57, 58, 344, 345, 346, 347, 348, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 369, 464, 465, 466, 467, 474, 475, 476, 477, 478, 479, 480, 481, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 992, 993, 994, 995, 996, 997, 998, 1003, 1004, 1005, 1006, 1007, 1013, 1015, 1085, 1086, 1087, 1088, 1089, 1090, 1092, 1093, 1094, 1095, 1096, 1097, 1615, 1616, 1617, 1618, 1619, 1621, 1622, 1623, 1624, 1625, 1626, 1627, 1628, 1629, 1630, 1638, 1727, 1728, ...], 'Greens': [106, 107, 575, 1857, 2518, 2519, 2520, 2521], 'GrnHill': [2256, 2892], 'IDOTRR': [205, 206, 301, 302, 303, 304, 305, 306, 307, 726, 727, 754, 755, 758, 759, 760, 939, 940, 941, 942, 943, 945, 1359, 1360, 1361, 1362, 1363, 1364, 1365, 1368, 1369, 1370, 1544, 1545, 1546, 1547, 1548, 1549, 1550, 1551, 1552, 1553, 1554, 1555, 1556, 1557, 1558, 1610, 2030, 2031, 2032, 2033, 2034, 2035, 2036, 2037, 2038, 2039, 2043, 2231, 2232, 2233, 2234, 2235, 2236, 2237, 2238, 2239, 2240, 2241, 2242, 2243, 2244, 2669, 2690, 2691, 2692, 2693, 2694, 2695, 2696, 2697, 2872, 2873, 2874, 2875, 2876, 2877, 2878, 2879, 2880, 2881, 2882], 'Landmrk': [2788], 'MeadowV': [326, 327, 328, 329, 330, 331, 332, 973, 975, 977, 978, 979, 1593, 1594, 1595, 1596, 1597, 1599, 1600, 2283, 2284, 2285, 2286, 2287, 2288, 2289, 2290, 2908, 2909, 2910, 2913, 2914, 2916, 2917, 2918, 2919, 2920], 'Mitchel': [309, 310, 311, 312, 313, 322, 323, 324, 325, 333, 334, 335, 336, 337, 338, 339, 340, 946, 947, 948, 949, 950, 951, 952, 953, 954, 969, 970, 971, 972, 974, 976, 980, 981, 982, 983, 984, 985, 986, 987, 988, 1563, 1564, 1565, 1566, 1567, 1568, 1569, 1588, 1589, 1590, 1591, 1592, 1598, 1601, 1602, 1603, 1604, 1605, 1606, 1607, 1608, 1609, 2247, 2248, 2249, 2250, 2251, 2252, 2253, 2254, 2277, 2278, 2279, 2280, 2281, 2282, 2291, 2292, 2293, 2294, 2295, 2296, 2297, 2298, 2299, 2300, 2301, 2302, 2303, 2304, 2885, 2886, 2887, 2888, 2889, 2890, 2903, 2904, 2905, ...], 'NAmes': [0, 1, 2, 3, 23, 24, 25, 26, 27, 28, 117, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 341, 342, 343, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 418, 419, 593, 594, 595, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610, 611, 612, 613, 616, 617, 618, 619, 620, 621, 622, 623, 624, ...], 'NPkVill': [32, 33, 34, 35, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 1046, 1047, 1048, 1680, 1681, 1682, 2373, 2376, 2377], 'NWAmes': [19, 20, 21, 110, 111, 112, 113, 114, 115, 116, 118, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 387, 388, 389, 390, 391, 579, 580, 581, 582, 583, 584, 585, 586, 587, 588, 589, 590, 591, 592, 596, 597, 598, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026, 1027, 1028, 1029, 1030, 1031, 1032, 1185, 1186, 1187, 1188, 1189, 1190, 1191, 1197, 1198, 1199, 1200, 1201, 1203, 1643, 1644, 1645, 1646, 1647, 1648, 1649, 1650, 1651, 1652, 1653, 1654, 1655, 1656, 1657, 1658, 1659, 1660, 1661, 1863, 1864, 1865, 1866, 1867, ...], 'NoRidge': [59, 60, 61, 62, 63, 64, 65, 90, 91, 92, 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 564, 1098, 1099, 1100, 1101, 1102, 1103, 1104, 1105, 1106, 1107, 1108, 1109, 1157, 1158, 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768, 1769, 1770, 1771, 1772, 1773, 1774, 1832, 1833, 2442, 2443, 2444, 2445, 2446, 2447, 2448, 2449, 2450, 2451, 2452, 2453, 2499, 2500, 2501, 2502, 2503], 'NridgHt': [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 482, 1050, 1051, 1052, 1053, 1054, 1055, 1056, 1057, 1058, 1059, 1060, 1061, 1062, 1063, 1064, 1065, 1066, 1067, 1068, 1069, 1070, 1071, 1072, 1073, 1074, 1075, 1076, 1077, 1078, 1079, 1091, 1684, 1685, 1686, 1687, 1688, 1689, 1690, 1691, 1692, ...], 'OldTown': [158, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 200, 201, 202, 203, 204, 207, 650, 654, 655, 656, 657, 658, 659, 660, 661, 662, 663, 664, 665, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714, 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 749, 750, 751, 752, 753, 756, 757, 1254, 1255, 1257, 1258, 1259, 1283, 1284, 1285, 1286, 1287, 1288, 1289, 1290, 1291, 1292, 1293, ...], 'SWISU': [211, 212, 213, 214, 285, 286, 288, 289, 290, 291, 292, 905, 906, 907, 908, 909, 910, 911, 913, 914, 916, 1515, 1516, 1517, 1518, 1519, 1520, 1521, 1522, 1527, 1529, 2047, 2080, 2194, 2195, 2200, 2703, 2704, 2842, 2845, 2846, 2847, 2848, 2849, 2851, 2852, 2855, 2856], 'Sawyer': [83, 84, 85, 86, 87, 88, 89, 108, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 230, 231, 554, 555, 556, 557, 558, 559, 560, 561, 562, 578, 761, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 780, 1149, 1150, 1151, 1152, 1153, 1154, 1156, 1371, 1372, 1373, 1376, 1377, 1378, 1379, 1380, 1381, 1382, 1383, 1384, 1385, 1386, 1387, 1388, 1389, 1390, 1391, 1393, 1394, 1396, 1397, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1829, 1830, 1831, 1861, 2044, 2048, 2049, 2050, 2051, 2052, 2053, 2054, 2055, 2056, 2057, ...], 'SawyerW': [72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 244, 245, 246, 247, 248, 253, 254, 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 799, 800, 801, 802, 803, 804, 805, 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 832, 1131, 1132, 1133, 1134, 1135, 1136, 1137, 1138, 1139, 1140, 1141, 1142, 1143, 1144, 1145, 1146, 1147, 1148, 1418, 1419, 1420, 1421, 1422, 1427, 1428, 1806, 1807, 1808, 1809, 1810, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 2089, 2090, 2091, 2092, ...], 'Somerst': [22, 66, 67, 68, 69, 70, 71, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 109, 383, 384, 385, 386, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532, 565, 566, 567, 568, 569, 570, 571, 572, 573, 1110, 1111, 1112, 1113, 1114, 1115, 1116, 1117, 1118, 1119, 1120, 1121, 1122, 1123, 1124, 1125, 1126, 1127, 1128, 1129, 1130, 1159, 1160, 1161, 1162, 1163, 1164, 1165, 1166, 1167, 1168, 1169, 1170, 1171, 1172, 1173, 1174, 1175, 1176, ...], 'StoneBr': [6, 7, 8, 14, 15, 17, 349, 350, 351, 352, 365, 366, 367, 368, 999, 1000, 1001, 1002, 1008, 1009, 1010, 1011, 1012, 1014, 1620, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 1639, 1640, 1641, 1642, 2322, 2326, 2327, 2328, 2329, 2330, 2331, 2332, 2333, 2334, 2335, 2336, 2339, 2340, 2341], 'Timber': [314, 315, 316, 317, 318, 319, 320, 321, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 1570, 1571, 1572, 1573, 1574, 1575, 1576, 1577, 1578, 1579, 1580, 1581, 1582, 1583, 1584, 1585, 1586, 1587, 2255, 2257, 2258, 2259, 2260, 2261, 2262, 2263, 2264, 2265, 2266, 2267, 2268, 2269, 2270, 2271, 2272, 2273, 2274, 2275, 2276, 2891, 2893, 2894, 2895, 2896, 2897, 2898, 2899, 2900, 2901, 2902], 'Veenker': [563, 574, 576, 577, 1155, 1178, 1179, 1180, 1181, 1182, 1183, 1827, 1828, 1853, 1854, 1855, 1856, 1858, 1859, 1860, 2498, 2516, 2517, 2522]}

We can also access a group using the .get_group() method:

# get the Bloomington neighborhood group
ames_grp.get_group('Blmngtn').head()

	Order	PID	MS SubClass	MS Zoning	Lot Frontage	Lot Area	Street	Alley	Lot Shape	Land Contour	...	Pool QC	Fence	Misc Feature	Mo Sold	Yr Sold	Sale Type	Sale Condition	SalePrice
52	53	528228285	120	RL	43.0	3203	Pave	NaN	Reg	Lvl	...	NaN	NaN	NaN	1	2010	WD	Normal	160000
53	54	528228440	120	RL	43.0	3182	Pave	NaN	Reg	Lvl	...	NaN	NaN	NaN	4	2010	WD	Normal	192000
468	469	528228290	120	RL	53.0	3684	Pave	NaN	Reg	Lvl	...	NaN	NaN	NaN	6	2009	WD	Normal	174000
469	470	528228295	120	RL	51.0	3635	Pave	NaN	Reg	Lvl	...	NaN	NaN	NaN	5	2009	WD	Normal	175900
470	471	528228435	120	RL	43.0	3182	Pave	NaN	Reg	Lvl	...	NaN	NaN	NaN	5	2009	WD	Normal	192500

5 rows × 82 columns

If we then call an aggregation method after our groupby() call, we will see the DataFrame returned with group-level aggregations:

ames.groupby('Neighborhood').agg({'SalePrice': ['mean', 'median']}).head()

	SalePrice
	mean	median
Neighborhood
Blmngtn	196661.678571	191500.0
Blueste	143590.000000	130500.0
BrDale	105608.333333	106000.0
BrkSide	124756.250000	126750.0
ClearCr	208662.090909	197500.0

This process always follows this model:

Groups as index vs. variables#

Note

Notice that the grouped variable becomes the Index in our example!

ames.groupby('Neighborhood').agg({'SalePrice': ['mean', 'median']}).index

Index(['Blmngtn', 'Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr',
       'Crawfor', 'Edwards', 'Gilbert', 'Greens', 'GrnHill', 'IDOTRR',
       'Landmrk', 'MeadowV', 'Mitchel', 'NAmes', 'NPkVill', 'NWAmes',
       'NoRidge', 'NridgHt', 'OldTown', 'SWISU', 'Sawyer', 'SawyerW',
       'Somerst', 'StoneBr', 'Timber', 'Veenker'],
      dtype='object', name='Neighborhood')

This is the default behavior of pandas, and probably how pandas wants to be used. In fact, this is the fastest way to do it, but it’s a matter of less than a millisecond. However, you aren’t always going to see people group by the index. Instead of setting the group as the index, we can set the group as a variable.

Tip

The grouped variable can remain a Series/variable by adding the as_index = False parameter/argument to groupby().

ames.groupby('Neighborhood', as_index=False).agg({'SalePrice': ['mean', 'median']}).head()

	Neighborhood	SalePrice
		mean	median
0	Blmngtn	196661.678571	191500.0
1	Blueste	143590.000000	130500.0
2	BrDale	105608.333333	106000.0
3	BrkSide	124756.250000	126750.0
4	ClearCr	208662.090909	197500.0

Grouping by multiple variables#

Sometimes we have multiple categories by which we’d like to group. To extend our example, assume we want to find the average sale price by neighborhood AND year sold. We can pass a list of variable names to the groupby() method:

ames.groupby(['Neighborhood', 'Yr Sold'], as_index=False).agg({'SalePrice': 'mean'})

	Neighborhood	Yr Sold	SalePrice
0	Blmngtn	2006	214424.454545
1	Blmngtn	2007	194671.500000
2	Blmngtn	2008	190714.400000
3	Blmngtn	2009	177266.666667
4	Blmngtn	2010	176000.000000
...	...	...	...
125	Timber	2010	224947.625000
126	Veenker	2006	270000.000000
127	Veenker	2007	253577.777778
128	Veenker	2008	225928.571429
129	Veenker	2009	253962.500000

130 rows × 3 columns

Knowledge check#

Questions:

How would you convert the following statement into a grouped aggregation syntax: “what is the average above ground square footage of homes based on neighbhorhood and bedroom count”?
Compute the above statement (variable hints: Gr Liv Area = above ground square footage, Neighborhood = neighborhood, Bedroom AbvGr = bedroom count).
Using the results from #2, find out which neighborhoods have 1 bedrooms homes that average more than 1500 above ground square feet.

Video 🎥:

Exercises#

Questions:

Using the Ames housing data…

What neighbhorhood has the largest median sales price? (hint: check out sort_values())
What is the mean and median sales price based on the number of bedrooms (Bedroom AbvGr)?
Which neighbhorhood has the largest median sales price for 3 bedroom homes? Which neighborhood has the smallest median sales price for 3 bedroom homes?
Compute the sales price per square footage (Gr Liv Area) per home. Call this variable price_per_sqft. Now compute the median price_per_sqft per neighborhood. Which neighborhood has the largest median price_per_sqft? Does this differ from the neighborhood identified in #1? What information does this provide you?

Computing environment#

Python implementation: CPython
Python version       : 3.12.4
IPython version      : 8.26.0

jupyterlab: 4.2.3
pandas    : 2.2.2

Lesson 3c: Summarizing data

Contents

Lesson 3c: Summarizing data#

Learning objectives#

Simple aggregation#

Summarizing a series#

Knowledge check#

Describe method#

Summarizing a DataFrame#

The Aggregation method#

Knowledge check#

Describe method#

Grouped aggregation#

Groups as index vs. variables#

Grouping by multiple variables#

Knowledge check#

Exercises#

Computing environment#