{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lession 4a: Tidy data\n", "\n", "[Tidy data](https://vita.had.co.nz/papers/tidy-data.pdf) is about “linking the structure of a dataset with its semantics (its meaning)”. It is defined by:\n", "\n", "1. Each variable forms a column\n", "2. Each observation forms a row\n", "3. Each type of observational unit forms a table\n", "\n", "Often you’ll need to reshape a dataframe to make it tidy (or for some other purpose).\n", "\n", "
\n", "\"tidy-data\"\n", "
\n", "\n", "Source: [R4DS](https://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure)\n", "\n", "Once a DataFrame is tidy, it becomes much easier to compute summary statistics, join with other datasets, visualize, apply machine learning models, etc. In this lesson we will focus on ways to reshape DataFrames so that they meet the tidy guidelines.\n", "\n", "```{admonition} Video 🎥:\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning objectives\n", "\n", "By the end of this lesson you will be able to:\n", "\n", "- Reshape data from wide to long\n", "- Reshape data from long to wide" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tools for reshaping\n", "\n", "Pandas provides multiple methods that help to reshape DataFrames:\n", "\n", "* `.melt()`: make wide data long.\n", "* `.pivot()`: make long data width.\n", "* `.pivot_table()`: same as `.pivot()` but can handle multiple indexes.\n", "\n", "
\n", "\"melt-pivot\"\n", "
\n", "\n", "Source: [Garrick Aden-Buie](https://github.com/gadenbuie/tidyexplain#spread-and-gather)\n", "\n", "The following will illustrate each of these for their unique purpose." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Melting wide data\n", "\n", "The below data shows how many homes were sold within each neighborhood across each year. Is this considered 'tidy data'? No because we have a variable (year sold) that is represented as the columns and another variable (number of homes sold) filled in as the element values. \n", "\n", "If you wanted to answer questions like: “Does the number of homes sold vary depending on year?” then the below data is not in the appropriate form to answer this question." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
neighborhood20062007200820092010
0Blmngtn11.04.05.06.02.0
1BluesteNaN2.02.04.02.0
2BrDale9.05.07.06.03.0
3BrkSide19.024.031.023.011.0
4ClearCr10.09.011.06.08.0
\n", "
" ], "text/plain": [ " neighborhood 2006 2007 2008 2009 2010\n", "0 Blmngtn 11.0 4.0 5.0 6.0 2.0\n", "1 Blueste NaN 2.0 2.0 4.0 2.0\n", "2 BrDale 9.0 5.0 7.0 6.0 3.0\n", "3 BrkSide 19.0 24.0 31.0 23.0 11.0\n", "4 ClearCr 10.0 9.0 11.0 6.0 8.0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "ames_wide = pd.read_csv('../data/ames_wide.csv')\n", "ames_wide.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example we would consider this data \"wide\" and our objective is to convert it into a DataFrame with three variables:\n", "\n", "1. neighborhood\n", "2. year\n", "3. homes_sold\n", "\n", "To do so we'll use the `.melt()` method. `.melt()` arguments include:\n", "\n", "- `id_vars`: Identifier column\n", "- `var_name`: Name to give the new variable represented by the old column headers\n", "- `value_name`: Name to give the new variable represented by the old element values" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
neighborhoodyearhomes_sold
0Blmngtn200611.0
1Blueste2006NaN
2BrDale20069.0
3BrkSide200619.0
4ClearCr200610.0
............
135SawyerW201018.0
136Somerst201021.0
137StoneBr20106.0
138Timber20108.0
139Veenker2010NaN
\n", "

140 rows × 3 columns

\n", "
" ], "text/plain": [ " neighborhood year homes_sold\n", "0 Blmngtn 2006 11.0\n", "1 Blueste 2006 NaN\n", "2 BrDale 2006 9.0\n", "3 BrkSide 2006 19.0\n", "4 ClearCr 2006 10.0\n", ".. ... ... ...\n", "135 SawyerW 2010 18.0\n", "136 Somerst 2010 21.0\n", "137 StoneBr 2010 6.0\n", "138 Timber 2010 8.0\n", "139 Veenker 2010 NaN\n", "\n", "[140 rows x 3 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ames_melt = ames_wide.melt(id_vars='neighborhood', var_name='year', value_name='homes_sold')\n", "ames_melt\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `value_vars` argument allows us to select which specific variables we want to “melt” (if you don’t specify `value_vars`, all non-identifier columns will be used). For example, below I’m omitting the 2006 column:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
neighborhoodyearhomes_sold
0Blmngtn20074.0
1Blueste20072.0
2BrDale20075.0
3BrkSide200724.0
4ClearCr20079.0
............
107SawyerW201018.0
108Somerst201021.0
109StoneBr20106.0
110Timber20108.0
111Veenker2010NaN
\n", "

112 rows × 3 columns

\n", "
" ], "text/plain": [ " neighborhood year homes_sold\n", "0 Blmngtn 2007 4.0\n", "1 Blueste 2007 2.0\n", "2 BrDale 2007 5.0\n", "3 BrkSide 2007 24.0\n", "4 ClearCr 2007 9.0\n", ".. ... ... ...\n", "107 SawyerW 2010 18.0\n", "108 Somerst 2010 21.0\n", "109 StoneBr 2010 6.0\n", "110 Timber 2010 8.0\n", "111 Veenker 2010 NaN\n", "\n", "[112 rows x 3 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ames_wide.melt(\n", " id_vars='neighborhood',\n", " value_vars=['2007', '2008', '2009', '2010'],\n", " var_name='year',\n", " value_name='homes_sold'\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Knowledge check\n", "\n", "```{admonition} Questions:\n", ":class: attention\n", "Given the following DataFrame, reshape the DataFrame from the current \"wide\" format to a \"longer\" format made up of the following variables:\n", "\n", "- `Name`: will contain the same values in the current `Name` column, \n", "- `Year`: will contain the year values which are currently column names, and \n", "- `Courses`: will contain the values that are currently listed under each year variable.\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame({\"Name\": [\"Tom\", \"Mike\", \"Tiffany\", \"Varada\", \"Joel\"],\n", " \"2018\": [1, 3, 4, 5, 3],\n", " \"2019\": [2, 4, 3, 2, 1],\n", " \"2020\": [5, 2, 4, 4, 3]})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Video 🎥:\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Pivoting long data\n", "\n", "Sometimes, you want to make long data wide, which we can do with `.pivot()`. When using `.pivot()` we need to specify the index to pivot on, and the columns that will be used to make the new columns of the wider dataframe. Let's convert our `ames_melt` DataFrame back to the wide format:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
year20062007200820092010
neighborhood
Blmngtn11.04.05.06.02.0
BluesteNaN2.02.04.02.0
BrDale9.05.07.06.03.0
BrkSide19.024.031.023.011.0
ClearCr10.09.011.06.08.0
CollgCr60.065.058.063.021.0
Crawfor20.034.021.021.07.0
Edwards43.041.045.042.023.0
Gilbert38.045.026.041.015.0
Greens4.01.0NaN1.02.0
GrnHill1.01.0NaNNaNNaN
IDOTRR20.025.026.013.09.0
Landmrk1.0NaNNaNNaNNaN
MeadowV10.08.07.05.07.0
Mitchel23.028.022.024.017.0
NAmes99.0105.086.095.058.0
NPkVill3.03.03.010.04.0
NWAmes25.030.030.035.011.0
NoRidge17.017.014.013.010.0
NridgHt32.043.031.045.015.0
OldTown51.048.056.055.029.0
SWISU12.05.010.010.011.0
Sawyer38.037.030.023.023.0
SawyerW20.021.025.041.018.0
Somerst29.051.041.040.021.0
StoneBr15.012.010.08.06.0
Timber11.021.018.014.08.0
Veenker4.09.07.04.0NaN
\n", "
" ], "text/plain": [ "year 2006 2007 2008 2009 2010\n", "neighborhood \n", "Blmngtn 11.0 4.0 5.0 6.0 2.0\n", "Blueste NaN 2.0 2.0 4.0 2.0\n", "BrDale 9.0 5.0 7.0 6.0 3.0\n", "BrkSide 19.0 24.0 31.0 23.0 11.0\n", "ClearCr 10.0 9.0 11.0 6.0 8.0\n", "CollgCr 60.0 65.0 58.0 63.0 21.0\n", "Crawfor 20.0 34.0 21.0 21.0 7.0\n", "Edwards 43.0 41.0 45.0 42.0 23.0\n", "Gilbert 38.0 45.0 26.0 41.0 15.0\n", "Greens 4.0 1.0 NaN 1.0 2.0\n", "GrnHill 1.0 1.0 NaN NaN NaN\n", "IDOTRR 20.0 25.0 26.0 13.0 9.0\n", "Landmrk 1.0 NaN NaN NaN NaN\n", "MeadowV 10.0 8.0 7.0 5.0 7.0\n", "Mitchel 23.0 28.0 22.0 24.0 17.0\n", "NAmes 99.0 105.0 86.0 95.0 58.0\n", "NPkVill 3.0 3.0 3.0 10.0 4.0\n", "NWAmes 25.0 30.0 30.0 35.0 11.0\n", "NoRidge 17.0 17.0 14.0 13.0 10.0\n", "NridgHt 32.0 43.0 31.0 45.0 15.0\n", "OldTown 51.0 48.0 56.0 55.0 29.0\n", "SWISU 12.0 5.0 10.0 10.0 11.0\n", "Sawyer 38.0 37.0 30.0 23.0 23.0\n", "SawyerW 20.0 21.0 25.0 41.0 18.0\n", "Somerst 29.0 51.0 41.0 40.0 21.0\n", "StoneBr 15.0 12.0 10.0 8.0 6.0\n", "Timber 11.0 21.0 18.0 14.0 8.0\n", "Veenker 4.0 9.0 7.0 4.0 NaN" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ames_pivot = ames_melt.pivot(index='neighborhood', columns='year', values='homes_sold')\n", "ames_pivot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You’ll notice that Pandas set our specified index as the index of the new DataFrame and preserved the label of the columns. We can easily remove these names and reset the index to make our DataFrame look like it originally did:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
neighborhood20062007200820092010
0Blmngtn11.04.05.06.02.0
1BluesteNaN2.02.04.02.0
2BrDale9.05.07.06.03.0
3BrkSide19.024.031.023.011.0
4ClearCr10.09.011.06.08.0
5CollgCr60.065.058.063.021.0
6Crawfor20.034.021.021.07.0
7Edwards43.041.045.042.023.0
8Gilbert38.045.026.041.015.0
9Greens4.01.0NaN1.02.0
10GrnHill1.01.0NaNNaNNaN
11IDOTRR20.025.026.013.09.0
12Landmrk1.0NaNNaNNaNNaN
13MeadowV10.08.07.05.07.0
14Mitchel23.028.022.024.017.0
15NAmes99.0105.086.095.058.0
16NPkVill3.03.03.010.04.0
17NWAmes25.030.030.035.011.0
18NoRidge17.017.014.013.010.0
19NridgHt32.043.031.045.015.0
20OldTown51.048.056.055.029.0
21SWISU12.05.010.010.011.0
22Sawyer38.037.030.023.023.0
23SawyerW20.021.025.041.018.0
24Somerst29.051.041.040.021.0
25StoneBr15.012.010.08.06.0
26Timber11.021.018.014.08.0
27Veenker4.09.07.04.0NaN
\n", "
" ], "text/plain": [ " neighborhood 2006 2007 2008 2009 2010\n", "0 Blmngtn 11.0 4.0 5.0 6.0 2.0\n", "1 Blueste NaN 2.0 2.0 4.0 2.0\n", "2 BrDale 9.0 5.0 7.0 6.0 3.0\n", "3 BrkSide 19.0 24.0 31.0 23.0 11.0\n", "4 ClearCr 10.0 9.0 11.0 6.0 8.0\n", "5 CollgCr 60.0 65.0 58.0 63.0 21.0\n", "6 Crawfor 20.0 34.0 21.0 21.0 7.0\n", "7 Edwards 43.0 41.0 45.0 42.0 23.0\n", "8 Gilbert 38.0 45.0 26.0 41.0 15.0\n", "9 Greens 4.0 1.0 NaN 1.0 2.0\n", "10 GrnHill 1.0 1.0 NaN NaN NaN\n", "11 IDOTRR 20.0 25.0 26.0 13.0 9.0\n", "12 Landmrk 1.0 NaN NaN NaN NaN\n", "13 MeadowV 10.0 8.0 7.0 5.0 7.0\n", "14 Mitchel 23.0 28.0 22.0 24.0 17.0\n", "15 NAmes 99.0 105.0 86.0 95.0 58.0\n", "16 NPkVill 3.0 3.0 3.0 10.0 4.0\n", "17 NWAmes 25.0 30.0 30.0 35.0 11.0\n", "18 NoRidge 17.0 17.0 14.0 13.0 10.0\n", "19 NridgHt 32.0 43.0 31.0 45.0 15.0\n", "20 OldTown 51.0 48.0 56.0 55.0 29.0\n", "21 SWISU 12.0 5.0 10.0 10.0 11.0\n", "22 Sawyer 38.0 37.0 30.0 23.0 23.0\n", "23 SawyerW 20.0 21.0 25.0 41.0 18.0\n", "24 Somerst 29.0 51.0 41.0 40.0 21.0\n", "25 StoneBr 15.0 12.0 10.0 8.0 6.0\n", "26 Timber 11.0 21.0 18.0 14.0 8.0\n", "27 Veenker 4.0 9.0 7.0 4.0 NaN" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ames_pivot = ames_pivot.reset_index()\n", "ames_pivot.columns.name = None\n", "ames_pivot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Knowledge check\n", "\n", "```{admonition} Questions:\n", ":class: attention\n", "Given the following DataFrame, reshape the DataFrame from the current \"long\" format to a \"wider\" format made up of the following variables:\n", "\n", "- `Name`: will contain the same values in the current `Name` column, \n", "- `Year`: will contain the year values which are currently column names, and \n", "- `Courses`: will contain the values that are currently listed under each year variable.\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame({\n", " \"Name\": [\"Tom\", \"Mike\", \"Tiffany\", \"Tom\", \"Mike\", \"Tiffany\"],\n", " \"Variable\": [\"Year\", \"Year\", \"Year\", \"Courses\", \"Courses\", \"Courses\"],\n", " \"Value\": [2018, 2018, 2018, 1, 3, 4]\n", "})\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{admonition} Video 🎥:\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pivoting with special needs\n", "\n", "`.pivot()` will often get you what you want, but it won’t work if you want to:\n", "\n", "* Use multiple indexes or\n", "* Have duplicate index/column labels\n", "\n", "For example, let's look at pivoting the below data:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
neighborhoodyear_soldbedroomshomes_sold
0Blmngtn200611
1Blmngtn2006210
2Blmngtn200724
3Blmngtn200825
4Blmngtn200912
...............
430Veenker200736
431Veenker200813
432Veenker200834
433Veenker200923
434Veenker200941
\n", "

435 rows × 4 columns

\n", "
" ], "text/plain": [ " neighborhood year_sold bedrooms homes_sold\n", "0 Blmngtn 2006 1 1\n", "1 Blmngtn 2006 2 10\n", "2 Blmngtn 2007 2 4\n", "3 Blmngtn 2008 2 5\n", "4 Blmngtn 2009 1 2\n", ".. ... ... ... ...\n", "430 Veenker 2007 3 6\n", "431 Veenker 2008 1 3\n", "432 Veenker 2008 3 4\n", "433 Veenker 2009 2 3\n", "434 Veenker 2009 4 1\n", "\n", "[435 rows x 4 columns]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ames2 = pd.read_csv('../data/ames_wide2.csv')\n", "ames2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, say you wanted to pivot `ames_wide2` so that the `year_sold` is represented as columns and `homes_sold` values are the elements. If we try to do this similar to the last section's example we get an error stating `ValueError: Index contains duplicate entries, cannot reshape`." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "Index contains duplicate entries, cannot reshape", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/8f/c06lv6q17tjbyjv2nkt0_s4s1sh0tg/T/ipykernel_5256/2062246961.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mames2\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpivot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'neighborhood'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'year_sold'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalues\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'homes_sold'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/.pyenv/versions/3.9.4/lib/python3.9/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36mpivot\u001b[0;34m(self, index, columns, values)\u001b[0m\n\u001b[1;32m 6877\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mpandas\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpivot\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mpivot\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6878\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 6879\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mpivot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindex\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcolumns\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvalues\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 6880\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6881\u001b[0m _shared_docs[\n", "\u001b[0;32m~/.pyenv/versions/3.9.4/lib/python3.9/site-packages/pandas/core/reshape/pivot.py\u001b[0m in \u001b[0;36mpivot\u001b[0;34m(data, index, columns, values)\u001b[0m\n\u001b[1;32m 459\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 460\u001b[0m \u001b[0mindexed\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdata\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_constructor_sliced\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_values\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindex\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 461\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mindexed\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munstack\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 462\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 463\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.pyenv/versions/3.9.4/lib/python3.9/site-packages/pandas/core/series.py\u001b[0m in \u001b[0;36munstack\u001b[0;34m(self, level, fill_value)\u001b[0m\n\u001b[1;32m 3827\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mpandas\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcore\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0munstack\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3828\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3829\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0munstack\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfill_value\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3830\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3831\u001b[0m \u001b[0;31m# ----------------------------------------------------------------------\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.pyenv/versions/3.9.4/lib/python3.9/site-packages/pandas/core/reshape/reshape.py\u001b[0m in \u001b[0;36munstack\u001b[0;34m(obj, level, fill_value)\u001b[0m\n\u001b[1;32m 428\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_extension_array_dtype\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdtype\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 429\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0m_unstack_extension_series\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfill_value\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 430\u001b[0;31m unstacker = _Unstacker(\n\u001b[0m\u001b[1;32m 431\u001b[0m \u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mlevel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mconstructor\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_constructor_expanddim\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 432\u001b[0m )\n", "\u001b[0;32m~/.pyenv/versions/3.9.4/lib/python3.9/site-packages/pandas/core/reshape/reshape.py\u001b[0m in \u001b[0;36m__init__\u001b[0;34m(self, index, level, constructor)\u001b[0m\n\u001b[1;32m 116\u001b[0m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Unstacked DataFrame is too big, causing int32 overflow\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 117\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 118\u001b[0;31m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_make_selectors\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 119\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 120\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0mcache_readonly\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/.pyenv/versions/3.9.4/lib/python3.9/site-packages/pandas/core/reshape/reshape.py\u001b[0m in \u001b[0;36m_make_selectors\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m 165\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 166\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mmask\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m<\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 167\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"Index contains duplicate entries, cannot reshape\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 168\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 169\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroup_index\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcomp_index\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: Index contains duplicate entries, cannot reshape" ] } ], "source": [ "ames2.pivot(index='neighborhood', columns='year_sold', values='homes_sold')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reason is we have duplicate values in our neighborhood column and Pandas doesn't know how to isolate the index values to properly align the pivoted data. In such a case, we’d use `.pivot_table()`. It will apply an aggregation function to our duplicates, in this case, we’ll `sum()` them up:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
year_sold20062007200820092010
neighborhood
Blmngtn11.04.05.06.02.0
BluesteNaN2.02.04.02.0
BrDale9.05.07.06.03.0
BrkSide19.024.031.023.011.0
ClearCr10.09.011.06.08.0
CollgCr60.065.058.063.021.0
Crawfor20.034.021.021.07.0
Edwards43.041.045.042.023.0
Gilbert38.045.026.041.015.0
Greens4.01.0NaN1.02.0
GrnHill1.01.0NaNNaNNaN
IDOTRR20.025.026.013.09.0
Landmrk1.0NaNNaNNaNNaN
MeadowV10.08.07.05.07.0
Mitchel23.028.022.024.017.0
NAmes99.0105.086.095.058.0
NPkVill3.03.03.010.04.0
NWAmes25.030.030.035.011.0
NoRidge17.017.014.013.010.0
NridgHt32.043.031.045.015.0
OldTown51.048.056.055.029.0
SWISU12.05.010.010.011.0
Sawyer38.037.030.023.023.0
SawyerW20.021.025.041.018.0
Somerst29.051.041.040.021.0
StoneBr15.012.010.08.06.0
Timber11.021.018.014.08.0
Veenker4.09.07.04.0NaN
\n", "
" ], "text/plain": [ "year_sold 2006 2007 2008 2009 2010\n", "neighborhood \n", "Blmngtn 11.0 4.0 5.0 6.0 2.0\n", "Blueste NaN 2.0 2.0 4.0 2.0\n", "BrDale 9.0 5.0 7.0 6.0 3.0\n", "BrkSide 19.0 24.0 31.0 23.0 11.0\n", "ClearCr 10.0 9.0 11.0 6.0 8.0\n", "CollgCr 60.0 65.0 58.0 63.0 21.0\n", "Crawfor 20.0 34.0 21.0 21.0 7.0\n", "Edwards 43.0 41.0 45.0 42.0 23.0\n", "Gilbert 38.0 45.0 26.0 41.0 15.0\n", "Greens 4.0 1.0 NaN 1.0 2.0\n", "GrnHill 1.0 1.0 NaN NaN NaN\n", "IDOTRR 20.0 25.0 26.0 13.0 9.0\n", "Landmrk 1.0 NaN NaN NaN NaN\n", "MeadowV 10.0 8.0 7.0 5.0 7.0\n", "Mitchel 23.0 28.0 22.0 24.0 17.0\n", "NAmes 99.0 105.0 86.0 95.0 58.0\n", "NPkVill 3.0 3.0 3.0 10.0 4.0\n", "NWAmes 25.0 30.0 30.0 35.0 11.0\n", "NoRidge 17.0 17.0 14.0 13.0 10.0\n", "NridgHt 32.0 43.0 31.0 45.0 15.0\n", "OldTown 51.0 48.0 56.0 55.0 29.0\n", "SWISU 12.0 5.0 10.0 10.0 11.0\n", "Sawyer 38.0 37.0 30.0 23.0 23.0\n", "SawyerW 20.0 21.0 25.0 41.0 18.0\n", "Somerst 29.0 51.0 41.0 40.0 21.0\n", "StoneBr 15.0 12.0 10.0 8.0 6.0\n", "Timber 11.0 21.0 18.0 14.0 8.0\n", "Veenker 4.0 9.0 7.0 4.0 NaN" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ames2.pivot_table(index='neighborhood', columns='year_sold', values='homes_sold', aggfunc='sum')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we wanted to keep the numbers per bedroom, we could specify both `neighborhood` and `bedrooms` as multiple indexes:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
year_sold20062007200820092010
neighborhoodbedrooms
Blmngtn11.0NaNNaN2.0NaN
210.04.05.04.02.0
Blueste1NaN1.0NaNNaN1.0
2NaN1.01.04.01.0
3NaNNaN1.0NaNNaN
.....................
Veenker01.0NaNNaNNaNNaN
1NaN1.03.0NaNNaN
21.02.0NaN3.0NaN
32.06.04.0NaNNaN
4NaNNaNNaN1.0NaN
\n", "

125 rows × 5 columns

\n", "
" ], "text/plain": [ "year_sold 2006 2007 2008 2009 2010\n", "neighborhood bedrooms \n", "Blmngtn 1 1.0 NaN NaN 2.0 NaN\n", " 2 10.0 4.0 5.0 4.0 2.0\n", "Blueste 1 NaN 1.0 NaN NaN 1.0\n", " 2 NaN 1.0 1.0 4.0 1.0\n", " 3 NaN NaN 1.0 NaN NaN\n", "... ... ... ... ... ...\n", "Veenker 0 1.0 NaN NaN NaN NaN\n", " 1 NaN 1.0 3.0 NaN NaN\n", " 2 1.0 2.0 NaN 3.0 NaN\n", " 3 2.0 6.0 4.0 NaN NaN\n", " 4 NaN NaN NaN 1.0 NaN\n", "\n", "[125 rows x 5 columns]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ames2.pivot(index=['neighborhood', 'bedrooms'], columns='year_sold', values='homes_sold')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result above is a mutlti-index or “hierarchically indexed” DataFrame, which we haven't really talked about up to this point. However, we can easily flatten this with `.reset_index()` and removing the column's index name." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
neighborhoodbedrooms20062007200820092010
0Blmngtn11.0NaNNaN2.0NaN
1Blmngtn210.04.05.04.02.0
2Blueste1NaN1.0NaNNaN1.0
3Blueste2NaN1.01.04.01.0
4Blueste3NaNNaN1.0NaNNaN
\n", "
" ], "text/plain": [ " neighborhood bedrooms 2006 2007 2008 2009 2010\n", "0 Blmngtn 1 1.0 NaN NaN 2.0 NaN\n", "1 Blmngtn 2 10.0 4.0 5.0 4.0 2.0\n", "2 Blueste 1 NaN 1.0 NaN NaN 1.0\n", "3 Blueste 2 NaN 1.0 1.0 4.0 1.0\n", "4 Blueste 3 NaN NaN 1.0 NaN NaN" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ames2_reshaped = (\n", " ames2\n", " .pivot(index=['neighborhood', 'bedrooms'], columns='year_sold', values='homes_sold')\n", " .reset_index()\n", ")\n", "ames2_reshaped.columns.name = None\n", "ames2_reshaped.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional video\n", "\n", "```{admonition} Video 🎥:\n", "Here's a webinar that provides a thorough discussion around tidy data principles along with illustrating examples of reshaping data with Pandas. It is longer (50 minutes) but is worth a watch if you are still trying to get your arms around the above lesson conceps.\n", "\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercises\n", "\n", "For this exercise, we're going to work with this data set from this paper by [Reeves, et al.](https://www.cell.com/developmental-cell/fulltext/S1534-5807(11)00573-9?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS1534580711005739%3Fshowall%3Dtrue) in which they measured the width of the gradient in the morphogen Dorsal in Drosophila embryos for various genotypes using different method. Don't get hung up in what this means, our object is to simply tidy this dataset." ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wtdl1/+; dl-venus/+dl1/+; dl-gfp/+
wholemountscross-sectionsanti-Dorsalanti-VenusVenus (live)anti-Dorsalanti-GFPGFP (live)
00.12880.13270.14820.16320.16660.22480.23890.2412
10.15540.14570.15030.16710.17530.18910.20350.1942
20.13060.14470.15770.17040.17050.17050.19430.2186
30.14130.12820.17110.1779NaN0.17350.20000.2104
40.15570.14870.13420.1483NaN0.21350.25600.2463
...........................
147NaN0.1466NaNNaNNaNNaNNaNNaN
148NaN0.1671NaNNaNNaNNaNNaNNaN
149NaN0.1265NaNNaNNaNNaNNaNNaN
150NaN0.1448NaNNaNNaNNaNNaNNaN
151NaN0.1740NaNNaNNaNNaNNaNNaN
\n", "

152 rows × 8 columns

\n", "
" ], "text/plain": [ " wt dl1/+; dl-venus/+ \\\n", " wholemounts cross-sections anti-Dorsal anti-Venus Venus (live) \n", "0 0.1288 0.1327 0.1482 0.1632 0.1666 \n", "1 0.1554 0.1457 0.1503 0.1671 0.1753 \n", "2 0.1306 0.1447 0.1577 0.1704 0.1705 \n", "3 0.1413 0.1282 0.1711 0.1779 NaN \n", "4 0.1557 0.1487 0.1342 0.1483 NaN \n", ".. ... ... ... ... ... \n", "147 NaN 0.1466 NaN NaN NaN \n", "148 NaN 0.1671 NaN NaN NaN \n", "149 NaN 0.1265 NaN NaN NaN \n", "150 NaN 0.1448 NaN NaN NaN \n", "151 NaN 0.1740 NaN NaN NaN \n", "\n", " dl1/+; dl-gfp/+ \n", " anti-Dorsal anti-GFP GFP (live) \n", "0 0.2248 0.2389 0.2412 \n", "1 0.1891 0.2035 0.1942 \n", "2 0.1705 0.1943 0.2186 \n", "3 0.1735 0.2000 0.2104 \n", "4 0.2135 0.2560 0.2463 \n", ".. ... ... ... \n", "147 NaN NaN NaN \n", "148 NaN NaN NaN \n", "149 NaN NaN NaN \n", "150 NaN NaN NaN \n", "151 NaN NaN NaN \n", "\n", "[152 rows x 8 columns]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"../data/reeves_gradient_width_various_methods.csv\", comment='#', header=[0,1])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can happen with spreadsheets, we have a multiindex, where we have three main groups:\n", "\n", "* `wt` which refers to wild type\n", "* `dl1/+; dl-venus/+` which we'll refer to as simply Venus \n", "* `\tdl1/+; dl-gfp/+` which we'll refer to as simply GFP\n", "\n", "For each of these main groups we have multiple sub-columns: two for wild type (`wholemounts`, `cross-sections`), three for Venus (`anti-Dorsal`, `Anti-Venus`, `Venus (live)`), and three for GFP (`anti-Dorsal`, `anti-GFP`, `GFP (live)`). The rows here are the gradient width values recorded for each of the categories. Clearly these data are not tidy.\n", "\n", "For this exercise your objective is to:\n", "\n", "1. Reshape this data so that it looks like the following:" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
genotypemethodgradient width
0wtwholemounts0.1288
1wtwholemounts0.1554
2wtwholemounts0.1306
3wtwholemounts0.1413
4wtwholemounts0.1557
............
1211dl1/+; dl-gfp/+GFP (live)NaN
1212dl1/+; dl-gfp/+GFP (live)NaN
1213dl1/+; dl-gfp/+GFP (live)NaN
1214dl1/+; dl-gfp/+GFP (live)NaN
1215dl1/+; dl-gfp/+GFP (live)NaN
\n", "

1216 rows × 3 columns

\n", "
" ], "text/plain": [ " genotype method gradient width\n", "0 wt wholemounts 0.1288\n", "1 wt wholemounts 0.1554\n", "2 wt wholemounts 0.1306\n", "3 wt wholemounts 0.1413\n", "4 wt wholemounts 0.1557\n", "... ... ... ...\n", "1211 dl1/+; dl-gfp/+ GFP (live) NaN\n", "1212 dl1/+; dl-gfp/+ GFP (live) NaN\n", "1213 dl1/+; dl-gfp/+ GFP (live) NaN\n", "1214 dl1/+; dl-gfp/+ GFP (live) NaN\n", "1215 dl1/+; dl-gfp/+ GFP (live) NaN\n", "\n", "[1216 rows x 3 columns]" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "expected_result = pd.read_csv('../data/tidy_reeves_gradients.csv')\n", "expected_result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2\\. Now that you have a tidy data frame you will notice that you have many `NaN`s in the `gradient width` column because there were many of them in the data set. Drop all observations that contain `NaN` values.\n", "\n", "3\\. Now compute summary statistics via `.describe()` for the `gradient width` variable grouped by `genotype` and `method`. Which `genotype` and `method` has the narrowest `gradient width`?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Computing environment" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-input" ] }, "outputs": [], "source": [ "%load_ext watermark\n", "%watermark -v -p jupyterlab,pandas" ] } ], "metadata": { "interpreter": { "hash": "fd8d25807323b6a73e4e6e484dd361ea1af80a43f200f9cf712ca91de587529f" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 4 }