Lession 4a: Tidy data#

Tidy data is about “linking the structure of a dataset with its semantics (its meaning)”. It is defined by:

  1. Each variable forms a column

  2. Each observation forms a row

  3. Each type of observational unit forms a table

Often you’ll need to reshape a dataframe to make it tidy (or for some other purpose).

tidy-data

Source: R4DS

Once a DataFrame is tidy, it becomes much easier to compute summary statistics, join with other datasets, visualize, apply machine learning models, etc. In this lesson we will focus on ways to reshape DataFrames so that they meet the tidy guidelines.

Video 🎥:

Learning objectives#

By the end of this lesson you will be able to:

  • Reshape data from wide to long

  • Reshape data from long to wide

Tools for reshaping#

Pandas provides multiple methods that help to reshape DataFrames:

  • .melt(): make wide data long.

  • .pivot(): make long data width.

  • .pivot_table(): same as .pivot() but can handle multiple indexes.

melt-pivot

Source: Garrick Aden-Buie

The following will illustrate each of these for their unique purpose.

Melting wide data#

The below data shows how many homes were sold within each neighborhood across each year. Is this considered ‘tidy data’? No because we have a variable (year sold) that is represented as the columns and another variable (number of homes sold) filled in as the element values.

If you wanted to answer questions like: “Does the number of homes sold vary depending on year?” then the below data is not in the appropriate form to answer this question.

import pandas as pd

ames_wide = pd.read_csv('../data/ames_wide.csv')
ames_wide.head()
neighborhood 2006 2007 2008 2009 2010
0 Blmngtn 11.0 4.0 5.0 6.0 2.0
1 Blueste NaN 2.0 2.0 4.0 2.0
2 BrDale 9.0 5.0 7.0 6.0 3.0
3 BrkSide 19.0 24.0 31.0 23.0 11.0
4 ClearCr 10.0 9.0 11.0 6.0 8.0

In this example we would consider this data “wide” and our objective is to convert it into a DataFrame with three variables:

  1. neighborhood

  2. year

  3. homes_sold

To do so we’ll use the .melt() method. .melt() arguments include:

  • id_vars: Identifier column

  • var_name: Name to give the new variable represented by the old column headers

  • value_name: Name to give the new variable represented by the old element values

ames_melt = ames_wide.melt(id_vars='neighborhood', var_name='year', value_name='homes_sold')
ames_melt
neighborhood year homes_sold
0 Blmngtn 2006 11.0
1 Blueste 2006 NaN
2 BrDale 2006 9.0
3 BrkSide 2006 19.0
4 ClearCr 2006 10.0
... ... ... ...
135 SawyerW 2010 18.0
136 Somerst 2010 21.0
137 StoneBr 2010 6.0
138 Timber 2010 8.0
139 Veenker 2010 NaN

140 rows × 3 columns

The value_vars argument allows us to select which specific variables we want to “melt” (if you don’t specify value_vars, all non-identifier columns will be used). For example, below I’m omitting the 2006 column:

ames_wide.melt(
    id_vars='neighborhood',
    value_vars=['2007', '2008', '2009', '2010'],
    var_name='year',
    value_name='homes_sold'
    )
neighborhood year homes_sold
0 Blmngtn 2007 4.0
1 Blueste 2007 2.0
2 BrDale 2007 5.0
3 BrkSide 2007 24.0
4 ClearCr 2007 9.0
... ... ... ...
107 SawyerW 2010 18.0
108 Somerst 2010 21.0
109 StoneBr 2010 6.0
110 Timber 2010 8.0
111 Veenker 2010 NaN

112 rows × 3 columns

Knowledge check#

Questions:

Given the following DataFrame, reshape the DataFrame from the current “wide” format to a “longer” format made up of the following variables:

  • Name: will contain the same values in the current Name column,

  • Year: will contain the year values which are currently column names, and

  • Courses: will contain the values that are currently listed under each year variable.

df = pd.DataFrame({"Name": ["Tom", "Mike", "Tiffany", "Varada", "Joel"],
                   "2018": [1, 3, 4, 5, 3],
                   "2019": [2, 4, 3, 2, 1],
                   "2020": [5, 2, 4, 4, 3]})
df
Name 2018 2019 2020
0 Tom 1 2 5
1 Mike 3 4 2
2 Tiffany 4 3 4
3 Varada 5 2 4
4 Joel 3 1 3

Video 🎥:

Pivoting long data#

Sometimes, you want to make long data wide, which we can do with .pivot(). When using .pivot() we need to specify the index to pivot on, and the columns that will be used to make the new columns of the wider dataframe. Let’s convert our ames_melt DataFrame back to the wide format:

ames_pivot = ames_melt.pivot(index='neighborhood', columns='year', values='homes_sold')
ames_pivot
year 2006 2007 2008 2009 2010
neighborhood
Blmngtn 11.0 4.0 5.0 6.0 2.0
Blueste NaN 2.0 2.0 4.0 2.0
BrDale 9.0 5.0 7.0 6.0 3.0
BrkSide 19.0 24.0 31.0 23.0 11.0
ClearCr 10.0 9.0 11.0 6.0 8.0
CollgCr 60.0 65.0 58.0 63.0 21.0
Crawfor 20.0 34.0 21.0 21.0 7.0
Edwards 43.0 41.0 45.0 42.0 23.0
Gilbert 38.0 45.0 26.0 41.0 15.0
Greens 4.0 1.0 NaN 1.0 2.0
GrnHill 1.0 1.0 NaN NaN NaN
IDOTRR 20.0 25.0 26.0 13.0 9.0
Landmrk 1.0 NaN NaN NaN NaN
MeadowV 10.0 8.0 7.0 5.0 7.0
Mitchel 23.0 28.0 22.0 24.0 17.0
NAmes 99.0 105.0 86.0 95.0 58.0
NPkVill 3.0 3.0 3.0 10.0 4.0
NWAmes 25.0 30.0 30.0 35.0 11.0
NoRidge 17.0 17.0 14.0 13.0 10.0
NridgHt 32.0 43.0 31.0 45.0 15.0
OldTown 51.0 48.0 56.0 55.0 29.0
SWISU 12.0 5.0 10.0 10.0 11.0
Sawyer 38.0 37.0 30.0 23.0 23.0
SawyerW 20.0 21.0 25.0 41.0 18.0
Somerst 29.0 51.0 41.0 40.0 21.0
StoneBr 15.0 12.0 10.0 8.0 6.0
Timber 11.0 21.0 18.0 14.0 8.0
Veenker 4.0 9.0 7.0 4.0 NaN

You’ll notice that Pandas set our specified index as the index of the new DataFrame and preserved the label of the columns. We can easily remove these names and reset the index to make our DataFrame look like it originally did:

ames_pivot = ames_pivot.reset_index()
ames_pivot.columns.name = None
ames_pivot
neighborhood 2006 2007 2008 2009 2010
0 Blmngtn 11.0 4.0 5.0 6.0 2.0
1 Blueste NaN 2.0 2.0 4.0 2.0
2 BrDale 9.0 5.0 7.0 6.0 3.0
3 BrkSide 19.0 24.0 31.0 23.0 11.0
4 ClearCr 10.0 9.0 11.0 6.0 8.0
5 CollgCr 60.0 65.0 58.0 63.0 21.0
6 Crawfor 20.0 34.0 21.0 21.0 7.0
7 Edwards 43.0 41.0 45.0 42.0 23.0
8 Gilbert 38.0 45.0 26.0 41.0 15.0
9 Greens 4.0 1.0 NaN 1.0 2.0
10 GrnHill 1.0 1.0 NaN NaN NaN
11 IDOTRR 20.0 25.0 26.0 13.0 9.0
12 Landmrk 1.0 NaN NaN NaN NaN
13 MeadowV 10.0 8.0 7.0 5.0 7.0
14 Mitchel 23.0 28.0 22.0 24.0 17.0
15 NAmes 99.0 105.0 86.0 95.0 58.0
16 NPkVill 3.0 3.0 3.0 10.0 4.0
17 NWAmes 25.0 30.0 30.0 35.0 11.0
18 NoRidge 17.0 17.0 14.0 13.0 10.0
19 NridgHt 32.0 43.0 31.0 45.0 15.0
20 OldTown 51.0 48.0 56.0 55.0 29.0
21 SWISU 12.0 5.0 10.0 10.0 11.0
22 Sawyer 38.0 37.0 30.0 23.0 23.0
23 SawyerW 20.0 21.0 25.0 41.0 18.0
24 Somerst 29.0 51.0 41.0 40.0 21.0
25 StoneBr 15.0 12.0 10.0 8.0 6.0
26 Timber 11.0 21.0 18.0 14.0 8.0
27 Veenker 4.0 9.0 7.0 4.0 NaN

Knowledge check#

Questions:

Given the following DataFrame, reshape the DataFrame from the current “long” format to a “wider” format made up of the following variables:

  • Name: will contain the same values in the current Name column,

  • Year: will contain the year values which are currently column names, and

  • Courses: will contain the values that are currently listed under each year variable.

df = pd.DataFrame({
    "Name": ["Tom", "Mike", "Tiffany", "Tom", "Mike", "Tiffany"],
    "Variable": ["Year", "Year", "Year", "Courses", "Courses", "Courses"],
    "Value": [2018, 2018, 2018, 1, 3, 4]
})
df
Name Variable Value
0 Tom Year 2018
1 Mike Year 2018
2 Tiffany Year 2018
3 Tom Courses 1
4 Mike Courses 3
5 Tiffany Courses 4

Video 🎥:

Pivoting with special needs#

.pivot() will often get you what you want, but it won’t work if you want to:

  • Use multiple indexes or

  • Have duplicate index/column labels

For example, let’s look at pivoting the below data:

ames2 = pd.read_csv('../data/ames_wide2.csv')
ames2
neighborhood year_sold bedrooms homes_sold
0 Blmngtn 2006 1 1
1 Blmngtn 2006 2 10
2 Blmngtn 2007 2 4
3 Blmngtn 2008 2 5
4 Blmngtn 2009 1 2
... ... ... ... ...
430 Veenker 2007 3 6
431 Veenker 2008 1 3
432 Veenker 2008 3 4
433 Veenker 2009 2 3
434 Veenker 2009 4 1

435 rows × 4 columns

In this example, say you wanted to pivot ames_wide2 so that the year_sold is represented as columns and homes_sold values are the elements. If we try to do this similar to the last section’s example we get an error stating ValueError: Index contains duplicate entries, cannot reshape.

ames2.pivot(index='neighborhood', columns='year_sold', values='homes_sold')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 ames2.pivot(index='neighborhood', columns='year_sold', values='homes_sold')

File /opt/anaconda3/envs/bana6043/lib/python3.12/site-packages/pandas/core/frame.py:9339, in DataFrame.pivot(self, columns, index, values)
   9332 @Substitution("")
   9333 @Appender(_shared_docs["pivot"])
   9334 def pivot(
   9335     self, *, columns, index=lib.no_default, values=lib.no_default
   9336 ) -> DataFrame:
   9337     from pandas.core.reshape.pivot import pivot
-> 9339     return pivot(self, index=index, columns=columns, values=values)

File /opt/anaconda3/envs/bana6043/lib/python3.12/site-packages/pandas/core/reshape/pivot.py:570, in pivot(data, columns, index, values)
    566         indexed = data._constructor_sliced(data[values]._values, index=multiindex)
    567 # error: Argument 1 to "unstack" of "DataFrame" has incompatible type "Union
    568 # [List[Any], ExtensionArray, ndarray[Any, Any], Index, Series]"; expected
    569 # "Hashable"
--> 570 result = indexed.unstack(columns_listlike)  # type: ignore[arg-type]
    571 result.index.names = [
    572     name if name is not lib.no_default else None for name in result.index.names
    573 ]
    575 return result

File /opt/anaconda3/envs/bana6043/lib/python3.12/site-packages/pandas/core/series.py:4615, in Series.unstack(self, level, fill_value, sort)
   4570 """
   4571 Unstack, also known as pivot, Series with MultiIndex to produce DataFrame.
   4572 
   (...)
   4611 b    2    4
   4612 """
   4613 from pandas.core.reshape.reshape import unstack
-> 4615 return unstack(self, level, fill_value, sort)

File /opt/anaconda3/envs/bana6043/lib/python3.12/site-packages/pandas/core/reshape/reshape.py:517, in unstack(obj, level, fill_value, sort)
    515 if is_1d_only_ea_dtype(obj.dtype):
    516     return _unstack_extension_series(obj, level, fill_value, sort=sort)
--> 517 unstacker = _Unstacker(
    518     obj.index, level=level, constructor=obj._constructor_expanddim, sort=sort
    519 )
    520 return unstacker.get_result(
    521     obj._values, value_columns=None, fill_value=fill_value
    522 )

File /opt/anaconda3/envs/bana6043/lib/python3.12/site-packages/pandas/core/reshape/reshape.py:154, in _Unstacker.__init__(self, index, level, constructor, sort)
    146 if num_cells > np.iinfo(np.int32).max:
    147     warnings.warn(
    148         f"The following operation may generate {num_cells} cells "
    149         f"in the resulting pandas object.",
    150         PerformanceWarning,
    151         stacklevel=find_stack_level(),
    152     )
--> 154 self._make_selectors()

File /opt/anaconda3/envs/bana6043/lib/python3.12/site-packages/pandas/core/reshape/reshape.py:210, in _Unstacker._make_selectors(self)
    207 mask.put(selector, True)
    209 if mask.sum() < len(self.index):
--> 210     raise ValueError("Index contains duplicate entries, cannot reshape")
    212 self.group_index = comp_index
    213 self.mask = mask

ValueError: Index contains duplicate entries, cannot reshape

The reason is we have duplicate values in our neighborhood column and Pandas doesn’t know how to isolate the index values to properly align the pivoted data. In such a case, we’d use .pivot_table(). It will apply an aggregation function to our duplicates, in this case, we’ll sum() them up:

ames2.pivot_table(index='neighborhood', columns='year_sold', values='homes_sold', aggfunc='sum')
year_sold 2006 2007 2008 2009 2010
neighborhood
Blmngtn 11.0 4.0 5.0 6.0 2.0
Blueste NaN 2.0 2.0 4.0 2.0
BrDale 9.0 5.0 7.0 6.0 3.0
BrkSide 19.0 24.0 31.0 23.0 11.0
ClearCr 10.0 9.0 11.0 6.0 8.0
CollgCr 60.0 65.0 58.0 63.0 21.0
Crawfor 20.0 34.0 21.0 21.0 7.0
Edwards 43.0 41.0 45.0 42.0 23.0
Gilbert 38.0 45.0 26.0 41.0 15.0
Greens 4.0 1.0 NaN 1.0 2.0
GrnHill 1.0 1.0 NaN NaN NaN
IDOTRR 20.0 25.0 26.0 13.0 9.0
Landmrk 1.0 NaN NaN NaN NaN
MeadowV 10.0 8.0 7.0 5.0 7.0
Mitchel 23.0 28.0 22.0 24.0 17.0
NAmes 99.0 105.0 86.0 95.0 58.0
NPkVill 3.0 3.0 3.0 10.0 4.0
NWAmes 25.0 30.0 30.0 35.0 11.0
NoRidge 17.0 17.0 14.0 13.0 10.0
NridgHt 32.0 43.0 31.0 45.0 15.0
OldTown 51.0 48.0 56.0 55.0 29.0
SWISU 12.0 5.0 10.0 10.0 11.0
Sawyer 38.0 37.0 30.0 23.0 23.0
SawyerW 20.0 21.0 25.0 41.0 18.0
Somerst 29.0 51.0 41.0 40.0 21.0
StoneBr 15.0 12.0 10.0 8.0 6.0
Timber 11.0 21.0 18.0 14.0 8.0
Veenker 4.0 9.0 7.0 4.0 NaN

If we wanted to keep the numbers per bedroom, we could specify both neighborhood and bedrooms as multiple indexes:

ames2.pivot(index=['neighborhood', 'bedrooms'], columns='year_sold', values='homes_sold')
year_sold 2006 2007 2008 2009 2010
neighborhood bedrooms
Blmngtn 1 1.0 NaN NaN 2.0 NaN
2 10.0 4.0 5.0 4.0 2.0
Blueste 1 NaN 1.0 NaN NaN 1.0
2 NaN 1.0 1.0 4.0 1.0
3 NaN NaN 1.0 NaN NaN
... ... ... ... ... ... ...
Veenker 0 1.0 NaN NaN NaN NaN
1 NaN 1.0 3.0 NaN NaN
2 1.0 2.0 NaN 3.0 NaN
3 2.0 6.0 4.0 NaN NaN
4 NaN NaN NaN 1.0 NaN

125 rows × 5 columns

The result above is a mutlti-index or “hierarchically indexed” DataFrame, which we haven’t really talked about up to this point. However, we can easily flatten this with .reset_index() and removing the column’s index name.

ames2_reshaped = (
    ames2
    .pivot(index=['neighborhood', 'bedrooms'], columns='year_sold', values='homes_sold')
    .reset_index()
)
ames2_reshaped.columns.name = None
ames2_reshaped.head()
neighborhood bedrooms 2006 2007 2008 2009 2010
0 Blmngtn 1 1.0 NaN NaN 2.0 NaN
1 Blmngtn 2 10.0 4.0 5.0 4.0 2.0
2 Blueste 1 NaN 1.0 NaN NaN 1.0
3 Blueste 2 NaN 1.0 1.0 4.0 1.0
4 Blueste 3 NaN NaN 1.0 NaN NaN

Additional video#

Video 🎥:

Here’s a webinar that provides a thorough discussion around tidy data principles along with illustrating examples of reshaping data with Pandas. It is longer (50 minutes) but is worth a watch if you are still trying to get your arms around the above lesson conceps.

Exercises#

For this exercise, we’re going to work with this data set from this paper by Reeves, et al. in which they measured the width of the gradient in the morphogen Dorsal in Drosophila embryos for various genotypes using different method. Don’t get hung up in what this means, our object is to simply tidy this dataset.

df = pd.read_csv("../data/reeves_gradient_width_various_methods.csv", comment='#', header=[0,1])
df
wt dl1/+; dl-venus/+ dl1/+; dl-gfp/+
wholemounts cross-sections anti-Dorsal anti-Venus Venus (live) anti-Dorsal anti-GFP GFP (live)
0 0.1288 0.1327 0.1482 0.1632 0.1666 0.2248 0.2389 0.2412
1 0.1554 0.1457 0.1503 0.1671 0.1753 0.1891 0.2035 0.1942
2 0.1306 0.1447 0.1577 0.1704 0.1705 0.1705 0.1943 0.2186
3 0.1413 0.1282 0.1711 0.1779 NaN 0.1735 0.2000 0.2104
4 0.1557 0.1487 0.1342 0.1483 NaN 0.2135 0.2560 0.2463
... ... ... ... ... ... ... ... ...
147 NaN 0.1466 NaN NaN NaN NaN NaN NaN
148 NaN 0.1671 NaN NaN NaN NaN NaN NaN
149 NaN 0.1265 NaN NaN NaN NaN NaN NaN
150 NaN 0.1448 NaN NaN NaN NaN NaN NaN
151 NaN 0.1740 NaN NaN NaN NaN NaN NaN

152 rows × 8 columns

As can happen with spreadsheets, we have a multiindex, where we have three main groups:

  • wt which refers to wild type

  • dl1/+; dl-venus/+ which we’ll refer to as simply Venus

  • dl1/+; dl-gfp/+ which we’ll refer to as simply GFP

For each of these main groups we have multiple sub-columns: two for wild type (wholemounts, cross-sections), three for Venus (anti-Dorsal, Anti-Venus, Venus (live)), and three for GFP (anti-Dorsal, anti-GFP, GFP (live)). The rows here are the gradient width values recorded for each of the categories. Clearly these data are not tidy.

For this exercise your objective is to:

  1. Reshape this data so that it looks like the following:

expected_result = pd.read_csv('../data/tidy_reeves_gradients.csv')
expected_result
genotype method gradient width
0 wt wholemounts 0.1288
1 wt wholemounts 0.1554
2 wt wholemounts 0.1306
3 wt wholemounts 0.1413
4 wt wholemounts 0.1557
... ... ... ...
1211 dl1/+; dl-gfp/+ GFP (live) NaN
1212 dl1/+; dl-gfp/+ GFP (live) NaN
1213 dl1/+; dl-gfp/+ GFP (live) NaN
1214 dl1/+; dl-gfp/+ GFP (live) NaN
1215 dl1/+; dl-gfp/+ GFP (live) NaN

1216 rows × 3 columns

2. Now that you have a tidy data frame you will notice that you have many NaNs in the gradient width column because there were many of them in the data set. Drop all observations that contain NaN values.

3. Now compute summary statistics via .describe() for the gradient width variable grouped by genotype and method. Which genotype and method has the narrowest gradient width?

Computing environment#

Hide code cell source
%load_ext watermark
%watermark -v -p jupyterlab,pandas