Lession 4a: Tidy data#
Tidy data is about “linking the structure of a dataset with its semantics (its meaning)”. It is defined by:
Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table
Often you’ll need to reshape a dataframe to make it tidy (or for some other purpose).

Source: R4DS
Once a DataFrame is tidy, it becomes much easier to compute summary statistics, join with other datasets, visualize, apply machine learning models, etc. In this lesson we will focus on ways to reshape DataFrames so that they meet the tidy guidelines.
Video 🎥:
Learning objectives#
By the end of this lesson you will be able to:
Reshape data from wide to long
Reshape data from long to wide
Tools for reshaping#
Pandas provides multiple methods that help to reshape DataFrames:
.melt()
: make wide data long..pivot()
: make long data width..pivot_table()
: same as.pivot()
but can handle multiple indexes.

Source: Garrick Aden-Buie
The following will illustrate each of these for their unique purpose.
Melting wide data#
The below data shows how many homes were sold within each neighborhood across each year. Is this considered ‘tidy data’? No because we have a variable (year sold) that is represented as the columns and another variable (number of homes sold) filled in as the element values.
If you wanted to answer questions like: “Does the number of homes sold vary depending on year?” then the below data is not in the appropriate form to answer this question.
import pandas as pd
ames_wide = pd.read_csv('../data/ames_wide.csv')
ames_wide.head()
neighborhood | 2006 | 2007 | 2008 | 2009 | 2010 | |
---|---|---|---|---|---|---|
0 | Blmngtn | 11.0 | 4.0 | 5.0 | 6.0 | 2.0 |
1 | Blueste | NaN | 2.0 | 2.0 | 4.0 | 2.0 |
2 | BrDale | 9.0 | 5.0 | 7.0 | 6.0 | 3.0 |
3 | BrkSide | 19.0 | 24.0 | 31.0 | 23.0 | 11.0 |
4 | ClearCr | 10.0 | 9.0 | 11.0 | 6.0 | 8.0 |
In this example we would consider this data “wide” and our objective is to convert it into a DataFrame with three variables:
neighborhood
year
homes_sold
To do so we’ll use the .melt()
method. .melt()
arguments include:
id_vars
: Identifier columnvar_name
: Name to give the new variable represented by the old column headersvalue_name
: Name to give the new variable represented by the old element values
ames_melt = ames_wide.melt(id_vars='neighborhood', var_name='year', value_name='homes_sold')
ames_melt
neighborhood | year | homes_sold | |
---|---|---|---|
0 | Blmngtn | 2006 | 11.0 |
1 | Blueste | 2006 | NaN |
2 | BrDale | 2006 | 9.0 |
3 | BrkSide | 2006 | 19.0 |
4 | ClearCr | 2006 | 10.0 |
... | ... | ... | ... |
135 | SawyerW | 2010 | 18.0 |
136 | Somerst | 2010 | 21.0 |
137 | StoneBr | 2010 | 6.0 |
138 | Timber | 2010 | 8.0 |
139 | Veenker | 2010 | NaN |
140 rows × 3 columns
The value_vars
argument allows us to select which specific variables we want to “melt” (if you don’t specify value_vars
, all non-identifier columns will be used). For example, below I’m omitting the 2006 column:
ames_wide.melt(
id_vars='neighborhood',
value_vars=['2007', '2008', '2009', '2010'],
var_name='year',
value_name='homes_sold'
)
neighborhood | year | homes_sold | |
---|---|---|---|
0 | Blmngtn | 2007 | 4.0 |
1 | Blueste | 2007 | 2.0 |
2 | BrDale | 2007 | 5.0 |
3 | BrkSide | 2007 | 24.0 |
4 | ClearCr | 2007 | 9.0 |
... | ... | ... | ... |
107 | SawyerW | 2010 | 18.0 |
108 | Somerst | 2010 | 21.0 |
109 | StoneBr | 2010 | 6.0 |
110 | Timber | 2010 | 8.0 |
111 | Veenker | 2010 | NaN |
112 rows × 3 columns
Knowledge check#
Questions:
Given the following DataFrame, reshape the DataFrame from the current “wide” format to a “longer” format made up of the following variables:
Name
: will contain the same values in the currentName
column,Year
: will contain the year values which are currently column names, andCourses
: will contain the values that are currently listed under each year variable.
df = pd.DataFrame({"Name": ["Tom", "Mike", "Tiffany", "Varada", "Joel"],
"2018": [1, 3, 4, 5, 3],
"2019": [2, 4, 3, 2, 1],
"2020": [5, 2, 4, 4, 3]})
df
Name | 2018 | 2019 | 2020 | |
---|---|---|---|---|
0 | Tom | 1 | 2 | 5 |
1 | Mike | 3 | 4 | 2 |
2 | Tiffany | 4 | 3 | 4 |
3 | Varada | 5 | 2 | 4 |
4 | Joel | 3 | 1 | 3 |
Video 🎥:
Pivoting long data#
Sometimes, you want to make long data wide, which we can do with .pivot()
. When using .pivot()
we need to specify the index to pivot on, and the columns that will be used to make the new columns of the wider dataframe. Let’s convert our ames_melt
DataFrame back to the wide format:
ames_pivot = ames_melt.pivot(index='neighborhood', columns='year', values='homes_sold')
ames_pivot
year | 2006 | 2007 | 2008 | 2009 | 2010 |
---|---|---|---|---|---|
neighborhood | |||||
Blmngtn | 11.0 | 4.0 | 5.0 | 6.0 | 2.0 |
Blueste | NaN | 2.0 | 2.0 | 4.0 | 2.0 |
BrDale | 9.0 | 5.0 | 7.0 | 6.0 | 3.0 |
BrkSide | 19.0 | 24.0 | 31.0 | 23.0 | 11.0 |
ClearCr | 10.0 | 9.0 | 11.0 | 6.0 | 8.0 |
CollgCr | 60.0 | 65.0 | 58.0 | 63.0 | 21.0 |
Crawfor | 20.0 | 34.0 | 21.0 | 21.0 | 7.0 |
Edwards | 43.0 | 41.0 | 45.0 | 42.0 | 23.0 |
Gilbert | 38.0 | 45.0 | 26.0 | 41.0 | 15.0 |
Greens | 4.0 | 1.0 | NaN | 1.0 | 2.0 |
GrnHill | 1.0 | 1.0 | NaN | NaN | NaN |
IDOTRR | 20.0 | 25.0 | 26.0 | 13.0 | 9.0 |
Landmrk | 1.0 | NaN | NaN | NaN | NaN |
MeadowV | 10.0 | 8.0 | 7.0 | 5.0 | 7.0 |
Mitchel | 23.0 | 28.0 | 22.0 | 24.0 | 17.0 |
NAmes | 99.0 | 105.0 | 86.0 | 95.0 | 58.0 |
NPkVill | 3.0 | 3.0 | 3.0 | 10.0 | 4.0 |
NWAmes | 25.0 | 30.0 | 30.0 | 35.0 | 11.0 |
NoRidge | 17.0 | 17.0 | 14.0 | 13.0 | 10.0 |
NridgHt | 32.0 | 43.0 | 31.0 | 45.0 | 15.0 |
OldTown | 51.0 | 48.0 | 56.0 | 55.0 | 29.0 |
SWISU | 12.0 | 5.0 | 10.0 | 10.0 | 11.0 |
Sawyer | 38.0 | 37.0 | 30.0 | 23.0 | 23.0 |
SawyerW | 20.0 | 21.0 | 25.0 | 41.0 | 18.0 |
Somerst | 29.0 | 51.0 | 41.0 | 40.0 | 21.0 |
StoneBr | 15.0 | 12.0 | 10.0 | 8.0 | 6.0 |
Timber | 11.0 | 21.0 | 18.0 | 14.0 | 8.0 |
Veenker | 4.0 | 9.0 | 7.0 | 4.0 | NaN |
You’ll notice that Pandas set our specified index as the index of the new DataFrame and preserved the label of the columns. We can easily remove these names and reset the index to make our DataFrame look like it originally did:
ames_pivot = ames_pivot.reset_index()
ames_pivot.columns.name = None
ames_pivot
neighborhood | 2006 | 2007 | 2008 | 2009 | 2010 | |
---|---|---|---|---|---|---|
0 | Blmngtn | 11.0 | 4.0 | 5.0 | 6.0 | 2.0 |
1 | Blueste | NaN | 2.0 | 2.0 | 4.0 | 2.0 |
2 | BrDale | 9.0 | 5.0 | 7.0 | 6.0 | 3.0 |
3 | BrkSide | 19.0 | 24.0 | 31.0 | 23.0 | 11.0 |
4 | ClearCr | 10.0 | 9.0 | 11.0 | 6.0 | 8.0 |
5 | CollgCr | 60.0 | 65.0 | 58.0 | 63.0 | 21.0 |
6 | Crawfor | 20.0 | 34.0 | 21.0 | 21.0 | 7.0 |
7 | Edwards | 43.0 | 41.0 | 45.0 | 42.0 | 23.0 |
8 | Gilbert | 38.0 | 45.0 | 26.0 | 41.0 | 15.0 |
9 | Greens | 4.0 | 1.0 | NaN | 1.0 | 2.0 |
10 | GrnHill | 1.0 | 1.0 | NaN | NaN | NaN |
11 | IDOTRR | 20.0 | 25.0 | 26.0 | 13.0 | 9.0 |
12 | Landmrk | 1.0 | NaN | NaN | NaN | NaN |
13 | MeadowV | 10.0 | 8.0 | 7.0 | 5.0 | 7.0 |
14 | Mitchel | 23.0 | 28.0 | 22.0 | 24.0 | 17.0 |
15 | NAmes | 99.0 | 105.0 | 86.0 | 95.0 | 58.0 |
16 | NPkVill | 3.0 | 3.0 | 3.0 | 10.0 | 4.0 |
17 | NWAmes | 25.0 | 30.0 | 30.0 | 35.0 | 11.0 |
18 | NoRidge | 17.0 | 17.0 | 14.0 | 13.0 | 10.0 |
19 | NridgHt | 32.0 | 43.0 | 31.0 | 45.0 | 15.0 |
20 | OldTown | 51.0 | 48.0 | 56.0 | 55.0 | 29.0 |
21 | SWISU | 12.0 | 5.0 | 10.0 | 10.0 | 11.0 |
22 | Sawyer | 38.0 | 37.0 | 30.0 | 23.0 | 23.0 |
23 | SawyerW | 20.0 | 21.0 | 25.0 | 41.0 | 18.0 |
24 | Somerst | 29.0 | 51.0 | 41.0 | 40.0 | 21.0 |
25 | StoneBr | 15.0 | 12.0 | 10.0 | 8.0 | 6.0 |
26 | Timber | 11.0 | 21.0 | 18.0 | 14.0 | 8.0 |
27 | Veenker | 4.0 | 9.0 | 7.0 | 4.0 | NaN |
Knowledge check#
Questions:
Given the following DataFrame, reshape the DataFrame from the current “long” format to a “wider” format made up of the following variables:
Name
: will contain the same values in the currentName
column,Year
: will contain the year values which are currently column names, andCourses
: will contain the values that are currently listed under each year variable.
df = pd.DataFrame({
"Name": ["Tom", "Mike", "Tiffany", "Tom", "Mike", "Tiffany"],
"Variable": ["Year", "Year", "Year", "Courses", "Courses", "Courses"],
"Value": [2018, 2018, 2018, 1, 3, 4]
})
df
Name | Variable | Value | |
---|---|---|---|
0 | Tom | Year | 2018 |
1 | Mike | Year | 2018 |
2 | Tiffany | Year | 2018 |
3 | Tom | Courses | 1 |
4 | Mike | Courses | 3 |
5 | Tiffany | Courses | 4 |
Video 🎥:
Pivoting with special needs#
.pivot()
will often get you what you want, but it won’t work if you want to:
Use multiple indexes or
Have duplicate index/column labels
For example, let’s look at pivoting the below data:
ames2 = pd.read_csv('../data/ames_wide2.csv')
ames2
neighborhood | year_sold | bedrooms | homes_sold | |
---|---|---|---|---|
0 | Blmngtn | 2006 | 1 | 1 |
1 | Blmngtn | 2006 | 2 | 10 |
2 | Blmngtn | 2007 | 2 | 4 |
3 | Blmngtn | 2008 | 2 | 5 |
4 | Blmngtn | 2009 | 1 | 2 |
... | ... | ... | ... | ... |
430 | Veenker | 2007 | 3 | 6 |
431 | Veenker | 2008 | 1 | 3 |
432 | Veenker | 2008 | 3 | 4 |
433 | Veenker | 2009 | 2 | 3 |
434 | Veenker | 2009 | 4 | 1 |
435 rows × 4 columns
In this example, say you wanted to pivot ames_wide2
so that the year_sold
is represented as columns and homes_sold
values are the elements. If we try to do this similar to the last section’s example we get an error stating ValueError: Index contains duplicate entries, cannot reshape
.
ames2.pivot(index='neighborhood', columns='year_sold', values='homes_sold')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[9], line 1
----> 1 ames2.pivot(index='neighborhood', columns='year_sold', values='homes_sold')
File /opt/anaconda3/envs/bana6043/lib/python3.12/site-packages/pandas/core/frame.py:9339, in DataFrame.pivot(self, columns, index, values)
9332 @Substitution("")
9333 @Appender(_shared_docs["pivot"])
9334 def pivot(
9335 self, *, columns, index=lib.no_default, values=lib.no_default
9336 ) -> DataFrame:
9337 from pandas.core.reshape.pivot import pivot
-> 9339 return pivot(self, index=index, columns=columns, values=values)
File /opt/anaconda3/envs/bana6043/lib/python3.12/site-packages/pandas/core/reshape/pivot.py:570, in pivot(data, columns, index, values)
566 indexed = data._constructor_sliced(data[values]._values, index=multiindex)
567 # error: Argument 1 to "unstack" of "DataFrame" has incompatible type "Union
568 # [List[Any], ExtensionArray, ndarray[Any, Any], Index, Series]"; expected
569 # "Hashable"
--> 570 result = indexed.unstack(columns_listlike) # type: ignore[arg-type]
571 result.index.names = [
572 name if name is not lib.no_default else None for name in result.index.names
573 ]
575 return result
File /opt/anaconda3/envs/bana6043/lib/python3.12/site-packages/pandas/core/series.py:4615, in Series.unstack(self, level, fill_value, sort)
4570 """
4571 Unstack, also known as pivot, Series with MultiIndex to produce DataFrame.
4572
(...)
4611 b 2 4
4612 """
4613 from pandas.core.reshape.reshape import unstack
-> 4615 return unstack(self, level, fill_value, sort)
File /opt/anaconda3/envs/bana6043/lib/python3.12/site-packages/pandas/core/reshape/reshape.py:517, in unstack(obj, level, fill_value, sort)
515 if is_1d_only_ea_dtype(obj.dtype):
516 return _unstack_extension_series(obj, level, fill_value, sort=sort)
--> 517 unstacker = _Unstacker(
518 obj.index, level=level, constructor=obj._constructor_expanddim, sort=sort
519 )
520 return unstacker.get_result(
521 obj._values, value_columns=None, fill_value=fill_value
522 )
File /opt/anaconda3/envs/bana6043/lib/python3.12/site-packages/pandas/core/reshape/reshape.py:154, in _Unstacker.__init__(self, index, level, constructor, sort)
146 if num_cells > np.iinfo(np.int32).max:
147 warnings.warn(
148 f"The following operation may generate {num_cells} cells "
149 f"in the resulting pandas object.",
150 PerformanceWarning,
151 stacklevel=find_stack_level(),
152 )
--> 154 self._make_selectors()
File /opt/anaconda3/envs/bana6043/lib/python3.12/site-packages/pandas/core/reshape/reshape.py:210, in _Unstacker._make_selectors(self)
207 mask.put(selector, True)
209 if mask.sum() < len(self.index):
--> 210 raise ValueError("Index contains duplicate entries, cannot reshape")
212 self.group_index = comp_index
213 self.mask = mask
ValueError: Index contains duplicate entries, cannot reshape
The reason is we have duplicate values in our neighborhood column and Pandas doesn’t know how to isolate the index values to properly align the pivoted data. In such a case, we’d use .pivot_table()
. It will apply an aggregation function to our duplicates, in this case, we’ll sum()
them up:
ames2.pivot_table(index='neighborhood', columns='year_sold', values='homes_sold', aggfunc='sum')
year_sold | 2006 | 2007 | 2008 | 2009 | 2010 |
---|---|---|---|---|---|
neighborhood | |||||
Blmngtn | 11.0 | 4.0 | 5.0 | 6.0 | 2.0 |
Blueste | NaN | 2.0 | 2.0 | 4.0 | 2.0 |
BrDale | 9.0 | 5.0 | 7.0 | 6.0 | 3.0 |
BrkSide | 19.0 | 24.0 | 31.0 | 23.0 | 11.0 |
ClearCr | 10.0 | 9.0 | 11.0 | 6.0 | 8.0 |
CollgCr | 60.0 | 65.0 | 58.0 | 63.0 | 21.0 |
Crawfor | 20.0 | 34.0 | 21.0 | 21.0 | 7.0 |
Edwards | 43.0 | 41.0 | 45.0 | 42.0 | 23.0 |
Gilbert | 38.0 | 45.0 | 26.0 | 41.0 | 15.0 |
Greens | 4.0 | 1.0 | NaN | 1.0 | 2.0 |
GrnHill | 1.0 | 1.0 | NaN | NaN | NaN |
IDOTRR | 20.0 | 25.0 | 26.0 | 13.0 | 9.0 |
Landmrk | 1.0 | NaN | NaN | NaN | NaN |
MeadowV | 10.0 | 8.0 | 7.0 | 5.0 | 7.0 |
Mitchel | 23.0 | 28.0 | 22.0 | 24.0 | 17.0 |
NAmes | 99.0 | 105.0 | 86.0 | 95.0 | 58.0 |
NPkVill | 3.0 | 3.0 | 3.0 | 10.0 | 4.0 |
NWAmes | 25.0 | 30.0 | 30.0 | 35.0 | 11.0 |
NoRidge | 17.0 | 17.0 | 14.0 | 13.0 | 10.0 |
NridgHt | 32.0 | 43.0 | 31.0 | 45.0 | 15.0 |
OldTown | 51.0 | 48.0 | 56.0 | 55.0 | 29.0 |
SWISU | 12.0 | 5.0 | 10.0 | 10.0 | 11.0 |
Sawyer | 38.0 | 37.0 | 30.0 | 23.0 | 23.0 |
SawyerW | 20.0 | 21.0 | 25.0 | 41.0 | 18.0 |
Somerst | 29.0 | 51.0 | 41.0 | 40.0 | 21.0 |
StoneBr | 15.0 | 12.0 | 10.0 | 8.0 | 6.0 |
Timber | 11.0 | 21.0 | 18.0 | 14.0 | 8.0 |
Veenker | 4.0 | 9.0 | 7.0 | 4.0 | NaN |
If we wanted to keep the numbers per bedroom, we could specify both neighborhood
and bedrooms
as multiple indexes:
ames2.pivot(index=['neighborhood', 'bedrooms'], columns='year_sold', values='homes_sold')
year_sold | 2006 | 2007 | 2008 | 2009 | 2010 | |
---|---|---|---|---|---|---|
neighborhood | bedrooms | |||||
Blmngtn | 1 | 1.0 | NaN | NaN | 2.0 | NaN |
2 | 10.0 | 4.0 | 5.0 | 4.0 | 2.0 | |
Blueste | 1 | NaN | 1.0 | NaN | NaN | 1.0 |
2 | NaN | 1.0 | 1.0 | 4.0 | 1.0 | |
3 | NaN | NaN | 1.0 | NaN | NaN | |
... | ... | ... | ... | ... | ... | ... |
Veenker | 0 | 1.0 | NaN | NaN | NaN | NaN |
1 | NaN | 1.0 | 3.0 | NaN | NaN | |
2 | 1.0 | 2.0 | NaN | 3.0 | NaN | |
3 | 2.0 | 6.0 | 4.0 | NaN | NaN | |
4 | NaN | NaN | NaN | 1.0 | NaN |
125 rows × 5 columns
The result above is a mutlti-index or “hierarchically indexed” DataFrame, which we haven’t really talked about up to this point. However, we can easily flatten this with .reset_index()
and removing the column’s index name.
ames2_reshaped = (
ames2
.pivot(index=['neighborhood', 'bedrooms'], columns='year_sold', values='homes_sold')
.reset_index()
)
ames2_reshaped.columns.name = None
ames2_reshaped.head()
neighborhood | bedrooms | 2006 | 2007 | 2008 | 2009 | 2010 | |
---|---|---|---|---|---|---|---|
0 | Blmngtn | 1 | 1.0 | NaN | NaN | 2.0 | NaN |
1 | Blmngtn | 2 | 10.0 | 4.0 | 5.0 | 4.0 | 2.0 |
2 | Blueste | 1 | NaN | 1.0 | NaN | NaN | 1.0 |
3 | Blueste | 2 | NaN | 1.0 | 1.0 | 4.0 | 1.0 |
4 | Blueste | 3 | NaN | NaN | 1.0 | NaN | NaN |
Additional video#
Video 🎥:
Here’s a webinar that provides a thorough discussion around tidy data principles along with illustrating examples of reshaping data with Pandas. It is longer (50 minutes) but is worth a watch if you are still trying to get your arms around the above lesson conceps.
Exercises#
For this exercise, we’re going to work with this data set from this paper by Reeves, et al. in which they measured the width of the gradient in the morphogen Dorsal in Drosophila embryos for various genotypes using different method. Don’t get hung up in what this means, our object is to simply tidy this dataset.
df = pd.read_csv("../data/reeves_gradient_width_various_methods.csv", comment='#', header=[0,1])
df
wt | dl1/+; dl-venus/+ | dl1/+; dl-gfp/+ | ||||||
---|---|---|---|---|---|---|---|---|
wholemounts | cross-sections | anti-Dorsal | anti-Venus | Venus (live) | anti-Dorsal | anti-GFP | GFP (live) | |
0 | 0.1288 | 0.1327 | 0.1482 | 0.1632 | 0.1666 | 0.2248 | 0.2389 | 0.2412 |
1 | 0.1554 | 0.1457 | 0.1503 | 0.1671 | 0.1753 | 0.1891 | 0.2035 | 0.1942 |
2 | 0.1306 | 0.1447 | 0.1577 | 0.1704 | 0.1705 | 0.1705 | 0.1943 | 0.2186 |
3 | 0.1413 | 0.1282 | 0.1711 | 0.1779 | NaN | 0.1735 | 0.2000 | 0.2104 |
4 | 0.1557 | 0.1487 | 0.1342 | 0.1483 | NaN | 0.2135 | 0.2560 | 0.2463 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
147 | NaN | 0.1466 | NaN | NaN | NaN | NaN | NaN | NaN |
148 | NaN | 0.1671 | NaN | NaN | NaN | NaN | NaN | NaN |
149 | NaN | 0.1265 | NaN | NaN | NaN | NaN | NaN | NaN |
150 | NaN | 0.1448 | NaN | NaN | NaN | NaN | NaN | NaN |
151 | NaN | 0.1740 | NaN | NaN | NaN | NaN | NaN | NaN |
152 rows × 8 columns
As can happen with spreadsheets, we have a multiindex, where we have three main groups:
wt
which refers to wild typedl1/+; dl-venus/+
which we’ll refer to as simply Venusdl1/+; dl-gfp/+
which we’ll refer to as simply GFP
For each of these main groups we have multiple sub-columns: two for wild type (wholemounts
, cross-sections
), three for Venus (anti-Dorsal
, Anti-Venus
, Venus (live)
), and three for GFP (anti-Dorsal
, anti-GFP
, GFP (live)
). The rows here are the gradient width values recorded for each of the categories. Clearly these data are not tidy.
For this exercise your objective is to:
Reshape this data so that it looks like the following:
expected_result = pd.read_csv('../data/tidy_reeves_gradients.csv')
expected_result
genotype | method | gradient width | |
---|---|---|---|
0 | wt | wholemounts | 0.1288 |
1 | wt | wholemounts | 0.1554 |
2 | wt | wholemounts | 0.1306 |
3 | wt | wholemounts | 0.1413 |
4 | wt | wholemounts | 0.1557 |
... | ... | ... | ... |
1211 | dl1/+; dl-gfp/+ | GFP (live) | NaN |
1212 | dl1/+; dl-gfp/+ | GFP (live) | NaN |
1213 | dl1/+; dl-gfp/+ | GFP (live) | NaN |
1214 | dl1/+; dl-gfp/+ | GFP (live) | NaN |
1215 | dl1/+; dl-gfp/+ | GFP (live) | NaN |
1216 rows × 3 columns
2. Now that you have a tidy data frame you will notice that you have many NaN
s in the gradient width
column because there were many of them in the data set. Drop all observations that contain NaN
values.
3. Now compute summary statistics via .describe()
for the gradient width
variable grouped by genotype
and method
. Which genotype
and method
has the narrowest gradient width
?
Computing environment#
Show code cell source
%load_ext watermark
%watermark -v -p jupyterlab,pandas