UC BANA 7025: Data Wrangling
Syllabus
Learning Objectives
Material
Class Structure
Schedule
Conventions used in this book
Feedback
Acknowledgements
I Module 1
1
Overview
1.1
Learning objectives
1.2
Tasks
1.3
Course readings
2
Lesson 1a: Course overview
2.1
Learning objectives
2.2
Purpose of this course
2.3
Assumptions & Pre-requisites
2.4
Course Staff
2.5
Course Logistics
2.6
Grading
2.7
Getting Help
2.8
Fine Print
2.9
Exercises
3
Lesson 1b: R & RStudio
3.1
Learning objectives
3.2
R vs. RStudio
3.3
Installing R and RStudio
3.4
Understanding the RStudio IDE
3.4.1
Script Editor
3.4.2
Workspace Environment
3.4.3
Console
3.4.4
Misc. Displays
3.4.5
Workspace Options & Shortcuts
3.4.6
Knowledge check
3.5
Getting help
3.5.1
General Help
3.5.2
Getting Help From the Web
3.5.3
Knowledge check
3.6
Errors, warnings, and messages
3.7
Exercises
4
Lesson 1c: R fundamentals
4.1
Learning objectives
4.2
Assignment & evaluation
4.2.1
Assignment
4.2.2
Evaluation
4.2.3
Case Sensitivity
4.2.4
Knowledge check
4.3
R as a calculator
4.3.1
Basic Arithmetic
4.3.2
Miscellaneous Mathematical Functions
4.3.3
Infinite, and NaN Numbers
4.3.4
Knowledge check
4.4
Working with packages
4.4.1
Installing Packages
4.4.2
Loading Packages
4.4.3
Getting Help on Packages
4.4.4
Useful Packages
4.4.5
Knowledge check
4.4.6
Tidyverse
4.5
Style guide
4.5.1
Notation and Naming
4.5.2
Organization
4.5.3
Syntax
4.5.4
Knowledge check
4.6
Exercises
5
Lesson 1d: Vectors
5.1
Learning objectives
5.2
Creating vectors
5.2.1
Creating sequences
5.2.2
Knowledge check
5.3
Extracting elements
5.3.1
Positional indexing
5.3.2
Logical indexing
5.3.3
Knowledge check
5.4
Replacing elements
5.4.1
Knowledge check
5.5
Operations
5.5.1
Knowledge check
5.6
Missing data
5.6.1
Knowledge check
5.7
Vectorization
5.7.1
Looping versus Vectorization
5.7.2
Recycling
5.7.3
Knowledge check
5.8
Exercises
6
Lab
II Module 2
7
Overview
7.1
Learning objectives
7.2
Tasks
7.3
Course readings
8
Lesson 2a: Workflow
8.1
Learning objectives
8.2
R Projects
8.2.1
Creating Projects
8.2.2
So What’s Different?
8.2.3
Knowledge check
8.3
R Markdown
8.3.1
Creating an R Markdown File
8.3.2
Components of an R Markdown File
8.3.3
Knitting the R Markdown File
8.3.4
Additional Resources
8.3.5
Knowledge check
8.4
R Notebooks
8.4.1
Creating an R Notebook
8.4.2
Interactiveness of an R Notebook
8.4.3
Saving, Sharing, Previewing & Knitting an R Notebook
8.4.4
Additional Resources
8.4.5
Knowledge check
8.5
Exercises
8.6
Computing environment
9
Lesson 2b: Data types & structures
9.1
Learning objectives
9.2
Data types
9.2.1
Determining the type
9.2.2
Type conversion
9.2.3
Knowledge check
9.3
Data structures
9.3.1
Scalars and vectors
9.3.2
Matrices and arrays
9.3.3
Lists
9.3.4
Data frames
9.4
Exercises
10
Lesson 2c: Importing data
10.1
Learning objectives
10.2
Data & memory
10.3
Delimited files
10.3.1
Tibbles
10.3.2
File paths
10.3.3
Metadata
10.3.4
Knowledge check
10.4
Excel files
10.5
SQL databases
10.6
Many other file types
10.6.1
JSON files
10.6.2
R object files
10.7
Exercises
11
Lab
III Module 3
12
Overview
12.1
Learning objectives
12.2
Tasks
12.3
Course readings
13
Lesson 3a: Pipe operator
13.1
Learning objectives
13.2
Pipe (
%>%
) operator
13.3
Additional pipe operators (optional)
13.4
Additional resources
13.5
Exercises
14
Lesson 3b: Data transformation
14.1
Learning objectives
14.2
Prerequisites
14.3
Filtering observations
14.3.1
Knowledge check
14.4
Selecting variables
14.4.1
Knowledge check
14.5
Computing summary statistics
14.5.1
Knowledge check
14.6
Sorting observations
14.6.1
Knowledge check
14.7
Creating new variables
14.7.1
Knowledge check
14.8
Putting it altogether
14.9
Exercises
14.10
Additional resources
15
Lesson 3c: Tidy data
15.1
Learning objectives
15.2
Prerequisites
15.3
Making wide data longer
15.3.1
Knowledge check
15.4
Making long data wider
15.4.1
Knowledge check
15.5
Separate one variable into multiple
15.5.1
Knowledge check
15.6
Combine multiple variables into one
15.6.1
Knowledge check
15.7
Additional tidying functions
15.8
Putting it altogether
15.9
Exercises
15.10
Additional resources
16
Lab
IV Module 4
17
Overview
17.1
Learning objectives
17.2
Tasks
17.3
Course readings
18
Lesson 4a: Relational data
18.1
Learning objectives
18.2
Prerequisites
18.3
Keys
18.3.1
Knowledge check
18.4
Mutating joins
18.4.1
Inner join
18.4.2
Outer joins
18.4.3
Differing keys
18.4.4
Bigger example
18.4.5
Knowledge check
18.5
Filtering joins
18.5.1
Semi join
18.5.2
Anti join
18.5.3
Bigger example
18.5.4
Knowledge check
18.6
Exercises
18.7
Additional resources
19
Lesson 4b: Handling text data
19.1
Learning objectives
19.2
Prerequisites
19.3
String basics
19.3.1
Case conversion
19.3.2
Counting characters
19.3.3
Extracting parts of character strings
19.3.4
Knowledge check
19.4
Regular expressions
19.4.1
Regex basics
19.4.2
Multiple Words
19.4.3
Line anchors
19.4.4
Metacharacters
19.4.5
Character classes
19.4.6
Shorthand character classes
19.4.7
POSIX character classes
19.4.8
Repetition
19.4.9
Putting it altogether
19.4.10
Knowledge check
19.5
Exercises
19.6
Additional resources
20
Lesson 4c: Handling dates & times
20.1
Learning objectives
20.2
Prerequisites
20.3
Getting current date & time
20.4
Creating dates
20.4.1
Convert strings to dates
20.4.2
Create dates by merging data
20.4.3
Knowledge check
20.5
Extract & manipulate parts of dates
20.5.1
Knowledge check
20.6
Calculations with dates
20.6.1
Durations
20.6.2
Periods
20.6.3
Knowledge check
20.7
Exercises
20.8
Additional resources
21
Lab
V Module 5
22
Overview
22.1
Learning objectives
22.2
Tasks
22.3
Course readings
23
Lesson 5a: Introduction to ggplot2
23.1
Learning objectives
23.2
Prerequisites
23.3
Grammar of Graphics
23.4
The Basics
23.4.1
Knowledge check
23.5
Aesthetic Mappings
23.5.1
Knowledge check
23.6
Specifying Geometric Shapes
23.6.1
Knowledge check
23.7
Statistical Transformations
23.7.1
Knowledge check
23.8
Position Adjustments
23.8.1
Knowledge check
23.9
Managing Scales
23.9.1
Knowledge check
23.10
Coordinate Systems
23.10.1
Knowledge check
23.11
Facets
23.11.1
Knowledge check
23.12
Labels & Annotations
23.12.1
Knowledge check
23.13
Exercises
23.14
Additional Resources on
ggplot2
24
Lesson 5b: Handling factors
24.1
Learning objectives
24.2
Prerequisites
24.3
Creating factors & inspecting factors
24.3.1
Some basics
24.3.2
Factors in data frames
24.3.3
Ordinal factors
24.3.4
Knowledge check
24.4
Modifying factor order
24.4.1
Knowledge check
24.5
Modifying factor levels
24.5.1
Knowledge check
24.6
Exercises
24.7
Additional Resources
25
Lesson 5c: Visual data exploration
25.1
Learning objectives
25.2
Prerequisites
25.3
Univariate Distributions
25.3.1
Histograms
25.3.2
Quantile-quantile plots
25.3.3
Box plots
25.3.4
Categorical variables
25.3.5
Bar charts
25.3.6
Dot plots
25.3.7
Pie charts
25.3.8
Knowledge check
25.4
Bivariate relationships
25.4.1
Scatter plots
25.4.2
Strip plots
25.4.3
Overlays & faceting
25.5
Multivariate relationships
25.5.1
Layering variables
25.5.2
Parallel coordinate plots
25.5.3
Mosaic plots
25.5.4
Tree maps
25.5.5
Heat maps
25.5.6
Generalized pairs plot
25.5.7
Knowledge check
25.6
Data quality
25.6.1
Knowledge check
25.7
Exercise
26
Lab
VI Module 6
27
Overview
27.1
Learning objectives
27.2
Tasks
27.3
Course readings
28
Lesson 6a: Conditional statements
28.1
Learning objectives
28.2
Prerequisites
28.3
if
statement
28.3.1
Knowledge check
28.4
ifelse
statement
28.4.1
Knowledge check
28.5
Multiple conditions
28.5.1
Knowledge check
28.6
Vectorized approaches
28.6.1
Knowledge check
28.7
With data frames
28.7.1
Knowledge check
28.8
Exercises
29
Lesson 6b: Iteration statements
29.1
Learning objectives
29.2
for
loops
29.2.1
Knowledge check
29.3
Controlling sequences
29.3.1
Knowledge check
29.4
Repeating code for undefined number of iterations
29.4.1
while
loop
29.4.2
repeat
loop
29.4.3
Knowledge check
29.5
Iteration with functional programming
29.5.1
Knowledge check
29.6
Exercises
30
Lesson 6c: Writing functions
30.1
Learning objectives
30.2
When to write functions
30.3
Function components
30.3.1
Knowledge check
30.4
Function arguments
30.4.1
Knowledge check
30.5
Checking arguments and other conditions
30.5.1
Knowledge check
30.6
Lazy evaluation
30.7
Status updates
30.7.1
Option 1: Base R
30.7.2
Option 2:
progress
package
30.8
Distributing your functions
30.9
Exercises
30.10
Additional resources
31
Lab
VII Module 7
32
Overview
32.1
Learning objectives
32.2
Tasks
32.3
Course readings
33
Lesson 7a: First model with Tidymodels
33.1
Learning objectives
33.2
Prerequisites
33.3
Data splitting
33.3.1
Simple random sampling
33.3.2
Stratified sampling
33.3.3
Knowledge check
33.4
Building models
33.4.1
Knowledge check
33.5
Making predictions
33.5.1
Knowledge check
33.6
Evaluating model performance
33.6.1
Regression models
33.6.2
Classification models
33.6.3
Knowledge check
33.7
Exercises
34
Lesson 7b: Feature engineering
34.1
Learning objectives
34.2
Prerequisites
34.3
Create a recipe
34.4
Feature filtering
34.5
Numeric features
34.5.1
Skewness
34.5.2
Standardization
34.6
Categorical features
34.6.1
One-hot & dummy encoding
34.6.2
Ordinal encoding
34.6.3
Lumping
34.7
Fit a model with a recipe
34.8
Exercises
35
Lesson 7c: Model evaluation & selection
35.1
Learning objectives
35.2
Prerequisites
35.3
Resampling & cross-validation
35.4
K-fold cross-validation
35.5
Hyperparameter tuning
35.5.1
Bias
35.5.2
Variance
35.5.3
Hyperparameters
35.5.4
Full cartesian grid search
35.6
Finalizing our model
35.7
Exercises
35.8
Additional resources
36
Lab
VIII Additional Content
Computing Environment
References
University of Cincinnati
Data Wrangling with R
11
Lab
TBD