Syllabus

This is the primary “textbook” for the UC BANA 7025 Data Wrangling course. The following is a truncated syllabus; for the full syllabus along with complete course content please visit the online course content in Canvas.

Welcome to Data Wrangling with R! This course provides an intensive, hands-on introduction to Data Wrangling with the R programming language. You will learn the fundamental skills required to acquire, munge, transform, manipulate, and visualize data in a computing environment that fosters reproducibility.

Data wrangling, which is also commonly referred to as data munging, transformation, manipulation, janitor work, etc. can be a painstakingly laborious process. In fact, it has been stated that up to 80% of data analysis is spent on the process of cleaning and preparing data (Wickham 2014; Dasu and Johnson 2003). However, being a prerequisite to the rest of the data analysis workflow (visualization, modeling, reporting), it’s essential that you become fluent and efficient in data wrangling techniques.

Learning Objectives

This course will guide you through the data wrangling process along with give you a solid foundation of the basics of working with data in R. My goal is to teach you how to easily wrangle your data so you can spend more time focused on understanding the content of your data via visualization, modeling, and reporting your results. Upon successfully completing this course, you will be able to:

  • Perform your data analysis in a literate programming environment
  • Manage different types of data
  • Manage different data structures
  • Import and export data
  • Index, subset, reshape and transform your data
  • Compute descriptive statistics
  • Visualize data
  • Make your code efficient by using control statements & iteration
  • Write your own functions
  • Train and evaluate predictive models

…all with R!

This course assumes no prior knowledge of R. Experience with programming concepts or another programming language will help, but is not required to understand the material.

Material

The bulk of the classroom material will be provided via this book, the recorded lectures, and any additional resources provided via Canvas. In some cases there may be additional recommended readings, all of which are readily available online.

Class Structure

Modules: For this class each module is covered over the course of week. In the “Overview” section for each module you will find overall learning objectives, a short description of the learning content covered in that module, along with all tasks that are required of you for that module (i.e. quizzes, lab). Each module will have two or more primary lessons and associated quizzes along with a lab.

Lessons: For each lesson you will read and work through the tutorial. Short videos will be sprinkled throughout the lesson to further discuss and reinforce lesson concepts. Each lesson will have various “TODO” exercises throughout, along with end-of-lesson exercises. I highly recommend you work through these exercises as they will prepare you for the quizzes, labs, and project work.

Quizzes: There will be a short quiz associated with each lesson. These quizzes will be hosted in the course website on Canvas. Please check Canvas for due dates for these quizzes.

Labs: There will be a lab associated with each module. For these labs students will be guided through a case study step-by-step. The aim is to provide a detailed view on how to manage a variety of complex real-world data; how to convert real problems into data wrangling and analysis problems; and to apply R to address these problems and extract insights from the data. These labs will be provided via the course website on Canvas and the submission of these labs will also be done through the course website on Canvas. Please check Canvas for due dates for these labs.

Project: The final project is designed for you to put to work the tools and knowledge that you gain throughout this course. This provides you with multiple benefits. - It will provide you with more experience using data wrangling tools on real life data sets. - It helps you become a self-directed learner. As a data scientist, a large part of your job is to self-direct your learning and interests to find unique and creative ways to find insights in data. - It starts to build your data science portfolio. Establishing a data science portfolio is a great way to show potential employers your ability to work with data.

Schedule

See the Canvas course webpage for a detailed schedule with due dates for quizzes, labs, etc.

Module Description
1 Introduction
R fundamentals & the Rstudio IDE
Deeper understanding of vectors
2 Reproducible Documents and Importing Data
Managing your workflow and reproducibility
Data structures & importing data
3 Tidy Data and Data Manipulation
Data manipulation & summarization
Tidy data
4 Relational Data and More Tidyverse Packages
Relational data
Leveraging the Tidyverse to analyze text & date-time data
5 Data Visualization & Exploration
Data visualization
Exploratory data analysis
6 Creating Efficient Code in R
Control statements & iteration
Writing functions
7 Introduction to Applied Modeling
Introduction to tidymodels
Feature engineering & model evaluation/selection

Conventions used in this book

The following typographical conventions are used in this book:

  • strong italic: indicates new terms,
  • bold: indicates package & file names,
  • inline code: monospaced highlighted text indicates functions or other commands that could be typed literally by the user,
  • code chunk: indicates commands or other text that could be typed literally by the user
1 + 2
## [1] 3

In addition to the general text used throughout, you will notice the following cells that provide additional context for improved learning:

A video that further discusses and demonstrates this topic is available.

A tip or suggestion that will likely produce better results.

A general note that could improve your understanding but is not required for the course requirements.

Warning or caution to look out for.

Knowledge check exercises to gauge your learning progress.

Feedback

To report errors or bugs that you find in this course material please post an issue at https://github.com/bradleyboehmke/uc-bana-7025/issues. For all other communication be sure to use Canvas or the university email.

When communicating with me via email, please always include BANA7025 in the subject line.

Acknowledgements

This course and its materials have been influenced by the following resources:

References

Dasu, Tamraparni, and Theodore Johnson. 2003. Exploratory Data Mining and Data Cleaning. Vol. 479. John Wiley & Sons.
Wickham, Hadley. 2014. “Tidy Data.” The Journal of Statistical Software 59. http://www.jstatsoft.org/v59/i10/.