Data Mining with R
Syllabus
This is the primary “textbook” for the Machine Learning section of the UC BANA 4080 Data Mining course. The following is a truncated syllabus; for the full syllabus along with complete course content please visit the online course content in Canvas.
Welcome to Data Mining with R! This course provides an intensive, hands-on introduction to data mining and analysis techniques. You will learn the fundamental skills required to extract informative attributes, relationships, and patterns from data sets. You will gain hands-on experience with exploratory data analysis, data visualization, unsupervised learning techniques such as clustering and dimension reduction, and supervised learning techniques such as linear regression, regularized regression, decision trees, random forests, and more! You will also be exposed to some more advanced topics such as ensembling techniques, deep learning, model stacking, and model interpretation. Together, this will provide you with a solid foundation of tools and techniques applied in organizations to aid modern day data-driven decision making.
Learning Objectives
Upon successfully completing this course, you will be able to:
- Apply data wrangling techniques to manipulate and prepare data for analysis.
- Use exploratory data analysis and visualization to provide descriptive insights of data.
- Apply common unsupervised learning algorithms to find common groupings of observations and features in a given dataset.
- Describe and apply a sound analytic modeling process.
- Apply, compare, and contrast various predictive modeling techniques.
- Have the resources and understanding to continue advancing your data mining and analysis capabilities.
…all with R!
This course assumes no prior knowledge of R. Experience with programming concepts or another programming language will help, but is not required to understand the material.
Material
This course is split into two main sections - Data Wrangling and Machine Learning. The data wrangling section will provide you the fundamental skills required to acquire, munge, transform, manipulate, and visualize data in a computing environment that fosters reproducibility. The primary course material for this section is provided via this free online book.
The second section focused on machine learning section will expose you to several algorithms to identify hidden patterns and relationships within your data. The primary course material for this part of the course is provided via this free online book. There will also be recorded lectures and additional supplementary resources provided via Canvas.
Class Structure
Modules: For this class each module is covered over the course of week. In the “Overview” section for each module you will find overall learning objectives, a short description of the learning content covered in that module, along with all tasks that are required of you for that module (i.e. quizzes, lab). Each module will have two or more primary lessons and associated quizzes along with a lab.
Lessons: For each lesson you will read and work through the tutorial. Short videos will be sprinkled throughout the lesson to further discuss and reinforce lesson concepts. Each lesson will have various “TODO” exercises throughout, along with end-of-lesson exercises. I highly recommend you work through these exercises as they will prepare you for the quizzes, labs, and project work.
Quizzes: There will be a short quiz associated with each lesson. These quizzes will be hosted in the course website on Canvas. Please check Canvas for due dates for these quizzes.
Labs: There will be a lab associated with each module. For these labs students will be guided through a case study step-by-step. The aim is to provide a detailed view on how to manage a variety of complex real-world data; how to convert real problems into data wrangling and analysis problems; and to apply R to address these problems and extract insights from the data. These labs will be provided via the course website on Canvas and the submission of these labs will also be done through the course website on Canvas. Please check Canvas for due dates for these labs.
Projects: There will be two projects designed for you to put to work the tools and knowledge that you gain throughout this course. This provides you with multiple benefits.
- It will provide you with more experience using data wrangling tools on real life data sets.
- It helps you become a self-directed learner. As a data scientist, a large part of your job is to self-direct your learning and interests to find unique and creative ways to find insights in data.
- It starts to build your data science portfolio. Establishing a data science portfolio is a great way to show potential employers your ability to work with data.
Schedule
See the Canvas course webpage for a detailed schedule with due dates for quizzes, labs, etc.
Module | Description |
---|---|
DATA WRANGLING | |
1 | Introduction |
R fundamentals & the Rstudio IDE | |
Deeper understanding of vectors | |
2 | Reproducible Documents and Importing Data |
Managing your workflow and reproducibility | |
Data structures & importing data | |
3 | Tidy Data and Data Manipulation |
Data manipulation & summarization | |
Tidy data | |
4 | Relational Data and More Tidyverse Packages |
Relational data | |
Leveraging the Tidyverse to text & date-time data | |
5 | Data Visualization & Exploration |
Data visualization | |
Exploratory data analysis | |
6 | Creating Efficient Code in R |
Control statements & iteration | |
Writing functions | |
7 | Mid-term Project |
MACHINE LEARNING | |
8 | Introduction to Applied Modeling |
Introduction to machine learning | |
First model with Tidymodels | |
9 | First Regression Models |
Simple linear regression | |
Multiple linear regression | |
10 | More Modeling Processes |
Feature engineering | |
Resampling | |
11 | Classification & Regularization |
Logistic regression | |
Regularized regression | |
12 | Hyperparameter Tuning & Non-linearity |
Hyperparameter tuning | |
Multivariate adaptive regression splines | |
13 | Tree-based Models |
Decision trees | |
Bagging | |
Random forests | |
14 | Unsupervised learning |
Clustering | |
Dimension reduction | |
15 | Final Project |
Conventions used in this book
The following typographical conventions are used in this book:
- strong italic: indicates new terms,
- bold: indicates package & file names,
inline code
: monospaced highlighted text indicates functions or other commands that could be typed literally by the user,- code chunk: indicates commands or other text that could be typed literally by the user
In addition to the general text used throughout, you will notice the following cells that provide additional context for improved learning:
A video demonstrating this topic is available in Canvas.
A tip or suggestion that will likely produce better results.
A general note that could improve your understanding but is not required for the course requirements.
Warning or caution to look out for.
Knowledge check exercises to gauge your learning progress.
Feedback
To report errors or bugs that you find in this course material please post an issue at https://github.com/bradleyboehmke/uc-bana-4080/issues. For all other communication be sure to use Canvas or the university email.
When communicating with me via email, please always include BANA4080 in the subject line.
Acknowledgements
This course and its materials have been influenced by the following resources:
- Jenny Bryan, STAT 545: Data wrangling, exploration, and analysis with R
- Garrett Grolemund & Hadley Wickham, R for Data Science
- Stephanie Hicks, Statistical Computing
- Chester Ismay & Albert Kim, ModernDive
- Alex Douglas et al., An Introduction to R
- Brandon Greenwell, Hands-on Machine Learning with R