Syllabus

This is the primary “textbook” for the Machine Learning section of the UC BANA 4080 Data Mining course. The following is a truncated syllabus; for the full syllabus along with complete course content please visit the online course content in Canvas.

Welcome to Data Mining with R! This course provides an intensive, hands-on introduction to data mining and analysis techniques. You will learn the fundamental skills required to extract informative attributes, relationships, and patterns from data sets. You will gain hands-on experience with exploratory data analysis, data visualization, unsupervised learning techniques such as clustering and dimension reduction, and supervised learning techniques such as linear regression, regularized regression, decision trees, random forests, and more! You will also be exposed to some more advanced topics such as ensembling techniques, deep learning, model stacking, and model interpretation. Together, this will provide you with a solid foundation of tools and techniques applied in organizations to aid modern day data-driven decision making.

Learning Objectives

Upon successfully completing this course, you will be able to:

  • Apply data wrangling techniques to manipulate and prepare data for analysis.
  • Use exploratory data analysis and visualization to provide descriptive insights of data.
  • Apply common unsupervised learning algorithms to find common groupings of observations and features in a given dataset.
  • Describe and apply a sound analytic modeling process.
  • Apply, compare, and contrast various predictive modeling techniques.
  • Have the resources and understanding to continue advancing your data mining and analysis capabilities.

…all with R!

This course assumes no prior knowledge of R. Experience with programming concepts or another programming language will help, but is not required to understand the material.

Material

This course is split into two main sections - Data Wrangling and Machine Learning. The data wrangling section will provide you the fundamental skills required to acquire, munge, transform, manipulate, and visualize data in a computing environment that fosters reproducibility. The primary course material for this section is provided via this free online book.

The second section focused on machine learning section will expose you to several algorithms to identify hidden patterns and relationships within your data. The primary course material for this part of the course is provided via this free online book. There will also be recorded lectures and additional supplementary resources provided via Canvas.

Class Structure

Modules: For this class each module is covered over the course of week. In the “Overview” section for each module you will find overall learning objectives, a short description of the learning content covered in that module, along with all tasks that are required of you for that module (i.e. quizzes, lab). Each module will have two or more primary lessons and associated quizzes along with a lab.

Lessons: For each lesson you will read and work through the tutorial. Short videos will be sprinkled throughout the lesson to further discuss and reinforce lesson concepts. Each lesson will have various “TODO” exercises throughout, along with end-of-lesson exercises. I highly recommend you work through these exercises as they will prepare you for the quizzes, labs, and project work.

Quizzes: There will be a short quiz associated with each lesson. These quizzes will be hosted in the course website on Canvas. Please check Canvas for due dates for these quizzes.

Labs: There will be a lab associated with each module. For these labs students will be guided through a case study step-by-step. The aim is to provide a detailed view on how to manage a variety of complex real-world data; how to convert real problems into data wrangling and analysis problems; and to apply R to address these problems and extract insights from the data. These labs will be provided via the course website on Canvas and the submission of these labs will also be done through the course website on Canvas. Please check Canvas for due dates for these labs.

Projects: There will be two projects designed for you to put to work the tools and knowledge that you gain throughout this course. This provides you with multiple benefits.
- It will provide you with more experience using data wrangling tools on real life data sets. - It helps you become a self-directed learner. As a data scientist, a large part of your job is to self-direct your learning and interests to find unique and creative ways to find insights in data. - It starts to build your data science portfolio. Establishing a data science portfolio is a great way to show potential employers your ability to work with data.

Schedule

See the Canvas course webpage for a detailed schedule with due dates for quizzes, labs, etc.

Module Description
DATA WRANGLING
1 Introduction
R fundamentals & the Rstudio IDE
Deeper understanding of vectors
2 Reproducible Documents and Importing Data
Managing your workflow and reproducibility
Data structures & importing data
3 Tidy Data and Data Manipulation
Data manipulation & summarization
Tidy data
4 Relational Data and More Tidyverse Packages
Relational data
Leveraging the Tidyverse to text & date-time data
5 Data Visualization & Exploration
Data visualization
Exploratory data analysis
6 Creating Efficient Code in R
Control statements & iteration
Writing functions
7 Mid-term Project
MACHINE LEARNING
8 Introduction to Applied Modeling
Introduction to machine learning
First model with Tidymodels
9 First Regression Models
Simple linear regression
Multiple linear regression
10 More Modeling Processes
Feature engineering
Resampling
11 Classification & Regularization
Logistic regression
Regularized regression
12 Hyperparameter Tuning & Non-linearity
Hyperparameter tuning
Multivariate adaptive regression splines
13 Tree-based Models
Decision trees
Bagging
Random forests
14 Unsupervised learning
Clustering
Dimension reduction
15 Final Project

Conventions used in this book

The following typographical conventions are used in this book:

  • strong italic: indicates new terms,
  • bold: indicates package & file names,
  • inline code: monospaced highlighted text indicates functions or other commands that could be typed literally by the user,
  • code chunk: indicates commands or other text that could be typed literally by the user
1 + 2
## [1] 3

In addition to the general text used throughout, you will notice the following cells that provide additional context for improved learning:

A video demonstrating this topic is available in Canvas.

A tip or suggestion that will likely produce better results.

A general note that could improve your understanding but is not required for the course requirements.

Warning or caution to look out for.

Knowledge check exercises to gauge your learning progress.

Feedback

To report errors or bugs that you find in this course material please post an issue at https://github.com/bradleyboehmke/uc-bana-4080/issues. For all other communication be sure to use Canvas or the university email.

When communicating with me via email, please always include BANA4080 in the subject line.

Acknowledgements

This course and its materials have been influenced by the following resources: