Statistical Computing#
Note
This repository contains additional resources for the UC BANA 6043 Statistical Computing course. The following is a truncated syllabus; for the full syllabus along with complete course content please visit the online course content in Canvas.
Welcome to Statistical Computing with Python! This course provides an intensive, hands-on introduction to statistical computing and data science with the Python programming language. You will gain foundational skills in managing data structures, performing data wrangling, computing and visualizing statistical relationships, managing various environments conducive for statistical analysis, and performing machine learning modeling. Most importantly, since this course only has time to introduce foundational skills, much emphasis is placed on giving you a mental model of Python’s data science ecosystem so you know how, when, and where to continue advancing your statistical computing capabilities.
Learning objectives#
Upon successfully completing this course, you will:
Have a mental model of the Python data science ecosystem: libraries, capabilities, vocabulary, and widely-available Python resources.
Have the ability to use Python within both interactive (Jupyter, REPL) and non-interactive (scripts) environments.
Be able to perform core data wrangling activities: importing data, reshaping data, transforming data, and exporting data.
Be able to compute descriptive statistics and visualize key patterns and relationships with your data.
Be exposed to modeling via scikit-learn and discuss the fundamentals of building models in Python.
Have the resources and understanding to continue advancing your statistical computing capabilities.
Note
This course assumes no prior knowledge of Python. Experience with programming concepts or another programming language will help, but is not required to understand the material.
Material#
The bulk of the classroom material will be provided via this book, the recorded lectures, and class notes. In some cases there are additional recommended readings, all of which are readily available online.
Class structure#
Modules: For this class each module is covered over the course of week. In the “Overview” section for each module you will find overall learning objectives, a short description of the learning content covered in that module, along with all tasks that are required of you for that module (i.e. quizzes, lab). Each module will have two or more primary lessons and associated quizzes along with a lab.
Lessons: For each lesson you will read and work through the tutorial. Short videos will be sprinkled throughout the lesson to further discuss and reinforce lesson concepts. Each lesson will have various “Your Turn” exercises throughout, along with end-of-lesson exercises. I highly recommend you work through these exercises as they will prepare you for the quizzes, labs, and project work.
Quizzes: There will be a short quiz associated with each lesson. These quizzes will be hosted in the course website on Canvas. Please check Canvas for due dates for these quizzes.
Labs: There will be a lab associated with each module. For these labs students will be guided through a case study step-by-step. The aim is to provide a detailed view on how to manage a variety of complex real-world data; how to convert real problems into data wrangling and analysis problems; and to apply Python to address these problems and extract insights from the data. Submission of these labs will be done through the course website on Canvas. Please check Canvas for due dates for these labs.
Project: The final project is designed for you to put to work the tools and knowledge that you gain throughout this course. This provides you with multiple benefits.
It will provide you with more experience using data science tools on real life data sets.
It helps you become a self-directed learner. As a data scientist, a large part of your job is to self-direct your learning and interests to find unique and creative ways to find insights in data.
It starts to build your data science portfolio. Establishing a data science portfolio is a great way to show potential employers your ability to work with data.
Schedule#
Note
See the Canvas course webpage for a detailed schedule with due dates for quizzes, labs, etc.
Module |
Description |
---|---|
1 |
Starting with the Basics |
Introduction to JupyterLab and the notebook environment |
|
Python fundamentals |
|
2 |
Python Data Science Ecosystem & DataFrames |
Modules, packages, and a preview of Python’s data science ecosystem |
|
Importing data and working with DataFrames |
|
3 |
Data Wrangling Part 1 |
Subsetting and manipulating data |
|
Computing summary statistics at different levels |
|
4 |
Data Wrangling Part 2 |
Tidying and joining data |
|
Handling text data |
|
5 |
Data Visualization |
Higher and lower level plotting APIs |
|
Interactive visualizations |
|
6 |
Creating Efficient Code in Python |
Control statements & iteration |
|
Writing functions |
|
7 |
Intro to Machine Learning with Scikit-Learn |
Basics of the Scikit-learn API |
|
Feature engineering and model evaluation/selection |
Conventions used in this book#
The following typographical conventions are used in this book:
strong italic: indicates new terms,
bold: indicates package & file names,
inline code
: monospaced highlighted text indicates functions or other commands that could be typed literally by the user,code chunk: indicates commands or other text that could be typed literally by the user
1 + 2
3
In addition to the general text used throughout, you will notice the following cells:
Tip
Signifies a tip or suggestion
Note
Signifies a general note
Warning
Signifies a warning or caution
Questions:
Knowledge check exercises to gauge your learning progress.
Video 🎥:
A short video on the topic is available to watch.
Feedback#
To report errors or bugs that you find in this course material please post an issue at bradleyboehmke/uc-bana-6043#issues. For all other communication be sure to use Canvas or the university email.
Note
When communicating with me via email, please always include BANA6043 in the subject line.
Acknowledgements#
This course and its materials are influenced, or leverage, resources by the following:
Ethan Swan, Python for Data Science
Justin Bois, Caltech Intro to Programming for the Biological Sciences Bootcamp
Tomas Beuzen, Python Programming for Data Science
Inria, About Machine learning in Python with scikit-learn MOOC
Corey Schafer, Various Python playlists
Kevin Markham, Data School