BANA 4080: Data Mining

Author

Brad Boehmke

Published

June 26, 2025

Welcome

This book is currently in active development.

Welcome to BANA 4080: Introduction to Data Mining with Python. This course provides an immersive, hands-on introduction to the tools and techniques used in modern data science. You’ll learn how to explore, analyze, and model data using Python and gain practical experience through labs, projects, and real-world datasets.

Along the way you will develop core skills in data wrangling, exploratory data analysis, data visualization, and even key machine learning techniques such as supervised, unsupervised, and deep learning model. We’ll even take a quick detour into generative AI and large language models (LLMs). Throughout this process we’ll use real-world data and experiential learning to guide your learning.

By the end of the course, students will be able to:

  • Use Python to read, clean, transform, and visualize data
  • Apply exploratory and statistical techniques to understand datasets
  • Build and evaluate basic machine learning models
  • Gain exposure to cutting-edge techniques like deep learning and LLMs

Who should read this?

This book is designed for upper-level undergraduate students who may have little to no prior programming experience but are eager to explore the world of data science using Python. It’s also an ideal resource for early-career professionals or students in analytics, business, or quantitative fields who are looking to upskill—whether by learning Python for the first time or by building a deeper understanding of how to explore, visualize, and model data. The content is structured to be accessible and hands-on, guiding readers step-by-step through the core tools and techniques used in modern data-driven problem solving.

How this book is structured

This book is broken into 14 modules, each aligned with a week of instruction in the BANA 4080 course. Every module introduces key concepts or techniques in data science, combining concise explanations with interactive, hands-on code examples. Whether you’re reading independently or following along with the course, the modular structure makes it easy to work through the content at your own pace, week by week.

Module & Topics Summary of Concepts Covered
1. Fundamentals I Course overview, coding environment setup, Python basics
2. Fundamentals II Using Jupyter notebooks, data structures, Python libraries
3. Pandas DataFrames Importing data, DataFrame fundamentals, subsetting DataFrames
4. Data Wrangling I Cleaning, filtering, aggregating, and merging tabular data
5. Data Wrangling II Working with datetime, text data, and joining data like SQL
6. Data Visualization Creating plots using matplotlib and seaborn, exploratory data analysis
7. Writing Efficient Python Code Control flow, defining functions, loops, list comprehensions
8. Introduction to Machine Learning Overview of ML, features/labels, train/test split, scikit-learn basics
9. Unsupervised Learning Clustering (k-means), PCA, dimensionality reduction, t-SNE visualization
10. Supervised Learning Regression and classification models: linear, logistic regression
11. Deep Learning & Neural Networks Neural networks using Keras; simple classification tasks
12. Generative AI & Prompt Engineering Working with LLMs, OpenAI API, prompt design, building AI agents
13. Final Project Kickoff Scoping and starting a capstone data science project
14. Final Project Presentations & Wrap-Up Presenting project findings, course reflection, and next steps

Conventions used in this book

The following typographical conventions are used in this book:

  • strong italic: indicates new terms,
  • bold: indicates package & file names,
  • inline code: monospaced highlighted text indicates functions or other commands that could be typed literally by the user,
  • code chunk: indicates commands or other text that could be typed literally by the user
1 + 2
3

In addition to the general text used throughout, you will notice the following code chunks with images:

Signifies a tip or suggestion

Signifies a general note

Signifies a warning or caution

Software used throughout this book

This book is built around an open-source Python-based data science ecosystem. While the list of tools evolves with the field, the examples and exercises in this book are designed to work with Python 3.x, currently using…

Code
# Display the Python version
import sys
print("Python version:", sys.version.split()[0])
Python version: 3.13.5

…and are executed within Jupyter Notebooks, which provide an interactive, beginner-friendly environment for writing and running code.

Throughout the modules, we use foundational Python libraries such as:

  • pandas and numpy for data wrangling and numerical computing,
  • matplotlib and seaborn for data visualization,
  • scikit-learn and keras for machine learning and deep learning, and
  • openai and transformers for generative AI and large language model exploration.

Each module explicitly introduces the relevant software and libraries, explains how and why they are used, and provides reproducible code so that readers can follow along and generate similar results in their own environment.

Additional resources

TBD