4 Putting DataOps into Practice

Reading Time: 45-60 minutes

In the previous chapter, we explored the foundational concepts of DataOps, focusing on the importance of processes such as data ingestion, validation, versioning, and lineage in building reliable machine learning systems. Now, it’s time to take these concepts from theory to practice by applying them to create data pipelines. Data pipelines form the backbone of any modern ML system, serving as the structured, automated pathways through which data flows from raw sources to actionable insights.

A data pipeline is a series of interconnected steps that collect, process, validate, and transform data into formats ready for machine learning workflows. Whether you’re ingesting data from multiple sources, cleaning and preparing it for modeling, or ensuring its integrity through validation and versioning, pipelines make it possible to handle these tasks efficiently and consistently. They are essential for ensuring scalability, reliability, and repeatability in ML systems, especially in environments where data is constantly changing or arriving in real-time.

This chapter will guide you through the practical implementation of DataOps principles by developing end-to-end data pipelines. We’ll begin with an introduction to the components of a data pipeline and the tools commonly used to implement them. From there, we’ll delve into the step-by-step process of building a pipeline, starting with a simple example and progressing to techniques for creating scalable workflows that can handle complex, large-scale ML applications.

By the end of this chapter, you’ll have a deeper understanding of how to translate DataOps practices into actionable pipelines, setting the stage for real-world machine learning deployments. Whether you’re working with batch or streaming data, structured or unstructured datasets, this chapter will equip you with the tools and strategies to design robust data pipelines that meet the demands of modern ML workflows.

4.1 Introduction to Data Pipelines

In the world of machine learning (ML), the success of a system hinges on the quality, reliability, and timeliness of the data that powers it. Data pipelines are the operational backbone of these systems, ensuring data flows seamlessly from its source, through various transformations, and into the hands of models or end-users. A data pipeline is a series of automated steps that enable data ingestion, processing, and delivery—integrating the concepts of DataOps to ensure scalability, consistency, and efficiency.

In the previous chapter, we explored the fundamental principles of DataOps, including data ingestion, processing, validation, versioning, and lineage. These processes establish the foundation for building robust data pipelines. Now, we turn our attention to applying these principles in practice to design end-to-end workflows that support machine learning systems.

The Importance of Data Pipelines in ML System Design

Data pipelines address critical challenges faced by ML workflows, such as managing large data volumes, handling diverse data formats, and maintaining data integrity. By automating repetitive and error-prone tasks like ingestion, cleaning, and transformation, pipelines ensure that data is prepared and delivered reliably.

From the concepts introduced earlier, pipelines incorporate:

Data Ingestion: Automating the flow of data from various sources—whether databases, APIs, or real-time streams—into a centralized system.
Data Processing: Applying transformations, feature engineering, and cleaning steps to structure raw data into formats ready for model consumption.
Data Validation: Embedding quality checks to ensure the consistency and integrity of data at every stage of the pipeline.
Data Versioning and Lineage: Providing traceability and reproducibility by maintaining historical records of datasets and documenting their transformations.

For example, in a customer churn prediction system, a data pipeline might ingest transaction data from a database, clean and preprocess it to handle missing values and inconsistencies, and validate that the data conforms to schema standards before feeding it into an ML model. Each step would leverage the DataOps principles discussed earlier to ensure that the process is efficient, scalable, and error-resistant.

The Role of DataOps in Constructing Scalable and Reliable Pipelines

As explored in the previous chapter, DataOps is not just a set of practices but a philosophy that ensures data workflows are collaborative, automated, and aligned with business objectives. When applied to pipeline construction, DataOps enables:

Scalability: Modular design allows pipelines to handle growing data volumes and adapt to new sources or transformations without significant redesigns.
Reliability: Continuous validation, versioning, and monitoring ensure that pipelines consistently deliver high-quality data, even in dynamic environments.
Collaboration: DataOps principles encourage communication between data engineers, analysts, and machine learning practitioners, ensuring pipelines meet diverse needs.
Traceability: By tracking data lineage and applying version control, DataOps ensures every step in the pipeline is documented, facilitating debugging, compliance, and reproducibility.

Consider a recommendation system that ingests real-time user behavior data from an API and combines it with historical purchase data stored in a data lake. A DataOps-driven pipeline would:

Ingest and Validate: Fetch data from both sources, validating schema consistency and ensuring data quality.
Process and Transform: Standardize features such as timestamps, handle missing values, and engineer inputs like “average time between purchases.”
Track Lineage: Document the transformations applied to both real-time and historical data to ensure traceability and compliance.
Version Datasets: Maintain snapshots of preprocessed data for future analysis, debugging, or retraining.

By applying the foundational concepts of DataOps, such pipelines are not only robust and scalable but also agile enough to adapt to evolving ML requirements.

In this chapter, we will build on the foundational knowledge from the previous chapter to design and implement data pipelines that embody the principles of DataOps. Through hands-on examples and practical guidance, you will learn how to construct workflows that transform raw data into reliable inputs for machine learning systems, setting the stage for scalable and effective solutions.

4.2 ETL vs. ELT Paradigms

When designing data pipelines, there are two primary paradigms: Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT). These paradigms represent distinct approaches to organizing and processing data as it flows through pipelines, each with its own strengths and trade-offs. Choosing between ETL and ELT (or a hybrid approach) requires an understanding of your organization’s technical infrastructure, data requirements, and use cases.

Extract, Transform, Load (ETL)

The ETL paradigm embodies a traditional, structured approach to data pipelines. It follows a linear process:

Extract data from various sources, such as relational databases, APIs, or CSV files.
Transform the extracted data into a clean, structured format using techniques like cleaning, aggregation, and enrichment.
Load the transformed data into a target system, such as a relational database or data warehouse.

flowchart LR
    ob1(Raw data source) --> processing[Data processing]
    ob2(Raw data source) --> processing
    ob3(Raw data source) --> processing
    processing --> id3[(Data storage)]

Figure 4.1: In the ETL paradigm, the pipeline ingests the data, then processes the data prior to storing the data in their target destination.

ETL pipelines focus on preparing high-quality, ready-to-use data before it enters the storage system. In ETL workflows, the data processing phase discussed earlier plays a central role. Transformations are performed early in the pipeline to ensure data quality and structure before it reaches the storage layer.

When to Use ETL

When working with traditional data warehouses or legacy systems that prioritize structured and clean data.
In use cases requiring strict data governance and pre-defined schemas, such as regulatory compliance reporting.
When downstream applications depend on consistent, pre-processed datasets.

Advantages

Simplifies downstream workflows by ensuring data is pre-processed and ready for use.
Reduces storage costs by retaining only clean and structured data.
Aligns well with environments that require high data integrity.

Challenges

Upfront transformations can delay data availability.
Scaling ETL pipelines to handle large datasets or real-time workflows can be resource-intensive.
Rigid processes may struggle to adapt to rapidly changing data needs.

Example

Imagine a financial institution that extracts transaction data, transforms it into a consistent format by applying currency conversions and anomaly detection, and loads the cleaned data into a warehouse for fraud detection reports.

Extract, Load, Transform (ELT)

The ELT paradigm, a more modern approach, inverts the transformation and loading steps:

Extract data from various sources, often retaining its raw format.
Load the extracted data directly into a scalable storage system, such as a data lake or cloud-based data warehouse.
Transform the data within the storage system using its computational resources, tailoring transformations to specific analytical needs.

flowchart LR
    ob1(Raw data source) --> id3[(Data storage)]
    ob2(Raw data source) --> id3
    ob3(Raw data source) --> id3
    id3 --> processing[Data processing] --> id3

Figure 4.2: In the ELT paradigm, the pipeline ingests data directly into the destination system and transforms it in parallel or after the fact.

In ELT workflows, the data ingestion phase emphasizes rapid loading of raw data into storage, enabling flexibility for later transformations. Also, by storing the data in its raw form, ELT better handles structured, unstructured, and semi-structured data and allows downstream analytics on this data to more easily adapt.

When to Use ELT

When leveraging cloud-native platforms like Snowflake, BigQuery, or AWS Redshift, which support high-speed storage and in-database transformations.
In workflows requiring rapid ingestion and the ability to adapt transformations for different use cases.
When storing raw data is essential for exploratory analysis or regulatory purposes.

Advantages

Enables faster data ingestion by deferring transformations.
Supports diverse and flexible transformations tailored to specific analyses.
Scales well with large data volumes, leveraging modern storage and processing systems.

Challenges

Requires advanced infrastructure to handle and process raw data efficiently.
Higher storage costs due to the retention of raw, unprocessed data.
Ad-hoc transformations can lead to inconsistencies if not governed properly.

Example

An e-commerce platform ingests raw clickstream data into a cloud data warehouse. When needed, transformations such as sessionizing user behavior or aggregating purchase trends are applied directly within the warehouse for personalized recommendations or sales forecasting.

Comparing ETL and ELT

Aspect	ETL	ELT
Data Transformation	Before loading into storage	After loading into storage
Storage Requirements	Lower, as only processed data is stored	Higher, as raw data is retained
Processing Time	Slower ingestion due to upfront transformations	Faster ingestion with deferred transformations
Flexibility	Limited; transformations are predefined	High; transformations can be ad hoc
Infrastructure	Suitable for legacy systems or traditional data warehouses	Ideal for modern, scalable systems

Choosing the Right Paradigm

As highlighted in the previous chapter, the goals of DataOps — efficiency, reliability, and scalability — should guide the choice of ETL, ELT, or a hybrid approach. Consider:

ETL is well-suited for structured environments and use cases that demand immediate access to clean, well-processed data.
ELT shines in cloud-native or big data ecosystems where raw data flexibility and scalability are critical.

Both paradigms have their strengths, and many organizations blend elements of each. For example, a retail company might use ETL for compliance reporting while leveraging ELT for real-time inventory analysis. By understanding these paradigms within the broader DataOps framework, teams can design pipelines that meet both technical and business requirements effectively.

4.3 Tools for DataOps

Building effective and scalable data pipelines requires leveraging specialized tools at each stage of the DataOps lifecycle. From data ingestion to processing, validation, and versioning, the ecosystem of tools available is vast and continually expanding. This section provides an overview of commonly used tools, highlighting their capabilities, trade-offs, and when they might be most appropriate. Additionally, this space is dynamic, with new tools regularly introduced to address evolving challenges. Therefore, the goal is not to become tied to a specific tool but to understand the broader landscape and make informed decisions based on your pipeline’s requirements.

Data Ingestion Tools

Efficiently gathering data from diverse sources is the foundation of any pipeline. Ingestion tools vary from those tailored for real-time streaming to solutions focused on batch processing and ease of integration. Some examples include:

Apache Kafka

A distributed event-streaming platform designed for high-throughput, real-time data ingestion. Kafka is widely used in applications like fraud detection, IoT data pipelines, and real-time analytics.

When to Use: Ideal for high-velocity, low-latency systems.
Limitations: Steep learning curve and resource-heavy for smaller use cases.

Apache NiFi

A flow-based programming tool that automates data movement and transformation with a user-friendly interface. NiFi supports a wide range of data sources and formats.

When to Use: Low-code scenarios requiring rapid integration of multiple data sources.
Limitations: Less suited for complex, high-throughput pipelines.

Airbyte

An open-source data integration platform focused on moving data into data warehouses or lakes. Its extensive library of connectors simplifies ingesting data from various APIs and databases.

When to Use: Batch ingestion with diverse source compatibility.
Limitations: Primarily batch-focused and may require configuration for real-time pipelines.

Fivetran

A managed data integration tool that simplifies data ingestion by automating schema handling and updates. It’s widely used for moving data into analytics-ready storage solutions.

When to Use: Enterprise scenarios requiring minimal management overhead for cloud data integration.
Limitations: Subscription-based pricing and limited customization compared to open-source tools.

Data Processing Tools

Processing raw data into clean, structured, and feature-rich formats is critical to preparing it for machine learning workflows. Many tools exist but the following are a few popular tools that support various scales and complexities of data transformation.

Apache Spark

A powerful engine for distributed data processing. Spark supports both batch and streaming data workflows, making it suitable for big data and ML pipelines.

When to Use: Processing large-scale data across distributed systems.
Limitations: Requires expertise and infrastructure, which might be overkill for smaller datasets.

DBT (Data Build Tool)

A transformation tool that applies SQL to prepare data for analytics workflows. DBT is particularly useful for pipelines that involve data warehouses.

When to Use: SQL-centric environments where analytics-ready data is the end goal.
Limitations: Focused on transformations within SQL databases; less versatile for non-SQL workflows.

Polars

An open source Python & Rust library for data manipulation and analysis, offering rich functionality for handling structured data with performance in mind.

When to Use: Small to large datasets or for prototyping workflows. A great substitute for Pandas when performance and speed are required.
Limitations: Not designed for real-time processing.

Prefect

A workflow orchestration tool that enables robust scheduling and monitoring of data processing tasks.

When to Use: Complex pipelines requiring orchestration of multiple processing steps.
Limitations: Adds an orchestration layer, which might be unnecessary for simple workflows.

Data Validation Tools

Ensuring the integrity and accuracy of data is essential to building reliable pipelines. Validation tools help detect anomalies, enforce schema compliance, and improve data quality. Some popular data validation tools include:

Great Expectations

A framework for defining, executing, and documenting data validation checks. It integrates well with modern data stacks.

When to Use: Custom validation rules with automated reporting needs.
Limitations: May require significant customization for non-standard checks.

TFDV (TensorFlow Data Validation)

A library designed for detecting anomalies and validating schema in ML datasets.

When to Use: TensorFlow-based workflows or pipelines requiring statistical validation.
Limitations: Limited applicability outside of TensorFlow-centric environments.

Deequ

A library developed by AWS for validating large datasets using Spark. It focuses on automated quality checks and anomaly detection.

When to Use: Spark-based pipelines requiring scalable data quality checks.
Limitations: Limited compatibility with non-Spark environments.

Data Versioning Tools

Versioning ensures traceability and reproducibility by tracking changes to datasets and workflows. It is vital for debugging, compliance, and collaboration. Some common data versioning tools include:

DVC (Data Version Control)

A Git-like tool for versioning data, models, and pipelines. It integrates seamlessly with Git repositories.

When to Use: Managing medium to large datasets in collaborative environments.
Limitations: Can be challenging to set up for teams unfamiliar with Git.

Delta Lake

A storage layer that adds versioning and ACID transactions to data lakes. It works well with distributed systems like Apache Spark.

When to Use: Large-scale pipelines needing robust versioning and consistency.
Limitations: Requires Spark for full functionality, adding complexity to smaller-scale workflows.

LakeFS

A Git-like version control system for data lakes. It enables branching and snapshotting of data workflows.

When to Use: Data lakes requiring advanced version control features.
Limitations: Best suited for teams already working with data lakes.

Navigating a Rapidly Evolving Ecosystem

The DataOps tooling landscape is dynamic, with new tools and frameworks regularly introduced to address emerging challenges. This constant innovation provides opportunities to enhance workflows but also demands adaptability.

Focus on Principles: Rather than mastering specific tools, prioritize understanding the core concepts of data ingestion, processing, validation, and versioning. This flexibility allows you to evaluate and adopt new tools as they emerge.
Experiment and Iterate: Allocate time for evaluating tools that may better align with your pipeline’s evolving needs. Open-source communities and GitHub repositories are excellent resources for exploration.
Hybrid Tooling: Often, no single tool will address all requirements perfectly. Combining tools—such as using Airbyte for ingestion, DBT for processing, and Great Expectations for validation—creates tailored workflows.
Leverage Community Support: Most tools provide extensive documentation, active forums, and integration guides. Engage with these communities to stay informed about updates and best practices.

By focusing on the principles of DataOps and staying informed about the ever-changing ecosystem, teams can design resilient and adaptable pipelines. In the following sections, we’ll explore how to combine these tools into cohesive workflows and build scalable, efficient data pipelines for machine learning systems.

4.4 Hands-On Example: A YouTube Data Pipeline

Now that you’ve learned a bit about data pipelines, common paradigms, and tooling involved, let’s apply some of these concepts to create a simplified data pipeline that demonstrates how to ingest, process, and prepare YouTube data for downstream machine learning (ML) workflows. We’ll focus on collecting data like video titles, transcripts, views, likes, and comments. This example ties together concepts from previous chapters, including data ingestion, processing, validation, and versioning.

Pipeline Design

The pipeline design we’ll use will resemble more of an ETL process than an ELT. This is mainly for storage simplicity since this pipeline is run locally (whether that be on my machine or your machine) rather than in a cloud environment where storage capacity constraints are less of an issue.

We’ll use a couple of APIs to ingest our data, perform a little bit of data processing to clean and prepare our data for future analyses, and then illustrate some basic data validation along with versioning and storing our final prepared data.

flowchart LR
    subgraph ingest[Data Ingestion]
      direction LR
      subgraph p1[Ingest Video IDs]
      end
      subgraph p2[Ingest Video Stats]
      end
      subgraph p3[Ingest Video Transcript]
      end
    end
    p1 --> p2
    p1 --> p3
    ingest --> process(Process Raw Data)
    process --> Validate --> Version --> data[(Data storage)]

Figure 4.3: High-level architecture of our data pipeline, which leverages an ETL process to ingest, process, validate, version and then store our data.

Basic Requirements

If you’d like to follow along and execute this code then there are a few requirements you’ll need on your end.

You can find the requirements, source code for the data pipeline, and helper functions here.

First, you’ll need to make sure you have the following Python libraries installed:

Python version: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 08:28:27) [Clang 14.0.6 ]

dvc==3.59.0
great_expectations==1.3.1
jupyterlab==4.1.6
matplotlib==3.8.0
numpy==1.26.4
pandas<=2.2
python-dotenv==0.21.0
tqdm==4.63.0
youtube_transcript_api==0.6.2

Next, to simplify this example I have created helper functions to abstract away a lot of the finer code details. You can see this in the following imports where I am importing helper functions from the dataops_utils.py module. If you want to reproduce this pipeline then you can download the dataops_utils.py script here.

import great_expectations as gx
import os
import numpy as np
import pandas as pd
import unicodedata
import warnings

from dataops_utils import (
    ingest_channel_video_ids,
    ingest_video_stats,
    ingest_video_transcript,
)
from dotenv import load_dotenv

Lastly, We’ll be using the YouTube Data API to extract metadata and video information. To follow along, ensure you have:

A Google Cloud account.
Access to the YouTube Data API to include an API key. See here to get started.

Once you have a Youtube API key then you’re ready to go! The next code chunk sets your API key and establishes the API URL and the channel ID. Note, I’m a golf junkie so the channel ID I am using is for Grant Horvat, a Youtube golfer with nearly 1 million followers.¹

# I have my API key set as an environment variable
load_dotenv()
API_KEY = os.getenv('YOUTUBE_API_KEY')

# In your case you can add your API key here
if API_KEY is None:
    API_KEY = "INSERT_YOUR_YOUTUBE_API_KEY"

BASE_URL = "https://www.googleapis.com/youtube/v3"
CHANNEL_ID = 'UCgUueMmSpcl-aCTt5CuCKQw'

Data Ingestion

The first step is to ingest the Youtube data. To do so, we need to follow three steps:

Ingest all the video IDs provided by a given channel.

# Ingest Youtube video IDs
video_ids = ingest_channel_video_ids(API_KEY, CHANNEL_ID)

# Example of what the first record looks like
video_ids[0]

{'channel_id': 'UCgUueMmSpcl-aCTt5CuCKQw',
 'video_id': 'wzrIKGcOlsU',
 'datetime': '2025-01-13T17:00:24Z',
 'title': 'Rory McIlroy has another gear.'}

Use the ingested video IDs to ingest Youtube stats for each video such as total views, likes, and comments.

# Ingest Youtube video statistics
video_data = ingest_video_stats(video_ids, API_KEY)

# Example of the stats collected for the first video
video_data[0]

{'channel_id': 'UCgUueMmSpcl-aCTt5CuCKQw',
 'video_id': 'wzrIKGcOlsU',
 'datetime': '2025-01-13T17:00:24Z',
 'title': 'Rory McIlroy has another gear.',
 'views': '3142',
 'likes': '304',
 'comments': '16'}

And lastly, ingest the transcript for each video.

# Ingest Youtube video transcripts
video_data = ingest_video_transcript(video_data)

# Example of the final raw data that includes
# video ID, title, date, stats, and transcript
video_data[0]

{'channel_id': 'UCgUueMmSpcl-aCTt5CuCKQw',
 'video_id': 'wzrIKGcOlsU',
 'datetime': '2025-01-13T17:00:24Z',
 'title': 'Rory McIlroy has another gear.',
 'views': '3142',
 'likes': '304',
 'comments': '16',
 'transcript': "I've never seen you go after the ball like this there it is wow that was it that was so good 190 190 nice that was hit well good what was that 72 as well that was good that was nice that was FL that was nice 127 A2 191 that's gone 34 there you go"}

Data Processing

For data preprocessing, we’re going to first convert our data to a Pandas DataFrame

raw_data = pd.DataFrame(video_data)
raw_data.head()

	channel_id	video_id	datetime	title	views	likes	comments	transcript
0	UCgUueMmSpcl-aCTt5CuCKQw	wzrIKGcOlsU	2025-01-13T17:00:24Z	Rory McIlroy has another gear.	3142	304	16	I've never seen you go after the ball like thi...
1	UCgUueMmSpcl-aCTt5CuCKQw	Pz42jFngEzM	2025-01-13T03:25:25Z	Thank you. 1 Million ❤️	59474	6757	428	[Music] so sick let's go [Music]
2	UCgUueMmSpcl-aCTt5CuCKQw	fSHh01YT0-Q	2025-01-07T18:50:32Z	Tiger Woods hits the ball off the heel.	64116	2055	30	over the course of my career I've always hit t...
3	UCgUueMmSpcl-aCTt5CuCKQw	erzLT7fy2r0	2025-01-07T17:56:07Z	Tiger Woods liked my golf swing!	122584	4486	76	what's wrong with that yeah that came off you ...
4	UCgUueMmSpcl-aCTt5CuCKQw	3O08SnyZ88U	2025-01-07T17:07:04Z	Tiger Woods teaches me how to hit it straight!	555884	18286	142	what did you do in your career when you had a ...

And then clean our data by:

Removing missing values. There are a few videos with no transcript text because the videos have no talking in them.
Removing duplicate observations. Just in case there is duplication in video information during the downloading process.
Remove any inconsistent data types. This avoids errors during data processing and ensures consistency in operations applied to the data.
Remove any observations that have invalid datetime values. This ensures chronological accuracy for any time-based analysis or trends.
Remove any videos with minimal number of views. This filters out content that may not provide enough engagement data for meaningful insights.
Remove any videos with very little transcript text. This ensures that the remaining data contains sufficient content for natural language processing or text-based analysis.
Clean the title and transcript text. This removes unnecessary noise, such as non-character string values (i.e. unicode characters), making the text suitable for analysis.

# Remove rows with missing data
cleaned_data = raw_data.dropna()

# Remove duplicate rows
cleaned_data = cleaned_data.drop_duplicates()

# Remove any inconsistent data types
for col in ['views', 'likes', 'comments']:
    cleaned_data[col] = pd.to_numeric(cleaned_data[col], errors='coerce')

# Remove any observations that have invalid datetime values
cleaned_data['datetime'] = pd.to_datetime(cleaned_data['datetime'], errors='coerce')
cleaned_data = cleaned_data.dropna(subset=['datetime'])

# Remove any observations where the views value is less than 3 standard deviations
# from the mean
mean_views = cleaned_data['views'].mean()
std_views = cleaned_data['views'].std()
cleaned_data = cleaned_data[cleaned_data['views'] >= (mean_views - 3 * std_views)]

# Remove any observations where the transcript length is less than 3 standard deviations
# from the mean transcript length
cleaned_data['transcript_length'] = cleaned_data['transcript'].apply(lambda x: len(x) if pd.notnull(x) else 0)
mean_transcript_length = cleaned_data['transcript_length'].mean()
std_transcript_length = cleaned_data['transcript_length'].std()
cleaned_data = cleaned_data[cleaned_data['transcript_length'] >= (mean_transcript_length - 3 * std_transcript_length)]

# Remove/clean the title and transcript columns for non-character string values
# (i.e. unicode characters)
def clean_text(text):
    if isinstance(text, str):
        return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('ascii')
    return text

cleaned_data['title'] = cleaned_data['title'].apply(clean_text)
cleaned_data['transcript'] = cleaned_data['transcript'].apply(clean_text)

cleaned_data.head()

	channel_id	video_id	datetime	title	views	likes	comments	transcript	transcript_length
0	UCgUueMmSpcl-aCTt5CuCKQw	wzrIKGcOlsU	2025-01-13 17:00:24+00:00	Rory McIlroy has another gear.	3142	304	16	I've never seen you go after the ball like thi...	246
1	UCgUueMmSpcl-aCTt5CuCKQw	Pz42jFngEzM	2025-01-13 03:25:25+00:00	Thank you. 1 Million	59474	6757	428	[Music] so sick let's go [Music]	32
2	UCgUueMmSpcl-aCTt5CuCKQw	fSHh01YT0-Q	2025-01-07 18:50:32+00:00	Tiger Woods hits the ball off the heel.	64116	2055	30	over the course of my career I've always hit t...	483
3	UCgUueMmSpcl-aCTt5CuCKQw	erzLT7fy2r0	2025-01-07 17:56:07+00:00	Tiger Woods liked my golf swing!	122584	4486	76	what's wrong with that yeah that came off you ...	208
4	UCgUueMmSpcl-aCTt5CuCKQw	3O08SnyZ88U	2025-01-07 17:07:04+00:00	Tiger Woods teaches me how to hit it straight!	555884	18286	142	what did you do in your career when you had a ...	646

While there are many advanced techniques for preprocessing text data—such as more thorough cleaning, lemmatization, or converting text into embeddings for model input—the intent of this section is to focus on building a simplified example of an end-to-end data pipeline. These additional steps can significantly enhance the quality and utility of text data, but they are outside the scope of this example, which is designed to highlight the core concepts and practical steps involved in creating a data pipeline.

Data Validation

Next, we’ll validate our data. To do so we’ll use Great Expectations and, first, we need to create a data and define our data assets.

# Create Data Context.
context = gx.get_context()

# Create Data Source, Data Asset, Batch Definition, and Batch.
data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="Youtube video data")
batch_definition = data_asset.add_batch_definition_whole_dataframe("batch definition")
batch = batch_definition.get_batch(batch_parameters={"dataframe": cleaned_data})

Now that the infrastructure is set up, we’ll go through a process of validating that our data meets certain expectations. This code defines a data validation suite to ensure the integrity of our cleaned YouTube dataset. It consists of three primary validation steps:

Column Existence Validation: It verifies that all required columns (channel_id, video_id, datetime, title, views, likes, comments, transcript, and transcript_length) are present in the dataset. This step ensures that the essential structure of the dataset is intact.
Data Type Validation: It checks that each column contains values of the expected data type, such as Object for textual data (channel_id, video_id, title, and transcript), Timestamp for date-related fields, and int64 for numerical fields (views, likes, comments, and transcript_length). This ensures that data types align with the intended use of each column.
Null Value Validation: It confirms that no empty or null values exist in the critical columns. This guarantees that the dataset is complete and avoids errors caused by missing data in downstream processes.

Finally, the code executes the validation suite against a data batch and prints whether all validation checks were successful. This ensures the dataset meets the defined quality standards before being used in the data pipeline.

# Create an Expectation Suite
suite = gx.ExpectationSuite(name="Youtube video data expectations")

# Add the Expectation Suite to the Data Context
suite = context.suites.add(suite)

# Validate columns exist
suite.add_expectation(gx.expectations.ExpectColumnToExist(column='channel_id'))
suite.add_expectation(gx.expectations.ExpectColumnToExist(column='video_id'))
suite.add_expectation(gx.expectations.ExpectColumnToExist(column='datetime'))
suite.add_expectation(gx.expectations.ExpectColumnToExist(column='title'))
suite.add_expectation(gx.expectations.ExpectColumnToExist(column='views'))
suite.add_expectation(gx.expectations.ExpectColumnToExist(column='likes'))
suite.add_expectation(gx.expectations.ExpectColumnToExist(column='comments'))
suite.add_expectation(gx.expectations.ExpectColumnToExist(column='transcript'))
suite.add_expectation(gx.expectations.ExpectColumnToExist(column='transcript_length'))

# Validate data types
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeOfType(
    column='channel_id', type_="object"
    ))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeOfType(
    column='video_id', type_="object"
    ))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeOfType(
    column='datetime', type_="Timestamp"
    ))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeOfType(
    column='title', type_="object"
    ))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeOfType(
    column='views', type_="int64"
    ))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeOfType(
    column='likes', type_="int64"
    ))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeOfType(
    column='comments', type_="int64"
    ))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeOfType(
    column='transcript', type_="object"
    ))
suite.add_expectation(gx.expectations.ExpectColumnValuesToBeOfType(
    column='transcript_length', type_="int64"
    ))

# Validate no empty values exist
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column='channel_id'))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column='video_id'))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column='datetime'))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column='title'))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column='views'))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column='likes'))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column='comments'))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column='transcript'))
suite.add_expectation(gx.expectations.ExpectColumnValuesToNotBeNull(column='transcript_length'))

# Validate results
validation_results = batch.validate(suite)
print(validation_results.success)

True

The validation checks implemented here represent just a small subset of the capabilities available with Great Expectations. This powerful tool provides a wide array of validation checks, enabling you to ensure data quality at a much deeper level. For instance, you can validate the distribution of values to check for expected statistical patterns, ensure cardinality constraints to avoid duplicate or excessive values, or verify that data adheres to specific patterns (e.g., email formats or numeric ranges). These additional validations can further enhance the robustness and reliability of your data pipelines. Explore the full range of options here: Great Expectations Documentation - Expectations.

Data Versioning

For our purposes, we’re going to store the final cleaned_data to our project directory and then we’re going to use DVC (Data Version Control) to version and track changes to our data.

DVC assumes you are working within a Git repository, as it relies on Git for tracking the metadata of your data files and pipelines.

If you’re familiar with Git: You can follow along with this example as long as you are working within a Git repository. Make sure you have initialized a Git repository (git init) in your project directory and have basic knowledge of Git commands.
If Git is new to you: Don’t worry! You can still follow along to understand the process conceptually, even if you don’t execute the Git commands. Later chapters in this book will provide a deeper dive into Git, equipping you with the knowledge to implement version control effectively.

For now, focus on understanding how DVC integrates with data pipelines to manage and version datasets systematically.

So, first, we’ll store our final data as a parquet file, which is just a more efficient approach than storing as a CSV file.

# Ensure the directory exists
os.makedirs('data', exist_ok=True)

# Write the cleaned data to a parquet file
cleaned_data.to_parquet('data/youtube_video_data.parquet', index=False)

Next, we need to initialize DVC in our project by running the following command in the project root.

dvc init

This initializes a .dvc directory with a few internal files to track data and pipeline versions. These should be added to Git.

git status
Changes to be committed:
        new file:   .dvc/.gitignore
        new file:   .dvc/config
        ...
git commit -m "Initialize DVC"

Next, you need to use dvc add to start tracking the dataset file. This command creates a .dvc file (youtube_video_data.parquet.dvc) to track the file’s metadata and location. It also adds a .gitignore file so that the actual .parquet data file is not committed to git; instead the .parquet.dvc metadata file is committed.

git add data/youtube_video_data.parquet.dvc data/.gitignore

Next, run the following commands to track and tag the dataset changes in Git.

git commit -m 'Initial processed Youtube data'
git tag -a "v1.0" -m "Youtube data v1.0"

If you have a remote storage location (e.g., AWS S3, Google Drive, Azure Blob Storage) configured for DVC, you can also push the data asset to that location to ensure it is stored offsite and sharable to the rest of your team/organization.

dvc remote add -d myremote s3://mybucket/myproject
dvc push

Now, say we re-run this pipeline next week and get additional video data. We can just follow this same procedure to save and version the updated data:

# Write the cleaned data to a parquet file
cleaned_data.to_parquet('data/youtube_video_data.parquet', index=False)

# Track the updated file
git add data/youtube_video_data.parquet.dvc

# Commit the changes
git commit -m 'Updated Youtube data'
git tag -a "v2.0" -m "Youtube data v2.0"

Our example demonstrates only the basic functionality of DVC, showcasing how it can be used to version datasets within a pipeline. However, DVC offers a wide range of additional features, including:

Tracking and managing entire pipelines.
Handling large datasets efficiently with external storage integrations.
Automating experiment tracking and comparisons.

To explore the full potential of DVC and learn how to leverage its advanced capabilities, check out the official documentation at https://dvc.org/doc. This resource provides comprehensive guidance and examples for incorporating DVC into robust, end-to-end workflows.

4.5 Creating Reliable & Scalable Data Pipelines

Designing reliable and scalable data pipelines requires adherence to the foundational principles that we discussed that promote maintainability, reproducibility, and efficiency. The hands-on example of the YouTube data pipeline demonstrates some of these design principles; however, there are areas where improvements could further align the pipeline with these principles.

Principles Incorporated in the YouTube Data Pipeline

Modularity and Abstraction

What We Did: The pipeline uses a modular design by encapsulating reusable functions within the dataops_utils module. This abstraction reduces complexity and makes the codebase more manageable, enabling easier updates and debugging.
Why It Matters: Modularity ensures that individual components, such as data cleaning, validation, and versioning, are independently testable and maintainable. Abstraction allows users to focus on higher-level logic without worrying about low-level details.

Reproducibility

What We Did: By incorporating Git for version control and DVC for dataset tracking, the pipeline ensures that every step—from data ingestion to cleaning—can be reproduced with consistent results.
Why It Matters: Reproducibility is essential for debugging, auditing, and collaboration, especially when multiple team members work on the same project or when retraining models on historical data.

Data Validation

What We Did: The pipeline integrates Great Expectations to validate the structure and quality of the dataset. This ensures that the data meets predefined criteria before it is processed further.
Why It Matters: Validation prevents poor-quality data from contaminating downstream workflows, thereby enhancing reliability and trust in the pipeline.

Limitations and Areas for Improvement

Scalability

Current Challenge: The YouTube API imposes daily quota limits on data retrieval, constraining the volume of data that can be ingested. This makes the pipeline less scalable for large-scale applications.
Potential Solution: To overcome this limitation, consider implementing data caching strategies, batching requests over multiple days, or leveraging additional API keys across multiple accounts to distribute the load.

Automation

Current Challenge: The pipeline is not automated and requires manual execution for each step. This limits its reliability and scalability in production environments.
Potential Solution: Incorporate workflow automation tools such as Apache Airflow or Prefect to schedule and monitor the pipeline. Automating these tasks would enhance reliability and reduce manual intervention.

Fault Tolerance

Current Challenge: The pipeline lacks mechanisms to handle failures gracefully, such as retrying failed API calls or dealing with incomplete datasets.
Potential Solution: Introduce error-handling routines and logging mechanisms to ensure that the pipeline can recover from failures without disrupting downstream processes.

Scalable Storage and Compute

Current Challenge: The pipeline is designed for small-scale use and does not leverage distributed computing or scalable storage solutions.
Potential Solution: Transition to cloud-based storage systems like AWS S3 or Google Cloud Storage and integrate distributed computing frameworks such as Apache Spark for handling large datasets.

Balancing Design Principles

While the YouTube data pipeline incorporates key design principles such as modularity, abstraction, and reproducibility, there are trade-offs due to its simplicity. The primary objective of this example was to illustrate the foundational steps in building a data pipeline, rather than creating a production-ready system. For real-world applications, enhancing scalability, automation, and fault tolerance would be critical to align fully with the design principles of reliable and scalable systems.

By continuously iterating on these principles, teams can evolve simple pipelines into robust, production-ready systems capable of handling complex, large-scale workflows.

4.6 Summary

In this chapter, we explored the practical aspects of building data pipelines to support robust machine learning workflows. By combining the principles and concepts introduced in the previous chapter, we demonstrated how to design, implement, and validate a data pipeline using tools like Pandas, Great Expectations, and DVC. The YouTube data pipeline provided a hands-on example of how data ingestion, processing, validation, and versioning can come together to create a cohesive system.

We also compared the ETL and ELT paradigms, highlighting their applications and trade-offs, and discussed how the design principles for good ML systems—such as modularity, abstraction, and reproducibility—can guide the creation of reliable and scalable pipelines. While the example illustrated foundational techniques, it also highlighted areas for improvement, such as addressing scalability and automation challenges.

This chapter aimed to provide you with the foundational knowledge and tools to build data pipelines tailored to your needs. As you advance, you’ll encounter more complex requirements and additional tools to refine and scale your workflows. In the next chapter

4.7 Exercise

This exercise will help you apply the concepts covered in this chapter to design and evaluate a data pipeline. The focus is on understanding the design principles, tools, and processes involved in building reliable and scalable data pipelines.

Part 1: Conceptual Design

Scenario: Imagine you are tasked with building a data pipeline for an online retail store. The pipeline should:

Ingest daily sales data from a transactional database and inventory updates from an API.
Process the data to calculate daily revenue and identify low-stock items.
Validate the data for missing or inconsistent records.
Version the processed data for auditing and historical tracking.

Task:

Sketch a conceptual design for this pipeline. Include the steps for ingestion, processing, validation, and versioning.
Identify which tools from this chapter (e.g., Pandas, Great Expectations, DVC) you would use for each step and justify your choice.

Part 2: Hands-On Experimentation

Using the YouTube pipeline example as a reference, answer the following:

Modify the Pipeline:
- Extend the provided YouTube pipeline to include additional validation checks (e.g., ensure views is greater than likes or that transcript_length is within a realistic range).
- What changes did you make to the validation suite, and why?
Experiment with Versioning:
- Create a new version of the cleaned_data DataFrame by simulating a change in the input data (e.g., remove videos with fewer than 10,000 views).
- Use DVC to version the new dataset and compare it to the original version. What do the differences reveal?

Part 3: Reflecting on Design Principles

Evaluate the Pipeline:
- Review the YouTube pipeline example and your modified pipeline. How well do they incorporate the design principles from Chapter 1 (e.g., modularity, scalability, reproducibility)?
- Identify one principle that could be improved in your pipeline design. How would you address this in a future iteration?
Scenario-Based Discussion:
- Imagine that the YouTube API introduces stricter quota limits. How would this impact the pipeline? Propose a solution to handle such constraints while ensuring the pipeline remains functional and reliable.

Feel free to explore a different Youtube channel. Note that the channel ID is different than the Youtube handle. For example, Grant Horvat’s Youtube handle is @GrantHorvatGolfs but his channel ID is UCgUueMmSpcl-aCTt5CuCKQw. The easiest way to find a channel’s ID is to go to the channel metadata where information for listed for “About”, “Links”, and “Channel details”. At the bottom of the pop up window is an option to “Share Channel” and then “Copy Channel ID”.↩︎