flowchart LR ob1(Raw data source) --> processing[Data processing] ob2(Raw data source) --> processing ob3(Raw data source) --> processing processing --> id3[(Data storage)]
4 Putting DataOps into Practice
Reading Time: 45-60 minutes
In the previous chapter, we explored the foundational concepts of DataOps, focusing on the importance of processes such as data ingestion, validation, versioning, and lineage in building reliable machine learning systems. Now, it’s time to take these concepts from theory to practice by applying them to create data pipelines. Data pipelines form the backbone of any modern ML system, serving as the structured, automated pathways through which data flows from raw sources to actionable insights.
A data pipeline is a series of interconnected steps that collect, process, validate, and transform data into formats ready for machine learning workflows. Whether you’re ingesting data from multiple sources, cleaning and preparing it for modeling, or ensuring its integrity through validation and versioning, pipelines make it possible to handle these tasks efficiently and consistently. They are essential for ensuring scalability, reliability, and repeatability in ML systems, especially in environments where data is constantly changing or arriving in real-time.
This chapter will guide you through the practical implementation of DataOps principles by developing end-to-end data pipelines. We’ll begin with an introduction to the components of a data pipeline and the tools commonly used to implement them. From there, we’ll delve into the step-by-step process of building a pipeline, starting with a simple example and progressing to techniques for creating scalable workflows that can handle complex, large-scale ML applications.
By the end of this chapter, you’ll have a deeper understanding of how to translate DataOps practices into actionable pipelines, setting the stage for real-world machine learning deployments. Whether you’re working with batch or streaming data, structured or unstructured datasets, this chapter will equip you with the tools and strategies to design robust data pipelines that meet the demands of modern ML workflows.
4.1 Introduction to Data Pipelines
In the world of machine learning (ML), the success of a system hinges on the quality, reliability, and timeliness of the data that powers it. Data pipelines are the operational backbone of these systems, ensuring data flows seamlessly from its source, through various transformations, and into the hands of models or end-users. A data pipeline is a series of automated steps that enable data ingestion, processing, and delivery—integrating the concepts of DataOps to ensure scalability, consistency, and efficiency.
In the previous chapter, we explored the fundamental principles of DataOps, including data ingestion, processing, validation, versioning, and lineage. These processes establish the foundation for building robust data pipelines. Now, we turn our attention to applying these principles in practice to design end-to-end workflows that support machine learning systems.
The Importance of Data Pipelines in ML System Design
Data pipelines address critical challenges faced by ML workflows, such as managing large data volumes, handling diverse data formats, and maintaining data integrity. By automating repetitive and error-prone tasks like ingestion, cleaning, and transformation, pipelines ensure that data is prepared and delivered reliably.
From the concepts introduced earlier, pipelines incorporate:
- Data Ingestion: Automating the flow of data from various sources—whether databases, APIs, or real-time streams—into a centralized system.
- Data Processing: Applying transformations, feature engineering, and cleaning steps to structure raw data into formats ready for model consumption.
- Data Validation: Embedding quality checks to ensure the consistency and integrity of data at every stage of the pipeline.
- Data Versioning and Lineage: Providing traceability and reproducibility by maintaining historical records of datasets and documenting their transformations.
For example, in a customer churn prediction system, a data pipeline might ingest transaction data from a database, clean and preprocess it to handle missing values and inconsistencies, and validate that the data conforms to schema standards before feeding it into an ML model. Each step would leverage the DataOps principles discussed earlier to ensure that the process is efficient, scalable, and error-resistant.
The Role of DataOps in Constructing Scalable and Reliable Pipelines
As explored in the previous chapter, DataOps is not just a set of practices but a philosophy that ensures data workflows are collaborative, automated, and aligned with business objectives. When applied to pipeline construction, DataOps enables:
- Scalability: Modular design allows pipelines to handle growing data volumes and adapt to new sources or transformations without significant redesigns.
- Reliability: Continuous validation, versioning, and monitoring ensure that pipelines consistently deliver high-quality data, even in dynamic environments.
- Collaboration: DataOps principles encourage communication between data engineers, analysts, and machine learning practitioners, ensuring pipelines meet diverse needs.
- Traceability: By tracking data lineage and applying version control, DataOps ensures every step in the pipeline is documented, facilitating debugging, compliance, and reproducibility.
Consider a recommendation system that ingests real-time user behavior data from an API and combines it with historical purchase data stored in a data lake. A DataOps-driven pipeline would:
- Ingest and Validate: Fetch data from both sources, validating schema consistency and ensuring data quality.
- Process and Transform: Standardize features such as timestamps, handle missing values, and engineer inputs like “average time between purchases.”
- Track Lineage: Document the transformations applied to both real-time and historical data to ensure traceability and compliance.
- Version Datasets: Maintain snapshots of preprocessed data for future analysis, debugging, or retraining.
By applying the foundational concepts of DataOps, such pipelines are not only robust and scalable but also agile enough to adapt to evolving ML requirements.
In this chapter, we will build on the foundational knowledge from the previous chapter to design and implement data pipelines that embody the principles of DataOps. Through hands-on examples and practical guidance, you will learn how to construct workflows that transform raw data into reliable inputs for machine learning systems, setting the stage for scalable and effective solutions.
4.2 ETL vs. ELT Paradigms
When designing data pipelines, there are two primary paradigms: Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT). These paradigms represent distinct approaches to organizing and processing data as it flows through pipelines, each with its own strengths and trade-offs. Choosing between ETL and ELT (or a hybrid approach) requires an understanding of your organization’s technical infrastructure, data requirements, and use cases.
Extract, Transform, Load (ETL)
The ETL paradigm embodies a traditional, structured approach to data pipelines. It follows a linear process:
- Extract data from various sources, such as relational databases, APIs, or CSV files.
- Transform the extracted data into a clean, structured format using techniques like cleaning, aggregation, and enrichment.
- Load the transformed data into a target system, such as a relational database or data warehouse.
ETL pipelines focus on preparing high-quality, ready-to-use data before it enters the storage system. In ETL workflows, the data processing phase discussed earlier plays a central role. Transformations are performed early in the pipeline to ensure data quality and structure before it reaches the storage layer.
- When working with traditional data warehouses or legacy systems that prioritize structured and clean data.
- In use cases requiring strict data governance and pre-defined schemas, such as regulatory compliance reporting.
- When downstream applications depend on consistent, pre-processed datasets.
- Simplifies downstream workflows by ensuring data is pre-processed and ready for use.
- Reduces storage costs by retaining only clean and structured data.
- Aligns well with environments that require high data integrity.
- Upfront transformations can delay data availability.
- Scaling ETL pipelines to handle large datasets or real-time workflows can be resource-intensive.
- Rigid processes may struggle to adapt to rapidly changing data needs.
Imagine a financial institution that extracts transaction data, transforms it into a consistent format by applying currency conversions and anomaly detection, and loads the cleaned data into a warehouse for fraud detection reports.
Extract, Load, Transform (ELT)
The ELT paradigm, a more modern approach, inverts the transformation and loading steps:
- Extract data from various sources, often retaining its raw format.
- Load the extracted data directly into a scalable storage system, such as a data lake or cloud-based data warehouse.
- Transform the data within the storage system using its computational resources, tailoring transformations to specific analytical needs.
flowchart LR ob1(Raw data source) --> id3[(Data storage)] ob2(Raw data source) --> id3 ob3(Raw data source) --> id3 id3 --> processing[Data processing] --> id3
In ELT workflows, the data ingestion phase emphasizes rapid loading of raw data into storage, enabling flexibility for later transformations. Also, by storing the data in its raw form, ELT better handles structured, unstructured, and semi-structured data and allows downstream analytics on this data to more easily adapt.
- When leveraging cloud-native platforms like Snowflake, BigQuery, or AWS Redshift, which support high-speed storage and in-database transformations.
- In workflows requiring rapid ingestion and the ability to adapt transformations for different use cases.
- When storing raw data is essential for exploratory analysis or regulatory purposes.
- Enables faster data ingestion by deferring transformations.
- Supports diverse and flexible transformations tailored to specific analyses.
- Scales well with large data volumes, leveraging modern storage and processing systems.
- Requires advanced infrastructure to handle and process raw data efficiently.
- Higher storage costs due to the retention of raw, unprocessed data.
- Ad-hoc transformations can lead to inconsistencies if not governed properly.
An e-commerce platform ingests raw clickstream data into a cloud data warehouse. When needed, transformations such as sessionizing user behavior or aggregating purchase trends are applied directly within the warehouse for personalized recommendations or sales forecasting.
Comparing ETL and ELT
Aspect | ETL | ELT |
---|---|---|
Data Transformation | Before loading into storage | After loading into storage |
Storage Requirements | Lower, as only processed data is stored | Higher, as raw data is retained |
Processing Time | Slower ingestion due to upfront transformations | Faster ingestion with deferred transformations |
Flexibility | Limited; transformations are predefined | High; transformations can be ad hoc |
Infrastructure | Suitable for legacy systems or traditional data warehouses | Ideal for modern, scalable systems |
As highlighted in the previous chapter, the goals of DataOps — efficiency, reliability, and scalability — should guide the choice of ETL, ELT, or a hybrid approach. Consider:
- ETL is well-suited for structured environments and use cases that demand immediate access to clean, well-processed data.
- ELT shines in cloud-native or big data ecosystems where raw data flexibility and scalability are critical.
Both paradigms have their strengths, and many organizations blend elements of each. For example, a retail company might use ETL for compliance reporting while leveraging ELT for real-time inventory analysis. By understanding these paradigms within the broader DataOps framework, teams can design pipelines that meet both technical and business requirements effectively.
4.3 Tools for DataOps
Building effective and scalable data pipelines requires leveraging specialized tools at each stage of the DataOps lifecycle. From data ingestion to processing, validation, and versioning, the ecosystem of tools available is vast and continually expanding. This section provides an overview of commonly used tools, highlighting their capabilities, trade-offs, and when they might be most appropriate. Additionally, this space is dynamic, with new tools regularly introduced to address evolving challenges. Therefore, the goal is not to become tied to a specific tool but to understand the broader landscape and make informed decisions based on your pipeline’s requirements.
Data Ingestion Tools
Efficiently gathering data from diverse sources is the foundation of any pipeline. Ingestion tools vary from those tailored for real-time streaming to solutions focused on batch processing and ease of integration. Some examples include:
A distributed event-streaming platform designed for high-throughput, real-time data ingestion. Kafka is widely used in applications like fraud detection, IoT data pipelines, and real-time analytics.
- When to Use: Ideal for high-velocity, low-latency systems.
- Limitations: Steep learning curve and resource-heavy for smaller use cases.
A flow-based programming tool that automates data movement and transformation with a user-friendly interface. NiFi supports a wide range of data sources and formats.
- When to Use: Low-code scenarios requiring rapid integration of multiple data sources.
- Limitations: Less suited for complex, high-throughput pipelines.
An open-source data integration platform focused on moving data into data warehouses or lakes. Its extensive library of connectors simplifies ingesting data from various APIs and databases.
- When to Use: Batch ingestion with diverse source compatibility.
- Limitations: Primarily batch-focused and may require configuration for real-time pipelines.
A managed data integration tool that simplifies data ingestion by automating schema handling and updates. It’s widely used for moving data into analytics-ready storage solutions.
- When to Use: Enterprise scenarios requiring minimal management overhead for cloud data integration.
- Limitations: Subscription-based pricing and limited customization compared to open-source tools.
Data Processing Tools
Processing raw data into clean, structured, and feature-rich formats is critical to preparing it for machine learning workflows. Many tools exist but the following are a few popular tools that support various scales and complexities of data transformation.
A powerful engine for distributed data processing. Spark supports both batch and streaming data workflows, making it suitable for big data and ML pipelines.
- When to Use: Processing large-scale data across distributed systems.
- Limitations: Requires expertise and infrastructure, which might be overkill for smaller datasets.
A transformation tool that applies SQL to prepare data for analytics workflows. DBT is particularly useful for pipelines that involve data warehouses.
- When to Use: SQL-centric environments where analytics-ready data is the end goal.
- Limitations: Focused on transformations within SQL databases; less versatile for non-SQL workflows.
An open source Python & Rust library for data manipulation and analysis, offering rich functionality for handling structured data with performance in mind.
- When to Use: Small to large datasets or for prototyping workflows. A great substitute for Pandas when performance and speed are required.
- Limitations: Not designed for real-time processing.
A workflow orchestration tool that enables robust scheduling and monitoring of data processing tasks.
- When to Use: Complex pipelines requiring orchestration of multiple processing steps.
- Limitations: Adds an orchestration layer, which might be unnecessary for simple workflows.
Data Validation Tools
Ensuring the integrity and accuracy of data is essential to building reliable pipelines. Validation tools help detect anomalies, enforce schema compliance, and improve data quality. Some popular data validation tools include:
A framework for defining, executing, and documenting data validation checks. It integrates well with modern data stacks.
- When to Use: Custom validation rules with automated reporting needs.
- Limitations: May require significant customization for non-standard checks.
A library designed for detecting anomalies and validating schema in ML datasets.
- When to Use: TensorFlow-based workflows or pipelines requiring statistical validation.
- Limitations: Limited applicability outside of TensorFlow-centric environments.
A library developed by AWS for validating large datasets using Spark. It focuses on automated quality checks and anomaly detection.
- When to Use: Spark-based pipelines requiring scalable data quality checks.
- Limitations: Limited compatibility with non-Spark environments.
Data Versioning Tools
Versioning ensures traceability and reproducibility by tracking changes to datasets and workflows. It is vital for debugging, compliance, and collaboration. Some common data versioning tools include:
A Git-like tool for versioning data, models, and pipelines. It integrates seamlessly with Git repositories.
- When to Use: Managing medium to large datasets in collaborative environments.
- Limitations: Can be challenging to set up for teams unfamiliar with Git.
A storage layer that adds versioning and ACID transactions to data lakes. It works well with distributed systems like Apache Spark.
- When to Use: Large-scale pipelines needing robust versioning and consistency.
- Limitations: Requires Spark for full functionality, adding complexity to smaller-scale workflows.
A Git-like version control system for data lakes. It enables branching and snapshotting of data workflows.
- When to Use: Data lakes requiring advanced version control features.
- Limitations: Best suited for teams already working with data lakes.
4.4 Hands-On Example: A YouTube Data Pipeline
Now that you’ve learned a bit about data pipelines, common paradigms, and tooling involved, let’s apply some of these concepts to create a simplified data pipeline that demonstrates how to ingest, process, and prepare YouTube data for downstream machine learning (ML) workflows. We’ll focus on collecting data like video titles, transcripts, views, likes, and comments. This example ties together concepts from previous chapters, including data ingestion, processing, validation, and versioning.
Pipeline Design
The pipeline design we’ll use will resemble more of an ETL process than an ELT. This is mainly for storage simplicity since this pipeline is run locally (whether that be on my machine or your machine) rather than in a cloud environment where storage capacity constraints are less of an issue.
We’ll use a couple of APIs to ingest our data, perform a little bit of data processing to clean and prepare our data for future analyses, and then illustrate some basic data validation along with versioning and storing our final prepared data.
flowchart LR subgraph ingest[Data Ingestion] direction LR subgraph p1[Ingest Video IDs] end subgraph p2[Ingest Video Stats] end subgraph p3[Ingest Video Transcript] end end p1 --> p2 p1 --> p3 ingest --> process(Process Raw Data) process --> Validate --> Version --> data[(Data storage)]
Basic Requirements
If you’d like to follow along and execute this code then there are a few requirements you’ll need on your end.
You can find the requirements, source code for the data pipeline, and helper functions here.
First, you’ll need to make sure you have the following Python libraries installed:
Next, to simplify this example I have created helper functions to abstract away a lot of the finer code details. You can see this in the following imports where I am importing helper functions from the dataops_utils.py
module. If you want to reproduce this pipeline then you can download the dataops_utils.py script here.
Lastly, We’ll be using the YouTube Data API to extract metadata and video information. To follow along, ensure you have:
- A Google Cloud account.
- Access to the YouTube Data API to include an API key. See here to get started.
Once you have a Youtube API key then you’re ready to go! The next code chunk sets your API key and establishes the API URL and the channel ID. Note, I’m a golf junkie so the channel ID I am using is for Grant Horvat, a Youtube golfer with nearly 1 million followers.1
Data Ingestion
The first step is to ingest the Youtube data. To do so, we need to follow three steps:
- Ingest all the video IDs provided by a given channel.
- Use the ingested video IDs to ingest Youtube stats for each video such as total views, likes, and comments.
- And lastly, ingest the transcript for each video.
Data Processing
For data preprocessing, we’re going to first convert our data to a Pandas DataFrame
And then clean our data by:
- Removing missing values. There are a few videos with no transcript text because the videos have no talking in them.
- Removing duplicate observations. Just in case there is duplication in video information during the downloading process.
- Remove any inconsistent data types. This avoids errors during data processing and ensures consistency in operations applied to the data.
- Remove any observations that have invalid datetime values. This ensures chronological accuracy for any time-based analysis or trends.
- Remove any videos with minimal number of views. This filters out content that may not provide enough engagement data for meaningful insights.
- Remove any videos with very little transcript text. This ensures that the remaining data contains sufficient content for natural language processing or text-based analysis.
- Clean the title and transcript text. This removes unnecessary noise, such as non-character string values (i.e. unicode characters), making the text suitable for analysis.
While there are many advanced techniques for preprocessing text data—such as more thorough cleaning, lemmatization, or converting text into embeddings for model input—the intent of this section is to focus on building a simplified example of an end-to-end data pipeline. These additional steps can significantly enhance the quality and utility of text data, but they are outside the scope of this example, which is designed to highlight the core concepts and practical steps involved in creating a data pipeline.
Data Validation
Next, we’ll validate our data. To do so we’ll use Great Expectations and, first, we need to create a data and define our data assets.
Now that the infrastructure is set up, we’ll go through a process of validating that our data meets certain expectations. This code defines a data validation suite to ensure the integrity of our cleaned YouTube dataset. It consists of three primary validation steps:
Column Existence Validation: It verifies that all required columns (
channel_id
,video_id
,datetime
,title
,views
,likes
,comments
,transcript
, andtranscript_length
) are present in the dataset. This step ensures that the essential structure of the dataset is intact.Data Type Validation: It checks that each column contains values of the expected data type, such as
Object
for textual data (channel_id
,video_id
,title
, andtranscript
),Timestamp
for date-related fields, andint64
for numerical fields (views
,likes
,comments
, andtranscript_length
). This ensures that data types align with the intended use of each column.Null Value Validation: It confirms that no empty or null values exist in the critical columns. This guarantees that the dataset is complete and avoids errors caused by missing data in downstream processes.
Finally, the code executes the validation suite against a data batch and prints whether all validation checks were successful. This ensures the dataset meets the defined quality standards before being used in the data pipeline.
The validation checks implemented here represent just a small subset of the capabilities available with Great Expectations. This powerful tool provides a wide array of validation checks, enabling you to ensure data quality at a much deeper level. For instance, you can validate the distribution of values to check for expected statistical patterns, ensure cardinality constraints to avoid duplicate or excessive values, or verify that data adheres to specific patterns (e.g., email formats or numeric ranges). These additional validations can further enhance the robustness and reliability of your data pipelines. Explore the full range of options here: Great Expectations Documentation - Expectations.
Data Versioning
For our purposes, we’re going to store the final cleaned_data
to our project directory and then we’re going to use DVC (Data Version Control) to version and track changes to our data.
DVC assumes you are working within a Git repository, as it relies on Git for tracking the metadata of your data files and pipelines.
If you’re familiar with Git: You can follow along with this example as long as you are working within a Git repository. Make sure you have initialized a Git repository (
git init
) in your project directory and have basic knowledge of Git commands.If Git is new to you: Don’t worry! You can still follow along to understand the process conceptually, even if you don’t execute the Git commands. Later chapters in this book will provide a deeper dive into Git, equipping you with the knowledge to implement version control effectively.
For now, focus on understanding how DVC integrates with data pipelines to manage and version datasets systematically.
So, first, we’ll store our final data as a parquet file, which is just a more efficient approach than storing as a CSV file.
Next, we need to initialize DVC in our project by running the following command in the project root.
dvc init
This initializes a .dvc directory with a few internal files to track data and pipeline versions. These should be added to Git.
git status
Changes to be committed:
new file: .dvc/.gitignore
new file: .dvc/config
...
git commit -m "Initialize DVC"
Next, you need to use dvc add
to start tracking the dataset file. This command creates a .dvc
file (youtube_video_data.parquet.dvc) to track the file’s metadata and location. It also adds a .gitignore
file so that the actual .parquet data file is not committed to git; instead the .parquet.dvc metadata file is committed.
git add data/youtube_video_data.parquet.dvc data/.gitignore
Next, run the following commands to track and tag the dataset changes in Git.
git commit -m 'Initial processed Youtube data'
git tag -a "v1.0" -m "Youtube data v1.0"
If you have a remote storage location (e.g., AWS S3, Google Drive, Azure Blob Storage) configured for DVC, you can also push the data asset to that location to ensure it is stored offsite and sharable to the rest of your team/organization.
dvc remote add -d myremote s3://mybucket/myproject
dvc push
Now, say we re-run this pipeline next week and get additional video data. We can just follow this same procedure to save and version the updated data:
# Write the cleaned data to a parquet file
cleaned_data.to_parquet('data/youtube_video_data.parquet', index=False)
# Track the updated file
git add data/youtube_video_data.parquet.dvc
# Commit the changes
git commit -m 'Updated Youtube data'
git tag -a "v2.0" -m "Youtube data v2.0"
Our example demonstrates only the basic functionality of DVC, showcasing how it can be used to version datasets within a pipeline. However, DVC offers a wide range of additional features, including:
- Tracking and managing entire pipelines.
- Handling large datasets efficiently with external storage integrations.
- Automating experiment tracking and comparisons.
To explore the full potential of DVC and learn how to leverage its advanced capabilities, check out the official documentation at https://dvc.org/doc. This resource provides comprehensive guidance and examples for incorporating DVC into robust, end-to-end workflows.
4.5 Creating Reliable & Scalable Data Pipelines
Designing reliable and scalable data pipelines requires adherence to the foundational principles that we discussed that promote maintainability, reproducibility, and efficiency. The hands-on example of the YouTube data pipeline demonstrates some of these design principles; however, there are areas where improvements could further align the pipeline with these principles.
Principles Incorporated in the YouTube Data Pipeline
- What We Did: The pipeline uses a modular design by encapsulating reusable functions within the
dataops_utils
module. This abstraction reduces complexity and makes the codebase more manageable, enabling easier updates and debugging. - Why It Matters: Modularity ensures that individual components, such as data cleaning, validation, and versioning, are independently testable and maintainable. Abstraction allows users to focus on higher-level logic without worrying about low-level details.
- What We Did: By incorporating Git for version control and DVC for dataset tracking, the pipeline ensures that every step—from data ingestion to cleaning—can be reproduced with consistent results.
- Why It Matters: Reproducibility is essential for debugging, auditing, and collaboration, especially when multiple team members work on the same project or when retraining models on historical data.
- What We Did: The pipeline integrates Great Expectations to validate the structure and quality of the dataset. This ensures that the data meets predefined criteria before it is processed further.
- Why It Matters: Validation prevents poor-quality data from contaminating downstream workflows, thereby enhancing reliability and trust in the pipeline.
Limitations and Areas for Improvement
- Current Challenge: The YouTube API imposes daily quota limits on data retrieval, constraining the volume of data that can be ingested. This makes the pipeline less scalable for large-scale applications.
- Potential Solution: To overcome this limitation, consider implementing data caching strategies, batching requests over multiple days, or leveraging additional API keys across multiple accounts to distribute the load.
- Current Challenge: The pipeline is not automated and requires manual execution for each step. This limits its reliability and scalability in production environments.
- Potential Solution: Incorporate workflow automation tools such as Apache Airflow or Prefect to schedule and monitor the pipeline. Automating these tasks would enhance reliability and reduce manual intervention.
- Current Challenge: The pipeline lacks mechanisms to handle failures gracefully, such as retrying failed API calls or dealing with incomplete datasets.
- Potential Solution: Introduce error-handling routines and logging mechanisms to ensure that the pipeline can recover from failures without disrupting downstream processes.
- Current Challenge: The pipeline is designed for small-scale use and does not leverage distributed computing or scalable storage solutions.
- Potential Solution: Transition to cloud-based storage systems like AWS S3 or Google Cloud Storage and integrate distributed computing frameworks such as Apache Spark for handling large datasets.
Balancing Design Principles
While the YouTube data pipeline incorporates key design principles such as modularity, abstraction, and reproducibility, there are trade-offs due to its simplicity. The primary objective of this example was to illustrate the foundational steps in building a data pipeline, rather than creating a production-ready system. For real-world applications, enhancing scalability, automation, and fault tolerance would be critical to align fully with the design principles of reliable and scalable systems.
By continuously iterating on these principles, teams can evolve simple pipelines into robust, production-ready systems capable of handling complex, large-scale workflows.
4.6 Summary
In this chapter, we explored the practical aspects of building data pipelines to support robust machine learning workflows. By combining the principles and concepts introduced in the previous chapter, we demonstrated how to design, implement, and validate a data pipeline using tools like Pandas, Great Expectations, and DVC. The YouTube data pipeline provided a hands-on example of how data ingestion, processing, validation, and versioning can come together to create a cohesive system.
We also compared the ETL and ELT paradigms, highlighting their applications and trade-offs, and discussed how the design principles for good ML systems—such as modularity, abstraction, and reproducibility—can guide the creation of reliable and scalable pipelines. While the example illustrated foundational techniques, it also highlighted areas for improvement, such as addressing scalability and automation challenges.
This chapter aimed to provide you with the foundational knowledge and tools to build data pipelines tailored to your needs. As you advance, you’ll encounter more complex requirements and additional tools to refine and scale your workflows. In the next chapter
4.7 Exercise
This exercise will help you apply the concepts covered in this chapter to design and evaluate a data pipeline. The focus is on understanding the design principles, tools, and processes involved in building reliable and scalable data pipelines.
Scenario: Imagine you are tasked with building a data pipeline for an online retail store. The pipeline should:
- Ingest daily sales data from a transactional database and inventory updates from an API.
- Process the data to calculate daily revenue and identify low-stock items.
- Validate the data for missing or inconsistent records.
- Version the processed data for auditing and historical tracking.
Task:
- Sketch a conceptual design for this pipeline. Include the steps for ingestion, processing, validation, and versioning.
- Identify which tools from this chapter (e.g., Pandas, Great Expectations, DVC) you would use for each step and justify your choice.
Using the YouTube pipeline example as a reference, answer the following:
- Modify the Pipeline:
- Extend the provided YouTube pipeline to include additional validation checks (e.g., ensure
views
is greater thanlikes
or thattranscript_length
is within a realistic range). - What changes did you make to the validation suite, and why?
- Extend the provided YouTube pipeline to include additional validation checks (e.g., ensure
- Experiment with Versioning:
- Create a new version of the
cleaned_data
DataFrame by simulating a change in the input data (e.g., remove videos with fewer than 10,000 views). - Use DVC to version the new dataset and compare it to the original version. What do the differences reveal?
- Create a new version of the
- Evaluate the Pipeline:
- Review the YouTube pipeline example and your modified pipeline. How well do they incorporate the design principles from Chapter 1 (e.g., modularity, scalability, reproducibility)?
- Identify one principle that could be improved in your pipeline design. How would you address this in a future iteration?
- Scenario-Based Discussion:
- Imagine that the YouTube API introduces stricter quota limits. How would this impact the pipeline? Propose a solution to handle such constraints while ensuring the pipeline remains functional and reliable.
Feel free to explore a different Youtube channel. Note that the channel ID is different than the Youtube handle. For example, Grant Horvat’s Youtube handle is
@GrantHorvatGolfs
but his channel ID is UCgUueMmSpcl-aCTt5CuCKQw. The easiest way to find a channel’s ID is to go to the channel metadata where information for listed for “About”, “Links”, and “Channel details”. At the bottom of the pop up window is an option to “Share Channel” and then “Copy Channel ID”.↩︎