20  Before You Build: Key Considerations

ImportantPicture This

You’ve just learned about the exciting world of machine learning and you’re eager to dive in. You have a dataset, you know whether your problem is regression or classification, and you’re ready to start building models. But hold on—successful machine learning projects aren’t built in a day, and they certainly aren’t built without careful planning.

Think of machine learning like building a house. You wouldn’t start hammering nails without first checking your foundation, reviewing your blueprints, and ensuring you have the right permits. Similarly, before you write a single line of modeling code, you need to establish a solid foundation: What exactly are you trying to solve? Is your data ready? How will you know if your model is working? And what ethical considerations should guide your decisions?

This chapter focuses on the critical thinking and planning that separates successful ML projects from costly failures. You won’t find complex algorithms or extensive coding here—instead, you’ll develop the judgment to ask the right questions, spot potential pitfalls, and set your projects up for success from the very beginning.

The goal isn’t to overwhelm you with theory, but to give you the practical wisdom that experienced data scientists use every day. By the end of this chapter, you’ll have a mental checklist that will guide you through the early stages of any machine learning project, helping you avoid common mistakes and build models that actually solve real business problems.

20.1 Learning Objectives

By the end of this chapter, you should be able to:

  • Frame machine learning problems by defining clear business questions and success criteria
  • Assess data readiness including data quality and the importance of proper train/test splits
  • Identify data leakage and understand why using “future” information invalidates model results
  • Recognize ethical considerations around fairness, privacy, and interpretability in ML applications
  • Apply a systematic pre-modeling checklist to real business scenarios
NoteA Foundation for Deeper Learning

This chapter’s goal is to get you thinking about these critical concepts and developing the right mindset for successful machine learning projects. Many of the topics we introduce here—from data splitting techniques to evaluation metrics to ethical considerations—will be explored in much greater detail in later chapters where you’ll learn the practical implementation skills and hands-on techniques to apply these concepts in real projects.

20.2 Framing the Problem

The most critical step in any machine learning project happens before you even touch your data: clearly defining what you’re trying to accomplish. This might sound obvious, but it’s where many projects go astray. Without a clear problem statement and definition of success, you’ll find yourself building technically impressive models that don’t actually solve business problems.

Start with the Business Question

Machine learning should always begin with a specific, answerable business question. Vague goals like “use AI to improve our business” or “build a model to predict customer behavior” are recipes for failure. Instead, successful ML projects start with questions like:

  • “Can we predict which customers are likely to cancel their subscription in the next 30 days?”
  • “What price should we set for this product to maximize profit while remaining competitive?”
  • “Which marketing email should we send to each customer to maximize click-through rates?”

Notice how each of these questions has specific, measurable outcomes. They define exactly what you’re trying to predict, over what time horizon, and for what business purpose.

Define Success Upfront

Before building any model, you need to answer: “How will I know if this model is successful?” This isn’t just about technical metrics—it’s about business impact.

Consider these different perspectives on success:

Technical Success: “Our model achieves 85% accuracy on our test set”
Business Success: “Using our model’s predictions, we reduced customer churn by 15% this quarter”

Both matter, but business success is what justifies the time and resources invested in your project. You should establish both technical benchmarks (maximum accuracy, minimum error rates) and business benchmarks (cost savings, revenue increase, time saved) before you begin modeling.

flowchart LR
  subgraph ML[ML System]
    direction BT
    subgraph p0[Overall Performance]
    end
    subgraph p1[Model performance metrics]
    end
    subgraph p2[Business performance metrics]
    end

   p1 --> p0
   p2 --> p0
  end

Figure 20.1: Understanding an ML sytems performance requires understanding model performance metrics and business performance metrics.

Ensuring Metric Alignment: The most critical consideration is ensuring your technical and business metrics are aligned and incentivize the same behavior. For example, if your business goal is to minimize customer complaints, optimizing purely for model accuracy might not be the right approach—a model that achieves high accuracy by being overly cautious might miss too many legitimate issues, leading to more complaints. Instead, you might prioritize recall (catching more potential problems) even if it means lower overall accuracy. The key is choosing technical metrics that, when optimized, naturally drive the business outcomes you care about.

Stakeholder Communication: It’s equally important to educate stakeholders about how models are measured technically and why those measurements matter for business outcomes. Business leaders often focus solely on bottom-line results, but understanding technical metrics helps them make informed decisions about model deployment, resource allocation, and acceptable trade-offs. When stakeholders understand that a 95% accurate fraud detection model still flags thousands of legitimate transactions for review, they can better plan operational processes and set realistic expectations for the model’s impact.

These metrics evaluate the accuracy and reliability of the ML model in making predictions. These metrics include common metrics you may have heard before if you’ve done any ML tasks such as mean squared error, \(R^2\), and mean absolute error for regression problems; accuracy, precision, and recall for classification problems; or BLEU, BERTScore, and perplexity for large language models. The choice of metric depends on the type of ML task (e.g., classification, regression) and the consequences of different kinds of errors.

Further reading: We’ll discuss model performance metrics more in later weeks but here are some additional readings you can browse now regarding selecting the right metric for evaluating ML models - Part 1, Part 2

These metrics measure the real-world impact of your ML system on business outcomes and organizational goals. Common business metrics for ML projects include:

  • Financial Metrics: Revenue increase, cost savings, profit margin improvement, return on investment (ROI), customer lifetime value changes
  • Operational Metrics: Process automation percentage, time savings, error reduction, productivity improvements, resource utilization
  • Customer Metrics: Customer satisfaction scores, churn rate reduction, conversion rate improvement, engagement increases, retention rates
  • Risk Metrics: Fraud detection rates, compliance improvements, risk exposure reduction, safety incident decreases
  • Efficiency Metrics: Decision-making speed, manual review time reduction, processing capacity increases, workflow optimization

The key is selecting business metrics that directly connect to your organization’s strategic objectives and can be clearly attributed to your ML system’s performance. Remember that business impact often takes time to materialize and may require longer measurement periods than technical metrics.

Understand the Decision Context

Every ML model exists to support human decision-making. Understanding exactly how your predictions will be used helps you frame the problem correctly.

Why this matters so much: The consequences of model failures can range from minor inconveniences to life-changing impacts. History is filled with examples of well-intentioned ML systems that caused significant harm because their creators didn’t fully consider the decision context. From biased hiring algorithms that discriminated against qualified candidates to criminal justice risk assessment tools that perpetuated racial inequities, the stakes of getting this wrong can be enormous.

WarningReal-World Impact

Machine learning models are increasingly used in high-stakes decisions affecting people’s lives—from loan approvals and job applications to medical diagnoses and criminal sentencing. For a sobering look at what can go wrong when models are deployed without careful consideration of their decision context, see Cathy O’Neil’s “Weapons of Math Destruction” which documents how algorithmic bias can perpetuate and amplify inequality.

Ask yourself:

  • Who will use these predictions? (Marketing team, customer service reps, automated system)
  • What action will they take? (Send targeted offers, flag for manual review, adjust pricing)
  • How quickly do they need results? (Real-time, daily batch, weekly reports)
  • What happens if the model is wrong? (Minor inconvenience, financial loss, safety risk)
  • Who is affected by these decisions? (Internal teams, customers, broader society)
  • What are the potential unintended consequences? (Bias amplification, privacy violations, safety risks)

For example, a model predicting whether a customer will purchase a product has very different requirements than a model predicting whether a medical device will fail. The first might tolerate some false positives if it leads to higher overall sales; the second requires extremely high precision to avoid potential safety hazards. Similarly, a model used for automated loan approvals carries legal and ethical responsibilities that don’t apply to a model suggesting Netflix recommendations.

Knowledge Check

NoneProblem Framing Practice:
  • Background: You work for TechFlow, a software company with 50,000+ customers. Your customer service team is drowning in support tickets and needs help managing their workload effectively.
  • Current Situation:
    • Volume: 2,000+ tickets per day across email, chat, and phone
    • Team: 15 customer service representatives working regular business hours
    • Response Goals: Respond to all tickets within 4 hours, resolve within 24 hours
    • Current Problems: Missing response deadlines, customers complaining about slow service, team working overtime
  • Ticket Types Include:
    • Password resets (25% of tickets) - Usually quick, can be automated
    • Billing questions (20% of tickets) - Require access to account details, moderate complexity
    • Technical bugs (30% of tickets) - Range from simple to complex, may need engineering team
    • Feature requests (15% of tickets) - Need to be forwarded to product team
    • Account cancellations (10% of tickets) - High priority, need immediate attention to retain customers
  • Available Data:
    • Historical ticket data including subject lines, descriptions, categories, resolution times
    • Customer information (subscription type, tenure, previous tickets)
    • Representative performance data and specializations
  • Business Context:
    • Cost of delay: Each hour of delay costs approximately $50 in customer satisfaction
    • Staff constraints: Can’t hire more representatives immediately due to budget/training time
    • Customer retention: Quick resolution of cancellation requests can save 30% of departing customers

The Ask: The customer service manager says: “Use machine learning to help manage our workload—we’re drowning and need any help we can get!”

Your Task:

  1. Rewrite this request as a specific, measurable business question that a machine learning model could address.
  2. Define what success would look like from both technical and business perspectives.
  3. Identify the decision context: Who would use this model? What actions would they take based on the predictions?

Consider the different approaches you could take and which would provide the most business value!

20.3 Data Readiness

Once you’ve clearly defined your problem, the next step is ensuring your data is ready to support that goal. This isn’t just about having “enough” data—it’s about having the right data of sufficient quality, organized in a way that allows for valid model training and testing.

Data Quality: Garbage In, Garbage Out

The famous computer science principle “garbage in, garbage out” is especially true for machine learning. Even the most sophisticated algorithms can’t overcome fundamentally flawed data. Before you begin modeling, you need to honestly assess whether your data can support your goals.

Common data quality issues include:

  • Missing values: Are there gaps in your data? Are they random, or do they follow patterns that could bias your model?
  • Inconsistent formats: Do you have dates recorded as “2023-01-15” in some places and “Jan 15, 2023” in others?
  • Duplicate records: The same customer or transaction appearing multiple times can skew your results
  • Outliers and errors: Unrealistic values like negative ages or sales dates in the future
  • Inconsistent definitions: What exactly counts as a “customer”? An “active user”? A “conversion”?
TipPartner with Data Owners Early

In most organizations, specific teams own the data assets you’ll be using for your analysis—whether it’s the marketing team managing customer data, the finance team handling transaction records, or the engineering team maintaining system logs. It’s extremely important to work with these data owners upfront to fully understand what you’re working with. They can explain what each field represents, how the data is collected, what business rules affect the data, and any nuances or quirks you should be aware of. This partnership can save you from making costly assumptions and help you identify potential data quality issues before they derail your project.

The Critical Importance of Train/Test Splits

One of the most fundamental concepts in machine learning is the train/test split. This is so important that getting it wrong can invalidate your entire project, regardless of how sophisticated your model is.

While later chapters will dive deep into the practical implementation of data splitting techniques and how to incorporate them into your Python workflow, for now it’s crucial to understand that how you prepare and split your data is a major factor in how well your model will perform on unseen data. The choices you make here directly impact whether your model will succeed or fail in real-world deployment.

Here’s the basic principle: You cannot fairly evaluate a model using the same data you used to train it.

Think of it like studying for an exam. If you practice using the exact same questions that will appear on the test, your practice score will be unrealistically high—it doesn’t reflect how well you’ll perform on new, unseen questions. Similarly, a model’s performance on its training data is almost always overly optimistic.

How Train/Test Splits Work

The solution is to split your data into distinct portions:

  • Training Set (typically 70-80% of data): Used to build and tune your model
  • Test Set (typically 20-30% of data): Used only to evaluate final model performance
  • Validation Set (optional, for complex projects): Used for model selection and tuning
flowchart TD
    A[Complete Dataset] --> B[Training Set<br/>70-80%]
    A --> C[Test Set<br/>20-30%]
    
    B --> D[Train Model]
    D --> E[Trained Model]
    
    C --> F[Evaluate Performance]
    E --> F
    F --> G[Unbiased Performance<br/>Estimate]
    
    style A fill:#f0f8ff
    style B fill:#e8f5e8
    style C fill:#ffe6e6
    style G fill:#fff2cc
Figure 20.2: Proper data splitting ensures unbiased evaluation by keeping test data completely separate from model training.

The golden rule: Once you set aside your test set, don’t touch it until you’re completely done with model development. The moment you use test data to make decisions about your model, it’s no longer a fair evaluation.

Random vs. Strategic Splitting

The central principle: How you split your data is crucial and depends entirely on the type of problem you’re solving. The wrong splitting strategy can lead to overly optimistic results that don’t translate to real-world performance, while the right strategy sets your model up for success from the start. We’ll explore the technical implementation of these strategies in later chapters, but understanding when and why to use different approaches is fundamental.

Random Splitting: The Default Approach

How it works: Random splitting assigns each row in your dataset to either training or testing using pure chance—like flipping a coin for each record. This ensures that your training and test sets are representative samples of your overall data, with similar distributions of features and outcomes.

When it’s appropriate: Random splitting works well when:

  • Your data represents a single time period or snapshot
  • Each row is independent (no customer groupings, time sequences, or hierarchical relationships)
  • You’re not trying to predict future events
  • Your outcome variable is reasonably balanced

Example: Predicting whether customers will respond to a marketing email using data from a single campaign where each customer received one email.

When Random Splitting Fails: Strategic Alternatives

However, many real-world problems require more thoughtful splitting strategies:

Problem: You’re building a model to predict which customers will churn next month using 2 years of customer data.

Why random splitting fails: If you randomly split this data, your model might use information from December 2023 to predict churn that happened in January 2023—essentially using the future to predict the past!

Strategic solution: Split by time—train on data from January 2022 to December 2022, test on data from January 2023 to December 2023. This mirrors real-world deployment where you use historical data to predict future events.

Problem: Predicting transaction fraud using a dataset where each customer has multiple transactions.

Why random splitting fails: If Customer A’s transactions appear in both training and test sets, your model might just learn to recognize Customer A rather than learning general fraud patterns.

Strategic solution: Keep all of each customer’s transactions together—if Customer A is in training, ALL their transactions stay in training.

Problem: Detecting equipment failures where only 2% of your data represents actual failures.

Why random splitting fails: You might randomly end up with very few (or even zero) failure cases in your test set, making evaluation impossible.

Strategic solution: Use stratified splitting to ensure both training and test sets maintain the same 2% failure rate as your original dataset.

Problem: Predicting student performance using data from multiple schools and classrooms.

Why random splitting fails: Students from the same classroom might be more similar to each other than to students from other classrooms, creating hidden dependencies.

Strategic solution: Split at the school or classroom level rather than at the student level to ensure true independence.

The key insight is that your splitting strategy should mirror how your model will be used in the real world. If you’ll use historical data to predict future events, split by time. If you’ll make predictions about new customers, split by customer. The goal is always to create test conditions that simulate actual deployment as closely as possible.

Knowledge Check

NoneData Readiness Assessment:
  • Business Context: You work for RetailMax, an e-commerce company that wants to launch a targeted marketing campaign. The marketing team plans to send personalized discount offers to customers who are likely to make a purchase in the next 30 days, hoping to convert them before they buy from competitors. They’ve allocated a budget for 10,000 targeted emails and want to maximize return on investment by selecting the customers most likely to purchase.
  • Your Dataset: 100,000 customer records spanning the past 2 years, including purchase history, browsing behavior, demographic information, and customer service interactions. The marketing team wants to deploy this model next month to identify targets for their campaign.
  • Your Task:
    1. Identify potential data quality issues you should check for before modeling.
    2. Design your train/test split strategy: How would you split this data? Why might a random split not be appropriate for this time-sensitive prediction?
    3. Spot the problem: A colleague tells you they achieved 99% accuracy by including “days_since_last_purchase” as a feature. What might be wrong with this approach?

Consider these questions carefully—they represent some of the most common pitfalls in ML projects!

20.4 Data Leakage: The Silent Model Killer

Data leakage is one of the most insidious problems in machine learning. It makes your model look incredibly successful during development, only to fail spectacularly when deployed in the real world. Understanding and preventing data leakage is crucial for building models that actually work.

What Is Data Leakage?

Data leakage occurs when information that would not be available at prediction time somehow finds its way into your training data. Essentially, your model is “cheating” by using information from the future or information that contains the answer you’re trying to predict.

The tricky part is that leakage often leads to models with impressive performance metrics—99% accuracy, perfect predictions, results that seem too good to be true. And they are too good to be true.

Types of Data Leakage

Data leakage can creep into your models through several different pathways, each with its own characteristics and warning signs. Understanding these distinct types helps you systematically check for and prevent leakage in your own projects.

This happens when you accidentally include information from after the time you’re trying to make predictions.

Example: You’re building a model to predict which customers will cancel their subscription in January 2024. You accidentally include a feature called “customer_satisfaction_survey_february_2024” in your training data. Your model performs amazingly well—of course it does, because customers who cancelled in January probably gave poor satisfaction ratings in February!

This occurs when you include features that are directly caused by or contain information about the outcome you’re predicting.

Example: You’re predicting whether someone will default on a loan, and you include a feature called “account_status” which has values like “current,” “late,” and “charged_off.” The “charged_off” status literally means the person defaulted—you’ve accidentally included the answer in your features!

This happens when you include features that wouldn’t be available when you actually need to make predictions.

Example: You’re building a model to approve credit applications instantly online. You include features from a detailed financial audit that takes 30 days to complete. Your model might be accurate, but it’s useless because you can’t wait 30 days to approve applications.

How to Spot Data Leakage

Warning signs that might indicate leakage:

  • Performance that’s too good: If your model achieves near-perfect accuracy on a complex real-world problem, be suspicious
  • One feature dominates: If removing a single feature causes model performance to collapse dramatically
  • Perfect correlation: If any feature correlates almost perfectly with your target variable
  • Temporal inconsistencies: Features that would only be available after the event you’re predicting

Preventing Data Leakage

Successfully preventing data leakage requires developing a systematic mindset and establishing rigorous practices throughout your model development process. The most effective approach is to think like your model will actually be deployed—constantly asking yourself whether each piece of information would realistically be available at the moment you need to make a prediction. This simple question can prevent most leakage issues before they occur.

Understanding your data timeline is equally crucial. Before you begin feature engineering, create a clear timeline showing when each piece of information becomes available in your business process. Your model can only use information that exists before the prediction moment, so mapping out these temporal relationships helps you identify potential temporal leakage early. For instance, if you’re predicting monthly churn, you should only use data that would be available at the beginning of that month, not data that accumulates during the month itself.

Be especially suspicious of features that seem too good to be true. If a feature appears to be a perfect predictor—correlating almost perfectly with your target variable—investigate thoroughly. Often, these “perfect” predictors are actually just different ways of measuring the outcome you’re trying to predict, or they contain information that wouldn’t be available at prediction time. Features that dramatically improve your model’s performance deserve extra scrutiny rather than celebration.

Finally, implement time-aware validation strategies for any problem involving temporal data. Instead of randomly splitting your data, use time-based validation where you train on earlier data and test on later data. This approach mirrors real-world deployment conditions and helps catch temporal leakage that might slip through random validation splits.

A Real-World Example: Hospital Readmission Prediction

Consider a common healthcare analytics challenge: building a model to predict which patients will be readmitted to the hospital within 30 days of discharge. This type of model is valuable for hospitals trying to improve patient care and reduce costs, but it’s also prone to subtle leakage issues that can make the model appear more accurate than it actually is.

Imagine you’re working with a feature called “length_of_initial_stay” measured in days. At first glance, this seems like a perfectly legitimate feature—longer hospital stays might indicate more severe conditions that increase readmission risk. However, the way you calculate this feature makes all the difference between a valid model and a leaky one.

The leaky approach might seem logical: calculate length_of_stay as the total number of days from initial admission to final discharge. If a patient is readmitted within your 30-day prediction window, this calculation would include the readmission dates, artificially inflating the stay length for patients who end up being readmitted. The model might then show a strong correlation between longer stays and readmission risk—but this is circular logic since the readmission itself is contributing to the “longer stay” measurement.

The correct approach calculates length_of_stay only from admission to the initial discharge date, completely ignoring any subsequent readmissions. This ensures you’re only using information that would have been available at the moment of initial discharge when you would actually need to make your prediction. While this version might show weaker correlations, it represents genuine predictive relationships that will hold up in real-world deployment.

This example illustrates how subtle definitional choices can introduce leakage that dramatically inflates apparent model performance while rendering the model useless for its intended purpose. The leaky version might achieve impressive accuracy in testing but would fail completely when deployed because the “future information” it relies on wouldn’t be available for new patients.

Knowledge Check

NoneSpot the Leakage:

For each scenario below, identify whether data leakage is present and explain why:

  1. Email Spam Detection: You’re predicting whether emails are spam. One of your features is “email_moved_to_spam_folder” (yes/no). The model achieves 95% accuracy.

  2. Customer Churn Prediction: You’re predicting which customers will cancel next month. Your features include last month’s purchase amount, customer age, and account creation date.

  3. Stock Price Prediction: You’re predicting tomorrow’s stock price using today’s opening price, trading volume, and tomorrow’s closing volume.

  4. Medical Diagnosis: You’re predicting disease presence using patient symptoms, lab results from the day of diagnosis, and treatment prescribed by the doctor.

Which scenarios contain leakage? What makes them problematic?

20.5 Fairness, Privacy, and Interpretability: The Human Impact

Machine learning models don’t exist in a vacuum—they make decisions that affect real people’s lives. Whether it’s determining who gets a loan, which job candidates get interviews, or what content people see on social media, your models have ethical implications that go far beyond technical performance metrics.

Fairness: Avoiding Discriminatory Outcomes

Algorithmic fairness means ensuring your model doesn’t systematically discriminate against protected groups based on characteristics like race, gender, age, or other sensitive attributes. The challenge is that bias can creep into models in subtle ways, even when you’re trying to be fair, often through the very data we use to train them.

Bias typically enters through several pathways: historical bias reflects past discrimination embedded in your training data (like historical hiring records that favor certain groups), representation bias occurs when some groups are underrepresented in your dataset, measurement bias stems from biased ways of measuring outcomes (such as credit scores that reflect historical lending discrimination), and proxy variables that seem neutral but correlate with protected characteristics (like zip codes that correlate with race).

Consider a hiring model trained on historical data that learns certain universities predict success. If those universities had discriminatory admission practices, the model perpetuates that bias into future hiring decisions. Promoting fairness requires proactive steps: auditing your data to understand demographics and historical outcomes, testing for disparate impact across different groups, considering algorithms that can optimize for both accuracy and fairness simultaneously, and maintaining human oversight for high-stakes decisions, especially edge cases.

Privacy: Protecting Sensitive Information

Machine learning models can inadvertently reveal sensitive information about individuals in your training data or make predictions that expose private details. Privacy concerns span multiple dimensions: data collection practices may gather more personal information than necessary, model inversion attacks can use outputs to infer sensitive information about training individuals, re-identification can occur when “anonymized” data is combined with other sources, and prediction privacy issues arise when models reveal sensitive information through their outputs (like predicting health conditions from seemingly unrelated data).

Protecting privacy requires systematic practices: data minimization ensures you only collect and use necessary information, proper anonymization removes or encrypts personally identifiable information, differential privacy adds carefully calibrated noise to protect individual privacy while preserving statistical patterns, and secure storage ensures data and models are properly protected and accessed only by authorized personnel.

Interpretability: Understanding Model Decisions

As models become more complex, understanding why they make specific decisions becomes increasingly difficult. This “black box” problem is particularly concerning for high-stakes applications where decisions significantly impact people’s lives. Interpretability matters for multiple reasons: it enables debugging by helping you understand and fix model mistakes, builds trust among users and stakeholders who need to understand decisions, ensures compliance with regulations that require explainable decisions (like certain credit approval requirements), and supports fairness efforts since you can’t fix bias without understanding how decisions are made.

Interpretability operates at different levels. Global interpretability helps you understand how the model works overall (such as knowing a credit model primarily relies on credit score and income), while local interpretability explains specific decisions (like understanding a particular loan was denied primarily due to high debt-to-income ratio). Common approaches include using inherently interpretable models like linear regression and decision trees, analyzing feature importance to understand which variables matter most, applying techniques like SHAP values to explain individual predictions, and providing counterfactual explanations that show how different inputs would change outcomes.

Balancing Competing Concerns

You’ll often face trade-offs between accuracy, fairness, privacy, and interpretability, and there’s no single “right” answer. The key is making these trade-offs consciously and transparently, guided by your specific application context and stakeholder needs. Critical questions to consider include: whether a small decrease in accuracy is justified by significant fairness improvements, how much model complexity is warranted by performance gains, what level of privacy protection is appropriate for your application, and who needs to understand model decisions and at what level of detail. These decisions should align with your organization’s values, regulatory requirements, and the real-world impact of your model’s deployment.

Knowledge Check

NoneEthical Considerations in Practice:

Consider a model being developed to screen job applications for a tech company.

  1. Identify potential fairness issues: What sources of bias might affect this model? What groups might be unfairly disadvantaged?

  2. Privacy concerns: What personal information might this model expose? How could the company protect applicant privacy?

  3. Interpretability requirements: Who would need to understand this model’s decisions? What level of explanation would be appropriate?

  4. Design better practices: How would you modify the model development process to address these concerns?

Think about how you would balance model performance with ethical considerations in this scenario.

20.6 Summary

Building successful machine learning models requires much more than selecting the right algorithm and achieving high accuracy scores. The foundation of any ML project lies in careful planning, thoughtful problem framing, and systematic consideration of potential pitfalls before you write your first line of modeling code.

Throughout this chapter, you’ve learned that effective machine learning begins with clearly defining your business problem and establishing concrete success criteria. You’ve seen how data quality and proper train/test splits form the bedrock of reliable model evaluation, and how data leakage can make models appear deceptively successful while actually being worthless in practice. Finally, you’ve considered the human impact of machine learning through the lenses of fairness, privacy, and interpretability—recognizing that technical excellence alone isn’t sufficient when models affect people’s lives.

The key insight is that good machine learning isn’t just about algorithms, it’s about problem framing, data discipline, appropriate evaluation, and responsible use. The time you invest in these foundational considerations will pay dividends throughout your project, helping you avoid costly mistakes and build models that actually solve real business problems.

TipYour pre-modeling checklist should include:
  • Clear problem definition: Specific, measurable business questions with defined success criteria
  • Data quality assessment: Understanding your data’s limitations, biases, and gaps
  • Proper validation strategy: Time-aware train/test splits that reflect real-world deployment
  • Leakage prevention: Ensuring only information available at prediction time is used for training
  • Appropriate metrics: Evaluation criteria aligned with business objectives and model type
  • Ethical considerations: Proactive assessment of fairness, privacy, and interpretability requirements

As you move forward in your machine learning journey, remember that the most sophisticated algorithms in the world can’t compensate for poor foundational planning. The habits and mindset you develop now—asking the right questions, being suspicious of results that seem too good to be true, and always considering the human impact of your models—will serve you well regardless of which specific techniques you eventually master.

20.7 End of Chapter Exercise: ML Project Pitfall Analysis

You work as a consultant helping companies identify potential problems in their machine learning projects before they invest significant resources. For each scenario below, identify the key issues that could derail the project and suggest what the team should address before building their models.

Company: A subscription streaming service
Goal: “Build a model to predict how much revenue each new customer will generate over their entire relationship with our company”
Proposed Approach: The data science team plans to use all available customer data, including viewing history, payment information, and customer service interactions. They want to achieve 90% accuracy to justify the project to executives.
Timeline: “We need this model deployed in 2 weeks for the next marketing campaign”

Your Analysis:

  1. Problem Framing Issues: What’s unclear or problematic about how they’ve defined their goal?
  2. Data and Methodology Concerns: What potential issues do you see with their proposed approach?
  3. Timeline and Expectations: What’s unrealistic about their timeline and success metrics?
  4. Recommendations: What should they clarify or change before proceeding?

Company: A regional bank
Goal: Automate loan approval decisions to reduce processing time and costs
Current Approach: The team has 10 years of historical loan data including applicant demographics, credit scores, employment history, and loan outcomes. They plan to train a model that achieves 95% accuracy, then deploy it to make instant approval decisions.
Special Note: “We included a feature called ‘loan_officer_final_decision’ because it correlates perfectly with whether loans were approved—this will make our model really accurate!”

Your Analysis:

  1. Data Leakage: What’s the obvious leakage problem in this scenario?
  2. Fairness Concerns: What bias issues might this model have?
  3. Evaluation Strategy: How should they properly evaluate this model?
  4. Ethical Considerations: What additional considerations should guide this project?

Company: A healthcare technology startup
Goal: Build a model to help doctors diagnose skin conditions from photographs
Data: 50,000 images labeled by dermatologists, with 95% showing healthy skin and 5% showing various conditions
Approach: Random train/test split with 80/20 division. They’re excited because their model achieves 95% accuracy!
Deployment Plan: “The model will provide diagnostic suggestions directly to patients through our app”

Your Analysis:

  1. Data Issues: What problems do you see with their dataset composition?
  2. Evaluation Problems: Why might 95% accuracy be misleading here?
  3. Deployment Concerns: What’s risky about their deployment plan?
  4. Regulatory and Safety Issues: What additional considerations apply to medical applications?

Company: A social media platform
Goal: Automatically detect and remove hate speech from user posts
Data: 1 million posts from the last 6 months, labeled by content moderators
Approach: The team wants to optimize for maximum recall (“we want to catch all hate speech”) and plans to use all available data including user profiles, posting history, and network connections.
Success Metric: “If we can catch 99% of hate speech, we’ll have solved the problem”

Your Analysis:

  1. Problem Framing: What’s oversimplified about their success metric?
  2. Privacy Concerns: What privacy issues arise from their data usage?
  3. Fairness Issues: How might this model create unfair outcomes?
  4. Metric Selection: What trade-offs are they ignoring by focusing only on recall?

Company: A manufacturing plant
Goal: Predict equipment failures to schedule maintenance before breakdowns occur
Data: 2 years of sensor readings (temperature, vibration, pressure) and maintenance records
Approach: They plan to include a feature called “maintenance_scheduled_next_week” because it seems predictive of failures. Their validation shows perfect predictions!
Business Case: “This will save millions by preventing unexpected downtime”

Your Analysis:

  1. Leakage Detection: What’s the leakage issue here?
  2. Temporal Considerations: How should they handle the time-series nature of this data?
  3. Cost-Benefit Analysis: What additional factors should they consider beyond prediction accuracy?
  4. Practical Deployment: What challenges might they face when actually using this model?

Reflection Questions

After analyzing these scenarios, consider:

  1. Common Patterns: What types of mistakes appear across multiple scenarios?
  2. Detection Skills: How can you develop the ability to spot these issues early in your own projects?
  3. Prevention Strategies: What processes or checklists could help teams avoid these pitfalls?
  4. Stakeholder Communication: How would you explain these technical issues to non-technical business leaders?

The goal of this exercise isn’t to memorize specific problems, but to develop the critical thinking skills and systematic approach that will help you identify and address issues before they derail your machine learning projects.