flowchart LR subgraph ML[ML System] direction BT subgraph p0[Overall Performance] end subgraph p1[Model performance metrics] end subgraph p2[Business performance metrics] end p1 --> p0 p2 --> p0 end
20 Before You Build: Key Considerations
You’ve just learned about the exciting world of machine learning and you’re eager to dive in. You have a dataset, you know whether your problem is regression or classification, and you’re ready to start building models. But hold on—successful machine learning projects aren’t built in a day, and they certainly aren’t built without careful planning.
Think of machine learning like building a house. You wouldn’t start hammering nails without first checking your foundation, reviewing your blueprints, and ensuring you have the right permits. Similarly, before you write a single line of modeling code, you need to establish a solid foundation: What exactly are you trying to solve? Is your data ready? How will you know if your model is working? And what ethical considerations should guide your decisions?
This chapter focuses on the critical thinking and planning that separates successful ML projects from costly failures. You won’t find complex algorithms or extensive coding here—instead, you’ll develop the judgment to ask the right questions, spot potential pitfalls, and set your projects up for success from the very beginning.
The goal isn’t to overwhelm you with theory, but to give you the practical wisdom that experienced data scientists use every day. By the end of this chapter, you’ll have a mental checklist that will guide you through the early stages of any machine learning project, helping you avoid common mistakes and build models that actually solve real business problems.
20.1 Learning Objectives
By the end of this chapter, you should be able to:
- Frame machine learning problems by defining clear business questions and success criteria
- Assess data readiness including data quality and the importance of proper train/test splits
- Identify data leakage and understand why using “future” information invalidates model results
- Recognize ethical considerations around fairness, privacy, and interpretability in ML applications
- Apply a systematic pre-modeling checklist to real business scenarios
This chapter’s goal is to get you thinking about these critical concepts and developing the right mindset for successful machine learning projects. Many of the topics we introduce here—from data splitting techniques to evaluation metrics to ethical considerations—will be explored in much greater detail in later chapters where you’ll learn the practical implementation skills and hands-on techniques to apply these concepts in real projects.
20.2 Framing the Problem
The most critical step in any machine learning project happens before you even touch your data: clearly defining what you’re trying to accomplish. This might sound obvious, but it’s where many projects go astray. Without a clear problem statement and definition of success, you’ll find yourself building technically impressive models that don’t actually solve business problems.
Start with the Business Question
Machine learning should always begin with a specific, answerable business question. Vague goals like “use AI to improve our business” or “build a model to predict customer behavior” are recipes for failure. Instead, successful ML projects start with questions like:
- “Can we predict which customers are likely to cancel their subscription in the next 30 days?”
- “What price should we set for this product to maximize profit while remaining competitive?”
- “Which marketing email should we send to each customer to maximize click-through rates?”
Notice how each of these questions has specific, measurable outcomes. They define exactly what you’re trying to predict, over what time horizon, and for what business purpose.
Define Success Upfront
Before building any model, you need to answer: “How will I know if this model is successful?” This isn’t just about technical metrics—it’s about business impact.
Consider these different perspectives on success:
Technical Success: “Our model achieves 85% accuracy on our test set”
Business Success: “Using our model’s predictions, we reduced customer churn by 15% this quarter”
Both matter, but business success is what justifies the time and resources invested in your project. You should establish both technical benchmarks (maximum accuracy, minimum error rates) and business benchmarks (cost savings, revenue increase, time saved) before you begin modeling.
Ensuring Metric Alignment: The most critical consideration is ensuring your technical and business metrics are aligned and incentivize the same behavior. For example, if your business goal is to minimize customer complaints, optimizing purely for model accuracy might not be the right approach—a model that achieves high accuracy by being overly cautious might miss too many legitimate issues, leading to more complaints. Instead, you might prioritize recall (catching more potential problems) even if it means lower overall accuracy. The key is choosing technical metrics that, when optimized, naturally drive the business outcomes you care about.
Stakeholder Communication: It’s equally important to educate stakeholders about how models are measured technically and why those measurements matter for business outcomes. Business leaders often focus solely on bottom-line results, but understanding technical metrics helps them make informed decisions about model deployment, resource allocation, and acceptable trade-offs. When stakeholders understand that a 95% accurate fraud detection model still flags thousands of legitimate transactions for review, they can better plan operational processes and set realistic expectations for the model’s impact.
These metrics evaluate the accuracy and reliability of the ML model in making predictions. These metrics include common metrics you may have heard before if you’ve done any ML tasks such as mean squared error, \(R^2\), and mean absolute error for regression problems; accuracy, precision, and recall for classification problems; or BLEU, BERTScore, and perplexity for large language models. The choice of metric depends on the type of ML task (e.g., classification, regression) and the consequences of different kinds of errors.
Further reading: We’ll discuss model performance metrics more in later weeks but here are some additional readings you can browse now regarding selecting the right metric for evaluating ML models - Part 1, Part 2
These metrics measure the real-world impact of your ML system on business outcomes and organizational goals. Common business metrics for ML projects include:
- Financial Metrics: Revenue increase, cost savings, profit margin improvement, return on investment (ROI), customer lifetime value changes
- Operational Metrics: Process automation percentage, time savings, error reduction, productivity improvements, resource utilization
- Customer Metrics: Customer satisfaction scores, churn rate reduction, conversion rate improvement, engagement increases, retention rates
- Risk Metrics: Fraud detection rates, compliance improvements, risk exposure reduction, safety incident decreases
- Efficiency Metrics: Decision-making speed, manual review time reduction, processing capacity increases, workflow optimization
The key is selecting business metrics that directly connect to your organization’s strategic objectives and can be clearly attributed to your ML system’s performance. Remember that business impact often takes time to materialize and may require longer measurement periods than technical metrics.
Understand the Decision Context
Every ML model exists to support human decision-making. Understanding exactly how your predictions will be used helps you frame the problem correctly.
Why this matters so much: The consequences of model failures can range from minor inconveniences to life-changing impacts. History is filled with examples of well-intentioned ML systems that caused significant harm because their creators didn’t fully consider the decision context. From biased hiring algorithms that discriminated against qualified candidates to criminal justice risk assessment tools that perpetuated racial inequities, the stakes of getting this wrong can be enormous.
Machine learning models are increasingly used in high-stakes decisions affecting people’s lives—from loan approvals and job applications to medical diagnoses and criminal sentencing. For a sobering look at what can go wrong when models are deployed without careful consideration of their decision context, see Cathy O’Neil’s “Weapons of Math Destruction” which documents how algorithmic bias can perpetuate and amplify inequality.
Ask yourself:
- Who will use these predictions? (Marketing team, customer service reps, automated system)
- What action will they take? (Send targeted offers, flag for manual review, adjust pricing)
- How quickly do they need results? (Real-time, daily batch, weekly reports)
- What happens if the model is wrong? (Minor inconvenience, financial loss, safety risk)
- Who is affected by these decisions? (Internal teams, customers, broader society)
- What are the potential unintended consequences? (Bias amplification, privacy violations, safety risks)
For example, a model predicting whether a customer will purchase a product has very different requirements than a model predicting whether a medical device will fail. The first might tolerate some false positives if it leads to higher overall sales; the second requires extremely high precision to avoid potential safety hazards. Similarly, a model used for automated loan approvals carries legal and ethical responsibilities that don’t apply to a model suggesting Netflix recommendations.
Knowledge Check
20.3 Data Readiness
Once you’ve clearly defined your problem, the next step is ensuring your data is ready to support that goal. This isn’t just about having “enough” data—it’s about having the right data of sufficient quality, organized in a way that allows for valid model training and testing.
Data Quality: Garbage In, Garbage Out
The famous computer science principle “garbage in, garbage out” is especially true for machine learning. Even the most sophisticated algorithms can’t overcome fundamentally flawed data. Before you begin modeling, you need to honestly assess whether your data can support your goals.
Common data quality issues include:
- Missing values: Are there gaps in your data? Are they random, or do they follow patterns that could bias your model?
- Inconsistent formats: Do you have dates recorded as “2023-01-15” in some places and “Jan 15, 2023” in others?
- Duplicate records: The same customer or transaction appearing multiple times can skew your results
- Outliers and errors: Unrealistic values like negative ages or sales dates in the future
- Inconsistent definitions: What exactly counts as a “customer”? An “active user”? A “conversion”?
In most organizations, specific teams own the data assets you’ll be using for your analysis—whether it’s the marketing team managing customer data, the finance team handling transaction records, or the engineering team maintaining system logs. It’s extremely important to work with these data owners upfront to fully understand what you’re working with. They can explain what each field represents, how the data is collected, what business rules affect the data, and any nuances or quirks you should be aware of. This partnership can save you from making costly assumptions and help you identify potential data quality issues before they derail your project.
The Critical Importance of Train/Test Splits
One of the most fundamental concepts in machine learning is the train/test split. This is so important that getting it wrong can invalidate your entire project, regardless of how sophisticated your model is.
While later chapters will dive deep into the practical implementation of data splitting techniques and how to incorporate them into your Python workflow, for now it’s crucial to understand that how you prepare and split your data is a major factor in how well your model will perform on unseen data. The choices you make here directly impact whether your model will succeed or fail in real-world deployment.
Here’s the basic principle: You cannot fairly evaluate a model using the same data you used to train it.
Think of it like studying for an exam. If you practice using the exact same questions that will appear on the test, your practice score will be unrealistically high—it doesn’t reflect how well you’ll perform on new, unseen questions. Similarly, a model’s performance on its training data is almost always overly optimistic.
How Train/Test Splits Work
The solution is to split your data into distinct portions:
- Training Set (typically 70-80% of data): Used to build and tune your model
- Test Set (typically 20-30% of data): Used only to evaluate final model performance
- Validation Set (optional, for complex projects): Used for model selection and tuning
flowchart TD A[Complete Dataset] --> B[Training Set<br/>70-80%] A --> C[Test Set<br/>20-30%] B --> D[Train Model] D --> E[Trained Model] C --> F[Evaluate Performance] E --> F F --> G[Unbiased Performance<br/>Estimate] style A fill:#f0f8ff style B fill:#e8f5e8 style C fill:#ffe6e6 style G fill:#fff2cc
The golden rule: Once you set aside your test set, don’t touch it until you’re completely done with model development. The moment you use test data to make decisions about your model, it’s no longer a fair evaluation.
Random vs. Strategic Splitting
The central principle: How you split your data is crucial and depends entirely on the type of problem you’re solving. The wrong splitting strategy can lead to overly optimistic results that don’t translate to real-world performance, while the right strategy sets your model up for success from the start. We’ll explore the technical implementation of these strategies in later chapters, but understanding when and why to use different approaches is fundamental.
Random Splitting: The Default Approach
How it works: Random splitting assigns each row in your dataset to either training or testing using pure chance—like flipping a coin for each record. This ensures that your training and test sets are representative samples of your overall data, with similar distributions of features and outcomes.
When it’s appropriate: Random splitting works well when:
- Your data represents a single time period or snapshot
- Each row is independent (no customer groupings, time sequences, or hierarchical relationships)
- You’re not trying to predict future events
- Your outcome variable is reasonably balanced
Example: Predicting whether customers will respond to a marketing email using data from a single campaign where each customer received one email.
When Random Splitting Fails: Strategic Alternatives
However, many real-world problems require more thoughtful splitting strategies:
The key insight is that your splitting strategy should mirror how your model will be used in the real world. If you’ll use historical data to predict future events, split by time. If you’ll make predictions about new customers, split by customer. The goal is always to create test conditions that simulate actual deployment as closely as possible.
Knowledge Check
20.4 Data Leakage: The Silent Model Killer
Data leakage is one of the most insidious problems in machine learning. It makes your model look incredibly successful during development, only to fail spectacularly when deployed in the real world. Understanding and preventing data leakage is crucial for building models that actually work.
What Is Data Leakage?
Data leakage occurs when information that would not be available at prediction time somehow finds its way into your training data. Essentially, your model is “cheating” by using information from the future or information that contains the answer you’re trying to predict.
The tricky part is that leakage often leads to models with impressive performance metrics—99% accuracy, perfect predictions, results that seem too good to be true. And they are too good to be true.
Types of Data Leakage
Data leakage can creep into your models through several different pathways, each with its own characteristics and warning signs. Understanding these distinct types helps you systematically check for and prevent leakage in your own projects.
How to Spot Data Leakage
Warning signs that might indicate leakage:
- Performance that’s too good: If your model achieves near-perfect accuracy on a complex real-world problem, be suspicious
- One feature dominates: If removing a single feature causes model performance to collapse dramatically
- Perfect correlation: If any feature correlates almost perfectly with your target variable
- Temporal inconsistencies: Features that would only be available after the event you’re predicting
Preventing Data Leakage
Successfully preventing data leakage requires developing a systematic mindset and establishing rigorous practices throughout your model development process. The most effective approach is to think like your model will actually be deployed—constantly asking yourself whether each piece of information would realistically be available at the moment you need to make a prediction. This simple question can prevent most leakage issues before they occur.
Understanding your data timeline is equally crucial. Before you begin feature engineering, create a clear timeline showing when each piece of information becomes available in your business process. Your model can only use information that exists before the prediction moment, so mapping out these temporal relationships helps you identify potential temporal leakage early. For instance, if you’re predicting monthly churn, you should only use data that would be available at the beginning of that month, not data that accumulates during the month itself.
Be especially suspicious of features that seem too good to be true. If a feature appears to be a perfect predictor—correlating almost perfectly with your target variable—investigate thoroughly. Often, these “perfect” predictors are actually just different ways of measuring the outcome you’re trying to predict, or they contain information that wouldn’t be available at prediction time. Features that dramatically improve your model’s performance deserve extra scrutiny rather than celebration.
Finally, implement time-aware validation strategies for any problem involving temporal data. Instead of randomly splitting your data, use time-based validation where you train on earlier data and test on later data. This approach mirrors real-world deployment conditions and helps catch temporal leakage that might slip through random validation splits.
A Real-World Example: Hospital Readmission Prediction
Consider a common healthcare analytics challenge: building a model to predict which patients will be readmitted to the hospital within 30 days of discharge. This type of model is valuable for hospitals trying to improve patient care and reduce costs, but it’s also prone to subtle leakage issues that can make the model appear more accurate than it actually is.
Imagine you’re working with a feature called “length_of_initial_stay” measured in days. At first glance, this seems like a perfectly legitimate feature—longer hospital stays might indicate more severe conditions that increase readmission risk. However, the way you calculate this feature makes all the difference between a valid model and a leaky one.
The leaky approach might seem logical: calculate length_of_stay as the total number of days from initial admission to final discharge. If a patient is readmitted within your 30-day prediction window, this calculation would include the readmission dates, artificially inflating the stay length for patients who end up being readmitted. The model might then show a strong correlation between longer stays and readmission risk—but this is circular logic since the readmission itself is contributing to the “longer stay” measurement.
The correct approach calculates length_of_stay only from admission to the initial discharge date, completely ignoring any subsequent readmissions. This ensures you’re only using information that would have been available at the moment of initial discharge when you would actually need to make your prediction. While this version might show weaker correlations, it represents genuine predictive relationships that will hold up in real-world deployment.
This example illustrates how subtle definitional choices can introduce leakage that dramatically inflates apparent model performance while rendering the model useless for its intended purpose. The leaky version might achieve impressive accuracy in testing but would fail completely when deployed because the “future information” it relies on wouldn’t be available for new patients.
Knowledge Check
20.5 Fairness, Privacy, and Interpretability: The Human Impact
Machine learning models don’t exist in a vacuum—they make decisions that affect real people’s lives. Whether it’s determining who gets a loan, which job candidates get interviews, or what content people see on social media, your models have ethical implications that go far beyond technical performance metrics.
Fairness: Avoiding Discriminatory Outcomes
Algorithmic fairness means ensuring your model doesn’t systematically discriminate against protected groups based on characteristics like race, gender, age, or other sensitive attributes. The challenge is that bias can creep into models in subtle ways, even when you’re trying to be fair, often through the very data we use to train them.
Bias typically enters through several pathways: historical bias reflects past discrimination embedded in your training data (like historical hiring records that favor certain groups), representation bias occurs when some groups are underrepresented in your dataset, measurement bias stems from biased ways of measuring outcomes (such as credit scores that reflect historical lending discrimination), and proxy variables that seem neutral but correlate with protected characteristics (like zip codes that correlate with race).
Consider a hiring model trained on historical data that learns certain universities predict success. If those universities had discriminatory admission practices, the model perpetuates that bias into future hiring decisions. Promoting fairness requires proactive steps: auditing your data to understand demographics and historical outcomes, testing for disparate impact across different groups, considering algorithms that can optimize for both accuracy and fairness simultaneously, and maintaining human oversight for high-stakes decisions, especially edge cases.
Privacy: Protecting Sensitive Information
Machine learning models can inadvertently reveal sensitive information about individuals in your training data or make predictions that expose private details. Privacy concerns span multiple dimensions: data collection practices may gather more personal information than necessary, model inversion attacks can use outputs to infer sensitive information about training individuals, re-identification can occur when “anonymized” data is combined with other sources, and prediction privacy issues arise when models reveal sensitive information through their outputs (like predicting health conditions from seemingly unrelated data).
Protecting privacy requires systematic practices: data minimization ensures you only collect and use necessary information, proper anonymization removes or encrypts personally identifiable information, differential privacy adds carefully calibrated noise to protect individual privacy while preserving statistical patterns, and secure storage ensures data and models are properly protected and accessed only by authorized personnel.
Interpretability: Understanding Model Decisions
As models become more complex, understanding why they make specific decisions becomes increasingly difficult. This “black box” problem is particularly concerning for high-stakes applications where decisions significantly impact people’s lives. Interpretability matters for multiple reasons: it enables debugging by helping you understand and fix model mistakes, builds trust among users and stakeholders who need to understand decisions, ensures compliance with regulations that require explainable decisions (like certain credit approval requirements), and supports fairness efforts since you can’t fix bias without understanding how decisions are made.
Interpretability operates at different levels. Global interpretability helps you understand how the model works overall (such as knowing a credit model primarily relies on credit score and income), while local interpretability explains specific decisions (like understanding a particular loan was denied primarily due to high debt-to-income ratio). Common approaches include using inherently interpretable models like linear regression and decision trees, analyzing feature importance to understand which variables matter most, applying techniques like SHAP values to explain individual predictions, and providing counterfactual explanations that show how different inputs would change outcomes.
Balancing Competing Concerns
You’ll often face trade-offs between accuracy, fairness, privacy, and interpretability, and there’s no single “right” answer. The key is making these trade-offs consciously and transparently, guided by your specific application context and stakeholder needs. Critical questions to consider include: whether a small decrease in accuracy is justified by significant fairness improvements, how much model complexity is warranted by performance gains, what level of privacy protection is appropriate for your application, and who needs to understand model decisions and at what level of detail. These decisions should align with your organization’s values, regulatory requirements, and the real-world impact of your model’s deployment.
Knowledge Check
20.6 Summary
Building successful machine learning models requires much more than selecting the right algorithm and achieving high accuracy scores. The foundation of any ML project lies in careful planning, thoughtful problem framing, and systematic consideration of potential pitfalls before you write your first line of modeling code.
Throughout this chapter, you’ve learned that effective machine learning begins with clearly defining your business problem and establishing concrete success criteria. You’ve seen how data quality and proper train/test splits form the bedrock of reliable model evaluation, and how data leakage can make models appear deceptively successful while actually being worthless in practice. Finally, you’ve considered the human impact of machine learning through the lenses of fairness, privacy, and interpretability—recognizing that technical excellence alone isn’t sufficient when models affect people’s lives.
The key insight is that good machine learning isn’t just about algorithms, it’s about problem framing, data discipline, appropriate evaluation, and responsible use. The time you invest in these foundational considerations will pay dividends throughout your project, helping you avoid costly mistakes and build models that actually solve real business problems.
- Clear problem definition: Specific, measurable business questions with defined success criteria
- Data quality assessment: Understanding your data’s limitations, biases, and gaps
- Proper validation strategy: Time-aware train/test splits that reflect real-world deployment
- Leakage prevention: Ensuring only information available at prediction time is used for training
- Appropriate metrics: Evaluation criteria aligned with business objectives and model type
- Ethical considerations: Proactive assessment of fairness, privacy, and interpretability requirements
As you move forward in your machine learning journey, remember that the most sophisticated algorithms in the world can’t compensate for poor foundational planning. The habits and mindset you develop now—asking the right questions, being suspicious of results that seem too good to be true, and always considering the human impact of your models—will serve you well regardless of which specific techniques you eventually master.
20.7 End of Chapter Exercise: ML Project Pitfall Analysis
You work as a consultant helping companies identify potential problems in their machine learning projects before they invest significant resources. For each scenario below, identify the key issues that could derail the project and suggest what the team should address before building their models.
Reflection Questions
After analyzing these scenarios, consider:
- Common Patterns: What types of mistakes appear across multiple scenarios?
- Detection Skills: How can you develop the ability to spot these issues early in your own projects?
- Prevention Strategies: What processes or checklists could help teams avoid these pitfalls?
- Stakeholder Communication: How would you explain these technical issues to non-technical business leaders?
The goal of this exercise isn’t to memorize specific problems, but to develop the critical thinking skills and systematic approach that will help you identify and address issues before they derail your machine learning projects.