Lesson 7b: First model with scikit-learn#
In this module, we present how to build predictive models on tabular datasets, with only numerical features.
In particular we will highlight:
the scikit-learn API:
.fit(X, y)
/.predict(X)
/.score(X, y)
;how to evaluate the generalization performance of a model with a train-test split.
Learning objectives#
By the end of this lesson you will be able to:
Explain and implement the
.fit()
,.predict()
, and.score()
methods provided by the scikit-learn API.Evaluate the generalization performance of a model with a train-test split.
Data#
We will use the same dataset “adult_census” described in the previous lesson. For more details about the dataset see http://www.openml.org/d/1590.
import pandas as pd
adult_census = pd.read_csv("../data/adult-census.csv")
Separating features from target#
Scikit-learn prefers our features (\(X\)) apart from our target (\(y\)). Consequently, it’s quite common to create separate data objects to hold our feature and target data.
Note
Numerical data is the most natural type of data used in machine learning and can (often) be directly fed into predictive models. Consequently, for this lesson we will use a subset of the original data with only the numerical columns.
Here, we create a target
data object that contains our target variable (class
) data and a features
data object that contains all our numeric feature data.
import numpy as np
# create column names of interest
target_col = "class"
feature_col = adult_census.drop(columns=target_col).select_dtypes(np.number).columns.values
target = adult_census[target_col]
target
0 <=50K
1 <=50K
2 >50K
3 >50K
4 <=50K
...
48837 <=50K
48838 >50K
48839 <=50K
48840 <=50K
48841 >50K
Name: class, Length: 48842, dtype: object
features = adult_census[feature_col]
features
age | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|
0 | 25 | 7 | 0 | 0 | 40 |
1 | 38 | 9 | 0 | 0 | 50 |
2 | 28 | 12 | 0 | 0 | 40 |
3 | 44 | 10 | 7688 | 0 | 40 |
4 | 18 | 10 | 0 | 0 | 30 |
... | ... | ... | ... | ... | ... |
48837 | 27 | 12 | 0 | 0 | 38 |
48838 | 40 | 9 | 0 | 0 | 40 |
48839 | 58 | 9 | 0 | 0 | 40 |
48840 | 22 | 9 | 0 | 0 | 20 |
48841 | 52 | 9 | 15024 | 0 | 40 |
48842 rows × 5 columns
print(
f"The dataset contains {features.shape[0]} samples and "
f"{features.shape[1]} features"
)
The dataset contains 48842 samples and 5 features
Knowledge check#
Questions:
What type of object is the target data set?
What type of object is the feature data set?
Fit a model#
We will build a classification model using the “K-nearest neighbors”
strategy. To predict the target of a new sample, a k-nearest neighbors takes
into account its k
closest samples in the training set and predicts the
majority target of these samples.
Note
We use a K-nearest neighbors here. However, be aware that it is seldom useful in practice. We use it because it is an intuitive algorithm. In future lessons, we will introduce alternative algorithms.
The .fit
method is called to train the model from the input (features) and target data.
# to display nice model diagram
from sklearn import set_config
set_config(display='diagram')
from sklearn.neighbors import KNeighborsClassifier
# 1. define the algorithm
model = KNeighborsClassifier()
# 2. fit the model
model.fit(features, target)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier()
Learning can be represented as follows:
Fig. 15 .fit
method representation.#
The method fit
is based on two important elements: (i) learning algorithm
and (ii) model state. The model state can be used later to either predict (for classifiers and regressors) or transform data (for transformers).
Note
Here and later, we use the name data
and target
to be explicit. In
scikit-learn documentation, data
is commonly named X
and target
is
commonly calledy
.
Make predictions#
Let’s use our model to make some predictions using the same dataset. To predict, a model uses a prediction function that will use the input data together with the model states.
target_predicted = model.predict(features)
We can illustrate the prediction mechanism as follows:
Fig. 16 .predict
method representation.#
To predict, a model uses a prediction method that will use the input data together with the model states. As for the learning algorithm and the model states, the prediction function is specific for each type of model.
Let’s now have a look at the computed predictions. For the sake of simplicity, we will look at the five first predicted targets.
target_predicted[:5]
array([' <=50K', ' <=50K', ' <=50K', ' >50K', ' <=50K'], dtype=object)
…and we can even check if the predictions agree with the real targets:
# accuracy of first 5 predictions
target[:5] == target_predicted[:5]
0 True
1 True
2 False
3 True
4 True
Name: class, dtype: bool
Note
Here, we see that our model makes a mistake when predicting the third observation.
To get a better assessment, we can compute the average accuracy rate.
(target == target_predicted).mean()
np.float64(0.8479791982310306)
This result means that the model makes a correct prediction for approximately 85 samples out of 100. Note that we used the same data to train and evaluate our model. Can this evaluation be trusted or is it too good to be true?
Train-test data split#
When building a machine learning model, it is important to evaluate the trained model on data that was not used to fit it, as generalization is our primary concern – meaning we want a rule that generalizes to new data.
Correct evaluation is easily done by leaving out a subset of the data when training the model and using it afterwards for model evaluation.
The data used to fit a model is called training data while the data used to assess a model is called testing data.
Scikit-learn provides the helper function sklearn.model_selection.train_test_split
which is used to automatically split the dataset into two subsets.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
features,
target,
random_state=123,
test_size=0.25,
stratify=target
)
Note
In scikit-learn setting the random_state
parameter allows to get
deterministic results when we use a random number generator. In the
train_test_split
case the randomness comes from shuffling the data, which
decides how the dataset is split into a train and a test set).
And as your target becomes more imbalanced it is important to use the stratify
parameter.
Knowledge check#
Questions:
How many observations are in your train and test data sets?
What is the proportion of response values in your
y_train
andy_test
?
Instead of computing the prediction and manually computing the average
success rate, we can use the method score
. When dealing with classifiers
this method returns their performance metric.
We can illustrate the score mechanism as follows:
Fig. 17 .score
method representation.#
The .score
method is very similar to the .predict
method; however, it adds one additional step to compare the predictions made to the actual values and then return an accuracy score. Note how below, we use the test data in the .score
method so that we are scoring our accuracy based on test data (data not used to train our model).
# 1. define the algorithm
model = KNeighborsClassifier()
# 2. fit the model
model.fit(X_train, y_train)
# 3. score our model on test data
accuracy = model.score(X_test, y_test)
print(f'The test accuracy using {model.__class__.__name__} is {round(accuracy, 4) * 100}%')
The test accuracy using KNeighborsClassifier is 82.59%
Note
If we compare with the accuracy obtained by wrongly evaluating the model on the training set, we find that this evaluation was indeed optimistic compared to the score obtained on a held-out test set.
This illustrates the importance of always testing the generalization performance of predictive models on a different set than the one used to train these models.
Exercises#
Questions:
Scikit-learn provides a logistic regression algorithm, which is another type of algorithm for making binary classification predictions. This algorithm is available at sklearn.linear_model.LogisticRegression
Fill in the blanks below to import the LogisticRegression
module, define the algorithm, fit the model, and score on the test data.
```python
# 1. import the LogisticRegression module
from sklearn.linear_model import __________
# 2. define the algorithm
model = __________
# 3. fit the model
model.fit(______, ______)
# 4. score our model on test data
model.score(______, ______)
How does this models performance compare to the KNeighborsClassifier
results?
Computing environment#
%load_ext watermark
%watermark -v -p jupyterlab,pandas,numpy,sklearn
Python implementation: CPython
Python version : 3.12.4
IPython version : 8.26.0
jupyterlab: 4.2.3
pandas : 2.2.2
numpy : 2.0.0
sklearn : 1.5.1