The last time Hackerfall tried to access this page, it returned a not found error. A cached version of the page is below, or clickhereto continue anyway

Misleading modelling: overfitting, cross-validation, and the bias-variance trade-off | Cambridge Coding Academy

Introduction

In this post you will get to grips with what is perhaps the most essential concept in machine learning: the bias-variance trade-off. The main idea here is that you want to create models that are as good at prediction as possible but that are still applicable to new data (i.e. they are generalizable). The danger is that you can easily create models that overfit to the local noise in your specific dataset, which isn’t too helpful and leads to poor generalizability since the noise is random and therefore different in each dataset. Essentially, you want to create models that capture only the useful components of a dataset. On the other hand, models that generalize very well but are too inflexible to generate good predictions are the other extreme you want to avoid (this is called underfitting).

We discuss and demonstrate these concepts using the k-nearest neighbors algorithm, which has a simple parameter k which can be varied to cleanly demonstrate these ideas of underfitting, overfitting and generalization. Together, this bundle of concepts related to the balance between underfitting and overfitting is referred to as the bias-variance trade-off. Here is a table summarizing some different but related characteristics of models which either underfit or overfit, which you can refer to throughout this post:

We will explain what all of these terms mean and how they are interrelated. We will also discuss cross-validation, which is a good way of estimating the accuracy and generalizability of your models.

You will encounter all of these concepts in future blog posts, which will cover model optimization, random forests, Naive Bayes, logistic regression and how to combine different models into an ensembled meta-model.

Generating the dataset

Let’s start off by building an artificial dataset to play with. You can do this easily with the make_classification() function from sklearn.datasets. Specifically, you will generate a relatively simple binary classification problem. To make it a bit more interesting, let’s make the data crescent-shaped and add some random noise. This should make it more realistic and increase the difficulty of classifying observations.

# Creating the dataset
# e.g. make_moons generates crescent-shaped data
# Check out make_classification, which generates linearly-separable data
from sklearn.datasets import make_moons

X, y = make_moons(
    n_samples=500,  # the number of observations
    random_state=1,
    noise=0.3
)

# Take a peek
print(X[:10,])
print(y[:10])
[[ 0.50316464  0.11135559]
 [ 1.06597837 -0.63035547]
 [ 0.95663377  0.58199637]
 [ 0.33961202  0.40713937]
 [ 2.17952333 -0.08488181]
 [ 2.00520942  0.7817976 ]
 [ 0.12531776 -0.14925731]
 [ 1.06990641  0.36447753]
 [-0.76391099 -0.6136396 ]
 [ 0.55678871  0.8810501 ]]
[1 1 0 0 1 1 1 0 0 0]

The dataset you just generated looks a bit like this:

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap

%matplotlib inline  # for the plots to appear inline in jupyter notebooks

# Plot the first feature against the other, color by class
plt.scatter(X[y == 1, 0], X[y == 1, 1], color="#EE3D34", marker="x")
plt.scatter(X[y == 0, 0], X[y == 0, 1], color="#4458A7", marker="o")

Next up, let’s split the dataset into a training set and test set. The training set will be used to develop and tune our models. The test set will be completely left alone until the very end, at which point you’ll run your finished models on it. Having a test set will allow you to get a good estimate of how well your models would perform out in the wild on previously unseen data.

from sklearn.cross_validation import train_test_split

# Split into training and test sets
XTrain, XTest, yTrain, yTest = train_test_split(X, y, random_state=1, test_size=0.5)

You are going to try to predict the classes in our dataset with a k Nearest Neighbor (kNN) classifier. Chapter 2 of the Introduction to Statistical Learning book provides a great intro to the theory behind kNN. We are huge fans of the ISLR book, so definitely check it out if you have the time. You could also have a look at this previous post that teaches you how to implement the algorithm from scratch in Python.

Introducing the k hyperparameter in kNN

The kNN algorithm works by using information about the k-nearest neighbors of a new data point in order to assign it a class label. It simply looks at the class of other data points most similar to it (its ‘nearest neighbors’) and assigns the new data point to the most common class of these neighbors. When using kNN, you have to set the value of k that you want the algorithm to use ahead of time, and it is not trivial to know which value to use.

If the value for k is high (e.g. k=99), then the model considers a large number of neighbors when making a a decision about the class of an unknown datapoint. This means that the model is quite constrained, since it has to take a large amount of information into account when classifying instances. In other words, a high number for k give rise to relatively “rigid” model behaviour.

By contrast, if the value for k is low (e.g. k=1 or k=2), then only a few neighbors are taken into account when making a classification decision. It is a very flexible model with a lot of complexity – it really fits very closely to the precise shape of the dataset. Hence, the predictions of the model are much more dependent on the local tendencies of the data (crucially, this includes the noise!).

Take a look at how the kNN algorithm separates the training cases when k=99 compared to when k=1. The green line is the decision boundary on the training data (i.e. the threshold at which the algorithm decides whether a data point belong in the blue or red class).

At the bottom of the post you learn how to generate these plots yourself, but let’s delve into some theory first.

When k=99 (on the left), it looks like the model fit might be a bit too smooth and could stand to fit the data a bit closer. The model has low flexibility and low complexity. It paints the decision boundary with a broad brush. It has relatively high bias because one can tell it is not modelling the data as well as it could – it models the underlying generative process of the data as something too simple, and this is highly biased away from the ground truth. But, the decision boundary would probably look very similar if you redrew it on a slightly different dataset. It is a stable model that won’t vary a lot – it has low variance.

When k=1 (on the right), you can see that the model is massively overfitting to the noise. It is technically generating perfectly correct predictions on the training set (the error in the bottom right hand corner is equal to 0.00!), but hopefully you can see how this fit is way too sensitive to individual data points. Keep in mind that you added noise to the dataset – it looks like this model fit is taking the noise too seriously and is fitting very closely to it. You can say that the k=1 model has high flexibility and high complexity because it tunes very tightly to the data. It also has low bias – if nothing else, the decision boundary certainly fits the trends you observe in the data. But, the fitted boundary would drastically change on even slightly different data – it would vary significantly, i.e. the k=1 model has high variance.

But how well do these models generalize, i.e. how well would they perform on new data?

You have so far only looked at the training data, but quantifying training error isn’t that useful. You are not interested in how well models can recapitulate what they just learned on the training set. Let’s take a look at how they perform on test data, since that gives a better impression of whether your models are actually good or not. Try it yourself using a few different values of k:

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

knn99 = KNeighborsClassifier(n_neighbors = 99)
knn99.fit(XTrain, yTrain)
yPredK99 = knn99.predict(XTest)
print "Overall Error of k=99 Model:", 1 - round(metrics.accuracy_score(yTest, yPredK99), 2)

knn1 = KNeighborsClassifier(n_neighbors = 1)
knn1.fit(XTrain, yTrain)
yPredK1 = knn1.predict(XTest)
print "Overall Error of k=1 Model:", 1 - round(metrics.accuracy_score(yTest, yPredK1), 2)
Overall Error of k=99 Model: 0.15
Overall Error of k=1 Model: 0.15

Actually, it looks like these models perform approximately equally well on the test data. Here are the decision boundaries you learned on the training set, applied to the test set. See if you can figure out where the two models are making their mistakes.

The two models are making mistakes for very different reasons. It seems that the k=99 model isn’t doing a good job at capturing the crescent shape of the data (it is underfitting), while the k=1 model is making mistakes by being horribly overfitted to the noise. Remember, the hallmark of overfitting is good training performance and bad testing performance, which is what you observe here.

Maybe intermediate values of k are where you want to be? Try a value of k between 99 and 1, e.g.:

knn50 = KNeighborsClassifier(n_neighbors = 50)
knn50.fit(XTrain, yTrain)
yPredK50 = knn50.predict(XTest)
print "Overall Error of k=50 Model:", 1 - round(metrics.accuracy_score(yTest, yPredK50), 2)
Overall Error of k=50 Model: 0.11

Looking better! Let’s check out the decision boundary for the k=50 model.

Much better – the model fit is similar to the actual trend in the dataset and this improvement is reflected in a lower test set error.

The bias-variance trade-off: concluding comments

Hopefully you now have a good intuition for what it means for models to underfit and overfit. See if all of the terms in the table at the beginning of this post now make sense. Basically, finding the right balance between overfitting and underfitting corresponds to the bias-variance trade-off.

To recap, when you train machine learning algorithms on a data set, what you are really interested in is how your model will perform on an independent data set. It is not enough to do a good job classifying instances on the training set. Essentially, you are only interested in building models that are generalizable – getting 100% accuracy on the training set is not impressive, and is simply an indicator of overfitting. Overfitting is the situation in which you have fitted your model too closely to the data, and have tuned to the noise instead of just to the signal.

To be clear: strictly speaking, you are not trying to model the trends in the dataset. You try to model the real world process that has led to us observing the data. The specific dataset you happen to be working with is just a small set of instances (i.e. a sample) of the ground truth, which brings with it its own noise and peculiarities.

Here is a summary figure showing how under-fitting (high bias, low variance), properly fitting, and over-fitting (low bias, high variance) models fare on the training compared to the test sets:

This idea of building generalizable models is the motivation behind splitting your dataset into a training set (on which models can be trained) and a test set (which is held out until the very end of your analysis, and provides an accurate measure of model performance).

But – BIG warning! It’s also possibly to overfit to the test set. If you were to try out lots of different models and keep changing them in order to chase accuracy points on the test set, then the information from the test set can inadvertently leak into our model creation phase. You need a way around this.

Estimating model performance using k-fold cross validation

Enter k-fold cross-validation, which is a handy technique for measuring a model’s performance using only the training set. Say that you want to do e.g. 10-fold cross-validation. The process is as follows: you randomly partition the training set into 10 equal sections. Then, we train an algorithm on 9/10ths (i.e. 9 out of the 10 sections) of that training set. You then evaluate its performance on the remaining 1 section. This gives you some measure of the model’s performance (e.g. overall accuracy). You then train the algorithm on a different 9/10ths of the training set, and evaluate on the other (different from before) remaining 1 section. You continue the process 10 times, get 10 different measures of model performance, and average these values to get an overall measure of performance. Of course, you could have chosen some number other than 10. To keep on with the example, the process behind 10-fold CV looks like this:

You can use k-fold cross validation to get an estimate of model accuracy, and you can use these estimates to tweak your model until you are happy. This lets you leave the test data alone until the very end, thus side-stepping the danger of overfitting to it. In other words, cross-validation provides a way to simulate having more data than you actually have so that you do not have to “spend” your test data until the very end of model building. k-fold cross validation, and its variants, are extremely popular and very useful, especially if you’re trying out lots and lots of different models (e.g. if you want to test how well a load of differently parameterized models perform).

Comparing training error, cross-validation error, and test error

So what would be the best value for k? Try out different values for k when building models with the training data and see how well the resulting models fare when predicting the classes of either the training set itself or the test set. Finally, see how well k-fold cross validation will be at indicating the best k.

Note: in practice, when scanning a parameter like this, it would be a bad idea to use the training set for testing the model. In equal measure, you would never scan a parameter using the test set multiple times (once for each parameter value tried). In the following plot you use these calculations just as an illustration to see what they would look like. In practice, only k-fold cross validation is a safe approach!

import numpy as np
from sklearn.cross_validation import train_test_split, cross_val_score

knn = KNeighborsClassifier()

# the range of number of neighbors you want to test
n_neighbors = np.arange(1, 141, 2)

# here you store the models for each dataset used
train_scores = list()
test_scores = list()
cv_scores = list()

# loop through possible n_neighbors and try them out
for n in n_neighbors:
    knn.n_neighbors = n
    knn.fit(XTrain, yTrain)
    train_scores.append(1 - metrics.accuracy_score(yTrain, knn.predict(XTrain)))  # this will over-estimate the accuracy
    test_scores.append(1 - metrics.accuracy_score(yTest, knn.predict(XTest)))
    cv_scores.append(1 - cross_val_score(knn, XTrain, yTrain, cv = 10).mean())  # you take the mean of the CV scores

So what would be the best value to pick for k? When multiple values are giving the same prediction error, you just pick the smallest value for k.

# what do these different datasets think is the best value of k?
print(
    'The best values of k are:\n' \
    '{} according to the Training Set\n' \
    '{} according to the Test Set and\n' \
    '{} according to Cross-Validation'.format(
        min(n_neighbors[train_scores == min(train_scores)]),
        min(n_neighbors[test_scores == min(test_scores)]),
        min(n_neighbors[cv_scores == min(cv_scores)])        
    )
)    
The best values of k are: 
1 according to the Training Set
23 according to the Test Set and
11 according to Cross-Validation

Rather than just collecting the best ks, have a peek at what the prediction error is over the range of ks tested.

# let's plot the error you get with different values of k
plt.figure(figsize=(10,7.5))
plt.plot(n_neighbors, train_scores, c="black", label="Training Set")
plt.plot(n_neighbors, test_scores, c="black", linestyle="--", label="Test Set")
plt.plot(n_neighbors, cv_scores, c="green", label="Cross-Validation")
plt.xlabel('Number of K Nearest Neighbors')
plt.ylabel('Classification Error')
plt.gca().invert_xaxis()
plt.legend(loc = "lower left")
plt.show()

Let’s talk about the classification error on the training set. The fewer neighbors you consider, the lower the prediction error when evaluating the models on the training set (black solid line). This makes sense, since you approach the scenario where each point only considers its own self when making new classifications, which leads to perfect “predictions”. The test data error follows a similar trajectory, but experiences an increase after a certain point because of local overfitting. This behaviour indicates that now the specific test set sample is not modelled very well by the model fit which was built on the training data.

In this plot, you see that especially for low values of k, using k-fold cross-validation highlights a region in the parameter space (i.e. very low values of k) that is very prone to overfitting. Even though cross-validation and the test set evaluation lead to somewhat different optima, they are both pretty decent and you are clearly in the right ballpark. You can also see that cross-validation is a reasonable estimator of test error. This type of plot is nice to study in order to get a good feeling for how a certain parameter influences model performance and help build an intuition about your dataset.

Just show me the code!

Here is the code for generating the above plots, and doing the training and testing of different kNN algorithms. This code is an adapted version of this scikit-learn example, and most it deals with the finicky details of calculating decision boundaries and making the plots look nice. The meaty machine learning parts of splitting the dataset, fitting the algorithm, and testing it were covered above.

def detect_plot_dimension(X, h=0.02, b=0.05):
    x_min, x_max = X[:, 0].min() - b, X[:, 0].max() + b
    y_min, y_max = X[:, 1].min() - b, X[:, 1].max() + b
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    dimension = xx, yy
    return dimension

def detect_decision_boundary(dimension, model):
    xx, yy = dimension  # unpack the dimensions
    boundary = model.predict(np.c_[xx.ravel(), yy.ravel()])
    boundary = boundary.reshape(xx.shape)  # Put the result into a color plot
    return boundary

def plot_decision_boundary(panel, dimension, boundary, colors=['#DADDED', '#FBD8D8']):
    xx, yy = dimension  # unpack the dimensions
    panel.contourf(xx, yy, boundary, cmap=ListedColormap(colors), alpha=1)
    panel.contour(xx, yy, boundary, colors="g", alpha=1, linewidths=0.5)  # the decision boundary in green

def plot_dataset(panel, X, y, colors=["#EE3D34", "#4458A7"], markers=["x", "o"]):
    panel.scatter(X[y == 1, 0], X[y == 1, 1], color=colors[0], marker=markers[0])
    panel.scatter(X[y == 0, 0], X[y == 0, 1], color=colors[1], marker=markers[1])

def calculate_prediction_error(model, X, y):
    yPred = model.predict(X)
    score = 1 - round(metrics.accuracy_score(y, yPred), 2)
    return score

def plot_prediction_error(panel, dimension, score, b=.3):
    xx, yy = dimension  # unpack the dimensions
    panel.text(xx.max() - b, yy.min() + b, ('%.2f' % score).lstrip('0'), size=15, horizontalalignment='right')

def explore_fitting_boundaries(model, n_neighbors, datasets, width):  
    # determine the height of the plot given the aspect ration of each panel should be equal
    height = float(width)/len(n_neighbors) * len(datasets.keys())

    nrows = len(datasets.keys())
    ncols = len(n_neighbors)

    # set up the plot
    figure, axes = plt.subplots(
        nrows,
        ncols,
        figsize=(width, height),
        sharex=True,
        sharey=True
    )

    dimension = detect_plot_dimension(X, h=0.02)  # the dimension each subplot based on the data

    # Plotting the dataset and decision boundaries
    i = 0
    for n in n_neighbors:
        model.n_neighbors = n
        model.fit(datasets["Training Set"][0], datasets["Training Set"][1])
        boundary = detect_decision_boundary(dimension, model)
        j = 0
        for d in datasets.keys():
            try:
                panel = axes[j, i]
            except (TypeError, IndexError):
                if (nrows * ncols) == 1:
                    panel = axes
                elif nrows == 1:  # if you only have one dataset
                    panel = axes[i]
                elif ncols == 1:  # if you only try one number of neighbors
                    panel = axes[j]
            plot_decision_boundary(panel, dimension, boundary)  # plot the decision boundary
            plot_dataset(panel, X=datasets[d][0], y=datasets[d][1])  # plot the observations
            score = calculate_prediction_error(model, X=datasets[d][0], y=datasets[d][1])
            plot_prediction_error(panel, dimension, score, b=0.2)  # plot the score

            # make compacted layout
            panel.set_frame_on(False)
            panel.set_xticks([])
            panel.set_yticks([])

            # format the axis labels
            if i == 0:
                panel.set_ylabel(d)
            if j == 0:
                panel.set_title('k={}'.format(n))
            j += 1
        i += 1    

    plt.subplots_adjust(hspace=0, wspace=0)  # make compacted layout

You can then run the code like this:

# specify the model and settings
model = KNeighborsClassifier()
n_neighbors = [200, 99, 50, 23, 11, 1]
datasets = {
    "Training Set": [XTrain, yTrain],
    "Test Set": [XTest, yTest]
}
width = 20

# explore_fitting_boundaries(model, n_neighbors, datasets, width)
explore_fitting_boundaries(model=model, n_neighbors=n_neighbors, datasets=datasets, width=width)

Conclusion

The bias-variance trade-off appears in a lot of different areas of machine learning. All algorithms can be considered to have a certain degree of flexibility and this is certainly not specific to kNN. The goal of finding the sweet spot of flexibility that describes the patterns in the data well but is still generalizable to new data applies to basically all algorithms.

ABOUT THE AUTHORS

Natasha Latysheva

Natasha is a computational biology PhD student at the MRC Laboratory of Molecular Biology. Her research is focused on cancer genomics, statistical network analysis, and protein structure. More generally, her research interests lie in data-intensive molecular biology, machine learning (especially deep learning) and data science.

Charles Ravarani

Charles is a research associate at the MRC Laboratory of Molecular Biology. Since the time of his PhD in Computational Biology, Charles has been working with large-scale genomic datasets to build molecular models of gene expression noise that ultimately improve the efficiency of current drug treatments. In his research, machine learning techniques have proven very useful and he is keen to be involved in the wider ML community to learn about new techniques.

Like this:

Like Loading...

Related

Continue reading on blog.cambridgecoding.com