# inline plotting instead of popping out
%matplotlib inline
# python 3.8.8
import os, itertools, csv
from IPython.display import Image
from IPython.display import display
# numpy 1.22.4
import numpy as np
# pandas 1.2.4
import pandas as pd
# scikit-learn 0.24.1
from sklearn import datasets
load_iris = datasets.load_iris
make_moons = datasets.make_moons
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, VotingClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import accuracy_score, mean_squared_error, roc_curve, auc
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.tree import DecisionTreeClassifier
# matplotlib 3.3.4
import matplotlib.pyplot as plt
# load utility classes/functions e.g., plot_decision_regions()
import urllib.request
urllib.request.urlretrieve("https://nthu-datalab.github.io/ml/labs/04-1_Perceptron_Adaline/lab04lib.py", "lab04lib.py")
from lab04lib import *
# Make output directory
if not os.path.exists("output/") : os.mkdir("output/")
import warnings
warnings.filterwarnings("ignore")
In this lab, we will guide you through the cross validation technique for hyperparameter selection. We will also practice the ensemble learning techniques that combine multiple base-leaners for better performance.
So far, we hold out the validation and testing sets for hyperparameter tuning and performance reporting. Specifically, we partition a dataset $\mathbb{X}$ into the training, validation, and testing sets. We use the training set to fit a model by giving a set of hyperparameters, and then use the validation set to evaluate the performance of the model given the hyperparameters. We repeat these two steps by issuing different sets of hyperparameters and pick the set that leads to the highest validation performance. We then use both the training and validation sets to train our final model, and apply it the testing set to evaluate/report the generalization performance. The following figure illustrates the procedure:
Next, we apply this technique to evaluate the KNeighborsClassifier
on the Iris dataset. For simplicity, we consider the sepal width
and petal length
features only. Let's split the dataset first:
iris = load_iris()
X, y = iris.data[:,[1,2]], iris.target
# hold out testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# hold out validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=0)
Then, we iterate through each value of hyperparameter n_neighbors
= 1, 15, 50 to train on the training set and estimate performance on the validation set and record the best:
best_k, best_score = -1, -1
clfs = {}
# hyperparameter tuning
for k in [1, 15, 50]:
pipe = Pipeline([['sc', StandardScaler()], ['clf', KNeighborsClassifier(n_neighbors=k)]])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_val)
score = accuracy_score(y_val, y_pred)
print('[{}-NN]\nValidation accuracy: {}'.format(k, score))
if score > best_score:
best_k, best_score = k, score
clfs[k] = pipe
# performance reporting
y_pred= clfs[best_k].predict(X_test)
print('\nTest accuracy: %.2f (n_neighbors=%d selected by the holdout method)' %
(accuracy_score(y_test, y_pred), best_k))
[1-NN] Validation accuracy: 0.9375 [15-NN] Validation accuracy: 0.90625 [50-NN] Validation accuracy: 0.4375 Test accuracy: 0.89 (n_neighbors=1 selected by the holdout method)
One major disadvantage of the holdout method is that the validation and testing performance is sensitive to the random splits. If we have a unfortunate split such that the validation (resp. testing) set is unrepresentative, we may end up picking suboptimal hyperparameters (resp. reporting a misleading performance score).
In this case, the hyperparameter n_neighbors
= 15 actually leads to better test performance:
y_pred= clfs[15].predict(X_test)
print('Test accuracy: %.2f (n_neighbors=15 selected manually)' %
accuracy_score(y_test, y_pred))
Test accuracy: 0.91 (n_neighbors=15 selected manually)
We can see that the validation set is unrepresentative and leads to indistinguishable validation accuracy scores (1.0) for all values of n_neighbors
.
Next, we take a look at a more robust technique called the $K$-Fold Cross-Validation.
In $K$-fold cross-validation (CV), we randomly split the training dataset into $K$ folds without replacement, where $K−1$ folds are used for the model training and the remaining 1 fold is for testing. This procedure is repeated $K$ times so that we obtain $K$ models and $K$ performance estimates. Then we take their average as the final performance estimate. The following figure illustrate the 10-fold CV:
We can apply $K$-fold CV to either the hyperparameter tuning, performance reporting, or both. The advantage of this approach is that the performance is less sensitive to unfortunate splits of data. In addition, it utilize data better since each example can be used for both training and validation/testing.
Let's use $K$-Fold CV to select the hyperparamter n_neighbors
of the KNeighborsClassifier
:
iris = load_iris()
X, y = iris.data[:,[1,2]], iris.target
# hold out testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
The dataset is first split into training/testing sets.
best_k, best_score = -1, -1
clfs = {}
for k in [1, 15, 50]: # experiment different hyperparameter
pipe = Pipeline([['sc', StandardScaler()], ['clf', KNeighborsClassifier(n_neighbors=k)]])
pipe.fit(X_train, y_train)
# K-Fold CV
scores = cross_val_score(pipe, X_train, y_train, cv=5)
print('[%d-NN]\nValidation accuracy: %.3f %s' % (k, scores.mean(), scores))
if scores.mean() > best_score:
best_k, best_score = k, scores.mean()
clfs[k] = pipe
[1-NN] Validation accuracy: 0.895 [0.9047619 0.95238095 0.85714286 0.85714286 0.9047619 ] [15-NN] Validation accuracy: 0.895 [0.9047619 0.85714286 0.9047619 0.9047619 0.9047619 ] [50-NN] Validation accuracy: 0.819 [0.80952381 0.80952381 0.85714286 0.80952381 0.80952381]
5-fold CV selects the best n_neighbors
= 15 as we expected. Once selecting proper hyperparameter values, we retrain the model on the complete training set and obtain a final performance estimate on the test set:
best_clf = clfs[best_k]
best_clf.fit(X_train, y_train)
# performance reporting
y_pred = best_clf.predict(X_test)
print('Test accuracy: %.2f (n_neighbors=%d selected by 5-fold CV)' %
(accuracy_score(y_test, y_pred), best_k))
Test accuracy: 0.84 (n_neighbors=1 selected by 5-fold CV)
We can also apply the $K$-fold CV to both the hyperparameter selection and performance reporting at the same time, this is called the nested CV. Following illustrate the $5\times 2$ nested CV:
where we select the values of hyperparameters by 2-fold CV and estimate the generalized performance by 5-fold CV, respectively. Let's try this ourselves:
outer_cv = KFold(n_splits=5, shuffle=True, random_state=1)
inner_cv = KFold(n_splits=10, shuffle=True, random_state=1)
outer_scores = []
# outer folds
for i, (train_idx, test_idx) in enumerate(outer_cv.split(X, y)):
print('[Outer fold %d/5]' % (i + 1))
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
best_k, best_score = -1, -1
clfs = {}
# hyperparameter tuning
for k in [1, 15, 50]:
inner_scores = []
# inner folds
for itrain_idx, val_idx in inner_cv.split(X_train, y_train):
X_itrain, X_val = X_train[itrain_idx], X_train[val_idx]
y_itrain, y_val = y_train[itrain_idx], y_train[val_idx]
pipe = Pipeline([['sc', StandardScaler()],
['clf', KNeighborsClassifier(n_neighbors=k)]])
pipe.fit(X_itrain, y_itrain)
y_pred = pipe.predict(X_val)
inner_scores.append(accuracy_score(y_val, y_pred))
score_mean = np.mean(inner_scores)
if best_score < score_mean:
best_k, best_score = k, score_mean
clfs[k] = pipe
# evaluate performance on test fold
best_clf = clfs[best_k]
best_clf.fit(X_train, y_train)
y_pred = best_clf.predict(X_test)
outer_scores.append(accuracy_score(y_test, y_pred))
print('Test accuracy: %.2f (n_neighbors=%d selected by inner 10-fold CV)' %
(outer_scores[i], best_k))
print('\nTest accuracy: %.2f (5x10 nested CV)' % np.mean(outer_scores))
[Outer fold 1/5] Test accuracy: 0.90 (n_neighbors=1 selected by inner 10-fold CV) [Outer fold 2/5] Test accuracy: 0.90 (n_neighbors=15 selected by inner 10-fold CV) [Outer fold 3/5] Test accuracy: 0.90 (n_neighbors=15 selected by inner 10-fold CV) [Outer fold 4/5] Test accuracy: 0.93 (n_neighbors=15 selected by inner 10-fold CV) [Outer fold 5/5] Test accuracy: 1.00 (n_neighbors=15 selected by inner 10-fold CV) Test accuracy: 0.93 (5x10 nested CV)
As we can see, the 5 inner CVs may select different values for the hyperparameter n_neighbors
. In this case, the 1st inner CV selects n_neighbors
= 1 due to an unlucky split of the training and testing sets in the outer fold. By doing nested CV, we get a more robust performance estimation.
In fact, we can simplify the above example using the GridSearchCV
from Scikit-learn:
outer_cv = KFold(n_splits=5, shuffle=True, random_state=1)
inner_cv = KFold(n_splits=10, shuffle=True, random_state=1)
outer_scores = []
# outer folds
for i, (train_idx, test_idx) in enumerate(outer_cv.split(X, y)):
print('[Outer fold %d/5]' % (i + 1))
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
pipe = Pipeline([['sc', StandardScaler()], ['clf', KNeighborsClassifier()]])
# hyperparameter tuning by grid search CV
param_grid = {'clf__n_neighbors':[1, 15, 50]}
gs = GridSearchCV(estimator=pipe, param_grid=param_grid,
scoring='accuracy', cv=inner_cv)
gs.fit(X_train, y_train)
best_clf = gs.best_estimator_
best_clf.fit(X_train, y_train)
outer_scores.append(best_clf.score(X_test, y_test))
print('Test accuracy: %.2f (n_neighbors=%d selected by inner 10-fold CV)' %
(outer_scores[i], gs.best_params_['clf__n_neighbors']))
print('\nTest accuracy: %.2f (5x10 nested CV)' % np.mean(outer_scores))
[Outer fold 1/5] Test accuracy: 0.90 (n_neighbors=1 selected by inner 10-fold CV) [Outer fold 2/5] Test accuracy: 0.90 (n_neighbors=15 selected by inner 10-fold CV) [Outer fold 3/5] Test accuracy: 0.90 (n_neighbors=15 selected by inner 10-fold CV) [Outer fold 4/5] Test accuracy: 0.93 (n_neighbors=15 selected by inner 10-fold CV) [Outer fold 5/5] Test accuracy: 1.00 (n_neighbors=15 selected by inner 10-fold CV) Test accuracy: 0.93 (5x10 nested CV)
NOTE: if we have a dataset with imbalance classes, we should use the stratified $K$-fold CV that prepserves the class proportions in each fold to ensure that each fold is representative of the class proportions in the training dataset. To use stratified CV, simply replace
>>> from sklearn.model_selection import KFold
>>> KFold(n_splits=...)
with
>>> from sklearn.model_selection import StratifiedKFold
>>> StratifiedKFold(y=..., n_splits=...)
How many folds $K$ do we need? Here are some rules of thumb explained in the lecture:
To see these in practice, let's consider the Polynomial regression where the ground truth data generating distribution is known:
$$\mathrm{P}(\mathrm{y}|\mathrm{x}) = \sin(x) + \epsilon, \epsilon\sim\mathcal{N}(0,\sigma^2)$$We can visualize the bias and variance as follows:
sigma = 1
n_range = range(10, 50, 2)
k_range = [5, 10]
poly = PolynomialFeatures(degree=2)
X = np.array([])
y = np.array([])
cv5_mean = []
cv5_std = []
cv10_mean = []
cv10_std = []
exp_mean = []
for n in n_range:
# compute the bias and variance of cv5
mse_test = []
for i in range(500):
x, y = gen_data(n, sigma)
X = poly.fit_transform(x[:, np.newaxis])
cv5 = KFold(n_splits=5, random_state=1, shuffle=True)
for i, (train, test) in enumerate(cv5.split(X, y)):
lr = LinearRegression()
lr.fit(X[train], y[train])
y_test_pred = lr.predict(X[test])
mse_test.append(mean_squared_error(y[test], y_test_pred))
cv5_mean.append(np.mean(mse_test))
cv5_std.append(np.std(mse_test))
# compute the bias and variance of cv10
mse_test = []
for i in range(500):
x, y = gen_data(n, sigma)
X = poly.fit_transform(x[:, np.newaxis])
cv10 = KFold(n_splits=10, random_state=1, shuffle=True)
for i, (train, test) in enumerate(cv10.split(X, y)):
lr = LinearRegression()
lr.fit(X[train], y[train])
y_test_pred = lr.predict(X[test])
mse_test.append(mean_squared_error(y[test], y_test_pred))
cv10_mean.append(np.mean(mse_test))
cv10_std.append(np.std(mse_test))
# compute the expected generalization error of f_N
mse_test = []
for i in range(500):
x, y = gen_data(n, sigma)
X = poly.fit_transform(x[:, np.newaxis])
lr = LinearRegression()
lr.fit(X, y)
x_test, y_test = gen_data(100, sigma)
X_test = poly.transform(x_test[:, np.newaxis])
y_test_pred = lr.predict(X_test)
mse_test.append(mean_squared_error(y_test, y_test_pred))
exp_mean.append(np.mean(mse_test))
plt.plot(n_range, cv5_mean,
markersize=5, label='5-Fold CV', color='blue')
plt.fill_between(n_range,
np.add(cv5_mean, cv5_std),
np.subtract(cv5_mean, cv5_std),
alpha=0.15, color='blue')
plt.plot(n_range, cv10_mean,
markersize=5, label='10-Fold CV', color='green')
plt.fill_between(n_range,
np.add(cv10_mean, cv10_std),
np.subtract(cv10_mean, cv10_std),
alpha=0.15, color='green')
plt.plot(n_range, exp_mean,
markersize=5, label='Exp', color='red')
plt.hlines(y=sigma, xmin=10, xmax=48,
label='Bayes', color='red',
linewidth=2, linestyle='--')
plt.legend(loc='upper right')
plt.xlim([10, 48])
plt.ylim([0, 5])
plt.xlabel('N')
plt.ylabel('MSE')
plt.tight_layout()
plt.savefig('./output/fig-cv-fold.png', dpi=300)
plt.show()
Usually, we set $K=10$ in most applications, $K=5$ for larger datasets, and $K=N$ for very small datasets. The last setting is called the leave-one-out CV.
No free lunch theorem states that no machine learning algorithm is universally better than others in all domains. The goal of ensembling is to combine multiple learners to improve the applicability and get better performance.
NOTE: it is possible that the final model performs no better than the most accurate learner in the ensemble models. But it at least reduces the probability of selecting a poor one and increases the applicability.
Voting is arguably the most straightforward way to combine multiple learners $d^{(j)}(\cdot)$. The idea is to taking a linear combination of the predictions made by the learners. For example, in multiclass classification, we have
$$\tilde{y}_k =\sum_j^L w_j d^{(j)}_k(\boldsymbol{x}), \text{ where }w_j\geq 0\text{ and }\sum_j w_j=1,$$for any class $k$, where $L$ is the number of voters. This can be simplified to the plurarity vote where each voter has the same weight:
$$\tilde{y}_k =\sum_j \frac{1}{L} d^{(j)}_k(\boldsymbol{x}).$$Let's use the VotingClassifier
from Scikit-learn to combine KNeighborsClassifer
, LogisticRegression
, and DecisionTreeClassifier
together and train on the synthetic two-moon dataset:
X, y = make_moons(n_samples=500, noise=0.3, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=3)
plt.scatter(X[y == 0, 0], X[y == 0, 1], label='Class 0',
c='r', marker='s', alpha=0.5)
plt.scatter(X[y == 1, 0], X[y == 1, 1], label='Class 1',
c='b', marker='x', alpha=0.5)
plt.scatter(X_test[:, 0], X_test[:, 1],
c='', marker='o', label='Class 1')
plt.show()
pipe1 = Pipeline([['sc', StandardScaler()], ['clf', LogisticRegression(C = 10, random_state = 0, solver = "liblinear")]])
pipe2 = Pipeline([['clf', DecisionTreeClassifier(max_depth = 3, random_state = 0)]])
pipe3 = Pipeline([['sc', StandardScaler()], ['clf', KNeighborsClassifier(n_neighbors = 5)]])