Regularization refers to techniques that improve the generalizability of a trained model. In this lab, we will guide you through some common regularization techniques such as weight decay, sparse weight, and validation.
Learning theory provides a means to understand the generalizability of trained model. As explained in the lecture, it turns out that model complexity plays a crucial role: too simple a model leads to high bias and underfitting because it cannot capture the trends or patterns in the underlying data generating distribution; while too complex a model leads to high variance and overfitting because it captures not only the trends of the underlying data generating distribution but also some patterns local to the training data.
Let's see the problems of overfitting and underfitting from a toy regression problem. Suppose we know the underlying data generating distribution: $$\mathrm{P}(\mathrm{x}, \mathrm{y})=\mathrm{P}(\mathrm{y}\,|\,\mathrm{x})\mathrm{P}(\mathrm{x}),$$where$$\mathrm{P}(\mathrm{x})\sim\mathrm{Uniform}$$and$$\mathrm{y}= \sin(x) + \epsilon, \epsilon\sim\mathcal{N}(0,\sigma^2)$$ We can generate a synthetic dataset as follows:
%matplotlib inline
from pylab import *
from sklearn.model_selection import train_test_split
import os
if not os.path.exists("output/") : os.mkdir("output/")
if not os.path.exists("data/") : os.mkdir("data/")
import warnings
warnings.filterwarnings("ignore")
def gen_data(num_data, sigma):
x = 2 * np.pi * (np.random.rand(num_data) - 0.5)
y = np.sin(x) + np.random.normal(0, sigma, num_data)
return (x, y)
num_data = 30
sigma = 0.2
x, y = gen_data(num_data, sigma)
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=0)
plt.scatter(x_train, y_train, color='blue')
plt.scatter(x_test, y_test, color='green')
x_grid = np.linspace(-1*np.pi, 1*np.pi)
sin_x = np.sin(x_grid)
plt.plot(x_grid, sin_x, color ='red', linewidth = 2)
plt.xlabel('x')
plt.ylabel('y')
plt.xlim(-np.pi, np.pi)
plt.ylim(-2, 2)
plt.tight_layout()
plt.savefig('./output/fig-dataset-sin.png', dpi=300)
plt.show()
The blue points are training examples and the green ones are testing points. The red curve is the $\sin$ function, which is the function $f^*$ with minimal generalization error $C[f^*]$ (called the Bayes error).
In regression, the degree $P$ of a polynomial (polynomial regression) and the depth of a decision tree (decision tree regression) are both hyperparameters that relate to model complexity. Let's consider the polynomial regression here and fit polynomials of degrees $1$, $5$, and $10$ to 20 randomly generated training data of the same distribution:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
degree = [1, 3, 10]
std_list = []
for d in degree:
X_fit = np.arange(-np.pi, np.pi, .1)[:, np.newaxis]
poly = PolynomialFeatures(degree=d)
for i in range(20):
x, y = gen_data(num_data, sigma)
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=0)
regr = LinearRegression()
regr = regr.fit(poly.fit_transform(x_train[:,np.newaxis]),
y_train[:,np.newaxis])
y_fit = regr.predict(poly.transform(X_fit))
plt.plot(X_fit, y_fit,
color='green', lw=1)
x_grid = np.linspace(-1*np.pi, 1*np.pi)
sin_x = np.sin(x_grid)
plt.plot(x_grid, sin_x, color='red', linewidth = 2)
plt.title('Degree: %d' %d)
plt.xlim(-np.pi, np.pi)
plt.ylim(-2, 2)
plt.tight_layout()
plt.savefig('./output/fig-polyreg-%d.png' % d, dpi=300)
plt.show()
When $P=1$, the polynomial is too simple to capture the trend of the $\sin$ function. On the other hand, when $P=10$, the polynomial becomes too complex such that it captures the undesirable patterns of noises.
NOTE: regression is bad at extrapolation. You can see that the shape of a fitted polynomial differs a lot from the $\sin$ function outside the region where training points reside.
One important conclusion we get from learning theory is that the model complexity plays a key role in generalization performance of a model. Let's plot the training and testing errors over different model complexities in our regression problem:
from sklearn.metrics import mean_squared_error
num_data = 50
x, y = gen_data(num_data, sigma)
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.3, random_state=0)
mse_train = []
mse_test = []
max_degree = 12
for d in range(1, max_degree):
poly = PolynomialFeatures(degree=d)
X_train_poly = poly.fit_transform(x_train[:,newaxis])
X_test_poly = poly.transform(x_test[:,newaxis])
regr = LinearRegression()
regr = regr.fit(X_train_poly, y_train)
y_train_pred = regr.predict(X_train_poly)
y_test_pred = regr.predict(X_test_poly)
mse_train.append(mean_squared_error(y_train, y_train_pred))
mse_test.append(mean_squared_error(y_test, y_test_pred))
plt.plot(range(1, max_degree), mse_train, label = 'Training error', color = 'blue', linewidth = 2)
plt.plot(range(1, max_degree), mse_test, label = 'Testing error', color = 'red', linewidth = 2)
plt.legend(loc='upper right')
plt.xlabel('Model complexity (polynomial degree)')
plt.ylabel('$MSE$')
plt.tight_layout()
plt.savefig('./output/fig-error-curve.png', dpi=300)
plt.show()
We can see that the training error (blue curve) decrease as the model complexity increases. However, the testing error (red curve) decreases at the beginning but increases latter. We see a clear bias-variance tradeoff as discussed in the lecture.
Although the error curve above visualizes the impact of model complexity, the bias-variance tradeoff holds only when you have sufficient training examples. The bounding methods of learning theory tell us that a model is likely to overfit regardless of it complexity when the size of training set is small. The learning curves are a useful tool for understanding how much training examples are sufficient:
def mse(model, X, y):
return ((model.predict(X) - y)**2).mean()
from sklearn.model_selection import learning_curve
num_data = 100
sigma = 1
degree = [1, 3, 10]
x, y = gen_data(num_data, sigma)
for d in degree:
poly = PolynomialFeatures(degree=d)
X = poly.fit_transform(x[:,np.newaxis])
lr = LinearRegression()
train_sizes, train_scores, test_scores = learning_curve(estimator=lr, X=X, y=y, scoring=mse)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.plot(train_sizes, train_mean,
color='blue', marker='o',
markersize=5,
label='Training error')
plt.fill_between(train_sizes,
train_mean+train_std,
train_mean-train_std,
alpha=0.15, color='blue')
plt.plot(train_sizes, test_mean,
color='green', linestyle='--',
marker='s', markersize=5,
label='Testing error')
plt.fill_between(train_sizes,
test_mean+test_std,
test_mean-test_std,
alpha=0.15, color='green')
plt.hlines(y=sigma, xmin=0, xmax=80, color='red', linewidth=2, linestyle='--')
plt.title('Degree: %d' % d)
plt.grid()
plt.xlabel('Number of training samples')
plt.ylabel('MSE')
plt.legend(loc='upper right')
plt.ylim([0, 3])
plt.tight_layout()
plt.savefig('./output/fig-learning-curve-%d.png' % d, dpi=300)
plt.show()
We can see that in these regression tasks, a polynomial of any degree almost always overfits the training data when the model size is small, resulting in poor testing performance. This indicates that we should collect more data instead of sitting in front of the computer and play with the models. You may also try other models as different models has different sample complexity (i.e., number of samples required to successfully train a model).
OK, we have verified the learning theory discussed in the lecture. Let's move on to the regularization techniques. Weight decay is a common regularization approach. The idea is to add a term in the cost function against complexity. In regression, this leads to two well-known models:
Ridge regression: $$\arg\min_{\boldsymbol{w},b}\Vert\boldsymbol{y}-(\boldsymbol{X}\boldsymbol{w}-b\boldsymbol{1})\Vert^{2}+\alpha\Vert\boldsymbol{w}\Vert^{2}$$
LASSO: $$\arg\min_{\boldsymbol{w},b}\Vert\boldsymbol{y}-(\boldsymbol{X}\boldsymbol{w}-b\boldsymbol{1})\Vert^{2}+\alpha\Vert\boldsymbol{w}\Vert_{1}$$
Let's see how they work using the Housing dataset:
import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/'
'housing/housing.data',
header=None,
sep='\s+')
df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS',
'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']
df.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
Remember that for weight decay to work properly, we need to ensure that all our features are on comparable scales:
from sklearn.preprocessing import StandardScaler
X = df.iloc[:, :-1].values
y = df['MEDV'].values
sc_x = StandardScaler()
X_std = sc_x.fit_transform(X)
We know that an unregularized polynomial regressor with degree $P=3$ will overfit the training data and has bad generalizability. Let's regularize its $L^2$-norm to see if we can get a better testing error:
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
poly = PolynomialFeatures(degree=3)
X_poly = poly.fit_transform(X_std)
X_train, X_test, y_train, y_test = train_test_split(
X_poly, y, test_size=0.3, random_state=0)
for a in [0, 1, 10, 100, 1000]:
lr_rg = Ridge(alpha=a)
lr_rg.fit(X_train, y_train)
y_train_pred = lr_rg.predict(X_train)
y_test_pred = lr_rg.predict(X_test)
print('\n[Alpha = %d]' % a )
print('MSE train: %.2f, test: %.2f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
[Alpha = 0] MSE train: 0.00, test: 19958.68 [Alpha = 1] MSE train: 0.73, test: 23.05 [Alpha = 10] MSE train: 1.66, test: 16.83 [Alpha = 100] MSE train: 3.60, test: 15.16 [Alpha = 1000] MSE train: 8.81, test: 19.22
We can see that a small value $\alpha$ drastically reduces the testing error. In addition, $\alpha = 100$ seems to be a good decay strength. As we can see, it's not a good idea to increase $\alpha$ forever, since it will over-shrink the coefficients of $\boldsymbol{w}$ and result in underfit.
Let's see the rate of weight decay as $\alpha$ grows:
X_train, X_test, y_train, y_test = train_test_split(
X_std, y, test_size=0.3, random_state=0)
max_alpha = 1000
coef_ = np.zeros((max_alpha, 13))
for a in range(1, max_alpha):
lr_rg = Ridge(alpha=a)
lr_rg.fit(X_train, y_train)
y_train_pred = lr_rg.predict(X_train)
y_test_pred = lr_rg.predict(X_test)
coef_[a,:] = lr_rg.coef_.reshape(1,-1)
plt.hlines(y=0, xmin=0, xmax=max_alpha, color='red', linewidth = 2, linestyle = '--')
for i in range(13):
plt.plot(range(max_alpha),coef_[:,i])
plt.ylabel('Coefficients')
plt.xlabel('Alpha')
plt.tight_layout()
plt.savefig('./output/fig-ridge-decay.png', dpi=300)
plt.show()
An alternative weight decay approach that can lead to sparse $\boldsymbol{w}$ is the LASSO. Depending on the value of $\alpha$, certain weights can become zero much faster than others, which makes the LASSO also useful as a supervised feature selection technique.
NOTE: since $L^1$-norm has non differentiable points, the solver (optimization method) is different from the one used in the Ridge regression. It will take much more time to train model weights.
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(
X_poly, y, test_size=0.3, random_state=0)
for a in [0, 0.001, 0.01, 0.1, 1, 10]:
lr_rg = Lasso(alpha=a)
lr_rg.fit(X_train, y_train)
y_train_pred = lr_rg.predict(X_train)
y_test_pred = lr_rg.predict(X_test)
print('\n[Alpha = %.4f]' % a )
print('MSE train: %.2f, test: %.2f' % (
mean_squared_error(y_train, y_train_pred),
mean_squared_error(y_test, y_test_pred)))
[Alpha = 0.0000] MSE train: 0.55, test: 61.02 [Alpha = 0.0010] MSE train: 0.64, test: 29.11 [Alpha = 0.0100] MSE train: 1.52, test: 19.51 [Alpha = 0.1000] MSE train: 4.34, test: 15.52 [Alpha = 1.0000] MSE train: 14.33, test: 22.42 [Alpha = 10.0000] MSE train: 55.79, test: 53.42
X_train, X_test, y_train, y_test = train_test_split(
X_std, y, test_size=0.3, random_state=0)
max_alpha = 10
coef_ = np.zeros((max_alpha,13))
for a in range(10):
lr_rg = Lasso(alpha=a+0.1)
lr_rg.fit(X_train, y_train)
y_train_pred = lr_rg.predict(X_train)
y_test_pred = lr_rg.predict(X_test)
coef_[a,:] = lr_rg.coef_.reshape(1,-1)
plt.hlines(y=0, xmin=0, xmax=max_alpha, color='red', linewidth = 2, linestyle = '--')
for i in range(13):
plt.plot(range(max_alpha),coef_[:,i])
plt.ylabel('Coefficients')
plt.xlabel('Alpha')
plt.tight_layout()
plt.savefig('./output/fig-ridge-decay.png', dpi=300)
plt.show()
The result shows that as the $\alpha$ increases, the coefficients shrink faster and become exactly zero when $\alpha=8$.
Since we can choose a suitable regularization strength $\alpha$ to make only part of coefficients become exactly zeros, LASSO can also be treated as a feature selection technique.
var_num = X_train.shape[1]
lr_lasso = Lasso(alpha = 1)
lr_lasso.fit(X_train, y_train)
lr_ridge = Ridge(alpha = 1)
lr_ridge.fit(X_train, y_train)
plt.scatter(range(var_num),lr_lasso.coef_, label = 'LASSO', color = 'blue')
plt.scatter(range(var_num),lr_ridge.coef_, label = 'Ridge', color = 'green')
plt.hlines(y=0, xmin=0, xmax=var_num-1, color='red', linestyle ='--')
plt.xlim(0,12)
plt.legend(loc = 'upper right')
plt.xlabel('Coefficients index')
plt.ylabel('Coefficients')
plt.tight_layout()
plt.show()
epsilon = 1e-4
idxs = np.where(abs(lr_lasso.coef_) > epsilon)
print('Selected attributes: {}'.format(df.columns.values[idxs]))
Selected attributes: ['RM' 'TAX' 'PTRATIO' 'LSTAT']
We can plot the pairwise distributions to see the correlation between the selected attributes and MEDV
:
import seaborn as sns
print('All attributes:')
sns.set(style='whitegrid', context='notebook')
sns.pairplot(df, x_vars=df.columns[:-1], y_vars=['MEDV'], size=2.5)
plt.tight_layout()
plt.show()
print('Selected attributes:')
sns.set(style='whitegrid', context='notebook')
sns.pairplot(df,x_vars=df.columns[idxs], y_vars=['MEDV'], size=2.5)
plt.tight_layout()
plt.show()
sns.reset_orig()
All attributes: