Linear, Polynomial, and Decision Tree Regression

Shan-Hung Wu & DataLab
Fall 2022

In this lab, we will guide you through the linear and polynomial regression using the Housing dataset (description, data). We will also extend the Decision Tree and Random Forest classifiers that we have learned from our previous labs to solve the regression problem.

Linear Regression

Regression models are used to predict target variables ($y$'s) in continuous space, which makes them attractive for

Consider $y$ can be explained by the 1-D points ($x\in R$), the linear model is defined as: $$\hat{y} = w_{0}+w_{1}x$$ Minimizing the sum of squared errors (SSE) can be understood as finding the best-fitting straight line through the example points. The best-fitting line is called the regression line (or hyperplane when $x\in \mathbb{R}^D$, $D>1$ ), and the vertical offsets from the regression line to the data points are called the residuals, i.e. prediction errors, as shown in the following figure: Note that the $w_0$ and $w_1$ control the intercept/bias and slope of the regression line respectively.

The Housing dataset

The Housing dataset from UCI repository collects information about houses in the suburbs of Boston. Following are the attributes:

1.  CRIM      Per capita crime rate by town
2.  ZN        Proportion of residential land zoned for lots over 25,000 sq.ft.
3.  INDUS     Proportion of non-retail business acres per town
4.  CHAS      Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5.  NOX       Nitric oxides concentration (parts per 10 million)
6.  RM        Average number of rooms per dwelling
7.  AGE       Proportion of owner-occupied units built prior to 1940
8.  DIS       Weighted distances to five Boston employment centres
9.  RAD       Index of accessibility to radial highways
10. TAX       Full-value property-tax rate per \$10,000
11. PTRATIO   Pupil-teacher ratio by town
12. B         1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT     % lower status of the population
14. MEDV      Median value of owner-occupied homes in $1000's

Let's load the data first, and see head 5 data in dataset.

Our goal is to predict the house prices ('MEDV'), which is in a continuous scale, using the values of some other variable. To select proper explanatory variables, we plot all the pairwise join distributions related to 'MEDV'.

Using this scatter-plot matrix, we can now quickly see how the data is distributed and whether it contains outliers. For example, we can see that there is a linear relationship between RM and the house prices MEDV. Furthermore, we can see in the histogram that both RM and MEDV variable seems to be normally distributed but MEDV contains several outliers, i.e. values that deviate from the majority values a lot. Let's use RM as the explanatory variable for our first linear regression task:

Fitting a Linear Regression Model via Scikit-learn

Scikit-learn has already implemented an LinearRegression class that we can make use of:

We may interpret the slope 9.10 as the average increase in 'MEDV' due to 'RM'. In contrast, the intercept sometimes also has physical meaning, but not in this case. Since that there is no negative value of a house.

Next, let's visualize how well the linear regression line fits the training data:

As we can see, the linear regression line reflects the general trend that house prices tend to increase with the number of rooms. Interestingly, we also observe a curious line at $y=50$ , which suggests that the prices may have been clipped.

Multivariate Cases & Performance Evaluation

If we have multiple explanatory variables, we can't visualize the linear regression hyperplane in a two-dimensional plot. In this case, we need some other ways to evaluate the trained model. Let's proceed with the multivariate linear regression and evaluate the results using the mean squared error (MSE) and coefficient of determination ($R^2$):

A normal $R^2$ value should fall between between 0 and 1, and the higher $R^2$ the better.In practice, we often consider $R^2>0.8$ as good. if $R^2$ is negative, it means that your model doesn't fit your data.
NOTE: it is important to standardize the explanatory variables in multivariate regression in order to improve the conditioning of the cost function and to prevent attributes with large values from dominating.

Residual Plot

In addition, the residual plot is a commonly used graphical analysis for a regression model to detect nonlinearity and outliers. In the case of a perfect prediction, the residuals would be exactly zero, which we will probably never encounter in realistic and practical applications. However, for a good regression model, we would expect that the errors are randomly distributed and the residuals should be randomly scattered around the centerline. If we see patterns in a residual plot, it means that our model is unable to capture some explanatory information, which is leaked into the residuals (as we can slightly see in the below). Furthermore, we can also use residual plots to detect outliers, which are represented by the points with a large deviation from the centerline.

Implementing the Linear Regression

Now, let's implement our own linear regression model. It is almost the same as the Adaline classifier we have implemented:

It is always a good practice to plot the cost as a function of the number of epochs (passes over the training dataset) when we are using optimization algorithms, such as gradient descent, to check for the convergence:

Next, let's visualize how well the linear regression line fits the training data:

We can see that the overall result looks almost identical to the Scikit-learn implementation. Note, however, that the implementation in Scikit-learn makes use of the LIBLINEAR library and advanced optimization algorithms that work better with unstandardized variables.

Polynomial Regression

Linear regression assumes a linear relationship between explanatory and response variables, which may not hold in the real world. For example, by seeing the pairwise distribution plot again, we find that the LSTAT (% lower status of the population) attribute is clearly not linearly correlated with our target variable MEDV. Next, let's construct polynomial features and turn our linear regression models into the polynomial ones.

In the resulting plot, we can see that the polynomial fit captures the relationship between the response and explanatory variable much better than the linear fit.

Multivariate Cases

Next, we train polynomial regressors of different degrees using all features in the Housing dataset and compare their performance.

We notice a very interesting behavior here. As the degree of polynomial goes up, the training errors decrease, but not the test errors. That is, a low training error does not imply a low test error. We will discuss this further in our next lecture.

Decision Tree Regression

Polynomial regression is not the only way to capture the nonlinear relationship between the explanatory and target variables. For example, we can modify the Decision Tree model for non-linear regression by simply replacing the entropy as the impurity measure of a node by the MSE. Let's see how it works in our task:

As we can see from the resulting plot, the decision tree captures the general trend in the data. However, a limitation of this model is that it does not capture the continuity and differentiability of the desired prediction.

Random Forest Regression

We can also modify the Random Forest model for regression to take advantages of an ensemble technique and get a better generalization performance. The basic random forests algorithm for regression is almost identical to the random forest algorithm for classification. The only difference is that we use the MSE criterion to grow individual decision trees, and the predicted target variable is calculated as the average prediction over all decision trees. Now, let's use all the features in the Housing dataset to train a random forest regression model:

We get better testing results ($R^2=0.83$) than those of multivariate linear regression ($R^2=0.67$) and see weaker patterns in the residual plot. However, we still observe that the testing performance is much worse than the training one. Understanding how the testing performance differs from the training performance is crucial, and we will dive into this topic in the next lecture.

NOTE: As in the classification, Decision Tree and Random Forest regression has a nice feature that they are not sensitive to the scaling of each explanatory variable, thus we do not standardize features this time.

Remarks

  1. Regression models are basically interpolation equations over the range of the explanatory variables. So they may give bad predictions if we extrapolate outside this range.
  2. Be careful about the outliers, which may change your regression hyperplane undesirably.

Assignmant

In this assignment, you need to train regression models on Beijing PM2.5 dataset in winter of 2014.

  1. You have to implement
    • a Linear(Polynomial) regressor
    • a Random Forest regressor
  2. You need to show a residual plot for each of your model on both training data and testing data.
  3. $R^2$ score need to be larger than 0.72 on testing data.

Requirements:

In the latter course, we will teach how to deal with those sample whose has nan (not a number) or non-scalar features. For now, we just remove them.

In the following, we select data that are recorded in winter between 2013 and 2014.