Logistic Regression and Evaluation Metrics

Shan-Hung Wu & DataLab
Fall 2023


In this lab, we will guide you through the practice of Logistic Regression.


Logistic Regression


Logistic Regression is a classification algorithm in combination with a decision rule that makes dichotomous the predicted probabilities of the outcome. Currently, it is one of the most widely used classification models in Machine Learning.

As discussed in the lecture, Logistic Regression predicts the label $\hat{y}$ of a given point $\boldsymbol{x}$ by

$$\hat{y}=\arg\max_{y}\mathrm{P}(y\,|\,\boldsymbol{x};\boldsymbol{w})$$

and the conditional probability is defined as

$$\mathrm{P}(y\,|\,\boldsymbol{x};\boldsymbol{w})=\sigma(\boldsymbol{w}^{\top}\boldsymbol{x})^{y'}[1-\sigma(\boldsymbol{w}^{\top}\boldsymbol{x})]^{(1-y')},$$

where $y'=\frac{y+1}{2}$. Let's first plot the logistic function $\sigma$ over $z=\boldsymbol{w}^{\top}\boldsymbol{x}$

$$\sigma\left(z\right)=\frac{\exp(z)}{\exp(z)+1}=\frac{1}{1+\exp(-z)}$$



We can see that $\sigma(z)$ approaches $1$ when $(z \rightarrow \infty)$, since $e^{-z}$ becomes very small for large values of $z$. Similarly, $\sigma(z)$ goes downward to $0$ for $ z \rightarrow -\infty$ as the result of an increasingly large denominator. The logistic function takes real number values as input and transforms them to values in the range $[0, 1]$ with an intercept at $\sigma(z) = 0.5$.

To learn the weights $\boldsymbol{w}$ from the training set $\mathbb{X}$ = $\{(x^1, y^1), (x^2, y^2), ..., (x^n, y^n)\}$, we can use ML estimation:

$$\arg\max_{\boldsymbol{w}}\log\mathrm{P}(\mathbb{X}\,|\,\boldsymbol{w}).$$

This problem can be solved by gradient ascent algorithm with the following update rule:

$$\boldsymbol{w}^{(t+1)}=\boldsymbol{w}^{(t)}+\eta\nabla_{\boldsymbol{w}}\log\mathrm{P}(\mathbb{X}\,|\,\boldsymbol{w}^{(t)}),$$

where

$$\nabla_{\boldsymbol{w}}\log\mathrm{P}(\mathbb{X}\,|\,\boldsymbol{w}^{(t)})=\sum_{i=1}^{N}[y^{(i)}-\sigma(\boldsymbol{w}^{(t)\top}\boldsymbol{x}^{(i)})]\boldsymbol{x}^{(i)}.$$

Therefore,

$$\boldsymbol{w}^{(t+1)}=\boldsymbol{w}^{(t)}+\eta\cdot[\boldsymbol{y}-\sigma(\boldsymbol{w}^{(t)}\boldsymbol{x})]\boldsymbol{x}.$$

Once $\boldsymbol{w}$ is solved, we can then make predictions by

$$\hat{y}=\arg\max_{y}\mathrm{P}(y\,|\,\boldsymbol{x};\boldsymbol{w})=\arg\max_{y}\{\sigma(\boldsymbol{w}^{\top}\boldsymbol{x}),1-\sigma(\boldsymbol{w}^{\top}\boldsymbol{x})\}=\mathrm{sign}(\boldsymbol{w}^{\top}\boldsymbol{x}).$$

Logistic Regression is very easy to implement but performs well on linearly separable classes (or classes close to linearly separable). Similar to the Perceptron and Adaline, the Logistic Regression model is also a linear model for binary classification. We can relate the Logistic Regression to our previous Adaline implementation. In Adaline, we used the identity function as the activation function. In Logistic Regression, this activation function simply becomes the logistic function (also called as sigmoid function) as illustrated below:


Predicting Class-Membership Probability


One benefit of using Logistic Regression is that it is able to output the class-membership probability (i.e., probability of a class to which a point $\boldsymbol{x}$ belongs) via $\sigma(\boldsymbol{w}^{\top}\boldsymbol{x})$ and $1-\sigma(\boldsymbol{w}^{\top}\boldsymbol{x})$.

In fact, there are many applications where we are not only interested in predicting class labels, but also in estimating the class-membership probability. For example, in weather forecasting, we are interested not only if it will rain tomorrow but also the chance of raining. Similarly, when diagnosing disease we usually care about the chance that a patient has a particular disease given certain symptoms. This is why Logistic Regression enjoys wide popularity in the field of medicine.

Training a Logistic Regression Model with Scikit-learn

Scikit-learn implements a highly optimized version of logistic regression that also supports multiclass classification off-the-shelf. Let's use it to make predictions on the standardized Iris training dataset.

NOTE: Logistic Regression, like many other binary classification models, can be easily extended to multiclass classification via One-vs-All or other similar techniques.


The Logistic Regression class can predict the class-membership probability via the predict_proba() method. For example, we can predict the probabilities of the first testing point:

The prob array tells us that the model predicts a 99% chance that the sample belongs to the Iris-Virginica class, and a 1% chance that the sample is a Iris-Versicolor flower.

Regularization


One way to regularize a logistic regression classifier is to add a weight decay term in the objective (or cost function), as in Ridge regression:

$$\arg\max_{\boldsymbol{w}}\log\mathrm{P}(\mathbb{X}\,|\,\boldsymbol{w})-\frac{\alpha}{2}\Vert\boldsymbol{w}\Vert^2,$$

where $\alpha > 0$ is a hyperparameter that controls the trade-off between maximizing the log likelihood and minimizing the weight. Note that the Logistic Regression class implemented in Scikit-learn uses the hyperparameter $C=1/\alpha$ due to convention.

Evaluation Metrics for Binary Classifiers

So far, we evaluate the performance of a classifier using the accuracy metric. Although accuracy is a general and common metric, there are several other evaluation metrics that allow us to quantify the performance of a model from different aspects.

Confusion Matrix

Before we get into the details of different evaluation metrics, let's print the so-called confusion matrix, a square matrix that reports the counts of the true positive, true negative, false positive, and false negative predictions of a classifier, as shown below:


The confusion matrix of our logistic regressor over the Iris dataset is shown as follows:

The meaning of each entry in the above confusion matrix is straightforward. For example, the cell at $(1,0)$ means that $2$ positive testing points are misclassified as negative. Confusion matrix helps us know not only the count of how many errors but how they are wrong. Correct predictions counts into the diagonal entries. A good performing classifier should have a confusion matrix that is a diagonal matrix which means that the entries outside the main diagonal are all zero.
The error rate (ERR) and accuracy (ACC) we have been using can be defined as follows:

$$ERR = \frac{FP+FN}{TP + TN + FP + FN},\enspace\text{ (the lower, the better)}$$$$ACC = \frac{TP+TN}{TP + TN + FP + FN} = 1-ERR.\enspace\text{ (the higher, the better)}$$

True and False Positive Rate

The true positive rate (TPR) and false positive rate (FPR) are defined as:

$$TPR = \frac{TP}{TP + FN}.\enspace\text{ (the higher, the better)}$$$$FPR = \frac{FP}{FP + TN},\enspace\text{ (the lower, the better)}$$

TPR and FPR are metrics particularly useful for tasks with imbalanced classes. For example, if we have 10% positive and 90% negative examples in the training set, then a dummy classifier that always give negative predictions will be able to achieve 90% accuracy. The accuracy metric is misleading in this case. On the other hand, by checking the TPR which equals to 0%, we learn that the dummy classifier is not performing well.

Precision, Recall, and $F_1$-Score

The Precision (PRE) and recall (REC) metrics are defines as:

$$PRE = \frac{TP}{TP + FP},\enspace\text{ (the higher, the better)}$$$$REC = \frac{TP}{TP + FN} = TPR.\enspace\text{ (the higher, the better)}$$

Basically, PRE means "how many points predicted as positive are indeed positive;" while REC refers to "how many positive points in the ground truth are successfully identified as positive." PRE and REC are useful metrics if we care specifically about the performance of positive predictions.

In practice, we may combine PRE and REC into a single score called the $F_1$-score:

$$F_1 = 2\frac{(PRE * REC)}{PRE+REC},\enspace\text{ (the higher, the better)}$$

which reaches its best value at $1$ and worst at $0$.

Evaluation Metrics for Soft Classifiers

Many classifiers, such as Adaline and Logistic Regression, can make "soft" predictions (i.e., real values instead of the "hard" 1 or -1). We may "harden" the soft predictions by defining a decision threshold $\theta$. For example, suppose a classifier makes soft predictions in range $[-1,1]$ that are sorted as follows:

We can define a threshold $\theta=0.8$ such that points with scores larger/smaller than $0.8$ become positive/negative outputs. It is clear that the performance of the classifier will vary as we use different values for threshold.

Receiver Operating Characteristic (ROC) Curve

The receiver operator characteristic (ROC) curve measures the performance of a classifier at all possible thresholds. We can draw an ROC curve by following the steps:

  1. Rank the soft predictions from highest to lowest;
  2. For each indexing threshold $\theta$ that makes the first $\theta$ points positive and the rest negative, $\theta=1,\cdots,\vert\mathbb{X}\vert$, calculate the $TPR^{(\theta)}$ and $FPR^{(\theta)}$;
  3. Draw points $(TPR^{(\theta)},FPR^{(\theta)})$ in a 2-D plot and connect the points to get an ROC curve.

Let's plot the ROC curve of our logistic regressor:

How does the ROC curve of a "good" classifier look like?

The ROC curve of a perfect classifier would have a line that goes from bottom left to top left and top left to top right. On the other hand, if the ROC curve is just the diagonal line then the model is just doing random guessing. Any useful classifier should have an ROC curve falling between these two curves.

Model Comparison

ROC curves are useful for comparing the performance of different classifiers over the same dataset. For example, suppose we have three classifiers $A$, $B$, and $C$ and their respective ROC curves, as shown below:

It is clear that the classifiers $B$ and $C$ are better than $A$. But how about $B$ and $C$? This can also be answered by ROC curves:

Area Under the Curve (AUC)

We can reduce an ROC curve to a single value by calculating the area under the curve (AUC). A perfect classifier has $AUC=1.0$, and random guessing results in $AUC=0.5$. It can be shown that AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.

Let's compute the AUC of our logistic regressor:

That's a pretty high score!

Evaluation Metrics for Multiclass Classification


In multiclass classification problem, we can extend the above metrics via one-vs-all technique, where we treat one class as "positive" and the rest as "negative" and compute a score for the class. If there are $K$ classes, then we compute $K$ scores, one for each class. However, if we just want to have a single final score, we need to decide how to combine these scores.

Scikit-learn implements the macro and micro averaging methods. For example, the micro-average of $K$ precision scores is calculated as follows:

$$PRE_{micro} = \frac{TP^{(1)} + \cdots + TP^{(K)}}{P'^{(1)} + \cdots + P'^{(K)}};$$

while the macro-average is simply the average of individual PRE's:

$$PRE_{macro} = \frac{PRE^{(1)} + \cdots + PRE^{(K)}}{K}$$

Micro-averaging is useful if we want to weight each data point or prediction equally, whereas macro-averaging weights all classes equally. Macro-average is the default in Scikit-learn.

Let's train a multiclass logistic regressor and see how it performs:

We can see that the micro average reports more conservative scores. This is because it takes into account the class size. In our testing set, the first class is smaller than the others so its score (1.00) contributes less to the final score.

Assignment

Goal

Predict the presence or absence of cardiac arrhythmia in a patient

Read this note carefully

Dataset

The Arrhythmia dataset from UCI repository contains 280 variables collected from 452 patients. Its information helps in distinguishing between the presence and absence of cardiac arrhythmia and in classifying arrhytmia in one of the 16 groups. In this homework, we will just focus on building a Logistic Regression model that can classify between the presence and absence of arrhythmia.

Original class 1 refers to 'normal' ECG which we will regard as 'absence of arrhythmia' and the rest of the classes will be 'presence of arrhythmia'.

How big is the dataset?

The last column of the dataset is the class label. It contains the 16 ECG classifications:

Let's make that column (class label) dichotomous.
Value is 0 if ECG is normal, 1 otherwise

Are the groups balanced?

Some columns have missing values denoted as '?'
To make the preprocessing simpler, let's just retain the columns with numeric values.

Please continue working from here.