KNN, SVM, Data Preprocessing, and Scikit-learn Pipeline

Shan-Hung Wu & DataLab
Fall 2022

In this lab, we will learn how to classify nonlinearly separable data by using KNN and SVM classifiers. Then, we will discuss how to handle missing values in the dataset, and how to encode categorical data. Finally, we will show how to pack multiple data preprocessing steps into into a single Pipeline in Scikit-learn to simplify the training workflow.

Nonlinearly Separable Data

Sometimes, the classes in a dataset may be nonlinearly separable. A famous synthetic dataset in this category, called the two-moon dataset, can be illustrated in 2-D:

If we apply linear classifiers such as Perceptron or Logistic Regression to this dataset, it is not possible to get a reasonable decision boundary:

As we can see, neither of the classifiers can separate the two classes well due to underfitting. In other words, the models are too simple to capture the shape of true boundary, introducing a large bias. It is better to use nonlinear classifiers for this dataset.

K-Nearest Neighbors Classifier

The KNN algorithm itself is fairly straightforward and can be summarized by the following steps:

  1. Choose number of neighbors, $K$, and a distance metric.
  2. Find the $K$ nearest neighbors of the data point that we want to classify.
  3. Assign the class label by majority vote.

KNN is a typical example of a lazy learner. It is called lazy because it simply memorizes the training dataset in the training phase and learns a discriminative function $f$ only before making a prediction.

By executing the following code, we will now implement a KNN model in scikit-learn using a Euclidean distance metric:

The KNN classifier achieves 95% accuracy. That's pretty good! Another advantage of such a memory-based approach is that the classifier immediately adapts as we collect new training data.

However, the downside is that the computational complexity for classifying new points grows linearly with the number of samples in the training dataset in the worst-case scenario, unless the dataset has very few dimensions (features) and the algorithm has been implemented using efficient data structures such as KD-trees, an algorithm for finding best matches in logarithmic expected time. Furthermore, we can't discard training samples since no training step is involved. Thus, storage space can become a challenge if we are working with large datasets.

Support Vector Classifier

Another powerful and widely used memory-based classifier is the nonlinear support vector classifier (SVC). Like KNN, nonlinear SVC makes predictions by the weighted average of the labels of similar examples (measured by a kernel function). However, only the support vectors, i.e., examples falling onto or inside the margin, can have positive weights and need to be remembered. In practice, SVC usually remembers much fewer examples than KNN does. Another difference is that SVC is not an lazy learner - the weights are trained eagerly in the training phase.

Let's make predictions using SVCs:

As we can see, non-linear SVC achieves 95% accuracy as KNN does. However, we haven't tuned its hyperparameters to get the best performance yet. Let's try other values:

From the above example, we can see that tuning the hyperparameters is very important to nonlinear SVC. Different parameter setting will make huge performance difference.

Tuning the hyperparameters of SVC is not as straightforward as we see in Polynomial Regression, where we can simply increase the polynomial degree from 1 and stop if the validation performance does not improve anymore. In SVC, there is no simple way to relate a particular hyperparameter combination $(C,\gamma)$ to the model complexity. So, we have to try out all possible (or specified) combinations exhaustively in order to pick the best one. This procedure is called grid search:

After finding the best parameter, we can then use it to evaluate on test data:

We have perfect test accuracy. That's great!

NOTE: grid search may consume a lot of time when the dataset is large. A practical way is to use coarse grids initially, narrow down into some grids that gives relatively good performance, and then perform more fine-grained grid searches within those grids recursively.

Data Preprocessing

Now we have hands-on experience of many machine learning algorithms. It's time to apply them to a more realistic dataset with quality issues. Data quality and the amount of useful information in the dataset are key factors determining how well a machine learning algorithm can learn and/or predict. Therefore, it is critical for us to examine and preprocess a dataset before we feed it to a learning algorithm. The dataset we use in the following section is the Adult dataset.

The Adult dataset

The Adult dataset from UCI repository collects information about people (attributes/features) for determining whether a person makes over 50K a year (label). Following are the attributes:

1.  age               continuous.
2.  workclass         Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, 
                      Without-pay, Never-worked.
3.  fnlwgt            continuous.
4.  education         Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 
                      9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
5.  education-num     continuous.
6.  marital-status    Married-civ-spouse, Divorced, Never-married, Separated, Widowed, 
                      Married-spouse-absent, Married-AF-spouse.
7.  occupation        Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, 
                      Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, 
                      Priv-house-serv, Protective-serv, Armed-Forces. 
8.  relationship      Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
9.  race              White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
10. sex               Female, Male.
11. capital-gain      continuous.
12. capital-loss      continuous.
13. hours-per-week    continuous.
14. native-country    United-States, Cambodia, England, Puerto-Rico, Canada, Germany, 
                      Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, 
                      Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, 
                      Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, 
                      Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, 
                      Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
15.  label            >50K, <=50K

You can see more details about the dataset here. Let's load the data:

We can observe two things in this dataset:

  1. Many attributes are not numeric but categorical.
  2. There are missing values. For example, data point No.14 has a missing value on native-country.

Since most machine learning algorithms can only take datasets with numeric features and without missing values. We have to preprocess this dataset.

Handling Categorical Data

Real-world datasets usually contain one or more categorical features. When we are talking about categorical data, we have to further distinguish between nominal and ordinal features. Ordinal features can be understood as categorical values that can be sorted or ordered. For example, T-shirt size would be an ordinal feature, because we can define an order XL > L > M. In contrast, nominal features don't imply any order. For example, we could think of T-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue.

In the Adult dataset, there is no obvious feature that is ordinal. So we will only focus on nominal ones in this lab. We can use the LabelEncoder in scikit-learn to help us encode categorical values into numerical values.

After executing the code, we successfully replaced the categorical values with numerical values.

Dealing with Missing Data

It is common in real-world applications that our samples have missing values in one or more attributes due to various reasons. For example, there could have been an error in the data collection process, or certain measurements are not applicable, or particular fields could have been left blank intentionally in a survey, etc. We typically see missing values as the blank spaces, NaN, or question mark in our datasets.

Unfortunately, most computational tools are unable to handle such missing values or would produce unpredictable results if we simply ignored them. Therefore, it is crucial that we take care of those missing values before proceeding with further analysis.

First, we can use isnull() method in Panda's Dataframe to see how many missing values we have in Adult dataset:

For larger dataset, it can be tedious to look for missing values manually. In this case, we can use the isnull() method to return a DataFrame with Boolean values, where True represents data is missing. Using the sum() method, we can then return the number of missing values per column.

Next, we will discuss some common strategies to handle missing data.

Eliminating Samples or Features with Missing Values

One of the easiest ways to deal with missing data is to simply remove the corresponding features (columns) or samples (rows) from the dataset entirely. We can call the dropna() method of Dataframe to eliminate rows or columns:

The dropna method supports several additional parameters that can come in handy:

Although removal of missing data seems to be a convenient approach, it comes up with several disadvantages. For example, dataset might be too small as we remove number of samples, resulting in overfitting. Additionally, if we eliminate too many features, we will run the risk of losing valuable relationship between features that our classifier needs to discriminate between classes.

Imputing Missing Values

If we do not have a large dataset, the removal of samples or dropping of entire feature columns may not be feasible, since we could lose valuable information. An alternative way is to use interpolation techniques to estimate the missing values from other training samples in the same dataset. There are some common interpolation techniques we can use, such as mean imputation, median imputation, and most frequent imputation.

The SimpleImputer class from scikit-learn provides a convenient way for imputation. Here we use it to perform the most frequent imputation since the missing values in the Adult dataset are all categorical features:

After executing the code, we successfully replaced the missing values into most frequent values.

One-Hot Encoding

If we stop at this point and feed the array to our classifier, we will make one of the most common mistakes in dealing with categorical data. Take the 'workclass' for example, we will assume that 'State-gov' is larger than 'Self-emp-not-inc', and 'Self-emp-not-inc' is larger than 'Private'. This incorrect assumption can lead to degraded performance. For example, if a model uses weight decay for regularization, it may prefer categorical values that are encoded closer to $0$.

A common workaround for this problem is to use a technique called one-hot encoding. The idea behind this approach is to create a new dummy feature column for each unique value in the nominal feature. To perform this transformation, we can use the OneHotEncoder from Scikit-learn:

Here, we can see that the numbers of column on both dataset increase significantly. Note that the number of columns between impute_onehot_data and drop_row_onehot_data are different, which implies that the drop-row method makes a value in a column disappear, resulting in loss of information.

NOTE: by default, the OneHotEncoder returns a sparse matrix when we use the transform() method. Sparse matrices save space to store entries with a lot of zeros.

The get_dummies() Method in Pandas

An alternative, but more convenient way to create dummy features via one-hot encoding is to use the get_dummies() method implemented in Pandas.

NOTE: the get_dummies() method only converts string columns and leave all other columns unchanged. If you want to use this method, you have to ensure that the categorical data are all string. Otherwise it will not perform encoding.

Scikit-learn Pipeline

When we applied different preprocessing techniques in the previous labs, such as standardization, data preprocessing, or PCA, you might notice that we have to reuse the parameters obtained during fitting the training data in order to scale and compress new data, ex. test data. Scikit-learn Pipeline allows us to fit a model including an arbitrary number of transformation steps and apply it to make predictions on new data. The following figure summarizes how Pipeline works:

Here, we give an example on how to combine Imputer and OneHotEncoder with KNN or SVM:

We can check whether one-hot encoding is useful or not:

As we can see, the performance of KNN does not change much because the model has no preference on specific numerical values. On the other hand, the accuracy drops in SVC because it has a weight decay term in the cost function that can be misled when the categorical features are not encoded as one-hot vectors.

We can also compare the performance between imputation and dropping rows:

We get only slightly worse results than imputation since we have a large enough dataset.

Finally, let's combine SVC pipeline with grid search:

Assignment

In this assignment, you have to train models and handle quality issues on Mushroom dataset. This data includes descriptions of hypothetical samples corresponding to 22 features of gilled mushrooms. Please refer to the website for more information about this dataset.

Goal

Given the dataset, predict whether a mushroom is poisonous or edible.

Read this note carefully