DataLab Cup 1: Text Feature Engineering

Shan-Hung Wu & DataLab
Fall 2022


In this competition, you are provided with a supervised dataset $\mathbb{X}$ consisting of the raw content of news articles and the binary popularity (where $1$ means "popular" and $-1$ means "unpopular", calculated based on the number of shares in online social networking services) of these articles as labels. Your goal is to learn a function $f$ from $\mathbb{X}$ that is able to predict the popularity of an unseen news article.

Dataset Format

Platform: Kaggle

Feature Engineering is More Important Than You Expected

So far, we learn various machine learning techniques based on datasets where the date features are predefined. In many real-world applications, including this competition, we only get raw data and have to define the features ourself. Feature engineering is the process of using domain knowledge to create features that make machine learning algorithms work. While good modeling and training techniques help you make better predictions, feature engineering usually determines whether your task is "learnable".

To demonstrate the importance of feature engineering, let's use the IMDB review dataset to show how to extract meaningful information from a subset of these movie reviews to build a machine learning model that can predict whether a certain reviewer liked or disliked a movie.

You can download the data you will use below.

We get movie reviews in its raw content---there is only one feature called review. If the review is positive comment, then the label field sentiment equals to 1; otherwise 0. To be able to predict from text, we have to go through several preprocessing steps first.

Preprocessing: Data Cleaning

Data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate pieces of information in the dataset. Let's print a review and see if we need to clean up the raw text:

As we can see here, the text contains HTML markup as well as punctuation and other non-letter characters. Since we care only about the semantics, we remove the HTML markup as it does not contain much useful semantics. Also, although punctuation marks might be useful in certain NLP contexts, we remove all punctuation marks for simplicity. One exception is the emoticon characters such as ":)" since they are certainly useful for sentiment analysis. Furthermore, we convert all text to lowercase since it doesn't matter if reviews are in upper or lower case.

In summary, we clean up the text by:

To accomplish this task, we use Python's regular expression (re) library, and a powerful HTML parsing tool BeautifulSoup4. If you don't have BeautifulSoup4 yet, you can install it via Anaconda:

> conda install beautifulsoup4

By calling BeautifulSoup(text, 'html.parser'), we constructs a BeautifulSoup object, which represents the document as a nested data structure, and you can navigate the tree easily, like selecting a tag or querying tags with some regex pattern (see more on BeautifulSoup website). For this example, we simply remove all HTML tags (including the tag properties) and keep only the raw texts between tags by calling the method get_text(). After we remove the HTML markup, we used a slightly more complex regex to find emoticons, which we temporarily stored as emoticons. Next we remove all non-word characters from the text via the regex "[\W]+", convert the text into lowercase characters, and add the temporarily stored emoticons to the end of the text. Additionally, we removed the nose character (-) from the emoticons for consistency.

Let's do a sanity check:

Our cleaning preprocessor seems to work correctly. That's great!

Now, we need to think about how to split the text corpora into individual elements. This is called tokenization. One way to tokenize documents is to split them into individual words by splitting the cleaned document at its whitespace characters.

The example sentence is now split into tokens. However, we see a problem here: the token "running" and "run" only differs in verb tense. It is not a good idea to keep them as different tokens as this introduces unnecessary redundancy in the vector representation. Let's merge them using a technique called word stemming.

Preprocessing: Word Stemming

Word stemming is a process that transforms words into their root forms and allows us to map related words to the same stem. The original stemming algorithm was developed by Martin F. Porter in 1979 and is hence known as the Porter stemming algorithm. The Natural Language Toolkit for Python implements the Porter stemming algorithm, which we use here. In order to install the NLTK, you can simply execute:

> conda install nltk

NOTE: NLTK module provides powerful tools for various NLP tasks, such as the sentiment polarity scoring, common stop words, POS tagging, etc., which you may find useful for this competition.

As we can see, the word "running" is now reduced to its root form "run".

NOTE: words stemming just heuristically strips outs prefix or suffix of words. Therefore, it'll produce strange result for some words, ex: the word "boring" will be wrongly reduced into non-existing word "bor". To overcome this, there's another technique, called lemmatization, which grammatically transforms words back to root form. Lemmatization is also implemented by NLTK in nltk.stem. Empirically, there is no much difference in performance between the two techniques.

Preprocessing: Stop-Word Removal

Stop-words are simply words that are extremely common in all sorts of texts thus contain little useful information that can be used to distinguish between different classes of documents. Example stop-words are "is", "and", "has", and "the". Removing stop-words can be useful if we are working with raw or normalized term frequencies such as BoW and Feature Hashing but not for TF-IDF which already downweight frequently occurring words. The BoW, feature hashing, and TF-IDF will be explained in the next sections.

Since machine learning models only accept numerical features, we must convert categorical features, such as tokens into a numerical form. In the next section, we introduce several commonly used models, including BoW, TF-IDF, and Feature Hashing that allows us to represent text as numerical feature vectors.

BoW (Bag-Of-Words)

The idea behind bag-of-words model is to represent each document by occurrence of words, which can be summarized as the following steps:

  1. Build vocabulary dictionary by unique token from the entire set of documents;
  2. Represent each document by a vector, where each position corresponds to the occurrence of a vocabulary in dictionary.

Each vocabulary in BoW can be a single word (1-gram) or a sequence of $n$ continuous words (n-gram). It has been shown empirically that 3-gram or 4-gram BoW models yield good performance in anti-spam email filtering application.

Here, we use Scikit-learn's implementation CountVectorizer to construct the BoW model:

The parameter ngram_range=(min-length, max-length) in CountVectorizer specifies the vocabulary to be {min-length}-gram to {max-length}-gram. For example ngram_range=(1, 2) will use both 1-gram and 2-gram as vocabularies. After constructing BoW model by calling fit(), you can access BoW vocabularies in its attribute vocubalary_, which is stored as Python dictionary that maps vocabulary to an integer index.

Let's transform the example documents into feature vectors:

Since each document contains only a small subset of vocabularies, CountVectorizer.transform() stores feature vectors as scipy.sparse matrix, where entry index is (document-index, vocabulary-index) pair, and the value is the term frequency---the number of times a vocabulary (term) occurs in a document. For example, (0,0) 2 means the 1st term "happy" appears twice in the 1st document.

Unfortunately, many Scikit-learn classifiers do not support input as sparse matrix now. We can convert doc_bag into a Numpy dense matrix:

Let's convert part of our movie review into BoW vectors and see what are the most frequent words.

To find out most frequent words among documents, we first sum up vocabulary counts in documents, where axis=0 is the document index. Then, we sort the summed vocabulary count array in ascending order and get the sorted index by argsort(). Next, we revert the sorted list by [::-1], and feed into inverse_transform() to get corresponding vocabularies. Finally, we show the 10 most frequent vocabularies with their occurrence counts.

Next, we introduce the TF-IDF model that downweights frequently occurring words among the input documents.

TF-IDF (Term-Frequency & Inverse-Document-Frequency)

TF-IDF model calculates not only the term-frequency (TF) as BoW model does, but also the document-frequency (DF) of a term, which refers to the number of documents that contain this term. The TF-IDF score for a term is defined as


where the $\log()$ term is called the inverse-document-frequency (IDF) and $N_\text{doc}$ is the total number of documents. The idea behind TF-IDF is to downweight the TF of a word if it appears in many documents. For example, if a word appears in every document, the second term become $\log(1)+1=1$, which will be smaller than any other word appearing in only a part of documents.

NOTE: we add $1$ to both the numerator and denominator inside the $\log()$ in the above definition so to avoid the numeric issue of dividing by $0$.

Let's create the TF-IDF feature representation:

Now we have a problem, the number of features that we have created in doc_tfidf is huge:

There are more than 3000 features for merely 100 documents. In practice, this may lead to too much memory consumption (even with sparse matrix representation) if we have a large number of vocabularies.

Feature Hashing

Feature hashing reduces the dimension vocabulary space by hashing each vocabulary into a hash table with a fixed number of buckets. As compared to BoW, feature hashing has the following pros and cons:

Ok, now we can transform raw text to feature vectors. Let's do the sentiment classification.

Sentiment Classification Pipeline

Let's use the LogisticRegression model to classify the movie reviews into positive and negative classes. As discussed in previous sections, there are several preprocessing steps to do before, so the workflow will be:

  1. Preprocessing: clean the text, and remove stop words;
  2. Convert words to vector: extract feature vectors from the raw review text;
  3. Classification: train a LogisticRegression model to do sentiment classification;
  4. Evaluate: we'll do 10-fold cross-validation to evaluate general performance.

In order to evaluate general performance of our model by 10-fold CV, which trains and evaluates the model 10 times, each on different split of the training and testing sets. It's a tedious task if we repeat steps 1 to 3 for each split ourself, thus we'll use the Pipeline in Scikit-learn to wrap these steps 1 to 3.

To emphasize the importance of data preprocessing, we compare the performance of pipelines with/withoud data preprocessing.

As we can see, the AUC is improved with preprocessing. Furthermore, the feature hashing reduces space consumption at the cost of degraded performance.

More Creative Features

Now, you can go create your basic set of features for the text in competition. But don't stop from here. If you do aware the power of feature engineering, use your creativity to extract more features from the raw text. The more meaningful features you create, the more likely you will get a better score and win.

Here are few examples for inspiration:

There are lots of other directions you can explore, such as NLP features, length of news, etc.

Use Out-of-Core Learning If You Don't Have Enough Memory

The size of dataset in the competition (300MB in raw text) is much larger than the example IMDB dataset (80MB in raw text). The dataset, after being represented as feature vectors, may become much larger, and you are unlikely to store all of them in memory. Next, we introduce another training technique called the Out of Core Learning to help you train a model using data streaming.

The idea of Out of Core Learning is similar to the stochastic gradient descent, which updates the model when seeing a minibatch, except that each minibatch is loaded from disk via a data stream. Since we only see a part of the dataset at a time, we can only use the HashingVectorizer to transform text into feature vectors because the HashingVectorizer does not require knowing the vocabulary space in advance.

Let's create a stream to read a chunk of CSV file at a time using the Pandas I/O API:

Good. Our stream works correctly.

For out-of core learning, we have to use models that can train and update the model's weight iteratively. Here, we use the SGDClassifier to train a LogisticRegressor using the stochastic gradient descent. We can partial update SGDClassifier by calling the partial_fit() method. Our workflow now becomes:

  1. Stream documents directly from disk to get a mini-batch (chunk) of documents;
  2. Preprocess: clean and remove stop-words in the mini-batch of documents;
  3. Convert words to vector: use HashingVectorizer to extract features from text;
  4. Update SGDClassifier and go back to step 1.

Let's do the out-of core learning:

After fitting SGDClassifier by an entire pass over training set, let's plot the learning curve:

The learning curve looks great! The validation accuracy improves as more examples are seen.

Since training SGDClassifier may take long, you can save your trained classifier to disk periodically:

Now you have the all the supporting knowledge for the competition. Happy coding and good luck!



The evaluation metric is AUC. The ranking shown on the leaderboard before the end of the competition reflects only the AUC performance over part of test.csv. However, this is not how we evaluate your final scores. After the competition, we calculate AUC over the entire test.csv and report the final ranking thereby.

There will be two baseline results, namely, Benchmark-60 and Benchmark-80. You have to outperform Benchmark-60 to get 60 points, and Benchmark-80 to get 80. Meanwhile, the higher AUC you achieve, the higher the final score you will get.


What you can do:

What you can't do:

Violation of any prohibited rule will be considered as cheating and results in 0 final score.

Honor code

In addition to the behaviors outlined by the official competition rules, "cheating" encompasses any attempt to gain an edge in accuracy by using information that is outside of the provided dataset, or an attempt to use the provided information in a way that is not intended, or attempt to copy code from other teams. Examples of cheating include (but are not limited to):

Your team will get 0 score in this competition once being found out cheating.

Important Dates


After the competition, each team has to hand in a report in Jupyter notebook format via the eeclass system. Your report should include:

The file name of your report must be: DL_comp1_{Your Team number}_report.ipynb.