Convolution Neural Networks

Shan-Hung Wu & DataLab
Fall 2024

In this lab, we introduce two datasets, MNIST and CIFAR, then we will talk about how to implement CNN models for these two datasets using tensorflow. The major difference between mnist and cifar is their size. Due to the limit of memory size and time issue, we offer a guide to illustrate typical input pipeline of tensorflow. Let's dive into tensorflow!

MNIST

We start from a simple dataset. MNIST is a simple computer vision dataset. It consists of image of handwritten digits like:

It also includes label for each image, telling us which digit it is. For example, the label for the above image are 5, 0, 4, and 1. Each image is 28 pixels by 28 pixels. We can interpret this as a big array of numbers:

The MNIST data is hosted on Yann LeCun's website. We can directly import MNIST dataset from Tensorflow.

Softmax Regression on MNIST

Before jumping to Convolutional Neural Network model, we're going to start with a very simple model with a single layer and softmax regression.

We know that every image in MNIST is a handwritten digit between zero and nine. So there are only ten possible digits that a given image can be. We want to give the probability of the input image for being each digit. That is, input an image, the model outputs a ten-dimension vector.

This is a classic case where a softmax regression is a natural, simple model. If you want to assign probabilities to an object being one of several different things, softmax is the thing to do.

From the above result, we got about 92% accuracy for Softmax Regression on MNIST. In fact, it's not so good. This is because we're using a very simple model.

Multilayer Convolutional Network on MNIST

We're now jumping from a very simple model to something moderately sophisticated: a small Convolutional Neural Network. This will get us to over 99% accuracy, not state of the art, but respectable.

Create the convolutional base

As input, a CNN takes tensors of shape (image_height, image_width, color_channels), ignoring the batch size. If you are new to color channels, MNIST has one (because the image are grayscale), whereas a color image has three (R,G,B). In this example, we will configure our CNN to process inputs of shape (28, 28, 1), which is the format of MNIST image. We do this by passing the argument input_shape to our first layer.

Let's display the architecture of our model so far.

Above, you can see that the output of every Conv2D and MaxPooling2D layer is a 3D tensor of shape (height, width, channels). The width and height dimensions tend to shrink as we go deeper in the network. The number of output channels for each Conv2D layer is controlled by the first argument (e.g., 32 or 64). Typically, as the width and height shrink, we can afford (computationally) to add more output channels in each Conv2D layer.

Add Dense layers on top

To complete our model, we will feed the last output tensor from the convolutional base (of shape (3, 3, 64)) into one or more Dense layers to perform classification. Dense layers take vectors as input (which are 1D), while the current output is a 3D tensor. First, we will flatten (or unroll) the 3D output to 1D, then add one or more Dense layers on top. MNIST has 10 output classes, so we use a final Dense layer with 10 outputs and a softmax activation.

To reduce overfitting, we will apply dropout before the readout layer. The idea behind dropout is to train an ensemble of model instead of a single model. During training, we drop out neurons with probability $p$, i.e., the probability to keep is $1-p$. When a neuron is dropped, its output is set to zero. These dropped neurons do not contribute to the training phase in forward pass and backward pass. For each training phase, we train the network slightly different from the previous one. It's just like we train different networks in each training phrase. However, during testing phase, we don't drop any neuron, and thus, implement dropout is kind of like doing ensemble. Also, randomly drop units in training phase can prevent units from co-adapting too much. Thus, dropout is a powerful regularization techique to deal with overfitting.

Here's the complete architecture of our model.

As you can see, our (3, 3, 64) outputs were flattened into vectors of shape (576) before going through two Dense layers.

Compile and train the model

As you can see, our simple CNN has achieved a test accuracy of 99%. Not bad for a few lines of code! For another style of writing a CNN (using the Keras Subclassing API and a GradientTape) head here.

Cifar-10

Actually MNIST is a easy dataset for the beginner. To demonstrate the power of Neural Networks, we need a larger dataset CIFAR-10.

CIFAR-10 consists of 60000 32x32 color image in 10 classes, with 6000 image per class. There are 50000 training image and 10000 test image. Here are the classes in the dataset, as well as 10 random image from each:

Before jumping to a complicated neural network model, we're going to start with KNN and SVM. The motivation here is to compare neural network model with traditional classifiers, and highlight the performance of neural network model.

tf.keras.datasets offers convenient facilities that automatically access some well-known datasets. Let's load the CIFAR-10 in tf.keras.datasets:

For simplicity, we also convert the image into the grayscale. We use the Luma coding that is common in video systems:

As we can see, the objects in grayscale image can still be recognizable.

Feature Selection

When coming to object detection, HOG (histogram of oriented gradients) is often extracted as a feature for classification. It first calculates the gradients of each image patch using sobel filter, then use the magnitudes and orientations of derived gradients to form a histogram per patch (a vector). After normalizing these histograms, it concatenates them into one HOG feature. For more details, read this tutorial.

Note. one can directly feed the original image for classification; however, it will take lots of time to train and get worse performance.

Once we have our getHOGfeat function, we then get the HOG features of all image.

K Nearest Neighbors (KNN) on CIFAR-10

scikit-learn provides off-the-shelf libraries for classification. For KNN and SVM classifiers, we can just import from scikit-learn to use.

We can observe that the accuracy of KNN on CIFAR-10 is embarrassingly bad.

Support Vector Machine (SVM) on CIFAR-10

By above, SVM is slightly better than KNN, but still poor. Next, we'll design a CNN model using tensorflow.

CNN on CIFAR-10

Although Cifar10 is larger than Mnist, it's not large enough for the dataset you will meet in the following lessons. For large datasets, we can't feed all training data to the model due to the limit of GPU memory size. Even if we can feed all data into the model, we still want the process of loading data is efficient. Input pipeline is the common way to solve these.

Input Pipeline

Structure of an input pipeline

A typical TensorFlow training input pipeline can be framed as an ETL process:

  1. Extract: Read data from memory (NumPy) or persistent storage -- either local (HDD or SSD) or remote (e.g. GCS or HDFS).
  2. Transform: Use CPU to parse and perform preprocessing operations on the data such as shuffling, batching, and domain specific transformations such as image decompression and augmentation, text vectorization, or video temporal sampling.
  3. Load: Load the transformed data onto the accelerator device(s) (e.g. GPU(s) or TPU(s)) that execute the machine learning model.

This pattern effectively utilizes the CPU, while reserving the accelerator for the heavy lifting of training your model. In addition, viewing input pipelines as an ETL process provides a framework that facilitates the application of performance optimizations.

tf.data API

To build a data input pipeline with tf.data, here are the steps that you can follow:

  1. Define data source and initialize your Dataset object
  2. Apply transformations on the dataset, following are some common useful techniques
    • map
    • shuffle
    • batch
    • repeat
    • prefetch
  3. Create iterator

Construct your Dataset

To create an input pipeline, you must start with a data source. For example, to construct a Dataset from data in memory, you can use tf.data.Dataset.from_tensors() or tf.data.Dataset.from_tensor_slices(). Alternatively, if your input data is stored in a file in TFRecord format, you can use tf.data.TFRecordDataset().

Once you have a Dataset object, you can transform it into a new Dataset by chaining method calls on the tf.data.Dataset object. For example, you can apply per-element transformations such as Dataset.map(), and multi-element transformations such as Dataset.batch(). See the documentation for tf.data.Dataset for a complete list of transformations.

Now suppose we have simple data sources:

We can create our tensorflow Dataset object with these two data using tf.data.Dataset.from_tensor_slices, which will automatically cut your data into slices:

Consume elements

The Dataset object is a Python iterable. This makes it possible to consume its elements using a for loop:

Or by explicitly creating a Python iterator using iter and consuming its elements using next:

Apply transformations

Next, according to your needs, you can preprocess your data in this step.

map

For example, Dataset.map() provide element-wise customized data preprocessing.

shuffle

Dataset.shuffle(buffer_size) maintains a fixed-size buffer and chooses the next element uniformly at random from that buffer. This way, you can see your data coming with different order in different epoch. This can prevent your model overfit on the order of your training data.

batch

Now our dataset is one example by one example. However, in reality, we usually want to read one batch at a time, thus we can call Dataset.batch(batch_size) to stack batch_size elements together.

Note: Be careful that if you apply Dataset.shuffle after Dataset.batch, you'll get shuffled batch but data in a batch remains the same.

repeat

Repeats this dataset count times.

Dataset.repeat(count) allow you iterate over a dataset in multiple epochs. count = None or -1 will let the dataset repeats indefinitely.

If you would like to perform a custom computation (e.g. to collect statistics) at the end of each epoch then it's simplest to restart the dataset iteration on each epoch:

Note: Since we repeat(2) to the dataset, the code above actually iterates each piece of the dataset 6 times even though epochs = 3.

Therefore, I prefer to set a desired number of epoch rather than using repeat(), unless you want the same piece of data to potentially be ordered together, e.g. dataset.repeat(n).shuffle(n).

prefetch

Creates a Dataset that prefetches elements from this dataset.

Dataset.prefetch(buffer_size) allow you decouple the time when data is produced from the time when data is consumed.

repeat+batch / batch+repeat

The Dataset.repeat transformation concatenates its arguments without signaling the end of one epoch and the beginning of the next epoch. Because of this a Dataset.batch applied after Dataset.repeat will yield batches that stradle epoch boundaries:

If you need clear epoch separation, put Dataset.batch before the repeat:

shufflt+repeat / repeat+shufflt

As with Dataset.batch the order relative to Dataset.repeat matters.

Dataset.shuffle doesn't signal the end of an epoch until the shuffle buffer is empty. So a shuffle placed before a repeat will show every element of one epoch before moving to the next.

But a repeat before a shuffle mixes the epoch boundaries together.

Now, let's start designing our cnn model!

CNN Model for CIFAR 10

Loading Data Manually

To know how it works under the hood, let's load CIFAR-10 by our own (not using tf.keras). According the descripion, the dataset file is divided into five training batches and one test batch, each with 10000 image. The test batch contains exactly 1000 randomly-selected image from each class.

Optimization for input pipeline

We all know that GPUs can radically reduce the time required to execute a single training step; however, all other affairs (including data loading, data transformations, memory copy from CPU to GPUs) are done by CPU, which sometimes becomes the bottleneck instead. We have learned above that there are lots transformations that make datasets more complex and reusable. Now, we are going to accelerate the input pipeline for better training performance, following this guide.

The code below briefly do the same thing in CNN Model for CIFAR 10. However, we change the dataset structure to show the time comsuming during the training.

Dataset with time

The block above defines our dataset, not only image and label, but also steps, timings and counters. Therefore, if we take two examples:

In above, the first block shows:

The second block shows:

Note that since 'cifar10_train.csv' is only opened once, only the first example is recorded Open time and the after examples are assigned with -1 (negative values would be filtered out). Also, the example_idx is assigned with -1, meaning that all examples are opened at the same time.

Besides, the duration of Read in all example are same because we calculte in average.

Map function with time

Now, Image shape is 3072 $(=32\cdot32\cdot3)$ and label is 0 to 1, so we have to apply map funciton to each example, meanwhile recording the time cost of map function:

Note that the @map_decorator in map function is necessary for record correct time. Therefore, if we take two examples again with map functions:

After map function (and since we do not apply shuffle(), in first block above, the image is mapped from original [59., 43., 50., ..., 140., 84., 72.] shape=(3072,) to [[[0.794585 , 0.1525154 , -0.43435538], ...]] shape=(24,24,3), the original label 6 is now mapped to [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.].

or in batch:

The annotatation is quite unreadable though, we still can roughly find that example_idx in Read, Map, Train all run through from 0 to 127.

BTW, min_width in draw_timeline() indicate the minimum time duration of the graph. Since draw_timeline() will apply max() to min_width and total execution time to decide the final time duration of graph, if you set a small value of min_width, the final time duration of graph will be the total execution time.

Re-train CNN with time

Here we only train 2 epoch since we are not pursuing performace but running experiments about better data pipeline (shorter time cost).

optimization dataset pipeline

The dataset pipeline of (dataset_train, dataset_test) is same to the CNN Model for CIFAR 10 part. However, if we optimize the pipeline as below, the performance would be better. The optimization is including:

  1. prefetching: overlaps the preprocessing and model execution of a training step.
  2. Interleave (Parallelizing data extraction): parallelize the data loading step, interleaving the contents of other datasets (such as data file readers).
  3. Parallel mapping: parallelized mapping across multiple CPU cores.
  4. Caching: cache a dataset, save some operations (like file opening and data reading) from being executed during each epoch.
  5. Vectorizing mapping: batch before map, so that mapping can be vectorized.

We won't explain each of them in detail. It's recommended to study the terms above in the official documentation. Here we only demonstrate the improvement.

Note that since we are vectorizing map function, there's one more dimension for batch in each inputs when mapping. Therefore, we have to modify map function first:

From the results above, we can find that the time comsuming reduces from 8268 to 261 (sec) but get close accuracy. There's exactly no Open, Read and Map time consuming in 2nd epoch (which is because of the Caching). Besides, the training and testing time in 2nd epoch also decrease.

In this lab, we study how to optimize the data pipeling (I/O). The result is great though, the result is highly depended on device. If you re-run the exactly same code above on your device, you may get totally different result (if the bottleneck on your device is the training speed, not I/O). Besides, the data type may also affect the result. Here we read image from .pkl files, which is an binary file with faster I/O speed. If we switch the situation like reading image from .jpg/.png files (what you would do in the assignment below), the imporvement would be even evident.

in practical use (a simple demo)

The code above is complicate because we have to combine time into dataset. In practical, the usage may look like:

interleave() is rarely used in my experience. Also remember that map_fun_batchwise() should include @tf.function decorator for AutoGraph speed up.

Assignment

In this assignment, you have to implement the input pipeline of the CNN model and try to write/read tfrecord with the Oregon Wildlife dataset.

We provide you with the complete code for the image classification task of the CNN model, but remove the part of the input pipeline. What you need to do is completing this part and training the model for at least 5 epochs.

Description of Dataset:

  1. The raw data is from kaggle, which consists of 20 class image of wildlife.
  2. We have filtered the raw data. You need to download the filtered image from here and use them to complete the image classification task.
  3. In the dataset we prepared for you, there are nearly 7,200 image, which contain 10 kinds of wildlife.

The sample image is shown below:

red_fox

Requirement:

Note:

The time.sleep(0.05) in the example is to avoid concurrency issues that TAs are unable to solve at short notice. However, the duration depends on devices (the throughput between CPU and GPU maybe). For example, in our lab servers, 0.05 is enough for one newer computer while another computer still sometimes meet the error even we increase to 0.1. Therefore, if you meet strange errors like below and not always meets the error when re-run the same code, setting higher sleep time may help though it's slower. Errors that TA meets:

Notification:

The accuracy now is 9.69% in testing set, costing with 2989 sec. Now try some data augmentation (transformation) to observe whether the accuracy and execution time are increased or decreased.

After trying data augmentation (transformation), it's time to optimize what you did above for better efficiency.