Image Captioning

Shan-Hung Wu & DataLab
Fall 2022

In the last Lab, you have learned how to implement machine translation, where the task is to transform a sentence S written in a source language, into its translation T in the target language.

The model architecture in machine translation is intuitionistic. An “encoder” RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn used as the initial hidden state of a “decoder” RNN that generates the target sentence.

So, what if we look at the images instead of reading the sentences in encoder? That is to say, we use a combination of convolutional neural networks to obtain the vectorial representation of images and recurrent neural networks to decode those representations into natural language sentences. The description must capture not only the objects contained in an image, but it also must express how these objects relate to each other as well as their attributes and the activities they are involved in.

This is the Image Captioning, a very important challenge for machine learning algorithms, as it amounts to mimicking the remarkable human ability to compress huge amounts of salient visual information into descriptive language.


This paper presents a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images.

To the best of our knowledge, this is the first work that incorporates the Recurrent Neural Network in a deep multimodal architecture.

The whole m-RNN architecture contains 3 parts: a language model part, an image part and a multimodal part.

It must be emphasized that:

  1. The image part is AlexNet, which connects the seventh layer of AlexNet to the multimodal layer.
  2. This model feeds the image at each time step.


This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model uses the encoder-decoder framework of machine translation, which replaces the encoder RNN by a deep convolution neural network.

It must be emphasized that:

  1. The model uses a more powerful CNN in the encoder, which yields the best performance on the ILSVRC 2014 classification competition at that time.
  2. To deal with vanishing and exploding gradients, LSTM was introduced in the decoder to generate sentences based on the fixed-length vector representations from CNN.
  3. The image is only input once, at t = -1, to inform the LSTM about the image contents.
    We empirically verified that feeding the image at each time step as an extra input yields inferior results, as the network can explicitly exploit noise in the image and overfits more easily.