In the last Lab, you have learned how to implement machine translation, where the task is to transform a sentence S written in a source language, into its translation T in the target language.
The model architecture in machine translation is intuitionistic. An “encoder” RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn used as the initial hidden state of a “decoder” RNN that generates the target sentence.
So, what if we look at the images instead of reading the sentences in encoder? That is to say, we use a combination of convolutional neural networks to obtain the vectorial representation of images and recurrent neural networks to decode those representations into natural language sentences. The description must capture not only the objects contained in an image, but it also must express how these objects relate to each other as well as their attributes and the activities they are involved in.
This is the Image Captioning, a very important challenge for machine learning algorithms, as it amounts to mimicking the remarkable human ability to compress huge amounts of salient visual information into descriptive language.
This paper presents a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images.
To the best of our knowledge, this is the first work that incorporates the Recurrent Neural Network in a deep multimodal architecture.
The whole m-RNN architecture contains 3 parts: a language model part, an image part and a multimodal part.
It must be emphasized that:
It must be emphasized that:
We empirically verified that feeding the image at each time step as an extra input yields inferior results, as the network can explicitly exploit noise in the image and overfits more easily.