COMP5623 Coursework on Image Caption Generation
10am 28th April 2020.
Submit work electronically via Minerva. This coursework is worth 20% of the credit on the
An indicative duration for this coursework is a total of around 17 hours.
Through this coursework, you will:
Understand the principles of text pre-processing and text embeddings
Gain experience working with an image to text model
Compare the performance and usage of RNNs versus LSTMs as sequence generators
Setup and resources
Please implement this coursework using Python and PyTorch. This coursework will use the
Flickr8k image caption dataset .
Sample image and ground truth captions
A child in a pink dress is climbing up a set of stairs in an entry
A girl going into a wooden building
A little girl climbing into a wooden playhouse
A little girl climbing the stairs to her playhouse
A little girl in a pink dress going into a wooden cabin
You are provided with a Jupyter Notebook file “COMP5623 CW2 Starter” with starter code.
The zip files containing the image data and text data can be downloaded here:
This dataset is larger than the ones we have used previously, so if you would like to work on
Colab, it is recommended to download the zip files, unzip them, and then upload all the files to
your Google Drive. This initial upload may take a while, but from then on you will only need to
mount your Drive every time you start a new Colab session and the data will be immediately
accessible. Mounting only takes a few seconds.
The first step is to build a vocabulary. The vocabulary consists of all the possible words which
can be used – both as input into the model, and as an output prediction. To build the vocabulary:
a. Parse the lines variable in the starter code, splitting the image ID from the caption
b. Split the caption text into words and trim any trailing whitespaces.
c. Convert all words into lowercase by calling word.lower().
d. Remove any punctuation (periods, commas, etc.).
e. Since the vocabulary size affects the embedding layer dimensions, it is better to remove
the very infrequently used words from the vocabulary. Remove any words which appear
3 times or less; this should leave you with a vocabulary size of roughly 3440 (plus or
minus some is fine).
f. Add all the remaining words to an instance of the Vocabulary() object provided.
Note that vocab.add_word() deals with duplicates already.
Complete the encoder network
Carefully read through the __init__() function of the EncoderCNN and complete the class
by writing the forward() function. Remember to put the part involving the ResNet in a with
BLEU for evaluation
One common way of comparing a generated text to a reference text is using BLEU, or the
Binlingual Evaluation Understudy. This article gives a good intuition to how the BLEU score is
The Python ntlk package for natural language processing provides a function
sentence_bleu() which computes this score given one or more reference texts and a
hypothesis. For the following two sections, generate this score by using all five reference
captions compared against each generated caption. You can import it on Colab like this:
from nltk.translate.bleu_score import sentence_bleu
If working on your local machine you may need to install the nltk package using pip or conda
Train using an RNN for decoder
Add an RNN layer to the decoder where indicated, using the API as reference:
Keep all the default parameters except for batch_first, which you may set to True.
Add code to the training loop provided which generates a sample caption from two test set
images before any training, and after each epoch. Display both the sample images, the reference
captions, generated caption, and its BLEU score after every epoch. Train for 5 epochs.
Displaying an image
To generate a prediction from your trained model, you will need to first pass the image tensor
into the encoder to get the image features. Remember to first process the image with the
data_transform. Because the EncoderCNN has a batch norm layer, and you are feeding
in single images (plus we do not want the shift the mean and std deviation), run this line before
passing in your images:
Pass the features from the encoder into the decoder sample() function.
The decoder output is a one-dimensional vector of word IDs, and length max_seq_length.
This means that if the caption is shorter than max_seq_length you can ignore any padding
after the token.
Finally, use the vocab to convert the IDs into words and string the caption together.
Train for exactly five epochs. No need to use a validation set.
Using the fully trained model, generate captions for at least five additional test images and
display along with BLEU scores and reference captions.
Train using an LSTM for decoder
Now replace the RNN layer with an LSTM (leave the RNN layer commented out):
Again, keep all the default parameters except for batch_first, which you may set to True.
As with the RNN, display the same two sample images, generated captions and BLEU scores
before training and after each epoch. Train for five epochs. Using the fully trained model,
generate captions for at least five additional test images and display along with BLEU scores and
Deliverable and Assessment
You should submit a report of not more than 8 pages (excluding Python code):
1. A common practice in natural language processing is to lemmatize words before creating
tokens. While we did not do this for our vocabulary, take a look at
Flickr8k.lemma.token.txt and briefly explain what the advantages/disadvantages might be
of using lemmatized vs regular tokens. (max. 5 marks)
2. Present the sample images and generated caption for each epoch of training for both the
RNN and LSTM version of the decoder, including the BLEU scores. (max. 30 marks)
3. Compare training using an RNN vs LSTM for the decoder network (loss, BLEU scores
over test set, quality of generated captions, performance on long captions vs. short
captions, etc.). (max. 30 marks)
4. Among the text annotations files downloaded with the Flickr8k dataset are two files we
did not use: ExpertAnnotations.txt and CrowdFlowerAnnotations.txt. Read the readme.txt
to understand their contents, then consider and discuss how these might be incorporated
into evaluating your models. (max. 10 marks)
5. Python code for all your work (max. 25 marks)
Your submission should include one PDF file for your report, and one Python or .pynb file,
zipped together into a .zip file.
 M. Hodosh, P. Young and J. Hockenmaier (2013) “Framing Image Description as a Ranking
Task: Data, Models and Evaluation Metrics”, Journal of Artifical Intellegence Research, Volume
47, pages 853-899 (http://www.jair.org/papers/paper3994.html)
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx