COMP5349: Cloud Computing Sem. 1/2020
Assignment: Text Corpus Analysis
Group Work: 20% 30.04.2020
This assignment tests your ability to design and implement a Spark application to handle
a relatively large data set. It also tests your ability to analyse the execution performance
of your own application on a cluster.
It is a group assignment, each group can have up to 2 students. You are encouraged to
form groups within the same lab. A number of self-enrolled groups have been created in
Canvas for each lab prefixed with the lab room and day. If you prefer to form a group with
a member from a different lab, you need to enrol in the lab which one of the members
regularly attends. You may be requested to move to a different group if the chosen lab is
2 Data Set
The data set used in this assignment is a public text corpus called Multi-Genre Natural
Language Inference (MultiNLI). The corpus consists of 433k sentence pairs in 10 genres
from written and spoken English. Each pair contains a premise sentence and a hypothesis
sentence. The corpus is developed for a particular NLP task called Natural Language Inference. It involves building a model to automatically determine if a hypothesis sentence
is true given that its premise sentence is true.
The corpus is divided into the following sets:
• a training set with sentence pairs from five genres: FICTION, GOVERNMENT, SLATE,
TELEPHONE and TRAVEL
• a matched development set with sentence pairs from the same five genres as those in
the training set
• a matched test set with sentence pairs from the same five genres as those in the
• a mismatched development set with sentence pairs from another five genres: 9/11,
FACE-TO-FACE, LETTERS, OUP and VERBATIM ;
• a mismatched test set with sentence pairs from the same 5 genres as those in the
mismatched development set
The training set contains around 390K sentences. The development and testing set contains around 10K sentences each.
3 Analysis Workloads
You are asked to perform two types of analysis:
• Vocabulary Exploration
• Sentence Vector Exploration
3.1 Vocabulary Exploration
You are asked to produce a few statistics of the vocabulary used in the corpus. Here, the
vocabulary refers to the set of all the words appearing in the corpus or in a subset of the
corpus. No stemming or spelling check is needed to remove near duplicates or typos.
The first set of workloads involves relatively small development and test data sets.
There are four sets in total each with around 10K sentences. The matched development
set and the matched test set contain sentence pairs in five genres. The mismatched development set and mismatched test set contain sentence pairs in five different genres. You are
asked to find out:
1. the number of common words between matched and mismatched sets
2. the number of words unique to the matched sets
3. the number of words unique to the mismatched sets
The next set of workloads deals with the training data set. It contains sentence pairs
from 5 genres. Among the vocabulary of the training set, there are common words appearing in all genres, and there could be words that only appear in one genre. You are asked
to find out the distribution of common and unique words. To be specific, you are asked to
1. The percentages of words appearing in five genres, in four genres, in three genres,
in two genres and in one genre
2. The same percentages after removing a given list of stop words
3.2 Sentence Vector Exploration
Most text mining and NLP tasks start by converting the input document or sentence(s) into
a vector. There are many ways to compute the vector representation of a sentence, from
traditional vector space based TFIDF representation to simple averaging of word vectors
to complex neural models trained on very large corpus. A sentence vector is expected to
be embedded with lots of semantic and syntactic information about the sentence.
In this workload, you are asked to compare two sentence vector representation methods
based on their ability to captures the genre feature of sentences. The sentence vector
representation methods are:
• TFIDF based vector representation. You should use the implementation provided by
the Spark ML library. You can decide on the dimension of the vector.
• Pre-trained sentence encoder. You are recommended to use the Universal Sentence
Encoder released by Google. The result would be a vector of 512 dimension.
For each vector representation method, you are asked to encode every sentence in the
training data set as a vector, then cluster all sentences into five clusters. Each cluster may
contain sentences belonging to different genres. Ideally, there is one genre where most
sentences are from in that cluster. The cluster will be labelled with the genre that most
sentences are from. After labeling each cluster, you are asked to compute the confusion
matrix to show the accuracy of clustering using this particular vector representation. A
confusion matrix shows for each label the percentage of correctly and incorrectly labelled
data. Below is a sample table showing you values and their meanings in a confusion
a b c d e
a 60% 10% 10% 10% 10%
b 5% 70% 10% 10% 5%
c 15% 5% 70% 5% 5%
d 5% 10% 5% 70% 10%
e 15% 5% 5% 5% 70%
If we look at the first value column, it says that for all sentences in genre a, 60% is
clustered as a, 5% is clustered as b, 15% as c, 5% as d and 15% as e. You only need to
compute all values in the confusion matrix in your application. The actual table should be
included in the report.
4 Performance Analysis
You are asked to find out empirically, with the same cluster capacity, which one of the
following configurations deliver better performance for your application:
• Having a small number of executors but with better capacity per executor.
• Having a large number of executors each with limited capacity
You need to run one of your applications in a few resource allocation configurations to
observe the performance difference.
5 Implementation Requirement
All analysis should be implemented using Spark API. Use your own judgement to decide
which part of the application will use Spark RDD, SparkSQL or SparkML.
While developing your application, make sure that you also pay attention to the quality
of your application. In particular,
• You are asked to design the content and sequence of RDDs/DataFrames carefully to
minimize repetitive operations.
• You are asked to choose the operations and design their sequence carefully to minimize shuffling size.
There are two deliverables: source code and project report (up to 10 pages). Both are
due on week 12 Wednesday 20th of May 23:59 .
There will be different links for the source code and report submission to facilitate plagiarism detection. The marker may need to run your code on their environment; make
sure you prepare one or multiple read.me files with details on how to configure the environment and to run your code. The source code and associated README.md file(s) should
be submitted as a single zip file. Remember, only the source code and README.md file(s)
should be submitted. No data file or compiled version should be included. ANY group
that includes data or any large file in their code submission will receive penalty point
All members need to sign the group assignment cover page digitally and submit the
signed cover page together with the report. They can be submitted as a single pdf file or
as a zip file.
There will be a zoom demo session in week 12 during normal lab time. Each group will
have a 10 minutes session with the tutor. ALL members of the group must attend the
demo. The tutor will communicate with each group on their respective session time and
zoom meeting room information. Demo details and what you need to prepare before the
demo will be released on week 12.
8 Report Structure
The report should include the following sections:
• Vocabulary Exploration
• Sentence Vector Exploration
• Performance Evaluation
You may add other sections or appendix at the end.
The Introduction section should briefly cover the programming language you used to
implement the project and additional software packages used. It should also cover the
environment for debugging and running the application.
The Vocabulary Exploration section should cover the application design, implementation details and results related to vocabulary exploration workloads. To explain the design,
you should include one or more annotated data flow DAGs to show the sequence of operations you used to get certain statistics. The annotation should show the structure/content
of RDDs/DataFrames in the DAG. You should also include design decisions and implementation details to achieve the quality requirements.The results should be put together in an
easy to read format.
The Sentence Vector Exploration section should cover the application design, implementation details and results related to sentence vector exploration workloads. You may
include one or more annotated data flow DAG to show the sequence of operations you
used to prepare the input for clustering and the data flow of computing the confusion matrix. The results should contain two tables showing the confusion matrix data obtained
from two sentence vector methods.
The Performance Evaluation section should give an overview of the execution environment, including the cluster size and capacity of nodes. You should describe each
resource configuration you have experimented with. You should experiment with at least
TWO resource configurations on ONE relatively complex application with multiple jobs
and stages. For each resource configuration, provide a description of the properties used
to control the resource allocation and the high level statistics of the execution statistics,
such as execution time, number of executors, number of threads, total shuffling size and
so on. This section should include a few screenshots or diagrams to highlight performance
variations under different configurations. For any diagram or screenshot, there should be
enough explanation of the performance statistics and differences observed.
9 Data Set Download Instruction and Other Materials
9.1 Data Set Download
You are suggested to download the MNLI data set from the GLUE site GLUE Tasks. Download the MultiNLI Matched as a zip file. After extracting the zip file. You will see five tsv
files at top level folder. These are the data sets you will work with.
• train.tsv: the training data set
• dev matched.tsv: the development set with matched genre
• dev mismatched.tsv : the development set with mismatched genre
• test matched.tsv: the test set with matched genre
• test mismatched.tsv: the test set with mismatched genre
All tsv files have many columns. The data we work with are on the following three
columns: genre, sentence1, sentence2.
There are many stop words lists available online. You can use any that you are familiar
with. For students with no preference, we recommend you to use the Stanford Stop Word
9.2 Malformed Rows
There could be malformed rows and depending on the file readers you use to load the
data, the malformed rows may present different issues. You are suggested to get rid of any
malformed row that standard tsv file readers cannot parse properly.
9.3 Tokenizer Options
Majority of the tasks in this assignment involve extracting words from the input sentences.
There are many ways of doing it. The simplest option is to use the string split feature to
get words from a sentence; alternatively, you may use the tokenizer provided by SparkML,
which seems to just use the said split feature. These two options do not end up with a
clean set of words. For instance, the punctuation marks are usually attached to the last
word of a sentence. A better option is to use the NLTK package. The package itself is
installed on EMR. The data sets that are required to run most functions are not installed
though. This needs to be done at cluster starting time. Instructions on how to configure
cluster wide software will be given in the week 10 lab.
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: email@example.com 微信号:vipnxx