An information retrieval model is an essential component for many applications such as search,
question answering, and recommendation. In this coursework you are going to develop infor
mation retrieval models that solve the problem of passage retrieval, i.e. given a query, return
a ranked list of short texts (passages). More specifically, you are going to build a passage re
ranking system: given a candidate list of passages for a query, you should re-rank these candidate
passages based on an information retrieval model.
Data can be downloaded from this online repository.1 The data set consists of 3 files:
• test-queries.tsv is a tab separated file, where each row contains a test query identifier
(qid) and the actual query text.
• passage-collection.txt is a collection of passages, one per row.
• candidate-passages-top1000.tsv is a tab separated file with an initial selection of at
most 1000 passages for each of the queries in test-queries.tsv. The format of this file is
<qid pid query passage>, where pid is the identifier of the passage retrieved, query is
the query text, and passage is the passage text (all tab separated). The passages contained
in this file are the same as the ones in passage-collection.txt. However, there might be
some repetitions, i.e. the same passage could have been returned for more than 1 queries.
Figure 1 shows some sample rows from that file.
It is important that you do not change the above filenames, i.e. your source code must use the
above filenames, otherwise we might not be able to assess your code automatically.
Coursework tasks and deliverables
Please read carefully as all deliverables (D1 to D12) are described within the tasks. If a deliv
erable requests an answer in a specific format, follow these guidelines carefully as marking may
be automated (incorrect format will be penalised and might result to 0 marks). Although we
strongly suggest using Python, you could also use Java. Please make one consistent choice for
this coursework. Do not submit your source code as a Jupyter Notebook or similar. Do not use
external functions or libraries that can solve (end-to-end) the tasks of building an inverted index,
TF-IDF, the vector space model, BM25, or any of the query likelihood language models. You are
not allowed to reuse any code that is available from someone or somewhere else (i.e. write your
own code). Use only unigram (1-gram) text representations to solve this coursework’s tasks.2
Write your report using the ACL LATEX template,3 and submit 1 PDF file named report.pdf.
All plots in your report must be in vector format (e.g. PDF). Your report should not exceed 6
pages. In total, your submission should be made of 4 source code files, 5 output CSV files, and 1
PDF file with the report. Do not deploy any internal directory structure in your submission, i.e.
just submit the aforementioned 10 files.
Task 1 – Text statistics (30 marks). Use the data in passage-collection.txt for this
task. Extract terms (1-grams) from the raw text. In doing so, you can also perform basic text
preprocessing steps. However, you can also choose not to. In any case, you should not remove
stop words for Task 1, but you can do so in later tasks of the coursework.
D1 Describe and justify your preprocessing choice(s), if any, and report the size of the identified
index of terms (vocabulary). Then, implement a function that counts the number of occur
rences of terms in the provided data set, plot their probability of occurrence (normalised
frequency) against their frequency ranking, and qualitatively justify that these terms follow
Zipf’s law. As a reminder, Zipf’s law for text sets s = 1 in the Zipfian distribution defined
where f(·) denotes the normalised frequency of a term, k denotes the term’s frequency
rank in our corpus (with k = 1 being the highest rank), N is the number of terms in
our vocabulary, and s is a distribution parameter. How does your empirical distribution
compare to the actual Zipf’s law distribution? Use Eq. 1 to explain where their difference
is coming from, and also compare the two in a log-log plot.
D2 Submit your source code for Task 1 as a single Python or Java file titled task1.py or
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: email@example.com 微信号:vipnxx