[110 points] Vector-space model.
Write a Python program that implements the vector-space model. You will test this program on the
Cranfield dataset, which is a standard Information Retrieval text collection, consisting of 1400
documents from the aerodynamics field. The dataset cranfield.zip is available from the Files/
section under Canvas. The dataset is associated with 225 queries, along with human relevance
judgments. These are also provided on Canvas as two files, cranfield.queries and cranfield.reljudge.
Note on relevance judgments: They are provided as a list of document numbers associated with
queries. For instance, the following lines in the cranfield.reljudge file
indicate that for query 4, documents 236 and 166 are relevant; for query 5, documents 552, 401,
1297, and 1296 are relevant.
Write a program called vectorspace.py that indexes the collection and returns a ranked list of
documents for each query in a list of queries. The program will receive (at least) four arguments on
the command line, in the following order. Other arguments can be added, if necessary.
● Argument 1: indicating the weighting schemes to be used for the documents. For the
purpose of this assignment, your program should be able to “understand” at least two
values for the term weighting schemes specified in this argument: tfc (corresponding to the
traditional tf.idf for documents), <your-own-weighting scheme> (different from tfc)
● Argument 2: indicating the weighting schemes to be used for the query. For the purpose of
this assignment, your program should be able to “understand” at least two values for the
term weighting schemes specified in this argument: tfx (corresponding to the traditional
tf.idf for queries), <your-own-weighting-scheme> (different from tfx).
● Argument 3: indicating the name of the folder containing the collection of documents to be
indexed. For testing purposes, use the cranfieldDocs/ folder that you will obtain after
unpacking the cranfield.zip archive.
● Argument 4: indicating the name of the file with the test queries. For testing purposes, use
the cranfield.queries query file that you will get from the class webpage.
Include the following two functions in vectorspace.py. For preprocessing in both functions, you are
encouraged to use the functions you implemented for Assignment 1. (Should you decide to use your
functions from Assignment 1, please include the relevant files in your submission):
a. Function that adds a document to the inverted index:
Name: indexDocument; input: the content of the document (string); input: weighting scheme for
documents (string); input: weighting scheme for query (string); input/output: inverted index (your
choice of data structure)
Given the name of a file, this function will:
● preprocess the content provided as input, i.e., apply removeSGML, tokenizeText,
removeStopwords, stemWords, as you did in Assignment 1.
● add the tokens to the inverted index provided as input and compile the counts necessary to
calculate the term weights for the given weighting schemes. Note: the inverted index will be
updated with the tokens from the document being processed.
b. Function that retrieves information from the index for a given query.
Name: retrieveDocuments; input: query (string); input: inverted index (your choice of data
structure); input: weighting scheme for documents (string); input: weighting scheme for query
(input); output: ids for relevant documents, along with similarity scores (dictionary)
Given a query and an inverted index, this function will:
● preprocess the query, i.e., removeSGML, tokenizeText, removeStopwords, stemWords, as you
did in Assignment 1
● determine the set of documents from the inverted index that include at least one token from
● calculate the cosine similarity between the query and each of the documents in this set,
using the given weighting schemes to calculate the document and the query term weights
The main program should perform the following sequence of steps:
i. open the folder containing the data collection, provided as the third argument on the command
line (e.g., cranfieldDocs/), and read one file at a time from this folder.
ii. for each file, obtain the content of the file, and add it to the index with indexDocument
iii. if necessary for the term weighting schemes, calculate and store the length of each document
iii. open the file with queries, provided as the fourth argument on the command line (e.g.,
cranfield.queries), and read one query at a time from this file (each line is a query)
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx