Question 1: Load the data set (30 pts)
We will use a pre-processed version of the classic “[Cranfield collections](http://ir.dcs.gla.ac.uk/resources/test_collections/cran/)” as our data set for implementing a text retrieval system. The Cranfield collections contain 1,400 short documents from the aerodynamics field, along with a set of 225 queries and the corresponding relevance judgements (that is, what documents are deemed relevant to what queries by human experts). In what follows, we will first create three functions for loading the three basic components (documents, queries and relevance judgements), respectively.
Question 1a: Load the documents (10 pts)
Complete the function below to load all Cranfield documents. Each of the 1,400 documents is stored as a single text file under `assets/cranfield/cranfieldDocs`. While the general file-reading procedures certainly work here, if you examine a sample file you will probably notice that the format of the content bears close resemblance to that of an HTML/XML file, in that it contains multiple “fields” demarcated by pairs of “tags” like `<TEXT>` and `</TEXT>`. This allows the use of XML-parsing tools in Python, such as the built-in [`xml.etree.ElementTree`](https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree) or external packages like [`lxml`](https://lxml.de/), which may be more useful than reading the files line by line.
Regardless of the tools you will be using, we only consider anything in between `<TEXT>` and `</TEXT>` as the “content” of a document. We actually also need the identifier in between `<DOCNO>` and `</DOCNO>` to uniquely identify each document, but it is just a serial number from 1 to 1400. You can either grab that number or just keep a counter as you parse the documents in order. Once you grab the text in between `<TEXT>` and `</TEXT>` for each document, perform the following processing steps:
* **Tokenise the text.** For no good reason, just use `word_tokenize` from the `nltk` library to perform a quick tokenisation.
* **Remove stop words.** Words like “a”, “for”, “the” etc. appear in almost every document and therefore, do not help distinguish one document from another. We remove them to drastically reduce the size of the vocabulary we have to deal with. The `stopwords` utility from `nltk.corpus` can provide us with a list of stop words in English.
* **Stem each word.** Words sharing a common stem usually have related meanings, for example, “compute”, “computer” and “computation”. As far as understanding the main idea of a document is concerned (so that it can be accurately retrieved given a relevant query), it is arguably not very useful to distinguish such words. So we can further shrink our vocabulary by keeping only the word stems. We will use `PorterStemmer` from `nltk.stem.porter` for this purpose.
Now that each document is comprised of a sequence of tokens, we store all documents in a `dict` indexed by the `doc_id`s, similar to the following:
‘d1’ : [‘experiment’, ‘studi’, …, ‘configur’, ‘experi’, ‘.’],
‘d2’ : [‘studi’, ‘high-spe’, …, ‘steadi’, ‘flow’, ‘.’],
‘d1400’: [‘report’, ‘extens’, …, ‘graphic’, ‘form’, ‘.’]
where we associate each document with a `doc_id` created by prefixing the document number found in between `<DOCNO>` and `</DOCNO>` by the letter “d”. Even though in general we shouldn’t rely on the order of the keys in a `dict`, **in this case please make sure the `doc_id`s are ordered as shown above**. In Python 3.7 or higher, a `dict` preserves insertion order, so as long as you add each document to your `dict` in ascending order of the document numbers, you should be good. This obviates the need for sorting when we need all documents in a nested list in a later assignment.
**This function should return a `dict` of length 1400, where each key is a `str` representing a `doc_id` and each value is a `list` that contains the tokens resulting from processing a document as described above.**
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: email@example.com 微信号:vipnxx