CAB431 – Assignment 2 Requirements
Weighting: 35% of the assessment for CAB431.
Items required to be submitted through CAB431 Blackboard:
1. A PDF or word file includes both
• Statement of completeness and your name(s) and student ID(s) in a cover page.
• Solutions to questions Q1, Q4 and Q7, and a paragraph README description for
how to execute your python code in terminal or in IDLE, the structure of your data
folder setting and import packages as well.
2. Your source code for all other questions, containing all files (using a zip file “code.zip” to
put them together) necessary to run the solutions and perform the evaluation (source code
only, no executables); and
3. A zip file “result.zip” contains all “result” data files (in text).
Please note you do not need to include the dataset folder generated by “dataset101-150.zip” in
Due date of Blackboard Submission: Friday week 12 (29th May 2020)
Individual working/pair: You may work on this assignment individually or in a pair (please
note the different requirements for individual and pairs).
Currently, a major challenge is to build communication between search engines and Web users.
However, most search engines can only use queries rather than Web user information needs due
to the difficulty of automatically acquiring user information needs. The first reason for this is
that Web users may not know how to represent their topics of interest. The second reason is
that Web users may not wish to invest a great deal of effort to dig out relevant pages from
hundreds of thousands of candidates provided by search engines.
In this assignment, you are expected to design a system, “IF Learning-Model”, to provide a
solution for this challenging issue. The system is broken up into three parts: Part I (Training Set
Discovery), Part II (IF model) and Part III (Evaluation). In Part I, the major task is to present an
approach in order to automatically discover a training set for a specified topic (we will provide
you 50 topics), which includes both positive documents (e.g., labelled as “1”) and negative
documents (e.g., labelled as “0”). You may need to use the topic title, description or narratives,
Pseudo-Relevance Feedback technique (or clustering technique) and an IR model for this part to
find a training set D which includes both D+ (positive – likely relevant documents) and D-
(negative – likely irrelevant documents) in a given un-labelled document set U. Part II is to
select more terms in D and discover weights for them; and then use the selected terms and their
weights to rank documents in U. Part III is the evaluation, you are required to prove your
solution is better than the query-based method (“the baseline model”) which uses only the topic
titles to rank U.
Example of topic101 – “economic espionage (EE)” is described as follows:
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx