Assignment 2 for CS 6120 (Natural Language Processing)
2. Take all the articles in your training data that belong to the category “earn” and treat
them as belonging to the class “earn”.
3. Take all the documents in your training data that do not belong to the category “earn”
and treat them as belonging to the class “not_earn”. Note that for this step, you cannot
assume that because a file belongs to a different category like “acq” it automatically
belongs to the class “not_earn”. There are some articles like “0007186” that belong to
both the category “earn” and “acq”. You should not include these articles in the class
4. Build your Naive Bayes classifier.
5. Submit your code.
Note that if you choose to implement Naïve Bayes from scratch, i.e., without using a library, you
can use add-1 smoothing to deal with zeroes.
1.2 Implement the evaluation metrics [5 points]
1. Write code to compute precision, recall, and F1. You are not allowed to use any pre-built
functions to compute these metrics. Your code should use two input files:
a. The first file containing the filenames of all articles that are marked as belonging
to the class “earn” by a human. This is the gold-set file.
b. The second file containing the filenames of all articles that are marked as
belonging to the class “earn” by your classifier. This is the system output file.
c. Use these two files to compute precision, recall, and F1.
2. Run the classifier you built in Section 1.1 on all articles in your training set.
3. Report your precision, recall, and F1 on your training set.
2. Part-of-Speech (POS) Tagging using Hidden Markov Models (HMM) [35 points]
Your goal is to implement a POS tagger using a bigram HMM . Given an observed sequence of n
words w 1
, w 2
, …w n
, choose the most probably sequence of POS tags t 1
, t 2
. Note that during
training, you might want to choose a frequency-threshold (e.g., 5) and treat any words that have
frequency below this frequency-threshold as unknown. You can choose th is
frequency-threshold based o n your own observations.
2.1 Frequency Counts [5 points]
Obtain frequency counts from the training set. You will need three types of frequency counts:
1. Tag-word pair counts – let’s denote this as C (ti , wi).
2. Tag unigram counts – let’s denote this as C (ti).
3. Tag bigram counts – let’s denote this as C (ti−1 , ti).
Note that to obtain the tag unigram and the tag bigram counts, you will need to separate out
each tag from its word. Also, for each sentence found in the training data, add a start token and
an end token to the beginning and the end of the sentence. Report frequency counts in different
2.2 Transition Probabilities [5 points]
Transition probability is the probability of a tag given its previous tag. Calculate transition
probabilities of the training set using the following equation:
P (ti | ti 1) = C(ti1)
C(ti1 , ti)
2.3 Emission Probabilities [5 points]
An emission probability is the probability of a given word being associated with a given tag.
Calculate emission probabilities of the training set using the following:
P (wi | ti) =
C(ti , wi)
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: email@example.com 微信号:vipnxx