1 Project Description
The aim of this project is to implement a machine learning model based on Naive Bayes for a
sentiment analysis task using the Rotten Tomatoes movie review dataset. Obstacles like sen
tence negation, sarcasm, terseness, language ambiguity, and many others make this task very
3 Data Description
The dataset is a corpus of movie reviews originally collected by Pang and Lee. This dataset
contains tab-separated files with phrases from the Rotten Tomatoes dataset. The data are split
into train/dev/test sets and the sentences are shuffled from their original order.
• Each sentence has a SentenceId.
• They all have been tokenized already.
The training, dev and test set contain respectively 7529, 1000 and 3310 sentences. The sentences
are labelled on a scale of five values:
1. somewhat negative
3. somewhat positive
In the following table you can find several sentences and their sentiment score.
Systems are evaluated according to macro-F1 score, i.e. the mean of the class-wise F1-scores:
where N is the number of classes. F1–score is calculated for each class i:
5 Project Roadmap
1. Implement some preprocessing steps:
• You are free to add any preprocessing step (e.g. lowercasing) before training your
models. Explain what you did in your report.
• Implement a function to map the 5-value sentiment scale to a 3-value sentiment scale.
Namely, the labels “negative” (value 0) and “somewhat negative” (value 1) are merged
into label “negative” (value 0). “Neutral” (value 2) will be mapped to “neutral” (value
1). And finally, “somewhat positive” (value 3) and “positive” (value 4) will be mapped
to the label “positive” (value 2).
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx