In this assignment, you will develop and critically analyse models for predicting the sentiment of Tweets. That is, given a tweet, your model(s) will predict whether the sentiment of the message is positive or negative. You will be provided with a data set of tweets that have been labelled with their sentiment. In addition, each Tweet is labelled with the variety of English in which it is written: African American English or Standard American English. You may use this additional information allows you to investigate whether your models work equally well across different social groups of English speakers. The assessment provides you with an opportunity to reflflect on concepts in machine learning in the context of an open-ended research problem, and to strengthen your skills in data analysis and problem solving.
The goal of this assignment is to critically assess the effectiveness of various Machine Learning algorithms on the problem of determining a tweet’s sentiment, and to express the knowledge that you have gained in a technical report. The technical side of this project will involve applying appropriate machine learning algorithms to the data to solve the task. There will be a Kaggle in-class competition where you can compare the performance of your algorithms against your classmates.
The focus of the project will be the report, formatted as a short research paper. In the report, you will demonstrate the knowledge that you have gained, in a manner that is accessible to a reasonably informed reader.
Stage I: Model development and testing and report writing (due May 13):
- One or more programs, written in Python, including all the code necessary to reproduce the results in your report (including model implementation, label prediction, and evaluation). You should also include a README fifile that brieflfly details your implementation. Submitted through Canvas.
- An anonymous written report, of 2000 words (±10%) excluding reference list. Your name and student
ID should not appear anywhere in the report, including the metadata (fifilename, etc.). Submitted through Canvas/Turnitin.3. Predictions for the test set of tweets submitted to the Kaggle1 in-class competition described in Sec 6.
Stage II: Peer reviews (due May 18th):
- Reviews of two reports written by your classmates, of 200-400 words each. Submitted through Canvas.
3 Data Sets
You will be provided with a training set of Tweets, labeled with a sentiment (target label) and English variety (demographic label); a development set with the same labels which you should use for model selection and tuning; a test set with no target (but demographic) labels, which will be used for fifinal evaluation in the Kaggle in-class competition; and an unlabelled data set providing additional Tweets with no labels at all, which you may use for semi- or unsupervised learning approaches.
All data sets are provided as pickled Pandas DataFrames. Each row in the DataFame corre sponds to one instance. It contains the tweet content, its target sentiment label (train and dev only) and its demographic label (train, dev and test only).
These are the labels that your model should predict (y). In the provided data set, each tweet is labelled with one of two possible sentiment values:
Demographic labels provide additional meta information about the Tweet. They should only be used to evaluate models on specifific subgroups of Twitter users, but not be predicted (and probably not used as features, although you can discuss this in your report). In the provided data set, each tweet is labelled with one of two possible demographic labels indicating the language variety of the tweet:
- AAE (African American English)
- SAE (Standard American English)
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx