HW1 Amazon Review Classification
The objective of this assignment are the following:
1. Implement the Nearest Neighbor Classification Algorithm
2. Handle Text Data (Reviews of Amazon Baby Products)
3. Design and Engineer Features from Text Data.
4. Choose the Best Model i.e., Parameters of a Nearest Neighbor Selection, Features and Similarity Functions
A practical application in e-commerce applications is to infer sentiment (or polarity) from free form review text submitted for range of products.
For the purposes of this assignment you have to implement a k-Nearest Neighbor Classifier to predict the sentiment for 18506 reviews for baby products provided in the test file (test.data). Positive sentiment is represented by a review rating and given by +1 and Negative Sentiment is represented by a review rating of -1.
In test.dat you are only provided the reviews but no ground truth rating which will be used for comparing your predictions.
Training data consists of 18506 reviews as well and exists in file train_file.dat. Each row begins with the sentiment score followed with a text of the rating.
For Evaluation Purposes (Leaderboard Ranking) we will use the Accuracy Metric comparing the Predictions submitted by you on the test set with the ground truth.
Some things to note:
• The public leaderboard shows results for 50% of randomly chosen test instances only. This is a standard practice in data mining challenge to avoid gaming of the system.
• The private leaderboard will be released after the deadline evaluates all the entries in the test set.
• In a 24-hour cycle you are allowed to submit a prediction file 5 times only.
• The final ranking will always be based on the last submission.
• format.dat shows an example file containing 18506 rows alternating with +1 and -1. Your test.dat should be similar to format.dat with the same number of rows i.e., 18506 but of course the sentiment score generated by your developed model.
• This is an individual assignment. Discussion of broad level strategies are allowed but any copying of prediction files and source codes will result in honor code violation.
• Feel free to use the programming language of your choice for this assignment.
• While you can use libraries and templates for dealing with text data, you are required to implement your own nearest neighbor classifier.
• Cross validation is required to determine the best parameter choices. This includes: the number of neighbors (k), the bag-of-words representation (binary vs. raw frequency count vs. TF*IDF), and distance/similarity measure (at least two). In your study, also include experiments using different features (e.g. full feature set vs. reduced feature sets using various dimensionality reduction or feature selection techniques).
• The TA should be able to run your code and recreate the same accuracy as your Miner submission. If for some reason the same accuracy cannot be recreated, you will be asked to explain the discrepancy. So if you do any random sampling in your algorithm, make sure you save the samples in a file (and mention it on your report) so we can recreate the results.
• Valid Submissions to the Miner2.vsnet.gmu.edu website
• Blackboard Submission of Source Code and Report:
• Create a folder called HW1_LastName
• Create a subfolder called src and put all the source code there.
• Create a subfolder called Report and place a 2-3-page, single-spaced report describing details regarding the steps you followed for developing the classifier for predicting the product review sentiments. Be sure to include the following in the report:
1. User Name Registered on miner website.
2. Rank & Accuracy score for your submission (at the time of writing the report).
3. Instruction on how to run your program.
4. Your Approach. Describe your parameter choices and how you choose your best parameters and features via cross validation (see above). Report the results on all experiments. Use tables or plots to make your presentation of results more clear.
5. Efficiency of your algorithm in terms of run time. Did you do anything to improve the run time (e.g. dimensionality reduction)? If so, describe them and report run times with their respective accuracy before and after the improvement.
6. Archive your parent folder (.zip or .tar.gz) and submit via Blackboard for HW1.
Grading for the Assignment will be split on your implementation (50%), report (20%) and ranking results (30%).
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx