首页 » cs代写 » Python数据挖掘代写 | CSE 158/258, Fall 2021: Homework 1

Python数据挖掘代写 | CSE 158/258, Fall 2021: Homework 1

本次代写是Python数据挖掘和预测分析的一个Homework

Instructions

Please submit your solution by the beginning of the week 3 lecture (Oct 11). Submissions should be
made on gradescope. Please complete homework individually.

This specification includes both questions from the undergraduate (CSE158) and graduate (CSE258) classes.
You are welcome to attempt questions from both classes but will only be graded on those for the class in which
you are enrolled.

You will need the following files:

GoodReads Fantasy Reviews :
https://cseweb.ucsd.edu/classes/fa21/cse258-b/data/fantasy_10000.json.gz
Beer Reviews : https://cseweb.ucsd.edu/classes/fa21/cse258-b/data/beer_50000.json The above
is a json formatted dataset. Data can be read using the json.loads function in Python, or by using eval.

Code examples : http://cseweb.ucsd.edu/classes/fa21/cse258-b/code/week1.py (regression) and http:
//cseweb.ucsd.edu/classes/fa21/cse258-b/code/week2.py (classification)

Executing the code requires a working install of Python 2.7 or Python 3 with the scipy packages installed.

Please include the code of (the important parts of) your solutions.

Tasks | Regression (week 1):

First, using the book review data, let’s see whether ratings can be predicted as a function of review length, or
by using temporal features associated with a review.

1. (CSE158 only) What is the distribution of ratings and review lengths in the dataset? Report the
number of 1-, 2-, 3-star (etc.) ratings, and show the relationship with length (e.g. via a scatterplot) (1
mark).

2. Train a simple predictor that estimates rating from review length, i.e.,
star rating ‘ 0 + 1  [review length in characters]
Report the values 0 and 1, and the Mean Squared Error of your predictor (on the entire dataset) (1
mark).

3. Extend your model to include (in addition to the length) features based on the time of the review. You
can parse the time data as follows:

import dateutil.parser
> t = dateutil.parser.parse(d[‘date_added’])
> t.weekday(), t.year # etc.
Using a one-hot encoding for the weekday and year, write down feature vectors for the rst two examples
(1 mark).

4. Train models that

• use the weekday and year values directly as features, i.e.,
star rating ‘ 0 + 1  [review length in characters] + 2  [t.weekday()] + 3  [t.year]

• use the one-hot encoding from Question 3.
Report the MSE of each (1 mark).

5. Repeat the above question, but this time split the data into a training and test set. You should split
the data randomly into 50%/50% train/test fractions. Report the MSE of each model separately on the
training and test sets.

6. (CSE258 only) Show that for a trivial predictor, i.e., y = 0, the best possible value of 0 in terms of
the Mean Absolute Error is the median of the label y. Hint: compute the derivative of the model’s MAE
and solve for 0


程序辅导定制C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB


本网站支持 Alipay WeChatPay PayPal等支付方式

E-mail: vipdue@outlook.com  微信号:vipnxx


如果您使用手机请先保存二维码,微信识别。如果用电脑,直接掏出手机果断扫描。