Please submit your solution by the beginning of the week 3 lecture (Oct 11). Submissions should be
made on gradescope. Please complete homework individually.
This specification includes both questions from the undergraduate (CSE158) and graduate (CSE258) classes.
You are welcome to attempt questions from both classes but will only be graded on those for the class in which
you are enrolled.
You will need the following files:
GoodReads Fantasy Reviews :
Beer Reviews : https://cseweb.ucsd.edu/classes/fa21/cse258-b/data/beer_50000.json The above
is a json formatted dataset. Data can be read using the json.loads function in Python, or by using eval.
Code examples : http://cseweb.ucsd.edu/classes/fa21/cse258-b/code/week1.py (regression) and http:
Executing the code requires a working install of Python 2.7 or Python 3 with the scipy packages installed.
Please include the code of (the important parts of) your solutions.
Tasks | Regression (week 1):
First, using the book review data, let’s see whether ratings can be predicted as a function of review length, or
by using temporal features associated with a review.
1. (CSE158 only) What is the distribution of ratings and review lengths in the dataset? Report the
number of 1-, 2-, 3-star (etc.) ratings, and show the relationship with length (e.g. via a scatterplot) (1
2. Train a simple predictor that estimates rating from review length, i.e.,
star rating ‘ 0 + 1 [review length in characters]
Report the values 0 and 1, and the Mean Squared Error of your predictor (on the entire dataset) (1
3. Extend your model to include (in addition to the length) features based on the time of the review. You
can parse the time data as follows:
> t = dateutil.parser.parse(d[‘date_added’])
> t.weekday(), t.year # etc.
Using a one-hot encoding for the weekday and year, write down feature vectors for the rst two examples
4. Train models that
• use the weekday and year values directly as features, i.e.,
star rating ‘ 0 + 1 [review length in characters] + 2 [t.weekday()] + 3 [t.year]
• use the one-hot encoding from Question 3.
Report the MSE of each (1 mark).
5. Repeat the above question, but this time split the data into a training and test set. You should split
the data randomly into 50%/50% train/test fractions. Report the MSE of each model separately on the
training and test sets.
6. (CSE258 only) Show that for a trivial predictor, i.e., y = 0, the best possible value of 0 in terms of
the Mean Absolute Error is the median of the label y. Hint: compute the derivative of the model’s MAE
and solve for 0
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx