首页 » Python代写 » 机器学习代写|CS 178: Machine Learning & Data Mining Homework 4

机器学习代写|CS 178: Machine Learning & Data Mining Homework 4

  • ALL

本次美国代写是一个Python机器学习和数据挖掘相关的Homework

Problem 1: Decision Trees for Spam Classification (25 points)

We’ll use the same data as in our earlier homework: In order to reduce my email load, I decide to implement a
machine learning algorithm to decide whether or not I should read an email, or simply file it away instead. To
train my model, I obtain the following data set of binary-valued features about each email, including whether I
know the author or not, whether the email is long or short, and whether it has any of several key words, along
with my final decision about whether to read it (y = +1 for “read”, y = −1 for “discard”).

In the case of any ties where both classes have equal probability, we will prefer to predict class +1.

1. Calculate the entropy H(y), in bits, of the binary class variable y. Hint: Your answer should be a number
between 0 and 1. (5 points)

2. Calculate the information gain for each feature xi. Which feature should I split on for the root node of the
decision tree? (10 points)

3. Determine the complete decision tree that will be learned from these data. (The tree should perfectly classify
all training data.) Specify the tree by drawing it, or with a set of nested if-then-else statements. (10 points)

Problem 2: Decision Trees in Python (50 points)

In this problem, we will use our Kaggle in-class competition data to test decision trees on real data. Kaggle is a
website designed to host data prediction competitions; we will use it to gain some experience with more realistic
machine learning problems, and have an opportunity to compare methods and ideas amongst ourselves. Our
in-class Kaggle page is https://www.kaggle.com/c/uci-cs178-f21; it is set to public / open participation this
year. Follow the instructions on the CS178 Canvas page to create an appropriate Kaggle account (if necessary),
join our in-class competition, and download the competition data. Note: although you will eventually form teams
for the project, please do not merge yet, as you will want to be able to submit individually for this homework. For
convenience, the data are also included in the HW4 code zip, in the data subdirectory.

1. The following code may be used to load the training features X and class labels Y :

1 X = np.genfromtxt(‘data/X_train.txt’, delimiter=’,’)
2 Y = np.genfromtxt(‘data/Y_train.txt’, delimiter=’,’)
3 X,Y = ml.shuffleData(X,Y)

4 # and simlarly for test data features. Test target values are withheld for the
,! competition.

The first 41 features are numeric (real-valued features); we will restrict our attention to these:
1 X = X[:,:41] # keep only the numeric features for now

Print the minimum, maximum, mean, and variance of each of the first 5 features. (5 points)

2. To enable us to do model selection, partition your training data X into training data Xtr,Ytr and validation
sets Xva,Yva of approximately equal size. Learn a decision tree classifier from the training data using the
method implemented in the mltools package (this may take a minute):

1 learner = ml.dtree.treeClassify(Xtr, Ytr, maxDepth=50)

Here, we set the maximum tree depth to 50 to avoid potential recursion limits or memory issues. Compute
and report your decision tree’s training and validation error rates. (5 points)

3. Now try varying the maxDepth parameter, which forces the tree learning algorithm to stop after at most that
many levels. Test maxDepth values in the range 0, 1, 2, …, 15 , and plot the training and validation
error rates versus maxDepth . Do models with higher maxDepth have higher or lower complexity? What
choice of maxDepth provides the best decision tree model? (10 points)

4. The minParent parameter controls the complexity of decision trees by lower bounding the amount of data
required to split nodes when learning. Fixing maxDepth=50 , compute and plot the training and validation
error rates for minParent values in the range 2.^[0:13]=[1,2,4,8,…,8192] . Do models with higher
minParent have higher or lower complexity? What choice of minParent provides the best decision tree
model? (10 points)

5. (Not graded) A related control is minLeaf ; how does complexity control with minParent compare to
minLeaf ?

6. We discussed in class that we could understand our model’s performance as we vary our preference for false
positives compared to false negatives using the ROC curve, or summarize this curve using a scalar area under
curve (AUC) score. For the best decision tree model trained in the previous parts, use the roc function to
plot an ROC curve summarizing your classifier performance on the training points, and another ROC curve
summarizing your performance on the validation points. Then using the auc function, compute and report
the AUC scores for the training and validation data. (10 points)


程序辅导定制C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB


blank

本网站支持 Alipay WeChatPay PayPal等支付方式

E-mail: vipdue@outlook.com  微信号:vipnxx


如果您使用手机请先保存二维码,微信识别。如果用电脑,直接掏出手机果断扫描。

blank