3 Ensemble Methods [26 pts]
We will study the performance of two ensemble methods on the Optdigits dataset consisting of handwritten
digits. This dataset contains 5620 samples (each an 8 × 8 grayscale image having 64 features) of digits 0
through 9, classified into 10 classes. In the following question we will use a subset of these classes as our
Within HW2 ensemble.py, the following functions have been implemented for you:
• load optdigits(classes)
• get median performance(X, y, m vals, nsplits=50)
• plot data(bagging scores, random forest scores, m range)
It also contains the following function declarations that you will implement:
• bagging ensemble(X train, y train, X test, y test, n clf=10)
• random forest(X train, y train, X test, y test, m, n clf=10)
a) (12pt) Implement random forest(X train, y train, X test, y test, m, n clf=10)
based on the specification provided in the skeleton code. Random forest consists of n clf-many de
cision trees where each decision tree is trained independently on a bootstrap sample of the training
data (for this problem, n clf=10). For each node, we randomly select m features as candidates for
Here, the final prediction of the bagging classifier is determined by a majority vote of these n clf
decision trees. In the case of ties, randomly sample among the plurality classes (i.e. the classes that
are tied) to choose a label.
You should use the sklearn.tree.DecisionTreeClassifier class. Set criterion=‘entropy’
to avoid the default setting. Also, see the max features parameter within this class.
Note: Do not set the max depth parameter. Remember that you are free to use helper functions to
keep your code organized.
Include a screenshot (or equivalent) of your implemented function as your solution to this question.
b) (6pt) Implement bagging ensemble(X train, y train, X test, y test, n clf=10)
based on the specification provided in the skeleton code. Like random forest, a bagging ensemble clas
sifier consists of n clf-many decision trees where each decision tree is trained independently on a
bootstrap sample of the training data. However, all features are considered at every split.
Again, the final prediction is determined by a majority vote of these n clf decision trees.
Include a screenshot (or equivalent) of your implemented function as your solution to this ques
c) (4pts) Now, we will compare the performance of these ensemble classifiers using get median performance().
Measure the median performance across 50 random splits of the digits dataset into training (80%) and
test (20%) sets, and plot the returned performance for each algorithm over the range of m values
specified in the skeleton code (use the plot data() function). See Figure 1 for an example of
the plot (yours may differ due to randomness, but the general trend should be the same). Include
your generated plot in you solution. How does the average test performance of the two methods
compare as we vary m (the size of the randomly selected subset of features)?
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx