This assessment assesses your understanding of model complexity, model selection,uncertainty in prediction with bootstrapping, and probabilistic machine learning, and linear models for regression and classification, covered in Modules 1, 2 and 3.
The total marks of this assessment is 100.
Section A. Model Complexity and Model Selection
In this section, you study the effect of model complexity on the training and testing error. You also demonstrate your programming skills by developing a regression algorithm and a cross-validation technique that will be used to select the models with the most effective complexity.
Background. A KNN regressor is similar to a KNN classifier (covered in Activity 1 of Module 1) in that it finds the K nearest neighbours and estimates the value of the given test point based on the values of its neighbours. The main difference between KNN regression and KNN classification is that KNN classifier returns the label that has the majority vote in the neighbourhood, whilst KNN regressor returns the average of the neighbours’ values. In Activity 1 of Module 1, we use the number of mis-classifications as the measurement of training and testing errors in KNN classifier. For KNN regressor, you need to choose another error function (e.g., the sum of the squares of the errors) as the measurement of training errors and testing errors.
Question 1 [KNN Regressor, 20 Marks]
I.Implement the KNN regressor function:
knn(train.data, train.label, test.data, K=4) which takes the training data and their labels (continuous values), the test data,and the size of the neighbourhood (K). It should return the regressed values for the test data points. Note that, you need to use a distance function to choose the neighbours. The distance function used to measure the distance between a pair of data points is Euclidean distance function.
Hint: You are allowed to use KNN classifier code from Activity 1 of Module 1.
II.Plot the training and the testing errors versus 1/K for K=1,.., 25 in one plot, using the Task1A_train.csv and Task1A_test.csv datasets provided for this assessment. Save the plot in your Jupyter Notebook file for Question 1. Report your chosen error function in your Jupyter Notebook file.
III.Report (in your Jupyter Notebook file) the optimum value for K in terms of the testing error. Discuss the values of K and model complexity corresponding to underfitting and overfitting based on your plot in the previous part (Part II).
Question 2 [L-fold Cross Validation, 15 Marks]
I.Implement a L-Fold Cross Validation (CV) function for your KNN regressor:
cv(train.data, train.label, K, numFold) which takes the training data and their labels (continuous values), the number of folds, and returns errors for different folds of the training data.
II.Using the training data in Question 1, run your L-Fold CV where the numFold is set to 10. Change the value of K=1,..,15 in your KNN regressor, and for each K compute the average 10 error numbers you have got. Plot the average error numbers versus 1/K for K=1,..,15 in your KNN regressor. Save the plot in your Jupyter Notebook file for Question 2.
III. Report (in your Jupyter Notebook file) the optimum value for K based on your plot for this 10-fold cross validation in the previous part (Part II).
Section B. Prediction Uncertainty with Bootstrapping
This section is the adaptation of Activity 2 of Module 1 from KNN classification to KNN regression. You use the bootstrapping technique to quantify the uncertainty of predictions for the KNN regressor that you implemented in Section A.
Background. Please refer to the background in Section A.
Question 3 [Bootstrapping, 25 Marks]
I.Modify the code in Activity 2 of Module 1 to handle bootstrapping for KNN regression.
II.Load Task1B_train.csv and Task1B_test.csv sets. Apply your bootstrapping for KNN regression with times = 50 (the number of subsets), size = 60 (the size of each subset), and change K=1,.., 15 (the neighbourhood size). Now create a boxplot where the x-axis is K, and the y-axis is the average test error (and the uncertainty around it) corresponding to each K. Save the plot in your Jupyter Notebook file for Question 3.
Hint: You can refer to the boxplot in Activity 2 of Module 1. But the error is measured in different ways compared with the KNN classifier.
III. Based on the plot in the previous part (Part П), how does the test error and its uncertainty behave as K increases? Explain in your Jupyter Notebook file.
IV.Load Task1B_train.csv and Task1B_test.csv sets. Apply your bootstrapping for KNN regression with K=10(the neighbourhood size), size = 40 (the size of each subset), and change times = 10, 20, 30,.., 200 (the number of subsets). Now create a boxplot where the x-axis is ‘times’, and the y-axis is the average error (and the uncertainty around it) corresponding to each value of ‘times’. Save the plot in your Jupyter Notebook file for Question 3.
V.Based on the plot in the previous part (Part IV), how does the test error and its uncertainty behave as the number of subsets in bootstrapping increases?
Explain in your Jupyter Notebook file.
Section C. Probabilistic Machine Learning
In this section, you show your knowledge about the foundation of the probabilistic machine learning (i.e. probabilistic inference and modelling) by solving a simple but basic statistical inference problem. Solve the following problem based on the probability concepts you have learned in Module 1 with the same math conventions.
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx