本次美国代写是机器学习相关的assignment
* Prerequisites
In this assignment you will implement the Naive Bayes Classifier. Before starting this assignment, make sure you understand the concepts discussed in the videos in Week 2 about Naive Bayes. You can also find it useful to read Chapter 1 of the textbook.
Also, make sure that you are familiar with the numpy.ndarray
class of python’s numpy
library and that you are able to answer the following questions:
Let’s assume a
is a numpy array.
- What is an array’s shape (e.g., what is the meaning of
a.shape
)? - What is numpy’s reshaping operation? How much computational over-head would it induce?
- What is numpy’s transpose operation, and how it is different from reshaping? Does it cause computation overhead?
- What is the meaning of the commands
a.reshape(-1, 1)
anda.reshape(-1)
? - Would happens to the variable
a
after we callb = a.reshape(-1)
? Does any of the attributes ofa
change? - How do assignments in python and numpy work in general?
- Does the
b=a
statement use copying by value? Or is it copying by reference? - Would the answer to the previous question change depending on whether
a
is a numpy array or a scalar value?
- Does the
You can answer all of these questions by
1. Reading numpy's documentation from https://numpy.org/doc/stable/.
2. Making trials using dummy variables.
*Assignment Summary
The UC Irvine machine learning data repository hosts a famous dataset, the Pima Indians dataset, on whether a patient has diabetes originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito. You can find it at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not. For several attributes in this data set, a value of 0 may indicate a missing value of the variable. It has a total of 768 data-points.
- Part 1-A) First, you will build a simple naive Bayes classifier to classify this data set. We will use 20% of the data for evaluation and the other 80% for training.You should use a normal distribution to model each of the class-conditional distributions.
Report the accuracy of the classifier on the 20% evaluation data, where accuracy is the number of correct predictions as a fraction of total predictions.
- Part 1-B) Next, you will adjust your code so that, for attributes 3 (Diastolic blood pressure), 4 (Triceps skin fold thickness), 6 (Body mass index), and 8 (Age), it regards a value of 0 as a missing value when estimating the class-conditional distributions, and the posterior.Report the accuracy of the classifier on the 20% that was held out for evaluation.
- Part 1-C) Last, you will have some experience with SVMLight, an off-the-shelf implementation of Support Vector Machines or SVMs. For now, you don’t need to understand much about SVM’s, we will explore them in more depth in the following exercises. You will install SVMLight, which you can find at http://svmlight.joachims.org, to train and evaluate an SVM to classify this data.You should NOT substitute NA values for zeros for attributes 3, 4, 6, and 8.
Report the accuracy of the classifier on the held out 20%
0. Data
0.1 Description
The UC Irvine’s Machine Learning Data Repository Department hosts a Kaggle Competition with famous collection of data on whether a patient has diabetes (the Pima Indians dataset), originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito.
You can find this data at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. The Kaggle website offers valuable visualizations of the original data dimensions in its dashboard. It is quite insightful to take the time and make sense of the data using their dashboard before applying any method to the data.
0.2 Information Summary
- Input/Output: This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not.
- Missing Data: For several attributes in this data set, a value of 0 may indicate a missing value of the variable.
- Final Goal: We want to build a classifier that can predict whether a patient has diabetes or not. To do this, we will train multiple kinds of models, and will be handing the missing data with different approaches for each method (i.e., some methods will ignore their existence, while others may do something about the missing data).
0.3 Loading
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from aml_utils import test_case_checker
df = pd.read_csv('../BasicClassification-lib/diabetes.csv')
df.head()
0.1 Splitting The Data
First, we will shuffle the data completely, and forget about the order in the original csv file.
- The training and evaluation dataframes will be named
train_df
andeval_df
, respectively. - We will also create the 2-d numpy array
train_features
whose number of rows is the number of training samples, and the number of columns is 8 (i.e., the number of features). We will defineeval_features
in a similar fashion - We would also create the 1-d numpy arrays
train_labels
andeval_labels
which contain the training and evaluation labels, respectively.
# Let's generate the split ourselves.
np_random = np.random.RandomState(seed=12345)
rand_unifs = np_random.uniform(0,1,size=df.shape[0])
division_thresh = np.percentile(rand_unifs, 80)
train_indicator = rand_unifs < division_thresh
eval_indicator = rand_unifs >= division_thresh
train_df = df[train_indicator].reset_index(drop=True)
train_features = train_df.loc[:, train_df.columns != 'Outcome'].values
train_labels = train_df['Outcome'].values
train_df.head()
eval_df = df[eval_indicator].reset_index(drop=True)
eval_features = eval_df.loc[:, eval_df.columns != 'Outcome'].values
eval_labels = eval_df['Outcome'].values
eval_df.head()
train_features.shape, train_labels.shape, eval_features.shape, eval_labels.shape
0.2 Pre-processing The Data
Some of the columns exhibit missing values. We will use a Naive Bayes Classifier later that will treat such missing values in a special way. To be specific, for attribute 3 (Diastolic blood pressure), attribute 4 (Triceps skin fold thickness), attribute 6 (Body mass index), and attribute 8 (Age), we should regard a value of 0 as a missing value.
Therefore, we will be creating the train_featues_with_nans
and eval_features_with_nans
numpy arrays to be just like their train_features
and eval_features
counter-parts, but with the zero-values in such columns replaced with nans.
train_df_with_nans = train_df.copy(deep=True)
eval_df_with_nans = eval_df.copy(deep=True)
for col_with_nans in ['BloodPressure', 'SkinThickness', 'BMI', 'Age']:
train_df_with_nans[col_with_nans] = train_df_with_nans[col_with_nans].replace(0, np.nan)
eval_df_with_nans[col_with_nans] = eval_df_with_nans[col_with_nans].replace(0, np.nan)
train_features_with_nans = train_df_with_nans.loc[:, train_df_with_nans.columns != 'Outcome'].values
eval_features_with_nans = eval_df_with_nans.loc[:, eval_df_with_nans.columns != 'Outcome'].values
print('Here are the training rows with at least one missing values.')
print('')
print('You can see that such incomplete data points constitute a substantial part of the data.')
print('')
nan_training_data = train_df_with_nans[train_df_with_nans.isna().any(axis=1)]
nan_training_data
1. Part 1 (Building a simple Naive Bayes Classifier)
Consider a single sample (x,y)(x,y), where the feature vector is denoted with xx, and the label is denoted with yy. We will also denote the jthjth feature of xx with x(j)x(j).
According to the textbook, the Naive Bayes Classifier uses the following decision rule:
“Choose yy such that
is the largest”
However, we first need to define the probabilistic models of the prior p(y)p(y) and the class-conditional feature distributions p(x(j)|y)p(x(j)|y) using the training data.
- Modelling the prior p(y)p(y): We fit a Bernoulli distribution to the
Outcome
variable oftrain_df
. - Modelling the class-conditional feature distributions p(x(j)|y)p(x(j)|y): We fit Gaussian distributions, and infer the Gaussian mean and variance parameters from
train_df
.
Task 1
Write a function log_prior
that takes a numpy array train_labels
as input, and outputs the following vector as a column numpy array (i.e., with shape (2,1)(2,1)).
Try and avoid the utilization of loops as much as possible. No loops are necessary.
Hint: Make sure all the array shapes are what you need and expect. You can reshape any numpy array without any tangible computational over-head.
def log_prior(train_labels):
# your code here
raise NotImplementedError
assert log_py.shape == (2,1)
return log_py
# Performing sanity checks on your implementation
some_labels = np.array([0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1])
some_log_py = log_prior(some_labels)
assert np.array_equal(some_log_py.round(3), np.array([[-0.916], [-0.511]]))
# Checking against the pre-computed test database
test_results = test_case_checker(log_prior, task_id=1)
assert test_results['passed'], test_results['message']
# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.
log_py = log_prior(train_labels)
log_py
程序辅导定制C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB

本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: vipdue@outlook.com 微信号:vipnxx
如果您使用手机请先保存二维码,微信识别。如果用电脑,直接掏出手机果断扫描。
