In this assignment you will implement the Naive Bayes Classifier. Before starting this assignment, make sure you understand the concepts discussed in the videos in Week 2 about Naive Bayes. You can also find it useful to read Chapter 1 of the textbook.
Also, make sure that you are familiar with the
numpy.ndarray class of python’s
numpy library and that you are able to answer the following questions:
a is a numpy array.
- What is an array’s shape (e.g., what is the meaning of
- What is numpy’s reshaping operation? How much computational over-head would it induce?
- What is numpy’s transpose operation, and how it is different from reshaping? Does it cause computation overhead?
- What is the meaning of the commands
- Would happens to the variable
aafter we call
b = a.reshape(-1)? Does any of the attributes of
- How do assignments in python and numpy work in general?
- Does the
b=astatement use copying by value? Or is it copying by reference?
- Would the answer to the previous question change depending on whether
ais a numpy array or a scalar value?
- Does the
You can answer all of these questions by
1. Reading numpy's documentation from https://numpy.org/doc/stable/. 2. Making trials using dummy variables.
The UC Irvine machine learning data repository hosts a famous dataset, the Pima Indians dataset, on whether a patient has diabetes originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito. You can find it at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not. For several attributes in this data set, a value of 0 may indicate a missing value of the variable. It has a total of 768 data-points.
- Part 1-A) First, you will build a simple naive Bayes classifier to classify this data set. We will use 20% of the data for evaluation and the other 80% for training.
You should use a normal distribution to model each of the class-conditional distributions.
Report the accuracy of the classifier on the 20% evaluation data, where accuracy is the number of correct predictions as a fraction of total predictions.
- Part 1-B) Next, you will adjust your code so that, for attributes 3 (Diastolic blood pressure), 4 (Triceps skin fold thickness), 6 (Body mass index), and 8 (Age), it regards a value of 0 as a missing value when estimating the class-conditional distributions, and the posterior.
Report the accuracy of the classifier on the 20% that was held out for evaluation.
- Part 1-C) Last, you will have some experience with SVMLight, an off-the-shelf implementation of Support Vector Machines or SVMs. For now, you don’t need to understand much about SVM’s, we will explore them in more depth in the following exercises. You will install SVMLight, which you can find at http://svmlight.joachims.org, to train and evaluate an SVM to classify this data.
You should NOT substitute NA values for zeros for attributes 3, 4, 6, and 8.
Report the accuracy of the classifier on the held out 20%
The UC Irvine’s Machine Learning Data Repository Department hosts a Kaggle Competition with famous collection of data on whether a patient has diabetes (the Pima Indians dataset), originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito.
You can find this data at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. The Kaggle website offers valuable visualizations of the original data dimensions in its dashboard. It is quite insightful to take the time and make sense of the data using their dashboard before applying any method to the data.
0.2 Information Summary
- Input/Output: This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not.
- Missing Data: For several attributes in this data set, a value of 0 may indicate a missing value of the variable.
- Final Goal: We want to build a classifier that can predict whether a patient has diabetes or not. To do this, we will train multiple kinds of models, and will be handing the missing data with different approaches for each method (i.e., some methods will ignore their existence, while others may do something about the missing data).
%matplotlib inline import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from aml_utils import test_case_checker
df = pd.read_csv('../BasicClassification-lib/diabetes.csv') df.head()
0.1 Splitting The Data
First, we will shuffle the data completely, and forget about the order in the original csv file.
- The training and evaluation dataframes will be named
- We will also create the 2-d numpy array
train_featureswhose number of rows is the number of training samples, and the number of columns is 8 (i.e., the number of features). We will define
eval_featuresin a similar fashion
- We would also create the 1-d numpy arrays
eval_labelswhich contain the training and evaluation labels, respectively.
# Let's generate the split ourselves. np_random = np.random.RandomState(seed=12345) rand_unifs = np_random.uniform(0,1,size=df.shape) division_thresh = np.percentile(rand_unifs, 80) train_indicator = rand_unifs < division_thresh eval_indicator = rand_unifs >= division_thresh
train_df = df[train_indicator].reset_index(drop=True) train_features = train_df.loc[:, train_df.columns != 'Outcome'].values train_labels = train_df['Outcome'].values train_df.head()
eval_df = df[eval_indicator].reset_index(drop=True) eval_features = eval_df.loc[:, eval_df.columns != 'Outcome'].values eval_labels = eval_df['Outcome'].values eval_df.head()
train_features.shape, train_labels.shape, eval_features.shape, eval_labels.shape
0.2 Pre-processing The Data
Some of the columns exhibit missing values. We will use a Naive Bayes Classifier later that will treat such missing values in a special way. To be specific, for attribute 3 (Diastolic blood pressure), attribute 4 (Triceps skin fold thickness), attribute 6 (Body mass index), and attribute 8 (Age), we should regard a value of 0 as a missing value.
Therefore, we will be creating the
eval_features_with_nans numpy arrays to be just like their
eval_features counter-parts, but with the zero-values in such columns replaced with nans.
train_df_with_nans = train_df.copy(deep=True) eval_df_with_nans = eval_df.copy(deep=True) for col_with_nans in ['BloodPressure', 'SkinThickness', 'BMI', 'Age']: train_df_with_nans[col_with_nans] = train_df_with_nans[col_with_nans].replace(0, np.nan) eval_df_with_nans[col_with_nans] = eval_df_with_nans[col_with_nans].replace(0, np.nan) train_features_with_nans = train_df_with_nans.loc[:, train_df_with_nans.columns != 'Outcome'].values eval_features_with_nans = eval_df_with_nans.loc[:, eval_df_with_nans.columns != 'Outcome'].values
print('Here are the training rows with at least one missing values.') print('') print('You can see that such incomplete data points constitute a substantial part of the data.') print('') nan_training_data = train_df_with_nans[train_df_with_nans.isna().any(axis=1)] nan_training_data
1. Part 1 (Building a simple Naive Bayes Classifier)
Consider a single sample, where the feature vector is denoted with , and the label is denoted with . We will also denote the feature of with .
According to the textbook, the Naive Bayes Classifier uses the following decision rule:
is the largest”
However, we first need to define the probabilistic models of the priorand the class-conditional feature distributions using the training data.
- Modelling the prior
: We fit a Bernoulli distribution to the
- Modelling the class-conditional feature distributions
: We fit Gaussian distributions, and infer the Gaussian mean and variance parameters from
Write a function
log_prior that takes a numpy array
train_labels as input, and outputs the following vector as a column numpy array (i.e., with shape ).
Try and avoid the utilization of loops as much as possible. No loops are necessary.
Hint: Make sure all the array shapes are what you need and expect. You can reshape any numpy array without any tangible computational over-head.
def log_prior(train_labels): # your code here raise NotImplementedError assert log_py.shape == (2,1) return log_py
# Performing sanity checks on your implementation some_labels = np.array([0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1]) some_log_py = log_prior(some_labels) assert np.array_equal(some_log_py.round(3), np.array([[-0.916], [-0.511]])) # Checking against the pre-computed test database test_results = test_case_checker(log_prior, task_id=1) assert test_results['passed'], test_results['message']
# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.
log_py = log_prior(train_labels) log_py
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: firstname.lastname@example.org 微信号:vipnxx