# 机器学习代写 | Naive Bayes Classifier

## * Prerequisites

In this assignment you will implement the Naive Bayes Classifier. Before starting this assignment, make sure you understand the concepts discussed in the videos in Week 2 about Naive Bayes. You can also find it useful to read Chapter 1 of the textbook.

Also, make sure that you are familiar with the numpy.ndarray class of python’s numpy library and that you are able to answer the following questions:

Let’s assume a is a numpy array.

• What is an array’s shape (e.g., what is the meaning of a.shape)?
• What is numpy’s reshaping operation? How much computational over-head would it induce?
• What is numpy’s transpose operation, and how it is different from reshaping? Does it cause computation overhead?
• What is the meaning of the commands a.reshape(-1, 1) and a.reshape(-1)?
• Would happens to the variable a after we call b = a.reshape(-1)? Does any of the attributes of a change?
• How do assignments in python and numpy work in general?
• Does the b=a statement use copying by value? Or is it copying by reference?
• Would the answer to the previous question change depending on whether a is a numpy array or a scalar value?

You can answer all of these questions by

1. Reading numpy's documentation from https://numpy.org/doc/stable/.
2. Making trials using dummy variables.

## *Assignment Summary

The UC Irvine machine learning data repository hosts a famous dataset, the Pima Indians dataset, on whether a patient has diabetes originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito. You can find it at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not. For several attributes in this data set, a value of 0 may indicate a missing value of the variable. It has a total of 768 data-points.

• Part 1-A) First, you will build a simple naive Bayes classifier to classify this data set. We will use 20% of the data for evaluation and the other 80% for training.

You should use a normal distribution to model each of the class-conditional distributions.

Report the accuracy of the classifier on the 20% evaluation data, where accuracy is the number of correct predictions as a fraction of total predictions.

• Part 1-B) Next, you will adjust your code so that, for attributes 3 (Diastolic blood pressure), 4 (Triceps skin fold thickness), 6 (Body mass index), and 8 (Age), it regards a value of 0 as a missing value when estimating the class-conditional distributions, and the posterior.

Report the accuracy of the classifier on the 20% that was held out for evaluation.

• Part 1-C) Last, you will have some experience with SVMLight, an off-the-shelf implementation of Support Vector Machines or SVMs. For now, you don’t need to understand much about SVM’s, we will explore them in more depth in the following exercises. You will install SVMLight, which you can find at http://svmlight.joachims.org, to train and evaluate an SVM to classify this data.

You should NOT substitute NA values for zeros for attributes 3, 4, 6, and 8.

Report the accuracy of the classifier on the held out 20%

## 0. Data

### 0.1 Description

The UC Irvine’s Machine Learning Data Repository Department hosts a Kaggle Competition with famous collection of data on whether a patient has diabetes (the Pima Indians dataset), originally owned by the National Institute of Diabetes and Digestive and Kidney Diseases and donated by Vincent Sigillito.

You can find this data at https://www.kaggle.com/uciml/pima-indians-diabetes-database/data. The Kaggle website offers valuable visualizations of the original data dimensions in its dashboard. It is quite insightful to take the time and make sense of the data using their dashboard before applying any method to the data.

### 0.2 Information Summary

• Input/Output: This data has a set of attributes of patients, and a categorical variable telling whether the patient is diabetic or not.
• Missing Data: For several attributes in this data set, a value of 0 may indicate a missing value of the variable.
• Final Goal: We want to build a classifier that can predict whether a patient has diabetes or not. To do this, we will train multiple kinds of models, and will be handing the missing data with different approaches for each method (i.e., some methods will ignore their existence, while others may do something about the missing data).

In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from aml_utils import test_case_checker

In [ ]:
df = pd.read_csv('../BasicClassification-lib/diabetes.csv')
df.head()

### 0.1 Splitting The Data

First, we will shuffle the data completely, and forget about the order in the original csv file.

• The training and evaluation dataframes will be named train_df and eval_df, respectively.
• We will also create the 2-d numpy array train_features whose number of rows is the number of training samples, and the number of columns is 8 (i.e., the number of features). We will define eval_features in a similar fashion
• We would also create the 1-d numpy arrays train_labels and eval_labels which contain the training and evaluation labels, respectively.
In [ ]:
# Let's generate the split ourselves.
np_random = np.random.RandomState(seed=12345)
rand_unifs = np_random.uniform(0,1,size=df.shape[0])
division_thresh = np.percentile(rand_unifs, 80)
train_indicator = rand_unifs < division_thresh
eval_indicator = rand_unifs >= division_thresh

In [ ]:
train_df = df[train_indicator].reset_index(drop=True)
train_features = train_df.loc[:, train_df.columns != 'Outcome'].values
train_labels = train_df['Outcome'].values

In [ ]:
eval_df = df[eval_indicator].reset_index(drop=True)
eval_features = eval_df.loc[:, eval_df.columns != 'Outcome'].values
eval_labels = eval_df['Outcome'].values

In [ ]:
train_features.shape, train_labels.shape, eval_features.shape, eval_labels.shape

### 0.2 Pre-processing The Data

Some of the columns exhibit missing values. We will use a Naive Bayes Classifier later that will treat such missing values in a special way. To be specific, for attribute 3 (Diastolic blood pressure), attribute 4 (Triceps skin fold thickness), attribute 6 (Body mass index), and attribute 8 (Age), we should regard a value of 0 as a missing value.

Therefore, we will be creating the train_featues_with_nans and eval_features_with_nans numpy arrays to be just like their train_features and eval_features counter-parts, but with the zero-values in such columns replaced with nans.

In [ ]:
train_df_with_nans = train_df.copy(deep=True) eval_df_with_nans = eval_df.copy(deep=True) for col_with_nans in ['BloodPressure', 'SkinThickness', 'BMI', 'Age']: train_df_with_nans[col_with_nans] = train_df_with_nans[col_with_nans].replace(0, np.nan) eval_df_with_nans[col_with_nans] = eval_df_with_nans[col_with_nans].replace(0, np.nan) train_features_with_nans = train_df_with_nans.loc[:, train_df_with_nans.columns != 'Outcome'].values eval_features_with_nans = eval_df_with_nans.loc[:, eval_df_with_nans.columns != 'Outcome'].values
In [ ]:
print('Here are the training rows with at least one missing values.') print('') print('You can see that such incomplete data points constitute a substantial part of the data.') print('') nan_training_data = train_df_with_nans[train_df_with_nans.isna().any(axis=1)] nan_training_data

## 1. Part 1 (Building a simple Naive Bayes Classifier)

Consider a single sample (x,y)(x,y), where the feature vector is denoted with xx, and the label is denoted with yy. We will also denote the jthjth feature of xx with x(j)x(j).

According to the textbook, the Naive Bayes Classifier uses the following decision rule:

“Choose yy such that

[logp(y)+jlogp(x(j)|y)][log⁡p(y)+∑jlog⁡p(x(j)|y)]

is the largest”

However, we first need to define the probabilistic models of the prior p(y)p(y) and the class-conditional feature distributions p(x(j)|y)p(x(j)|y) using the training data.

• Modelling the prior p(y)p(y): We fit a Bernoulli distribution to the Outcome variable of train_df.
• Modelling the class-conditional feature distributions p(x(j)|y)p(x(j)|y): We fit Gaussian distributions, and infer the Gaussian mean and variance parameters from train_df.

Write a function log_prior that takes a numpy array train_labels as input, and outputs the following vector as a column numpy array (i.e., with shape (2,1)(2,1)).

logpy=[logp(y=0)logp(y=1)]log⁡py=[log⁡p(y=0)log⁡p(y=1)]

Try and avoid the utilization of loops as much as possible. No loops are necessary.

Hint: Make sure all the array shapes are what you need and expect. You can reshape any numpy array without any tangible computational over-head.

In [ ]:
def log_prior(train_labels): # your code here raise NotImplementedError assert log_py.shape == (2,1) return log_py
In [ ]:
# Performing sanity checks on your implementation some_labels = np.array([0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1]) some_log_py = log_prior(some_labels) assert np.array_equal(some_log_py.round(3), np.array([[-0.916], [-0.511]])) # Checking against the pre-computed test database test_results = test_case_checker(log_prior, task_id=1) assert test_results['passed'], test_results['message']
In [ ]:
# This cell is left empty as a seperator. You can leave this cell as it is, and you should not delete it.
In [ ]:
log_py = log_prior(train_labels) log_py

E-mail: vipdue@outlook.com  微信号:vipnxx