# 程序代写｜PACSLZ-501001: Statistical Inference for Big Data Exercise 5

This is the fifinal project of this course. Our task is to write our own learning and classifification program on the synthesized data set we used in Homework 4. As a reminder,these data samples x R2 are generated according to a mixed-Gaussian model nClusters

pX|Y (x|i) = X cj · N(x; µi,j , I2), i = 0, 1 j=1

where µi,j ’s are some randomly chosen cluster centers, I2 is a 2 × 2 identity covariance matrix,and cj ’s satisfy that P j cj = 1. A typical plot of the generated data, with nClusters = 10 is shown in the following fifigure. We have shown in Homework 4 that this dataset can be

learned and classifified with a simple neural network. The goal of this project is to write our own learning algorithm, based on iterative updating of the parameters, to do the same task,so that we can develop an understanding of what are the key ingredients inside a neural network.

Model Family and Parameters

To start, we defifine a model family. Of course we could just defifine the family to be mixed Gaussians, but that would appear to be cheating. A commonly used model family is given in a very general form as follows.

PY |X(i|x; θ) =eFi(x)/P i eFi(x) y = i Y

(1)for some set of functions Fi(x) for each value i that Y can take. Now a quick observation is that adding a constant to every Fi(x) does not change the ratio, so w.o.l.g. we often take a convenient constraint that

Xi Fi(x) = 0,

for all x.

In this project, we will only consider the simple case that Y = {0, 1}, i.e., we only have two classes, so we have F1(x) = F0(x) = F(x). So we will work with a slightly simpler model as

PY |X(1|x) =eF(x)/e

F(x) + eF(x) , PY |X(0|x) =eF(x)/eF(x) + eF(x)

(2)and I will drop the vector sign on X in the following whenever it does not cause confusion.

We assume that

F(x) = k1 X j=0 weights[j] · b(x, center[j]) (3)

i.e., a linear combination of some base function b(x, center) that is easy to compute and easy to fifit, which we will give an example later. All we need to worry about is to fifind these weights and the center coeffiffifficients of the base functions. We allow k of them in total, which gives the complexity of the model. We write θ = (weights[j] R, center[j] R2 , j = 0, . . . , k 1) as the parameters of the distributions. θ is what we need to learn from the data. When we are given a set of training samples (x[i], y[i]), i = 0, . . . , n 1, we need to fifind the maximum likelihood estimate of θ

θˆML = arg maxθ 1/nn1Xi=0

log PY |X(y[i]|x[i])!

(4) where PY |X depends on the parameters θ as defifined in (2) and (3).

Short Discussion 1 You might feel that the form in (2) a bit strange. Why don’t we directly write the distributions as linear combinations of simple functions? It turns out that this particular form of doing linear combinations on the exponent is convenient when we have a lot of samples, and we would like to add up their log likelihood values. This form is closely related to a concept called “exponential family”, which is very important but we will not go into details in this course. You are certainly encouraged to try with other forms of distribution families.

The base function b(x, center) can be chosen as a uni-modal function that is easy to compute. We would like the function value to be larger for x close to center and smaller otherwise. For example, a simple choice can be

b(x, center) = ( 1 if k x centerk 2 < r2 0 oherwirse

for r as global parameters. A sensible choice I take is r2 = 20, such that most samples x generated from N(center, I2) would have b(x, center) = 1. A slightly better choice that emphasizes on samples closer to the center and gradually decrease the base function values is a truncated quadratic function

b(x, center) = ( r 2 − kx centerk 2 if k x centerk 2 < r2 0 otherwise

This should be compared with neural networks, where the outputs of the last layer of nodes can be thought as such base functions, but chosen from some difffferent families of functions that can be generated by the previous layers of activation functions. The last layer combining corresponds to using the weights to form a new linear combination. It turns out that when we use the right loss function, the optimization is exactly the same as matching distributions by using the linear combination on the exponent as we do here.

Task 1. You need to implement an effiffifficient way to evaluate the base functions. In particular,with input x as an array of samples, [nSamples, 2], and the center as a 2-dimensional array,the function should return an array of base function values for all samples together.

Task 2. In order for us to debug later and monitor how the right set of parameter is learned,we should have a way to visualize the chosen function. This takes some plotting skills, and you should do this in a way of your own choice. An example code is given as follows.

p l t . f i g u r e ( )

p l t . rcParams [ ’ a xe s . f a c e c o l o r ’ ] = ’ gray ’

p l t . s c a t t e r ( c o e f [ : , 0 ] , c o e f [ : , 1 ] , s = c o e f [: , 1] 4 0 0 0 , c=w ei g h t s [ : ] ,

cmap = ’ bwr ’ , alph a = 0 . 5 )

p l t . s c a t t e r ( x t r a i n [ : , 0 ] , x t r a i n [ : , 1 ] , c = y t r a i n f l a t ,

cmap = ’ bwr ’ , alph a = 0 . 2 )

p l t . a x i s ( [ xmin , xmax , ymin , ymax ] )

p l t . show ( )

When you design your own visualization, you need several components:

• Scatter plot are probably the right choice to see all the samples as well as the centers of the base functions, which in the above code are stored in coef array (size k × 3),i.e., coef [:, 0] and coef [:, 1]. coef [:, 1] is used to set the size of the scatter.
• I used colors to indicate which class the samples belong, and how the weights corre sponding to classes 0 and 1. A colormap ‘bwr’ is used so that more blue and red colors are used to indicate the two difffferent classes. In this colormap, ‘white’ means “not sure which class”, so I had to change the background to gray.
• I made the plot somewhat transparent by the alpha parameter in the plots, so that when the samples and functions are plotted together I can still see both.
• Fix the axis of the plot is helpful.

Update the Parameters

Obviously, we start by randomly choosing all the parameters. We will use an iterative method to fifind the maximum likelihood estimate of these parameters. That is , in each step, we fifix all but one of the parameters, and update this one parameter to increase the likelihood.

Without loss of generality, let’s say that we would like to update θ0 = (center[0, :] R2, weights[0]). For convenience, let’s write

E-mail: vipdue@outlook.com  微信号:vipnxx