# 这是一篇来自美国的CSCI-GA.2565-001 机器学习代写：作业3

**1 Variational Inference and Monte Carlo Gradients **

In this question, we will review the details of variational inference (VI), in particular, we will implement the gradient estimators that make VI tractable.

We consider the latent variable model *p*(**z***, ***x**) = Q *N i*=1 *p*(**x***i **|***z***i*)*p*(**z***i*) where **x***i **, ***z***i **∈ *R *D*. Recall that in VI, we fifind an approximation *q**λ*(**z**) to *p*(**z***|***x**).

(A) Let *V*1(*λ*) be the set of variational approximations *{**q**λ *: *q**λ*(**z**) = Q *N i*=1 *q*(**z***i *; *λ**i*)*} *where *λ**i *are parameters learned for each datapoint **x***i *. Now consider *f**λ*(**x**) as a deep neural network with *fifixed *architecture where *λ *parametrizes the network. Let *V*2(*λ*) = *{**q**λ *: *q**λ*(**z**) = Q *N i **q*(**z***i *; *f**λ*(**x***i*))*}*. Which of the two families (*V*1 or *V*2) is more expressive, i.e. approximates a larger set of distributions? **Prove **your answer.

Will your answer change if we let *f**λ *represent *variable *architecture, e.g. if *λ *parametrizes the set of multi-layered perceptrons of all sizes? Why or why not?

*Solution. *Note that *f**λ*(*x**i*) can at most be a universal approximator for *λ**i *, so *V*1 is more expressive than (or at least, equal to) *V*2. It the architecture is variable, this is also the same.

(B) For variational inference to work, we need to compute unbiased estimates of the gradient of the ELBO.

In class, we learnt two such estimators: score function (REINFORCE) and pathwise (reparametrization) gradients. Let us see this in practice for a simpler inference problem.

Consider the dataset of *N *= 100 one-dimensional data points *{**x**i**} **N i*=1 in data.csv. Suppose we want to minimize the following expectation with respect to a parameter *µ*:

min*µ *E*z**∼N *(*µ,*1) “*N *X*i*=1(*x**i **− **z*) 2 # (1)

(i) Write down the score function gradient for this problem. Using a suitable reparametrization, write down the reparameterization gradient for this problem.

*Solution. *score function gradient

*N *(*µ,*1) “*N *X*i*=1(*x**i **− **z*) 2*∇**µ *log *√ *1 2*π *exp *− *(*z **− *2 *µ*) 2 # =E*z**∼N *(*µ,*1) “*N *X*i*=1(*x**i **− **z*) 2 (*z **− **µ*) #

reparameterization gradient

*∇**µ *E*z**∼N *(*µ,*1) “*N *X*i*=1(*x**i **− **z*) 2 # =E* **∼N *(0*,*1) ” *∇**µ *(*N *X*i*=1(*x**i **− *(*µ *+ * *))2 )# =E* **∼N *(0*,*1) ” *−*2*N *X*i*=1(*x**i **− *(*µ *+ * *))#

(ii) Using PyTorch and for each of these two gradient estimators, perform gradient descent using *M *=*{*1*, *10*, *100*, *1000*} *gradient samples for *T *= 10 trials. Plot the mean and variance of the fifinal estimate for *µ *for each value of *M *across the *T *trials.

*You should have two graphs, one for each gradient estimator. Each of the graph should contain two **plots, one for the means and one for the variances. The **x**-axis should be **M**, hence each of these **plots will have four points. *

*Solution. *for score function gradient

(C) What conditions do you require on *p*(*z*) and *f*(*z*) (*f*(*z*) = P *N i*=1(*x**i **− **z*) 2 in this case) for each of the two gradient estimators to be valid? Do these apply to both continuous and discrete distributions *p*(*z*)?

*Solution. *for score function gradient, there is no restriction on *f*(*z*) and *p*(*z*). For parameterization gradient, *p*(*z*) should be continuous, and *f*(*z*) should be difffferentiable.

**2 Bayesian Parameters versus Latent Variables **

(A) Consider the model *y**i **∼ N *(**w***> ***x***i **, σ*2 ) where the inverse-variance is distributed *λ *= 1*/σ*2 *∼ *Gamma(*α, β*).

Show that the predictive distribution *y **? **|***w***, ***x ***? **, α, β *for a datapoint **x ***? *follows a generalized T distribution

*T*(*t*; *ν, µ, θ*) =Γ( *ν*+12 )Γ(*ν/*2)*θ **√ **πν * 1 + *ν *1 *t **− **µ **θ * 2 *− **ν*+12

with degree *ν *= 2*α*, mean *µ *= **w***> ***x ***? *and scale *θ *= p *β/α*. You may use the property Γ(*k*) = R 0 *∞ **x **k**−*1 *e **−**x**dx*.

*Solution. *Combine

*p*(1*/σ*2 = *λ*) = *β **α /*Γ(*α*)*λ **α**−*1 *e **−**βλ*

*p*(*y **? **|***w***, ***x ***? *) = *√ *2 1 *πσ*2 exp *− *(*y **? **− ***wx***? *) 22*σ *2

(B) Using your expression in (A), write down the MLE objective for **w **on *N *arbitrary labelled datapoints *{*(**x***i **, y**i*)*} **N i*=1. Do not optimize this objective.

*Solution.*

*N *Y*i*=1*p*(*y**i **|***w***, ***x***i **, α, β*) =*N *Y*i*=1Γ(*ν*+12)Γ(*ν/*2)*θ **√ **πν * 1 + *ν *1 *y**i **− **θ ***wx***i * 2 *− **ν*+12

(C) Now consider the model *y**i **∼ N * *f*(**x***i **, ***z***i **, ***w**)*, σ*2 where **z***i **∼ N *(**0***, ***I**), *σ *2 is known, and *f *is a deep neural network parametrized by **w**.

(i) Write down an expression for the predictive distribution *y **? **|***X***, ***y***, ***x ***? *, where **X***, ***y **denote the training datapoints. *(You may leave your answer as an integral.) *

**程序辅导定制C/C++/JAVA/安卓/PYTHON/留学生/PHP/APP开发/MATLAB**

本网站支持 Alipay WeChatPay PayPal等支付方式

**E-mail:** vipdue@outlook.com **微信号:**vipnxx

如果您使用手机请先保存二维码，微信识别。如果用电脑，直接掏出手机果断扫描。