# Python辅导 | HOMEWORK 4 – V4 CSC311 Fall 2019

HOMEWORK 4 – V4
CSC311 Fall 2019
Version history:
V0 ! V1: Please fill out the course evaluation.
V1 ! V2: add hint to q1 for the update rule for r(n)
V2 ! V3: fix indentation issue in starter code fnc km_assignment_step
clarify how to break ties in ✏-greedy q2.1
V3 ! V4: update cost function in starter code to match slides (ok to use old version)
clarify Boltzmann policy in qlearn function
• Deadline: Dec. 4, at 23:59.
• Submission: You need to submit one pdf file through MarkUs including all your answers,
plots, and your code. You can produce the file however you like (e.g. LATEX, Microsoft Word,
etc), as long as it is readable. Points will be deducted if we have a hard time reading your
solutions or understanding the structure of your code. For each question, you should append
1. Unsupervised Learning, 30 pts. In this problem, we will experiment with two clustering algorithms: (i) k-means, and (ii) EM algorithm for Gaussian mixtures. In what follows, N
denotes the number of data points, D denotes the dimension of each data point, xi 2 RD denotes
a single data point, K denotes the number of clusters, and xi ⇠ p(x|z = k) for k = 1, …, K
denotes the data generating distribution, and p(z = k) is the prior on the class (cluster in this
case).
(a) First, we will generate some data for this problem. Set the number of points N = 400,
their dimension D = 2, and the number of clusters K = 2, and generate data from the
distribution p(x|z = k) = N (µk, ⌃k). Sample 200 data points for k = 1 and 200 for k = 2,
with
µ1 =

0.1
0.1

, µ2 =

6.0
0.1

and ⌃1 = ⌃2 =

10 7
7 10
Here, N = 400. Since you generated the data, you already know which sample comes from
which class. Run the cell in the IPython notebook to generate the data.
Make a scatter plot of the data points showing the true cluster assignment of each point
either using di↵erent color codes or shape or both.
(b) Now, we assume that the true class labels are not known. Implement the k-means algorithm
for this problem. Write two functions: km_assignment_step, and km_refitting_step as
given in the lecture (Here, km_ means k-means). Identify the correct arguments, and the
order to run them. Initialize the algorithm with
µˆ 1 =

0.0
0.0

, µˆ 2 =

1.0
1.0

(1.1)
and run it until convergence. Show the resulting cluster assignments on a scatter plot either
using di↵erent color codes or shape or both. Also plot the cost vs. the number of iterations.
Report your misclassification error. Hint: you generated the data, you know the correct
labels. Your unsupervised learning algorithm doesn’t.
1
aeeoeoss
(c) Next, implement the EM algorithm for Gaussian mixtures. Write three functions: log-likelihood,
gm_e_step, and gm_m_step as given in the lecture. Identify the correct arguments, and
the order to run them. Initialize the algorithm with means as in (1.1), covariances with
⌃ˆ 1 = ⌃ˆ 2 = I, and ˆ⇡1 = ˆ⇡2.
In addition to the update equations in the lecture, for the M (Maximization) step, you need
to use this following equation to update the covariance matrix ⌃k:
⌃ˆ k = 1
Nk
X
N
n=1
r
(n)
k (x(n) µˆ k)(x(n) µˆ k)
> (1.2) ,
and for the E (Expectation) step, the update rule for r
(n)
k needs to be modified accordingly
(simply replace N (x|µˆ k, I) with N (x|µˆ k, ⌃ˆ k) in the update). Run the algorithm until convergence and show the resulting cluster assignments on a scatter plot either using di↵erent
color codes or shape or both. Also plot the log-likelihood vs. the number of iterations. Report
(d) Comment on the results:
(a) Compare the performance of k-Means and EM based on the resulting cluster assignments.
(b) Compare the performance of k-Means and EM based on their convergence rate. What
is the bottleneck for which method?
(c) Experiment with 5 di↵erent data realizations (generate new data), run your algorithms,
and summarize your findings. Does the algorithm performance depend on di↵erent
realizations of data?
2. Reinforcement Learning, 65 pts. In this portion of the assignment, you will write
software to implement a simulated environment and build a reinforcement learning agent that
discovers the optimal (shortest) path to a goal. The agent’s environment will look like:
Each cell is a state in the environment. The cell labeled “I” is the initial state of the agent.
The cell labeled “G” is the goal state of the agent. The black cells are barriers—states that are
inaccessible to the agent. At each time step, the agent may move one cell to the left, right, up, or
down. The environment does not wraparound. Thus, if the agent is in the lower left corner and
tries to move down, it will remain in the same cell. Likewise, if the agent is in the initial state
and tries to move up (into a barrier), it will remain in the same cell.
2.1. Implementation (20 points). You should implement a Q learning algorithm that selects
moves for the agent. The algorithm should perform exploration by choosing the action with the
maximum Q value 90% of the time, and choosing one of the four actions at random the remaining
10% of the time. We should ”break-ties” when the Q-values are zero for all the actions (happens
initially) by essentially choosing uniformly from the action. So now you have two conditions to
act randomly: for ✏ amount of the time, or if the Q values are all zero.
The simulation consist of a series of trials, each of which runs until the agent reaches the goal
state, or until it reaches a maximum number of steps, which you can set to 100. The reward at
the goal is 10, but at every other state is 0. You can set the parameter to 0.9.
2
I
G
2.2. Experiments.
1. Basic Q learning experiments (10 points).
(a) Run your algorithm several times on the given environment.
(b) Run your algorithm by passing in a list of 2 goal locations: (1,8) and (5,6). Note: we
are using 0-indexing, where (0,0) is top left corner. Report on the results.
2. Experiment with the exploration strategy, in the original environment (20 points).
(a) Try di↵erent ✏ values in ✏-greedy exploration: We asked you to use a rate of ✏=0.1, but
try also 0.5 and 0.01. Graph the results (for the 3 ✏-values) and discuss the costs and
benefits of higher and lower exploration rates.
(b) Try exploring with policy derived from the softmax of Q-values described in the
Q learning lecture. Use the values 2 {1, 3, 6} for your experiment, keeping fixed
throughout the training.
(c) Instead of fixing the = 0 to the initial value, we will increase the value of as the
number of episodes t increase:
(t) = 0ekt (2.1)
That is, the value is fixed for a particular episode. Run the training again for di↵erent
values of k 2 {0.05, 0.1, 0.25, 0.5}, keeping 0 = 1.0. Compare the results obtained with
this approach to those obtained with a static value.
3. Stochastic environments (15 points).
(a) Make the environment stochastic (uncertain), such that the agent only has say a 95%
chance of moving in the chosen direction, and a 5% chance of moving in a random
direction.
3
(b) Change the learning rule to handle the non-determinism, and experiment with di↵erent
values of the probability that the environment performs a random action prand 2
{0.05, 0.1, 0.25, 0.5} in this new rule. How does performance vary as the environment
becomes more stochastic?
Use the same parameters as in the first part, except change the alpha to be less than
1, e.g., ↵ = 0.5. Report your results.
2.3. Write-up. Hand in a brief summary of your experiments. For each sub-section, this summary should include a one paragraph overview of the problem and your implementation. It should
include a graph showing number of steps required to reach the goal as a function of learning trials
(one trial is one run of the agent through the environment until it reaches the goal or maximum
number of steps). You should also make a figure showing the policy of your agent for the first 2.1.1
section. The policy can be summarized by making an array of cells corresponding to the states of
the environment, and indicating the direction (up, down, left,right) that the agent is most likely
to move if it is in that state.
3. Course Evaluation, 5 pts. You will get the final 5 points on the assignment if you
confirm that you submitted your course evaluation.
4
soon E-mail: vipdue@outlook.com  微信号:vipnxx 