Python代写Jupyter | CS 369 2019 Assignment HMM & MCMC

CS 369 2019

Student: Stu Dent

UPI: upi001

Assignment 4 : HMM & MCMC (13 + 2 points)

In this assignment you will write code to implement

• The forward and backward algorithms for hidden Markov models,
• The MCMC algorithm to sample the posterior distribution.

Plus 2 extra points for well commented code.

Due: Wednesday 12th June at 23:59

Instruction

Write your submission in a Jupyter notebook and write any code in Python 3. You should submit the .ipynb file. The primary document the markers will look at is the .ipynb file.

In your report, include explanations of what you are doing and comments in the code. Where you are making mathematical derivations, show your working.

Question 1 HMM

In Assignment 2, we introduced a simple HMM to model secondary structure in proteins with states 𝐻,𝑆,𝑇 and symbols 𝐵,𝐼,𝑁. Refer to that assignment for more detail about the model and its meaning. It has :

transition probabilities a =

⎛⎝⎜⎜⎜⎜𝐻𝑆𝑇𝐻0.950.03330.05𝑆0.010.91670.05𝑇0.040.050.90⎞⎠⎟⎟⎟⎟

and emission probabilites e =

⎛⎝⎜⎜⎜⎜𝐻𝑆𝑇𝐵0.350.550.10𝐼0.550.150.10𝑁0.100.300.80⎞⎠⎟⎟⎟⎟

i. Simulate a HMM (2 points)

Write a function simulate_HMM to simulate a state and symbol sequence pair (𝜋,𝑥) of length 100. Print 𝜋 and 𝑥 to output. Make sure that 𝜋 has at least two runs of 𝐻s (you can set the random seed using numpy.random.seed() until you are happy with your sequence). You will use this simulated pair for the remainder of this question.

ii. The forward algorithm (2 points)

Write a function to implement the forward algorithm to return a forward table matrix 𝐹 and the probability 𝑃(𝑥) of the observing sequence 𝑥. You need to print 𝑃(𝑥) and make explicit if it is in a log space. 𝑥 is simulated by Question i. For forward algorithm examples, see Lecture 16 & Tutorial 6.

[ ]

iii. The backward algorithm (2 points)

Write a function to implement the backward algorithm to return a matrix 𝐵 and the probability 𝑃(𝑥) of the observing sequence 𝑥. You need to print 𝑃(𝑥) and make explicit if it is in a log space. 𝑥 is simulated by Question i. For a backwards algorithm example, see Tutorial 6.

iv. Test (1 point)

Calculate the matrix P where the (𝑘,𝑖)th entry of P is 𝑃(𝑘,𝑖)=𝑃𝑟(𝜋𝑖=𝑘|𝑥), the posterior probability of being in state 𝑘 at step 𝑖. Notice that 𝑃(𝑘,𝑖) is not a log probability in the below equation. 𝐹(𝑘,𝑖) and 𝐵(𝑘,𝑖) are from the anwsers of above questions.

𝑃(𝑘,𝑖)=𝑃𝑟(𝜋𝑖=𝑘|𝑥)=𝐹𝑖(𝑘)𝐵𝑖(𝑘)𝑃(𝑥)

Print the matrix P, and show that your method is correct by checking the column sums of P are all 1.

v. Plot (1 point)

Make a plot of 𝑃𝑟(𝜋𝑖=𝐻|𝑥) against the index 𝑖 (that is, plot 𝑖 on the horizontal axis and 𝑃𝑟(𝜋𝑖=𝐻|𝑥) on the vertical axis). To embed the plot in your report, include in the first cell of your notebook the command %matplotlib inline

vi. Report (1 point)

By comparing your plot to the true state sequence 𝜋, discuss whether or not you think you could accurately find the 𝛼-helices in a given protein sequence using this model.

Question 2 MCMC

i. Write a method (1 points)

Implement an MCMC algorithm to sample the posterior distribution of the effective population size 𝑁𝑒 given the 10-taxa tree simulated by Simulation.py (available on Canvas). The proposal kernel should be a uniform random walk on the parameter 𝑁𝑒, so that 𝛿𝑈𝑛𝑖𝑓(10,10) and 𝑁𝑒=𝑁𝑒+𝛿. The lower boundary of zero can be treated by reflection or rejection. The target distribution is the coalescent likelihood:

𝑝(𝑇|𝑁𝑒)=𝑘=2𝑛(1𝑁𝑒𝑒𝑥𝑝(𝑘(𝑘1)𝑡𝑘2𝑁𝑒))

where 𝑡𝑘 is the time duration that the tree 𝑇 has 𝑘 lineages.

You may find that you need to calculation this target distribution in log space to avoid numerical underflow:

𝑙𝑛(𝑝(𝑇|𝑁𝑒))=𝑘=2𝑛(𝑙𝑛(𝑁𝑒)𝑘(𝑘1)𝑡𝑘2𝑁𝑒)

The the acceptance probability will be: 􏰑

𝛼=𝑚𝑖𝑛��(1,𝑒𝑥𝑝(𝑝2𝑝1)), where proposal state 𝑝2=𝑙𝑛(𝑝(𝑁𝑒|𝑇)) and current state 𝑝1=𝑙𝑛(𝑝(𝑁𝑒|𝑇))

Notice above I gave you the coalescent likelihood, 𝑝(𝑇|𝑁𝑒), not the posterior 𝑝(𝑁𝑒|𝑇), so you need to also define a prior 𝑝(𝑁𝑒) to compute the posterior (up to a constant). If you only use the coalescent likelihood as the target distribution you will be effectively using an (improper) uniform prior. Something like a diffuse log-normal makes a lot more sense.

You may test your method by printing out the “important” parameters in each state with a short-length run.

If the above equations are not displayed properly, you can see them from the attached png files.

[ ]

[ ]

ii. Plot and report (3 points)

Give a starting value 𝑁𝑒=100, and run a MCMC with the chain of length 10,000 steps:

1. Plot the posterior distribution of 𝑁𝑒 simpled from your MCMC run.
2. Plot the prior distribution, and report what prior you used.
3. Compare 1 and 2 to comment on how you would get a better estimate of the effective population size.

[ ]

[ ]

[ ]

E-mail: vipdue@outlook.com  微信号:vipdue

[WPCR_SHOW HIDEREVIEWS="1" ]