- There are 7 questions on this Assignment for a total of 50 points.
- Read academic misconduct information discussed in ROASS document.
- Note that assignments are to be done independently unless otherwise explicitly stated and that inclusion of materials from online sites is strictly forbidden.
- Show all your work clearly in your answers to all questions for full marks.
- No handwritten submissions will be accepted. Using a simple text editor will not produce the desired output. Use a word processor of your choice for your answers, then convert it to a single PDF file before uploading.
- Make sure your name and student number are on each page of your file.
- You must hand in your assignment electronically via UMLearn using the “Assignments” tab under the “Assessments” drop-down menu. A folder for Assignment 4 has been created there. Recall that you must agree to the online honesty document before the submission folder becomes visible to you.
- You may upload multiple times, if so desired and encouraged, but only the final upload will be saved and will be visible to us. So, make sure that you’re submitting the correct copy every time.
- Start NOW!, don’t leave it to the last minute…
1.Consider the following loop.
LOOP : LDUR X10 , [ X1 , #0]
LDUR X11 , [ X1 , #8]
ADD X12 , X10 , X11
SUBI X1 , X1 , #16
CBNZ X12 , LOOP
Assume (i) that perfect branch prediction is used (no stalls due to control hazards); (ii) that there are no delay slots; (iii) that the pipeline has full forwarding support; and (iv) that branches are resolved in the EX (as opposed to the ID) stage.
(a) Show a pipeline execution diagram for the first two iterations of this loop.
(b) Mark pipeline stages that do not perform useful work. How often while the pipeline is full do we have a cycle in which all five pipeline stages are doing useful work? (Begin with the cycle during which the SUBI is in the IF stage and end with the cycle during which the CBNZ is in the IF stage.)
2.Consider a program with the following cache behaviors.
(a) Suppose a CPU with a write-through, write-allocate cache achieves a CPI of 2. What are the read and write bandwidths (measured by bytes per cycle) between RAM and the cache? (Assume each miss generates a request for one block.)
(b) For a write-back, write-allocate cache, assuming 30% of replaced data cache blocks are dirty, what are the read and write bandwidths needed for a CPI of 2?
(c) Do additional calculations to (separately) demonstrate the changes in the bandwidth if we
- Double the DC miss rate, and
- Reduce the IC rate to half.
3.Consider the following instruction sequence, running on a 5-stage pipeline datapath:
ADD X5 , X2 , X1
LDUR X3 , [ X5 , #4]
LDUR X2 , [ X2 , #0]
ORR X3 , X5 , X3
STUR X3 , [ X5 , #0]
(a) If there is no forwarding or hazard detection, insert NOPs to ensure correct execution.
(b) Now, change and/or rearrange the code to minimize the number of NOPs needed. You can assume register X7 can be used to hold temporary values in your modified code.
(c) If the processor has forwarding, but we forgot to implement the hazard detection unit, what happens when the original code executes?
(d) If there is forwarding, for the first seven cycles during the execution of this code, specify which signals are asserted in each cycle by hazard detection and forwarding units in figure below:
(e) If there is no forwarding, what new input and output signals do we need for the hazard detection unit in the above figure? Using this instruction sequence as an example, explain why each signal is needed.
4.Although a cache is named, by convention, according to the amount of data it holds (e.g, a 4 KiB cache can hold 4 KiB of data), caches also require SRAM to store metadata such as tags and valid bits. In the following questions, you will examine how a cache’s configuration affects the total amount of SRAM needed to implement it as well as the performance of the cache. Assume that the caches are byte addressable, and that addresses and words are 64 bits.
(a) Calculate the total number of bits required to implement a 32 KiB cache with 2-word blocks.
(b) Calculate the total number of bits required to implement a 64 KiB cache with 16-word blocks. How much bigger is this cache than the 32 KiB cache described in the previous question? Why the amount of data can be increased by only increasing the block size?
(c) Explain why the above 64 KiB cache, despite its larger data size, might provide slower performance than the first cache.
(d) Generate a series of read requests that have a lower miss rate on a 32 KiB 2-way set associative cache than on the cache described above?.
本网站支持 Alipay WeChatPay PayPal等支付方式
E-mail: email@example.com 微信号:vipnxx