cmsc411 fall 2013 final exam - university of maryland · cmsc411 fall 2013 final exam ... use the...
TRANSCRIPT
CMSC411 Fall 2013 Final Exam
Name:
Instructions
• You have 120 minutes to take this exam.
• This exam has 120 points, or 1 minute per point.
• You do not need to provide a number if you can show the appropriate fraction. E.g.,
1/13 is acceptable in place of .0769.
• This is a closed book exam. No notes or other aids are allowed.
• If you have a question, please raise your hand and wait for the instructor.
• Answer essay questions concisely using 2-3 sentences. Longer answers are not
necessary and a penalty may be applied.
• In order to be eligible for partial credit, show all of your work and clearly indicate
your answers.
• Write neatly. Credit cannot be given for illegible answers.
Problem Score Max
1 Architectures 15
2 Memory Hierarchy 15
3 Branch Prediction 20
4 Pipelining 10
5 Instruction Scheduling 8
6 Dynamic Scheduling 20
7 Multiprocessor Cache 24
8 RAID Disk Arrays 8
Total 120
1. (15 pts) Architectures
a. (3 pts) Explain how modern CPUs can use a much larger number of internal
registers (e.g., 200) than can be specified in the actual instruction format (e.g., 32).
b. (3 pts) Explain how CPUs were able to achieve higher levels of parallelism than
ILP limit studies that assumed infinite cache size, perfect branch prediction, and
perfect memory disambiguation.
c. (3 pts) What additional information is needed by directory-based coherence
protocols, compared to snooping protocols?
d. (3 pts) Explain why graphics processing units (GPUs) have been increasing in
processing power faster than standard CPUs.
e. (3 pts) List one advantage and one disadvantage of warehouse scale computers
(WSC) compared to local servers.
2. (15 pts) Memory Hierarchy
Suppose we have a byte addressable physical memory of size 4GB (232
bytes).
a. (5 pts) The quad-core ARM Cortex-A15 CPU used in the Nvidia Tegra 4 found in
various Android devices has a 2 MB L2 cache (221
bytes, not including tag bits)
and a cache block size of 64 (26) bytes. The L2 cache is 16-way (2
4) associative.
Compute for the L2 cache the length in number of bits for the tag, index and
offset fields of a 32-bit memory address (show your calculations
b. (2 pts) Considering the answer to part (a), circle the bits representing the
index in the following 32-bit memory address (in binary):
1 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0
c. (2 pts) Now suppose we are considering using the Cortex-A15 CPU for a new
netbook with a solid state drive and virtual memory. We design the virtual
memory system to be size 256 GB, or 238
bytes, where pages are 16 KB (214
bytes) each. Compute the number of page table entries needed if all the pages are
being used.
d. (4 pts) Compute the size of the page table if each page table entry also required 4
additional bits (valid, protection, dirty, use).
e. (2 pts) The Cortex-A15 CPU has two levels of cache. If the miss rates were 2%
for L1 cache, 1% for L2 cache, and 0.1% for memory, what percent of references
require accessing the disk (paged virtual memory)?
2n Val
21 2
22 4
23 8
24 16
25 32
26 64
27 128
28 256
29 512
210
1K
220
1M
230
1G
3. (20 pts) Branch prediction
For a loop containing two branches B1 & B2 (branch actions provided) for each loop
iterations, show on each loop iterations the state of the branch predictors (and branch history
tables, if needed), the predictions. Assume that all predictors are initialized to not taken, and
that the correlation bits are initially set to not taken. When multiple predictors may be used,
circle (or underline) the predictors used to make a prediction
a. (8 pts) (2,2) predictor w/ global history, without branch address
(standard 2-bit counter)
Loop
Iteration
Branch B1 Branch B2
predictor prediction action predictor prediction action
1
T T
2
NT NT
3
T NT
4
NT T
Exit
b. (8 pts) (1,1) predictor w/ local history + branch address
Loop
Iteration
Branch B1 Branch B2
predictor prediction action predictor prediction action
1
T T
2
NT NT
3
T NT
4
NT T
Exit
c. (4 pts) tournament predictor (saturating 2-bit counter)
Loop
Iteration
Branch B1
predictor
X
predictor
Y
tournament
predictor
prediction action
1 T T
T
2 NT NT
T
3 NT T
T
4 NT T
T
5 NT T
T
6 T NT
T
Exit
4. (10 pts) Pipelining
a. (4 pts) Consider the code on the left. List all RAW, WAR, and WAW hazards
found in the code, and the register causing the hazard. You can specify hazards in
the form (I3→I5 for F8).
b. (4 pts) Use the classic MIPS five-stage integer pipeline, show the timing of this
instruction sequence. Assume all memory accesses take 1 clock cycle, and a
register may be read and written in the same clock cycle. Assume normal
forwarding and bypassing hardware
c. (2 pts) For part b), list all forwarding hardware used by each instruction (e.g., I1
used MEM→EX)
Instruction Effect
LD F1, 0(Rx) F1 ← Mem(Rx)
ADD.D F1, F2, F3 F1 ← F2 + F3
MULT.D F1, F2, F3 F1 ← F2 * F3
1 2 3 4 5 6 7 8 9 10 11 12
I1: LD F1, 0(Rx) IF ID EX MEM WB
I2: LD F2, 8(Rx)
I3: MULT.D F3, F1, F2
I4: ADD.D F4, F1, F3
I5: ADD.D F5, F1, F2
I6: LD F5, 0(Rx)
I1: LD F1, 0(Rx)
I2: LD F2, 8(Rx)
I3: MULT.D F3, F1, F2
I4: ADD.D F4, F1, F3
I5: ADD.D F5, F1, F2
I6: LD F5, 0(Rx)
5. (8 pts) Instruction scheduling
For the following questions, assume instructions must be issued in order (i.e., an
instruction cannot be issued until all previous instructions have been issued). Assume
instructions stall only for true/flow/RAW dependences. Assume instructions are
issued the earliest clock cycle legally possible. Use the instruction latencies in the
table above right.
a. (4 pts) Show how instructions would be scheduled (with stalls) for a single-issue
processor.
b. (4 pts) Show how instructions would be scheduled (with nops & stalls) for a dual-
issue processor.
Schedule for 5a Schedule for 5b
Cycle Instruction Cycle Instruction Instruction
1 I1
1 I1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
Instruction Latency
Memory LD +2
ADD.D +0
MULT.D +1
I1: LD F1, 0(Rx)
I2: LD F2, 8(Rx)
I3: MULT.D F3, F1, F2
I4: ADD.D F4, F1, F3
I5: ADD.D F5, F1, F2
I6: LD F5, 0(Rx)
6. (20 pts) Dynamic scheduling
Now consider how the same sequence of instructions is executed on a single-issue Tomasulo-
style CPU. Assume the following:
• The CPU has 1 of each: load buffer, FP adder, FP multiplier functional unit.
• An unlimited number of reservation stations for each functional unit & load buffer.
• Functional units are not pipelined.
• No forwarding between functional units; results can only come from the CDB.
• Functional units may start executing a 2nd
instruction as soon as the current instruction is
finished, even if the result has not yet been placed on the CDB.
• If multiple instructions attempt to use the CDB in the same cycle, the instruction issued earliest goes first.
Instruction execution times (latencies) are provided as +n cycles, where an instruction
executes in 1+n cycles (i.e., spends 1+n cycles in the EX stage). For example: an instruction
with latency +2 executed at cycle 4 would finish in cycle 6 (4+2) and put its result on the
CDB in cycle 7 (if CDB is not busy).
For each problem part below, show what clock cycle each instruction is issued and when it
begins execution (i.e., enters its first EX cycle). Also show when each instruction writes the
CDB. If an instruction execution or completion stalls, list the length of stall and provide a
reason for the stall(s).
Instruction Latency
Memory LD +2
ADD.D +0
MULT.D +1
Instruction
Cycle #
Stalls Issue Exec Write
CDB
I1: LD F1, 0(Rx) 1
2 5
I2: LD F2, 8(Rx)
I3: MULT.D F3, F1, F2
I4: ADD.D F4, F1, F3
I5: ADD.D F5, F1, F2
I6: LD F5, 0(Rx)
I1: LD F1, 0(Rx)
I2: LD F2, 8(Rx)
I3: MULT.D F3, F1, F2
I4: ADD.D F4, F1, F3
I5: ADD.D F5, F1, F2
I6: LD F5, 0(Rx)
7. (24 pts) Multiprocessor cache coherency
Consider a simple bus-based symmetric multiprocessor, where each processor has a single
private cache with coherence maintained with a snooping, write-back protocol. Each cache is
direct mapped, with 3 blocks each holding 1 word. Addrs 0 & 3 are mapped to block B0,
addrs 1 & 4 are mapped to block B1, and addrs 2 & 5 are mapped to block B2.
For each question, assume the initial cache state is shown below. Show the resulting state of
the caches and memory after each action. Show only the blocks that change, and also indicate
the value returned by each read operation. For instance, for [P3 reads 2], the answer is
[P3.B2: (Shared, 2, 6)], indicating Processor 3’s block B2 now has state = Shared, addr = 2,
and data = 6. The read operation returns 6.
a. (4 pts) Processor 2 reads 2
b. (4 pts) Processor 2 writes 2 ← 0
c. (4 pts) Processor 2 reads 1
d. (4 pts) Processor 2 writes 1 ← 0
e. (4 pts) Processor 2 writes 3 ←0
f. (4 pts) Processor 1 reads 1
8. (8 pts) RAID disk arrays
RAID (redundant arrays of inexpensive disks) can be used to improve both
performance and reliability of hard disks. Consider the following data:
Bit1 Bit2 Bit3 Bit4 Bit5 Bit6 Bit7 Bit8
1 1 1 0 0 0 1 0
a. (2 pts) Show how it can be stored in a RAID level 0 system (striping) with 2 disks.
Disk Data
1
2
b. (4 pts) Show how it can be stored in a RAID level 3 system (striping+parity disk)
with 3 disks, where parity information is stored in disk 3.
Disk Data
1
2
3
c. (2 pts) Consider the following RAID level 3 system, where disk 2 has failed.
Rebuild disk 2 based on the parity information in disk 5.
Disk Data
1 0 1 1
2
3 1 0 0
4 1 1 1
5 1 1 0