cmsc411 fall 2013 final exam - university of maryland · cmsc411 fall 2013 final exam ... use the...

CMSC411 Fall 2013 Final Exam

Name:

Instructions

• You have 120 minutes to take this exam.

• This exam has 120 points, or 1 minute per point.

• You do not need to provide a number if you can show the appropriate fraction. E.g.,

1/13 is acceptable in place of .0769.

• This is a closed book exam. No notes or other aids are allowed.

• If you have a question, please raise your hand and wait for the instructor.

• Answer essay questions concisely using 2-3 sentences. Longer answers are not

necessary and a penalty may be applied.

• In order to be eligible for partial credit, show all of your work and clearly indicate

your answers.

• Write neatly. Credit cannot be given for illegible answers.

Problem Score Max

1 Architectures 15

2 Memory Hierarchy 15

3 Branch Prediction 20

4 Pipelining 10

5 Instruction Scheduling 8

6 Dynamic Scheduling 20

7 Multiprocessor Cache 24

8 RAID Disk Arrays 8

Total 120

1. (15 pts) Architectures

a. (3 pts) Explain how modern CPUs can use a much larger number of internal

registers (e.g., 200) than can be specified in the actual instruction format (e.g., 32).

b. (3 pts) Explain how CPUs were able to achieve higher levels of parallelism than

ILP limit studies that assumed infinite cache size, perfect branch prediction, and

perfect memory disambiguation.

c. (3 pts) What additional information is needed by directory-based coherence

protocols, compared to snooping protocols?

d. (3 pts) Explain why graphics processing units (GPUs) have been increasing in

processing power faster than standard CPUs.

e. (3 pts) List one advantage and one disadvantage of warehouse scale computers

(WSC) compared to local servers.

2. (15 pts) Memory Hierarchy

Suppose we have a byte addressable physical memory of size 4GB (232

bytes).

a. (5 pts) The quad-core ARM Cortex-A15 CPU used in the Nvidia Tegra 4 found in

various Android devices has a 2 MB L2 cache (221

bytes, not including tag bits)

and a cache block size of 64 (26) bytes. The L2 cache is 16-way (2

4) associative.

Compute for the L2 cache the length in number of bits for the tag, index and

offset fields of a 32-bit memory address (show your calculations

b. (2 pts) Considering the answer to part (a), circle the bits representing the

index in the following 32-bit memory address (in binary):

1 0 0 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 0

c. (2 pts) Now suppose we are considering using the Cortex-A15 CPU for a new

netbook with a solid state drive and virtual memory. We design the virtual

memory system to be size 256 GB, or 238

bytes, where pages are 16 KB (214

bytes) each. Compute the number of page table entries needed if all the pages are

being used.

d. (4 pts) Compute the size of the page table if each page table entry also required 4

additional bits (valid, protection, dirty, use).

e. (2 pts) The Cortex-A15 CPU has two levels of cache. If the miss rates were 2%

for L1 cache, 1% for L2 cache, and 0.1% for memory, what percent of references

require accessing the disk (paged virtual memory)?

2n Val

21 2

22 4

23 8

24 16

25 32

26 64

27 128

28 256

29 512

210

1K

220

1M

230

1G

3. (20 pts) Branch prediction

For a loop containing two branches B1 & B2 (branch actions provided) for each loop

iterations, show on each loop iterations the state of the branch predictors (and branch history

tables, if needed), the predictions. Assume that all predictors are initialized to not taken, and

that the correlation bits are initially set to not taken. When multiple predictors may be used,

circle (or underline) the predictors used to make a prediction

a. (8 pts) (2,2) predictor w/ global history, without branch address

(standard 2-bit counter)

Loop

Iteration

Branch B1 Branch B2

predictor prediction action predictor prediction action

1

T T

2

NT NT

3

T NT

4

NT T

Exit

b. (8 pts) (1,1) predictor w/ local history + branch address

Loop

Iteration

Branch B1 Branch B2

predictor prediction action predictor prediction action

1

T T

2

NT NT

3

T NT

4

NT T

Exit

c. (4 pts) tournament predictor (saturating 2-bit counter)

Loop

Iteration

Branch B1

predictor

X

predictor

Y

tournament

predictor

prediction action

1 T T

T

2 NT NT

T

3 NT T

T

4 NT T

T

5 NT T

T

6 T NT

T

Exit

4. (10 pts) Pipelining

a. (4 pts) Consider the code on the left. List all RAW, WAR, and WAW hazards

found in the code, and the register causing the hazard. You can specify hazards in

the form (I3→I5 for F8).

b. (4 pts) Use the classic MIPS five-stage integer pipeline, show the timing of this

instruction sequence. Assume all memory accesses take 1 clock cycle, and a

register may be read and written in the same clock cycle. Assume normal

forwarding and bypassing hardware

c. (2 pts) For part b), list all forwarding hardware used by each instruction (e.g., I1

used MEM→EX)

Instruction Effect

LD F1, 0(Rx) F1 ← Mem(Rx)

ADD.D F1, F2, F3 F1 ← F2 + F3

MULT.D F1, F2, F3 F1 ← F2 * F3

1 2 3 4 5 6 7 8 9 10 11 12

I1: LD F1, 0(Rx) IF ID EX MEM WB

I2: LD F2, 8(Rx)

I3: MULT.D F3, F1, F2

I4: ADD.D F4, F1, F3


I6: LD F5, 0(Rx)

I1: LD F1, 0(Rx)

I2: LD F2, 8(Rx)




I6: LD F5, 0(Rx)

5. (8 pts) Instruction scheduling

For the following questions, assume instructions must be issued in order (i.e., an

instruction cannot be issued until all previous instructions have been issued). Assume

instructions stall only for true/flow/RAW dependences. Assume instructions are

issued the earliest clock cycle legally possible. Use the instruction latencies in the

table above right.

a. (4 pts) Show how instructions would be scheduled (with stalls) for a single-issue

processor.

b. (4 pts) Show how instructions would be scheduled (with nops & stalls) for a dual-

issue processor.

Schedule for 5a Schedule for 5b

Cycle Instruction Cycle Instruction Instruction

1 I1

1 I1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

Instruction Latency

Memory LD +2

ADD.D +0

MULT.D +1

I1: LD F1, 0(Rx)

I2: LD F2, 8(Rx)




I6: LD F5, 0(Rx)

6. (20 pts) Dynamic scheduling

Now consider how the same sequence of instructions is executed on a single-issue Tomasulo-

style CPU. Assume the following:

• The CPU has 1 of each: load buffer, FP adder, FP multiplier functional unit.

• An unlimited number of reservation stations for each functional unit & load buffer.

• Functional units are not pipelined.

• No forwarding between functional units; results can only come from the CDB.

• Functional units may start executing a 2nd

instruction as soon as the current instruction is

finished, even if the result has not yet been placed on the CDB.

• If multiple instructions attempt to use the CDB in the same cycle, the instruction issued earliest goes first.

Instruction execution times (latencies) are provided as +n cycles, where an instruction

executes in 1+n cycles (i.e., spends 1+n cycles in the EX stage). For example: an instruction

with latency +2 executed at cycle 4 would finish in cycle 6 (4+2) and put its result on the

CDB in cycle 7 (if CDB is not busy).

For each problem part below, show what clock cycle each instruction is issued and when it

begins execution (i.e., enters its first EX cycle). Also show when each instruction writes the

CDB. If an instruction execution or completion stalls, list the length of stall and provide a

reason for the stall(s).

Instruction Latency

Memory LD +2

ADD.D +0

MULT.D +1

Instruction

Cycle #

Stalls Issue Exec Write

CDB

I1: LD F1, 0(Rx) 1

2 5

I2: LD F2, 8(Rx)




I6: LD F5, 0(Rx)

I1: LD F1, 0(Rx)

I2: LD F2, 8(Rx)




I6: LD F5, 0(Rx)

7. (24 pts) Multiprocessor cache coherency

Consider a simple bus-based symmetric multiprocessor, where each processor has a single

private cache with coherence maintained with a snooping, write-back protocol. Each cache is

direct mapped, with 3 blocks each holding 1 word. Addrs 0 & 3 are mapped to block B0,

addrs 1 & 4 are mapped to block B1, and addrs 2 & 5 are mapped to block B2.

For each question, assume the initial cache state is shown below. Show the resulting state of

the caches and memory after each action. Show only the blocks that change, and also indicate

the value returned by each read operation. For instance, for [P3 reads 2], the answer is

[P3.B2: (Shared, 2, 6)], indicating Processor 3’s block B2 now has state = Shared, addr = 2,

and data = 6. The read operation returns 6.

a. (4 pts) Processor 2 reads 2

b. (4 pts) Processor 2 writes 2 ← 0

c. (4 pts) Processor 2 reads 1

d. (4 pts) Processor 2 writes 1 ← 0

e. (4 pts) Processor 2 writes 3 ←0

f. (4 pts) Processor 1 reads 1

8. (8 pts) RAID disk arrays

RAID (redundant arrays of inexpensive disks) can be used to improve both

performance and reliability of hard disks. Consider the following data:

Bit1 Bit2 Bit3 Bit4 Bit5 Bit6 Bit7 Bit8

1 1 1 0 0 0 1 0

a. (2 pts) Show how it can be stored in a RAID level 0 system (striping) with 2 disks.

Disk Data

1

2

b. (4 pts) Show how it can be stored in a RAID level 3 system (striping+parity disk)

with 3 disks, where parity information is stored in disk 3.

Disk Data

1

2

3

c. (2 pts) Consider the following RAID level 3 system, where disk 2 has failed.

Rebuild disk 2 based on the parity information in disk 5.

Disk Data

1 0 1 1

2

3 1 0 0

4 1 1 1

5 1 1 0

cmsc411 fall 2013 final exam - university of maryland · cmsc411 fall 2013 final exam ... use the...

Documents