1. - semantic scholar€¦ · i5: sd f4, 0 (ry ... here, multd is dependent on the output of the ld...

1

ECE 562/468 Advanced Computer Architecture

Chapter 1-2 Sampling Questions

1. Pipeline CPI =Ideal pipeline CPI+

Structural Stalls + not fully pipelined. Insert the bubble clock cycle

RAW Stalls + Basic pipeline scheduling

WAR Stalls + register renaming (dynamic scheduling)

WAW stalls + register renaming (dynamic scheduling)

Control Stalls + dynamic branch and static branch prediction, and speculations, Loop unrolling

Based on our study, explain how to reduce each type of stalls?

Structural Stalls Insert the Null/Bubble clock cycle

RAW Stalls Basic pipeline scheduling

WAR Stalls register renaming (dynamic scheduling)

WAW stalls register renaming (dynamic scheduling)

Control Stalls Dynamic branch prediction

Loop unrolling

and hardware speculations

2. Your boss is trying to decide between a single-processor system and a dual-processor system. The

Table below gives the performance on two sets of benchmarks—a memory benchmark and a

processor benchmark. You know that your application will spend 30% of its time on memory-

centric computations, and 70% of its time on processor-centric computations.

How much speedup do you anticipate getting if you suggest your boss to move from using a Pentium 4

570 to an Athlon 64 X2 4800+ on a CPU-intensive application suite?

Speed-up from Pentium 4 570 to Athlon64 X2 4800+ can be measured as the ratio of their Dhrystone

performance:

2

Speed-up = Dhrystone Performance of Athlon64 X2 4200+

Dhrystone Performance of Pentium 4 570=20718/11210 (7,621) = 1.848

Let the required ratio of memory-processor computation be x . Then, for equal performance, we

can consider the following equation:

8889.0

40104511

)1(152203000)1(112103501

x

x

xxxx

Thus, the performance of Pentium 4 570 equals Pentium D 820 when there are 88.89% memory

operations and 11.1% processor operations.

3. Consider the following code segment. Identify data dependencies by marking with arrows and

labeling with names (RAW, WAR, WAW).

i: R4 R0 + R2

j: R8 R0 * R4

k: R4 R4 - R2

Simulate the execution of the code using the basic 5-stage pipeline (F, D, E, M, & W) with a single

memory port and without forwarding nor cycle splitting. Add/subtract takes 2 cycles, and Multiply

takes 3 cycles. The first instruction has done for you. Extend the table as needed.

Cycle #: 1 2 3 4 5 6 7 8 9 10 11 12 13 14

i: R4 R0 + R2 F D E E M W

j: R8 R0 * R4 F D * * E E E M W

k: R4 R4 - R2 F D * * * * E E M W

3

Cannot write until J is finished. J will read R4 during its execution.

Total number of cycles to complete the code: 12 ?

5. Imagine that your company is trying to decide between a single-processor system and a dual-processor

system. Figure 1.26 gives the performance on two sets of benchmarks—a memory benchmark and a

processor benchmark. You know that your application will spend 30% of its time on memory-centric

computations, and 70% of its time on processor-centric computations.

You are using a dual-core Athlon processor, and you are choosing between two ways to implement the

same algorithm. The first is to create a large lookup table to store 4K words of data. When you need the

result, you look up the answer. The second method would be to calculate the result in a very tight loop.

What are the advantages and disadvantages of each implementation? (Homework)

Thus, at what situation (when there are 88.89% memory operations and 11.1% processor operations) , the

performance of Pentium equals Pentium D 820 when there are 88.89% memory operations and 11.1%

processor operations.

8889.0

40104511

)1(152203000)1(112103501

x

x

xxxx

6. Think about what latency numbers really mean—they indicate the number of cycles a given function

requires to produce its output, nothing more. If the overall pipeline stalls for the latency cycles of each

functional unit, then you are at least guaranteed that any pair of back-to-back instructions (a “producer”

followed by a “consumer”) will execute correctly. But not all instruction pairs have a Producer/consumer

relationship. Sometimes two adjacent instructions have nothing to do with each other. How many cycles

would the loop body in the code sequence in Figure 2.35 require if the pipeline detected true data

dependences and only stalled on those, rather than blindly stalling everything just because one functional

unit is busy? Show the code with <stall> inserted where necessary to accommodate stated latencies. (Hint:

An instruction with latency “+2” needs 2 <stall> cycles to be inserted into the code sequence. Think of it

this way: a 1-cycle instruction has latency 1 + 0, meaning zero extra wait states. So latency 1 + 1 implies 1

stall cycle; latency 1 + N has N extra stall cycles.)

Loop: LD F2, 0(Rx) Memory LD +3

4

I0: MULTD F2, F0, F2 Memory SD +1

I1: DIVD F8, F2, F0 Integer ADD, SUB +0

I2: LD F4, 0(Ry) Branches +1

I3: ADDD F4, F0, F4 ADDD +2

I4: ADDD F10, F8, F2 MULTD +4

I5: SD F4, 0(Ry) DIVD +10

I6: ADDI Rx, Rx, #8

I7: ADDI Ry, Ry, #8 Latencies beyond Single Cycle

I8: SUB R20, R4, Rx

I9: BNZ R20, Loop

Solution: Latency: Latency is a measure of time delay experienced in a system.

Here, MULTD is dependent on the output of the LD instruction. So, it executes

when the output is ready. The instruction LD should stall the next instruction for 3 cycles and

MULTD produces result for the next instruction and it should stall 4 more cycles, and so on.

5

T0: Loop: LD F2, 0(Rx) 1+3

T1: stall due to load



T4: I0: MULTD F2, F0, F2 1+4

T5: stall due to multd




T9: I1: DIVD F8, F2, F0 1+10

T10: I2: LD F4, 0(Ry) 1+3




T14: I3: ADDD F4, F0, F4 1+2

T15: stall due to divd





T20: I4: ADDD F10, F8, F2 1+2

T21: I5: SD F4, 0(Ry) 1+1

T22: I6: ADDI Rx, Rx, #8 1

T23: I7: ADDI Ry, Ry, #8 1

6

T24: I8: SUB R20, R4, Rx 1

T25: I9: BNZ R20, Loop 1+1

T26: stall due to branches

Total Number of cycles = 27(T0-T26).

Sample questions in the textbook:

Chapter 1:

Case Study 1: Chip Fabrication Cost

Concepts illustrated by this case study

Fabrication Cost

Fabrication Yield

Defect Tolerance through Redundancy

There are many factors involved in the price of a computer chip. New, smaller technology gives a boost in

performance and a drop in required chip area. In the smaller technology, one can either keep the small area

or place more hardware on the chip in order to get more functionality. In this case study, we explore how

different design decisions involving fabrication technology, area, and redundancy affect the cost of chips.

1.1 [10/10/Discussion] <1.5, 1.5> Figure 1.22 gives the relevant chip statistics that influence the cost of

several current chips. In the next few exercises, you will be exploring the trade-offs involved between the

AMD Opteron, a single-chip processor, and the Sun Niagara, an 8-core chip.

a. [10] <1.5> What is the yield for the AMD Opteron? (Homework)

From Figure 1.22, for AMD Opteron,

Defects per unit area = 0.75 per cm2

Die Area = 199 mm2= 1.99 cm2

Wafer Yield = 1 OR 100%

Plugging values in formula gives Die Yield = (1 + 0.75 * 1.99/4)-4 = 0.2816

7

b. [10] <1.5> What is the yield for an 8-core Sun Niagara processor?

Similarly, using formula for Die Yield,

Yield for 8-core Sun Niagara Processor = (1 + 0.75 * 3.8/4)-4 = 0.1163

c. [Discussion] <1.4, 1.6> Why does the Sun Niagara have a worse yield than the AMD Opteron, even

though they have the same defect rate?

The die area of 8-core Sun processor is greater than that of AMD Opteron. Hence, for the same defect rate,

yield of Sun processor is less than AMD Opteron.

1.3 [20/20/10/10/20] <1.7> Your colleague at Sun suggests that, since the yield is so poor, it might make

sense to sell two sets of chips, one with 8 working processors and one with 6 working processors. We will

solve this exercise by viewing the yield as a probability of no defects occurring in a certain area given the

defect rate. For the Niagara, calculate probabilities based on each Niagara core separately (this may not be

entirely accurate, since the yield equation is based on empirical evidence rather than a mathematical

calculation relating the probabilities of finding errors in different portions of the chip).

a. [20] <1.7> Using the yield equation for the defect rate above, what is the probability that a defect will

occur on a single Niagara core (assuming the chip is divided evenly between the cores) in an 8-core chip?

b. [20] <1.7> What is the probability that a defect will occur on one or two cores (but not more than that)?

c. [10] <1.7> What is the probability that a defect will occur on none of the cores?

d. [10] <1.7> Given your answers to parts (b) and (c), what is the number of 6-core chips you will sell for

every 8-core chip?

e. [20] <1.7> If you sell your 8-core chips for $150 each, the 6-core chips for $100 each, the cost per die

sold is $80, your research and development budget was $200 million, and testing itself costs $1.50 per chip,

how many processors would you need to sell in order to recoup costs?

8

Case Study 2: Power Consumption in Computer Systems


Amdahl’s Law

Redundancy

MTTF

Power Consumption

Power consumption in modern systems is dependent on a variety of factors, including the chip clock

frequency, efficiency, the disk drive speed, disk drive utilization, and DRAM. The following exercises

explore the impact on power that different design decisions and/or use scenarios have.

1.4 [20/10/20] <1.6> Figure 1.23 presents the power consumption of several computer system components.

In this exercise, we will explore how the hard drive affects power consumption for the system.

a. [20] <1.6> Assuming the maximum load for each component, and a power supply efficiency of 70%,

what wattage must the server’s power supply deliver to a system with a Sun Niagara 8-core chip, 2 GB

184-pin Kingston DRAM, and two 7200 rpm hard drives?

b. [10] <1.6> How much power will the 7200 rpm disk drive consume if it is idle rougly 40% of the time?

c. [20] <1.6> Assume that rpm is the only factor in how long a disk is not idle (which is an

oversimplification of disk performance). In other words, assume that for the same set of requests, a 5400

rpm disk will require twice as much time to read data as a 10,800 rpm disk. What percentage of the time

would the 5400 rpm disk drive be idle to perform the same transactions as in part (b)? (homework)

1.6 [10/10/Discussion] <1.2, 1.9> Figure 1.24 gives a comparison of power and performance for several

benchmarks comparing two servers: Sun Fire T2000 (which uses Niagara) and IBM x346 (using Intel Xeon

processors).

a. [10] <1.9> Calculate the performance/power ratio for each processor on each benchmark.

b. [10] <1.9> If power is your main concern, which would you choose?

c. [Discussion] <1.2> For the database benchmarks, the cheaper the system, the lower cost per database

operation the system is. This is counterintuitive: larger systems have more throughput, so one might think

that buying a larger system would be a larger absolute cost, but lower per operation cost. Since this is true,

why do any larger server farms buy expensive servers? (Hint: Look at exercise 1.4 for some reasons.)

9

Case Study 3: The Cost of Reliability (and Failure) in Web

Servers


TPCC

Reliability of Web Servers

MTTF

This set of exercises deals with the cost of not having reliable Web servers. The data is in two sets: one

gives various statistics for Gap.com, which was down for maintenance for two weeks in 2005 [AP 2005].

The other is for Amazon.com, which was not down, but has better statistics on high-load sales days. The

exercises combine the two data sets and require estimating the economic cost to the shutdown.

10

1.9 [10/10] <1.8> The main reliability measure is MTTF. We will now look at different

systems and how design decisions affect their reliability. Refer to Figure 1.25

for company statistics.

a. [10] <1.8> We have a single processor with an FIT of 100. What is the MTTF for this system?

b. [10] <1.8> If it takes 1 day to get the system running again, what is the availability of the system?

1.10 [20] <1.8> Imagine that the government, to cut costs, is going to build a supercomputer out of the

cheap processor system in Exercise 1.9 rather than a specialpurpose reliable system. What is the MTTF for

a system with 1000 processors? Assume that if one fails, they all fail.

Case Study 4: Performance


Arithmetic Mean

11

Geometric Mean

Parallelism

Amdahl’s Law

Weighted Averages

In this set of exercises, you are to make sense of Figure 1.26, which presents the performance of selected

processors and a fictional one (Processor X), as reported by www.tomshardware.com. For each system, two

benchmarks were run. One benchmark exercised the memory hierarchy, giving an indication of the speed

of the memory for that system. The other benchmark, Dhrystone, is a CPU-intensive benchmark that does

not exercise the memory system. Both benchmarks are displayed in order to distill the effects that different

design decisions have on memory and CPU performance.

1.13 [10/10/20] <1.9> Imagine that your company is trying to decide between a single-processor system

and a dual-processor system. Figure 1.26 gives the performance on two sets of benchmarks—a memory

benchmark and a processor benchmark. You know that your application will spend 40% of its time on

memory-centric computations, and 60% of its time on processor-centric computations.

a. [10] <1.9> Calculate the weighted execution time of the benchmarks.

b. [10] <1.9> How much speedup do you anticipate getting if you move from

using a Pentium 4 570 to an Athlon 64 X2 4800+ on a CPU-intensive application

suite?

c. [20] <1.9> At what ratio of memory to processor computation would the performance

of the Pentium 4 570 be equal to the Pentium D 820?

1.14 [10/10/20/20] <1.10> Your company has just bought a new dual Pentium processor, and you have

been tasked with optimizing your software for this processor. You will run two applications on this dual

Pentium, but the resource requirements are not equal. The first application needs 80% of the resources, and

the other only 20% of the resources.

a. [10] <1.10> Given that 40% of the first application is parallelizable, how much speedup would you

achieve with that application if run in isolation?

12

b. [10] <1.10> Given that 99% of the second application is parallelizable, how much speedup would this

application observe if run in isolation?

c. [20] <1.10> Given that 40% of the first application is parallelizable, how much overall system speedup

would you observe if you parallelized it?

d. [20] <1.10> Given that 99% of the second application is parallelizable, how much overall system

speedup would you get?

Chapter 2:

Case Study 1: Exploring the Impact of Microarchitectural

Techniques


Basic Instruction Scheduling, Reordering, Dispatch

Multiple Issue and Hazards

Register Renaming

Out-of-Order and Speculative Execution

Where to Spend Out-of-Order Resources

You are tasked with designing a new processor microarchitecture, and you are trying to figure out how best

to allocate your hardware resources. Which of the hardware and software techniques you learned in

Chapter 2 should you apply? You have a list of latencies for the functional units and for memory, as well as

some representative code. Your boss has been somewhat vague about the performance requirements of

your new design, but you know from experience that, all else being equal, faster is usually better. Start with

the basics. Figure 2.35 provides a sequence of instructions and list of latencies.

13

2.1 [10] <1.8, 2.1, 2.2> What would be the baseline performance (in cycles, per loop iteration) of the code

sequence in Figure 2.35 if no new instruction execution could be initiated until the previous instruction

execution had completed? Ignore front-end fetch and decode. Assume for now that execution does not stall

for lack of the next instruction, but only one instruction/cycle can be issued. Assume the branch is taken,

and that there is a 1 cycle branch delay slot.

2.2 [10] <1.8, 2.1, 2.2> Think about what latency numbers really mean—they indicate the number of cycles

a given function requires to produce its output, nothing more. If the overall pipeline stalls for the latency

cycles of each functional unit, then you are at least guaranteed that any pair of back-to-back instructions (a

“producer” followed by a “consumer”) will execute correctly. But not all instruction pairs have a

producer/consumer relationship. Sometimes two adjacent instructions have nothing to do with each other.

How many cycles would the loop body in the code sequence in Figure 2.35 require if the pipeline detected

true data dependences and only stalled on those, rather than blindly stalling everything just because one

functional unit is busy? Show the code with <stall> inserted where necessary to accommodate

stated latencies. (Hint: An instruction with latency “+2” needs 2 <stall> cycles to be inserted into the code

sequence. Think of it this way: a 1-cycle instruction has latency 1 + 0, meaning zero extra wait states. So

latency 1 + 1 implies 1 stall cycle; latency 1 + N has N extra stall cycles.)

14

2.3 [15] <2.6, 2.7> Consider a multiple-issue design. Suppose you have two execution pipelines, each

capable of beginning execution of one instruction per cycle, and enough fetch/decode bandwidth in the

front end so that it will not stall your execution. Assume results can be immediately forwarded from one

execution unit to another, or to itself. Further assume that the only reason an execution pipeline would stall

is to observe a true data dependence. Now how many cycles does the loop require? (ignore)

15

2.4 [10] <2.6, 2.7> In the multiple-issue design of Exercise 2.3, you may have recognized some subtle

issues. Even though the two pipelines have the exact same instruction repertoire, they are not identical nor

interchangeable, because there is an implicit ordering between them that must reflect the ordering of the

instructions in the original program. If instruction N + 1 begins execution in Execution Pipe 1 at the same

time that instruction N begins in Pipe 0, and N + 1 happens to require a shorter execution latency than N,

then N + 1 will complete before N (even though program ordering would have implied otherwise). Recite at

least two reasons why that could be hazardous and will require special considerations in the

microarchitecture. Give an example of two instructions from the code in Figure 2.35 that demonstrate this

hazard.

16

2.5 [20] <2.7> Reorder the instructions to improve performance of the code in Figure 2.35. Assume the

two-pipe machine in Exercise 2.3, and that the out-of-order completion issues of Exercise 2.4 have been

dealt with successfully. Just worry about observing true data dependences and functional unit latencies for

now. How many cycles does your reordered code take?

2.6 [10/10] <2.1, 2.2> Every cycle that does not initiate a new operation in a pipe is a lost opportunity, in

the sense that your hardware is not “living up to its potential.”

a. [10] <2.1, 2.2> In your reordered code from Exercise 2.5, what fraction of all cycles, counting both pipes,

were wasted (did not initiate a new op)?

b. [10] <2.1, 2.2> Loop unrolling is one standard compiler technique for finding more parallelism in code,

in order to minimize the lost opportunities for performance.

c. Hand-unroll two iterations of the loop in your reordered code from Exercise 2.5. What speedup did you

obtain? (For this exercise, just color the N + 1 iteration’s instructions green to distinguish them from the

Nth iteration’s; if you were actually unrolling the loop you would have to reassign registers to prevent

collisions between the iterations.)

2.8 [20] <2.4> Exercise 2.7 explored simple register renaming: when the hardware register renamer sees a

source register, it substitutes the destination T register of the last instruction to have targeted that source

register. When the rename table sees a destination register, it substitutes the next available T for it. But

superscalar designs need to handle multiple instructions per clock cycle at every stage in the machine,

including the register renaming. A simple scalar processor would therefore look up both src register

mappings for each instruction, and allocate a new destination mapping per clock cycle. Superscalar

processors must be able to do that as well, but they must also ensure that any dest-to-src relationships

between the two concurrent instructions are handled correctly. Consider the sample code sequence in

Figure 2.38. Assume that we would like to simultaneously rename the first two instructions. Further assume

that the next two available T registers to be used are known at the beginning of the clock cycle in which

these two instructions are being renamed. Conceptually, what we want is for the first instruction to

do its rename table lookups, and then update the table per its destination’s T register. Then the second

instruction would do exactly the same thing, and any interinstruction dependency would thereby be handled

correctly. But there’s not enough time to write that T register designation into the renaming table and then

look it up again for the second instruction, all in the same clock cycle. That register substitution must

instead be done live (in parallel with the register rename table update). Figure 2.39 shows a circuit diagram,

using multiplexers and comparators, that will accomplish the necessary on-the-fly register renaming. Your

task is to show the cycle-by-cycle state of the rename table for every instruction of the code. Assume the

table starts out with every entry equal to its index (T0 = 0; T1 = 1, . . .).

2.11 [10/10/10] <2.3> Assume a five-stage single-pipeline microarchitecture (fetch, decode, execute,

memory, write back) and the code in Figure 2.41. All ops are 1 cycle except LW and SW, which are 1 + 2

cycles, and branches, which are 1 + 1 cycles. There is no forwarding. Show the phases of each instruction

per clock cycle for one iteration of the loop.

a. [10] <2.3> How many clock cycles per loop iteration are lost to branch overhead?

b. [10] <2.3> Assume a static branch predictor, capable of recognizing a backwards branch in the decode

stage. Now how many clock cycles are wasted on branch overhead?

c. [10] <2.3> Assume a dynamic branch predictor. How many cycles are lost on a correct prediction?

2.12 [20/20/20/10/20] <2.4, 2.7, 2.10> Let’s consider what dynamic scheduling might achieve here.

Assume a microarchitecture as shown in Figure 2.42. Assume that the ALUs can do all arithmetic ops

(MULTD, DIVD, ADDD, ADDI, SUB) and branches, and that the Reservation Station (RS) can dispatch at

most one operation to each functional unit per cycle (one op to each ALU plus one memory op to the

LD/ST unit).

17

a. [15] <2.4> Suppose all of the instructions from the sequence in Figure 2.35 are present in the RS, with no

renaming having been done. Highlight any instructions in the code where register renaming would improve

performance.

Hint: Look for RAW and WAW hazards. Assume the same functional unit latencies as in Figure 2.35.

b. [20] <2.4> Suppose the register-renamed version of the code from part (a) is resident in the RS in clock

cycle N, with latencies as given in Figure 2.35. Show how the RS should dispatch these instructions out-of-

order, clock by clock, to obtain optimal performance on this code. (Assume the same RS restrictions as in

part (a). Also assume that results must be written into the RS before they’re available for use; i.e., no

bypassing.) How many clock cycles does the code sequence take?

Case Study 2: Modeling a Branch Predictor

Concept illustrated by this case study

Modeling a Branch Predictor

18

Besides studying microarchitecture techniques, to really understand computer architecture you must also

program computers. Getting your hands dirty by directly modeling various microarchitectural ideas is

better yet. Write a C or Java program to model a 2,1 branch predictor. Your program will read a series of

lines from a file named history.txt (available on the companion CD—see Figure Figure 2.43).

Each line of that file has three data items, separated by tabs. The first datum on each line is the address of

the branch instruction in hex. The second datum is the branch target address in hex. The third datum is a 1

or a 0; 1 indicates a taken branch, and 0 indicates not taken. The total number of branches your model will

consider is, of course, equal to the number of lines in the file. Assume a directmapped BTB, and don’t

worry about instruction lengths or alignment (i.e., if your BTB has four entries, then branch instructions at

0x0, 0x1, 0x2, and 0x3 will reside in those four entries, but a branch instruction at 0x4 will overwrite

BTB[0]). For each line in the input file, your model will read the pair of data values, adjust the various

tables per the branch predictor being modeled, and collect key performance statistics. The final output of

your program will look like that shown in Figure 2.44.

Make the number of BTB entries in your model a command-line option.

2.13 [20/10/10/10/10/10/10] <2.3> Write a model of a simple four-state branch target buffer with 64 entries.

a. [20] <2.3> What is the overall hit rate in the BTB (the fraction of times a branch was looked up in the

BTB and found present)?

b. [10] <2.3> What is the overall branch misprediction rate on a cold start (the fraction of times a branch

was correctly predicted taken or not taken, regardless of whether that prediction “belonged to” the branch

being predicted)?

c. [10] <2.3> Find the most common branch. What was its contribution to the overall number of correct

predictions? (Hint: Count the number of times that branch occurs in the history.txt file, then track how each

instance of that branch fares within the BTB model.)

d. [10] <2.3> How many capacity misses did your branch predictor suffer?

e. [10] <2.3> What is the effect of a cold start versus a warm start? To find out, run the same input data set

once to initialize the history table, and then again to collect the new set of statistics.

f. [10] <2.3> Cold-start the BTB 4 more times, with BTB sizes 16, 32, and 64. Graph the resulting five

misprediction rates. Also graph the five hit rates.

g. [10] Submit the well-written, commented source code for your branch target buffer model.

1. - semantic scholar€¦ · i5: sd f4, 0 (ry ... here, multd is dependent on the output of the ld...

Documents