1. - semantic scholar€¦ · i5: sd f4, 0 (ry ... here, multd is dependent on the output of the ld...
TRANSCRIPT
1
ECE 562/468 Advanced Computer Architecture
Chapter 1-2 Sampling Questions
1. Pipeline CPI =Ideal pipeline CPI+
Structural Stalls + not fully pipelined. Insert the bubble clock cycle
RAW Stalls + Basic pipeline scheduling
WAR Stalls + register renaming (dynamic scheduling)
WAW stalls + register renaming (dynamic scheduling)
Control Stalls + dynamic branch and static branch prediction, and speculations, Loop unrolling
Based on our study, explain how to reduce each type of stalls?
Structural Stalls Insert the Null/Bubble clock cycle
RAW Stalls Basic pipeline scheduling
WAR Stalls register renaming (dynamic scheduling)
WAW stalls register renaming (dynamic scheduling)
Control Stalls Dynamic branch prediction
Loop unrolling
and hardware speculations
2. Your boss is trying to decide between a single-processor system and a dual-processor system. The
Table below gives the performance on two sets of benchmarks—a memory benchmark and a
processor benchmark. You know that your application will spend 30% of its time on memory-
centric computations, and 70% of its time on processor-centric computations.
How much speedup do you anticipate getting if you suggest your boss to move from using a Pentium 4
570 to an Athlon 64 X2 4800+ on a CPU-intensive application suite?
Speed-up from Pentium 4 570 to Athlon64 X2 4800+ can be measured as the ratio of their Dhrystone
performance:
2
Speed-up = Dhrystone Performance of Athlon64 X2 4200+
Dhrystone Performance of Pentium 4 570=20718/11210 (7,621) = 1.848
Let the required ratio of memory-processor computation be x . Then, for equal performance, we
can consider the following equation:
8889.0
40104511
)1(152203000)1(112103501
x
x
xxxx
Thus, the performance of Pentium 4 570 equals Pentium D 820 when there are 88.89% memory
operations and 11.1% processor operations.
3. Consider the following code segment. Identify data dependencies by marking with arrows and
labeling with names (RAW, WAR, WAW).
i: R4 R0 + R2
j: R8 R0 * R4
k: R4 R4 - R2
Simulate the execution of the code using the basic 5-stage pipeline (F, D, E, M, & W) with a single
memory port and without forwarding nor cycle splitting. Add/subtract takes 2 cycles, and Multiply
takes 3 cycles. The first instruction has done for you. Extend the table as needed.
Cycle #: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
i: R4 R0 + R2 F D E E M W
j: R8 R0 * R4 F D * * E E E M W
k: R4 R4 - R2 F D * * * * E E M W
3
Cannot write until J is finished. J will read R4 during its execution.
Total number of cycles to complete the code: 12 ?
5. Imagine that your company is trying to decide between a single-processor system and a dual-processor
system. Figure 1.26 gives the performance on two sets of benchmarks—a memory benchmark and a
processor benchmark. You know that your application will spend 30% of its time on memory-centric
computations, and 70% of its time on processor-centric computations.
You are using a dual-core Athlon processor, and you are choosing between two ways to implement the
same algorithm. The first is to create a large lookup table to store 4K words of data. When you need the
result, you look up the answer. The second method would be to calculate the result in a very tight loop.
What are the advantages and disadvantages of each implementation? (Homework)
Thus, at what situation (when there are 88.89% memory operations and 11.1% processor operations) , the
performance of Pentium equals Pentium D 820 when there are 88.89% memory operations and 11.1%
processor operations.
8889.0
40104511
)1(152203000)1(112103501
x
x
xxxx
6. Think about what latency numbers really mean—they indicate the number of cycles a given function
requires to produce its output, nothing more. If the overall pipeline stalls for the latency cycles of each
functional unit, then you are at least guaranteed that any pair of back-to-back instructions (a “producer”
followed by a “consumer”) will execute correctly. But not all instruction pairs have a Producer/consumer
relationship. Sometimes two adjacent instructions have nothing to do with each other. How many cycles
would the loop body in the code sequence in Figure 2.35 require if the pipeline detected true data
dependences and only stalled on those, rather than blindly stalling everything just because one functional
unit is busy? Show the code with <stall> inserted where necessary to accommodate stated latencies. (Hint:
An instruction with latency “+2” needs 2 <stall> cycles to be inserted into the code sequence. Think of it
this way: a 1-cycle instruction has latency 1 + 0, meaning zero extra wait states. So latency 1 + 1 implies 1
stall cycle; latency 1 + N has N extra stall cycles.)
Loop: LD F2, 0(Rx) Memory LD +3
4
I0: MULTD F2, F0, F2 Memory SD +1
I1: DIVD F8, F2, F0 Integer ADD, SUB +0
I2: LD F4, 0(Ry) Branches +1
I3: ADDD F4, F0, F4 ADDD +2
I4: ADDD F10, F8, F2 MULTD +4
I5: SD F4, 0(Ry) DIVD +10
I6: ADDI Rx, Rx, #8
I7: ADDI Ry, Ry, #8 Latencies beyond Single Cycle
I8: SUB R20, R4, Rx
I9: BNZ R20, Loop
Solution: Latency: Latency is a measure of time delay experienced in a system.
Here, MULTD is dependent on the output of the LD instruction. So, it executes
when the output is ready. The instruction LD should stall the next instruction for 3 cycles and
MULTD produces result for the next instruction and it should stall 4 more cycles, and so on.
5
T0: Loop: LD F2, 0(Rx) 1+3
T1: stall due to load
T2: stall due to load
T3: stall due to load
T4: I0: MULTD F2, F0, F2 1+4
T5: stall due to multd
T6: stall due to multd
T7: stall due to multd
T8: stall due to multd
T9: I1: DIVD F8, F2, F0 1+10
T10: I2: LD F4, 0(Ry) 1+3
T11: stall due to load
T12: stall due to load
T13: stall due to load
T14: I3: ADDD F4, F0, F4 1+2
T15: stall due to divd
T16: stall due to divd
T17: stall due to divd
T18: stall due to divd
T19: stall due to divd
T20: I4: ADDD F10, F8, F2 1+2
T21: I5: SD F4, 0(Ry) 1+1
T22: I6: ADDI Rx, Rx, #8 1
T23: I7: ADDI Ry, Ry, #8 1
6
T24: I8: SUB R20, R4, Rx 1
T25: I9: BNZ R20, Loop 1+1
T26: stall due to branches
Total Number of cycles = 27(T0-T26).
Sample questions in the textbook:
Chapter 1:
Case Study 1: Chip Fabrication Cost
Concepts illustrated by this case study
Fabrication Cost
Fabrication Yield
Defect Tolerance through Redundancy
There are many factors involved in the price of a computer chip. New, smaller technology gives a boost in
performance and a drop in required chip area. In the smaller technology, one can either keep the small area
or place more hardware on the chip in order to get more functionality. In this case study, we explore how
different design decisions involving fabrication technology, area, and redundancy affect the cost of chips.
1.1 [10/10/Discussion] <1.5, 1.5> Figure 1.22 gives the relevant chip statistics that influence the cost of
several current chips. In the next few exercises, you will be exploring the trade-offs involved between the
AMD Opteron, a single-chip processor, and the Sun Niagara, an 8-core chip.
a. [10] <1.5> What is the yield for the AMD Opteron? (Homework)
From Figure 1.22, for AMD Opteron,
Defects per unit area = 0.75 per cm2
Die Area = 199 mm2= 1.99 cm2
Wafer Yield = 1 OR 100%
Plugging values in formula gives Die Yield = (1 + 0.75 * 1.99/4)-4 = 0.2816
7
b. [10] <1.5> What is the yield for an 8-core Sun Niagara processor?
Similarly, using formula for Die Yield,
Yield for 8-core Sun Niagara Processor = (1 + 0.75 * 3.8/4)-4 = 0.1163
c. [Discussion] <1.4, 1.6> Why does the Sun Niagara have a worse yield than the AMD Opteron, even
though they have the same defect rate?
The die area of 8-core Sun processor is greater than that of AMD Opteron. Hence, for the same defect rate,
yield of Sun processor is less than AMD Opteron.
1.3 [20/20/10/10/20] <1.7> Your colleague at Sun suggests that, since the yield is so poor, it might make
sense to sell two sets of chips, one with 8 working processors and one with 6 working processors. We will
solve this exercise by viewing the yield as a probability of no defects occurring in a certain area given the
defect rate. For the Niagara, calculate probabilities based on each Niagara core separately (this may not be
entirely accurate, since the yield equation is based on empirical evidence rather than a mathematical
calculation relating the probabilities of finding errors in different portions of the chip).
a. [20] <1.7> Using the yield equation for the defect rate above, what is the probability that a defect will
occur on a single Niagara core (assuming the chip is divided evenly between the cores) in an 8-core chip?
b. [20] <1.7> What is the probability that a defect will occur on one or two cores (but not more than that)?
c. [10] <1.7> What is the probability that a defect will occur on none of the cores?
d. [10] <1.7> Given your answers to parts (b) and (c), what is the number of 6-core chips you will sell for
every 8-core chip?
e. [20] <1.7> If you sell your 8-core chips for $150 each, the 6-core chips for $100 each, the cost per die
sold is $80, your research and development budget was $200 million, and testing itself costs $1.50 per chip,
how many processors would you need to sell in order to recoup costs?
8
Case Study 2: Power Consumption in Computer Systems
Concepts illustrated by this case study
Amdahl’s Law
Redundancy
MTTF
Power Consumption
Power consumption in modern systems is dependent on a variety of factors, including the chip clock
frequency, efficiency, the disk drive speed, disk drive utilization, and DRAM. The following exercises
explore the impact on power that different design decisions and/or use scenarios have.
1.4 [20/10/20] <1.6> Figure 1.23 presents the power consumption of several computer system components.
In this exercise, we will explore how the hard drive affects power consumption for the system.
a. [20] <1.6> Assuming the maximum load for each component, and a power supply efficiency of 70%,
what wattage must the server’s power supply deliver to a system with a Sun Niagara 8-core chip, 2 GB
184-pin Kingston DRAM, and two 7200 rpm hard drives?
b. [10] <1.6> How much power will the 7200 rpm disk drive consume if it is idle rougly 40% of the time?
c. [20] <1.6> Assume that rpm is the only factor in how long a disk is not idle (which is an
oversimplification of disk performance). In other words, assume that for the same set of requests, a 5400
rpm disk will require twice as much time to read data as a 10,800 rpm disk. What percentage of the time
would the 5400 rpm disk drive be idle to perform the same transactions as in part (b)? (homework)
1.6 [10/10/Discussion] <1.2, 1.9> Figure 1.24 gives a comparison of power and performance for several
benchmarks comparing two servers: Sun Fire T2000 (which uses Niagara) and IBM x346 (using Intel Xeon
processors).
a. [10] <1.9> Calculate the performance/power ratio for each processor on each benchmark.
b. [10] <1.9> If power is your main concern, which would you choose?
c. [Discussion] <1.2> For the database benchmarks, the cheaper the system, the lower cost per database
operation the system is. This is counterintuitive: larger systems have more throughput, so one might think
that buying a larger system would be a larger absolute cost, but lower per operation cost. Since this is true,
why do any larger server farms buy expensive servers? (Hint: Look at exercise 1.4 for some reasons.)
9
Case Study 3: The Cost of Reliability (and Failure) in Web
Servers
Concepts illustrated by this case study
TPCC
Reliability of Web Servers
MTTF
This set of exercises deals with the cost of not having reliable Web servers. The data is in two sets: one
gives various statistics for Gap.com, which was down for maintenance for two weeks in 2005 [AP 2005].
The other is for Amazon.com, which was not down, but has better statistics on high-load sales days. The
exercises combine the two data sets and require estimating the economic cost to the shutdown.
10
1.9 [10/10] <1.8> The main reliability measure is MTTF. We will now look at different
systems and how design decisions affect their reliability. Refer to Figure 1.25
for company statistics.
a. [10] <1.8> We have a single processor with an FIT of 100. What is the MTTF for this system?
b. [10] <1.8> If it takes 1 day to get the system running again, what is the availability of the system?
1.10 [20] <1.8> Imagine that the government, to cut costs, is going to build a supercomputer out of the
cheap processor system in Exercise 1.9 rather than a specialpurpose reliable system. What is the MTTF for
a system with 1000 processors? Assume that if one fails, they all fail.
Case Study 4: Performance
Concepts illustrated by this case study
Arithmetic Mean
11
Geometric Mean
Parallelism
Amdahl’s Law
Weighted Averages
In this set of exercises, you are to make sense of Figure 1.26, which presents the performance of selected
processors and a fictional one (Processor X), as reported by www.tomshardware.com. For each system, two
benchmarks were run. One benchmark exercised the memory hierarchy, giving an indication of the speed
of the memory for that system. The other benchmark, Dhrystone, is a CPU-intensive benchmark that does
not exercise the memory system. Both benchmarks are displayed in order to distill the effects that different
design decisions have on memory and CPU performance.
1.13 [10/10/20] <1.9> Imagine that your company is trying to decide between a single-processor system
and a dual-processor system. Figure 1.26 gives the performance on two sets of benchmarks—a memory
benchmark and a processor benchmark. You know that your application will spend 40% of its time on
memory-centric computations, and 60% of its time on processor-centric computations.
a. [10] <1.9> Calculate the weighted execution time of the benchmarks.
b. [10] <1.9> How much speedup do you anticipate getting if you move from
using a Pentium 4 570 to an Athlon 64 X2 4800+ on a CPU-intensive application
suite?
c. [20] <1.9> At what ratio of memory to processor computation would the performance
of the Pentium 4 570 be equal to the Pentium D 820?
1.14 [10/10/20/20] <1.10> Your company has just bought a new dual Pentium processor, and you have
been tasked with optimizing your software for this processor. You will run two applications on this dual
Pentium, but the resource requirements are not equal. The first application needs 80% of the resources, and
the other only 20% of the resources.
a. [10] <1.10> Given that 40% of the first application is parallelizable, how much speedup would you
achieve with that application if run in isolation?
12
b. [10] <1.10> Given that 99% of the second application is parallelizable, how much speedup would this
application observe if run in isolation?
c. [20] <1.10> Given that 40% of the first application is parallelizable, how much overall system speedup
would you observe if you parallelized it?
d. [20] <1.10> Given that 99% of the second application is parallelizable, how much overall system
speedup would you get?
Chapter 2:
Case Study 1: Exploring the Impact of Microarchitectural
Techniques
Concepts illustrated by this case study
Basic Instruction Scheduling, Reordering, Dispatch
Multiple Issue and Hazards
Register Renaming
Out-of-Order and Speculative Execution
Where to Spend Out-of-Order Resources
You are tasked with designing a new processor microarchitecture, and you are trying to figure out how best
to allocate your hardware resources. Which of the hardware and software techniques you learned in
Chapter 2 should you apply? You have a list of latencies for the functional units and for memory, as well as
some representative code. Your boss has been somewhat vague about the performance requirements of
your new design, but you know from experience that, all else being equal, faster is usually better. Start with
the basics. Figure 2.35 provides a sequence of instructions and list of latencies.
13
2.1 [10] <1.8, 2.1, 2.2> What would be the baseline performance (in cycles, per loop iteration) of the code
sequence in Figure 2.35 if no new instruction execution could be initiated until the previous instruction
execution had completed? Ignore front-end fetch and decode. Assume for now that execution does not stall
for lack of the next instruction, but only one instruction/cycle can be issued. Assume the branch is taken,
and that there is a 1 cycle branch delay slot.
2.2 [10] <1.8, 2.1, 2.2> Think about what latency numbers really mean—they indicate the number of cycles
a given function requires to produce its output, nothing more. If the overall pipeline stalls for the latency
cycles of each functional unit, then you are at least guaranteed that any pair of back-to-back instructions (a
“producer” followed by a “consumer”) will execute correctly. But not all instruction pairs have a
producer/consumer relationship. Sometimes two adjacent instructions have nothing to do with each other.
How many cycles would the loop body in the code sequence in Figure 2.35 require if the pipeline detected
true data dependences and only stalled on those, rather than blindly stalling everything just because one
functional unit is busy? Show the code with <stall> inserted where necessary to accommodate
stated latencies. (Hint: An instruction with latency “+2” needs 2 <stall> cycles to be inserted into the code
sequence. Think of it this way: a 1-cycle instruction has latency 1 + 0, meaning zero extra wait states. So
latency 1 + 1 implies 1 stall cycle; latency 1 + N has N extra stall cycles.)
14
2.3 [15] <2.6, 2.7> Consider a multiple-issue design. Suppose you have two execution pipelines, each
capable of beginning execution of one instruction per cycle, and enough fetch/decode bandwidth in the
front end so that it will not stall your execution. Assume results can be immediately forwarded from one
execution unit to another, or to itself. Further assume that the only reason an execution pipeline would stall
is to observe a true data dependence. Now how many cycles does the loop require? (ignore)
15
2.4 [10] <2.6, 2.7> In the multiple-issue design of Exercise 2.3, you may have recognized some subtle
issues. Even though the two pipelines have the exact same instruction repertoire, they are not identical nor
interchangeable, because there is an implicit ordering between them that must reflect the ordering of the
instructions in the original program. If instruction N + 1 begins execution in Execution Pipe 1 at the same
time that instruction N begins in Pipe 0, and N + 1 happens to require a shorter execution latency than N,
then N + 1 will complete before N (even though program ordering would have implied otherwise). Recite at
least two reasons why that could be hazardous and will require special considerations in the
microarchitecture. Give an example of two instructions from the code in Figure 2.35 that demonstrate this
hazard.
16
2.5 [20] <2.7> Reorder the instructions to improve performance of the code in Figure 2.35. Assume the
two-pipe machine in Exercise 2.3, and that the out-of-order completion issues of Exercise 2.4 have been
dealt with successfully. Just worry about observing true data dependences and functional unit latencies for
now. How many cycles does your reordered code take?
2.6 [10/10] <2.1, 2.2> Every cycle that does not initiate a new operation in a pipe is a lost opportunity, in
the sense that your hardware is not “living up to its potential.”
a. [10] <2.1, 2.2> In your reordered code from Exercise 2.5, what fraction of all cycles, counting both pipes,
were wasted (did not initiate a new op)?
b. [10] <2.1, 2.2> Loop unrolling is one standard compiler technique for finding more parallelism in code,
in order to minimize the lost opportunities for performance.
c. Hand-unroll two iterations of the loop in your reordered code from Exercise 2.5. What speedup did you
obtain? (For this exercise, just color the N + 1 iteration’s instructions green to distinguish them from the
Nth iteration’s; if you were actually unrolling the loop you would have to reassign registers to prevent
collisions between the iterations.)
2.8 [20] <2.4> Exercise 2.7 explored simple register renaming: when the hardware register renamer sees a
source register, it substitutes the destination T register of the last instruction to have targeted that source
register. When the rename table sees a destination register, it substitutes the next available T for it. But
superscalar designs need to handle multiple instructions per clock cycle at every stage in the machine,
including the register renaming. A simple scalar processor would therefore look up both src register
mappings for each instruction, and allocate a new destination mapping per clock cycle. Superscalar
processors must be able to do that as well, but they must also ensure that any dest-to-src relationships
between the two concurrent instructions are handled correctly. Consider the sample code sequence in
Figure 2.38. Assume that we would like to simultaneously rename the first two instructions. Further assume
that the next two available T registers to be used are known at the beginning of the clock cycle in which
these two instructions are being renamed. Conceptually, what we want is for the first instruction to
do its rename table lookups, and then update the table per its destination’s T register. Then the second
instruction would do exactly the same thing, and any interinstruction dependency would thereby be handled
correctly. But there’s not enough time to write that T register designation into the renaming table and then
look it up again for the second instruction, all in the same clock cycle. That register substitution must
instead be done live (in parallel with the register rename table update). Figure 2.39 shows a circuit diagram,
using multiplexers and comparators, that will accomplish the necessary on-the-fly register renaming. Your
task is to show the cycle-by-cycle state of the rename table for every instruction of the code. Assume the
table starts out with every entry equal to its index (T0 = 0; T1 = 1, . . .).
2.11 [10/10/10] <2.3> Assume a five-stage single-pipeline microarchitecture (fetch, decode, execute,
memory, write back) and the code in Figure 2.41. All ops are 1 cycle except LW and SW, which are 1 + 2
cycles, and branches, which are 1 + 1 cycles. There is no forwarding. Show the phases of each instruction
per clock cycle for one iteration of the loop.
a. [10] <2.3> How many clock cycles per loop iteration are lost to branch overhead?
b. [10] <2.3> Assume a static branch predictor, capable of recognizing a backwards branch in the decode
stage. Now how many clock cycles are wasted on branch overhead?
c. [10] <2.3> Assume a dynamic branch predictor. How many cycles are lost on a correct prediction?
2.12 [20/20/20/10/20] <2.4, 2.7, 2.10> Let’s consider what dynamic scheduling might achieve here.
Assume a microarchitecture as shown in Figure 2.42. Assume that the ALUs can do all arithmetic ops
(MULTD, DIVD, ADDD, ADDI, SUB) and branches, and that the Reservation Station (RS) can dispatch at
most one operation to each functional unit per cycle (one op to each ALU plus one memory op to the
LD/ST unit).
17
a. [15] <2.4> Suppose all of the instructions from the sequence in Figure 2.35 are present in the RS, with no
renaming having been done. Highlight any instructions in the code where register renaming would improve
performance.
Hint: Look for RAW and WAW hazards. Assume the same functional unit latencies as in Figure 2.35.
b. [20] <2.4> Suppose the register-renamed version of the code from part (a) is resident in the RS in clock
cycle N, with latencies as given in Figure 2.35. Show how the RS should dispatch these instructions out-of-
order, clock by clock, to obtain optimal performance on this code. (Assume the same RS restrictions as in
part (a). Also assume that results must be written into the RS before they’re available for use; i.e., no
bypassing.) How many clock cycles does the code sequence take?
Case Study 2: Modeling a Branch Predictor
Concept illustrated by this case study
Modeling a Branch Predictor
18
Besides studying microarchitecture techniques, to really understand computer architecture you must also
program computers. Getting your hands dirty by directly modeling various microarchitectural ideas is
better yet. Write a C or Java program to model a 2,1 branch predictor. Your program will read a series of
lines from a file named history.txt (available on the companion CD—see Figure Figure 2.43).
Each line of that file has three data items, separated by tabs. The first datum on each line is the address of
the branch instruction in hex. The second datum is the branch target address in hex. The third datum is a 1
or a 0; 1 indicates a taken branch, and 0 indicates not taken. The total number of branches your model will
consider is, of course, equal to the number of lines in the file. Assume a directmapped BTB, and don’t
worry about instruction lengths or alignment (i.e., if your BTB has four entries, then branch instructions at
0x0, 0x1, 0x2, and 0x3 will reside in those four entries, but a branch instruction at 0x4 will overwrite
BTB[0]). For each line in the input file, your model will read the pair of data values, adjust the various
tables per the branch predictor being modeled, and collect key performance statistics. The final output of
your program will look like that shown in Figure 2.44.
Make the number of BTB entries in your model a command-line option.
2.13 [20/10/10/10/10/10/10] <2.3> Write a model of a simple four-state branch target buffer with 64 entries.
a. [20] <2.3> What is the overall hit rate in the BTB (the fraction of times a branch was looked up in the
BTB and found present)?
b. [10] <2.3> What is the overall branch misprediction rate on a cold start (the fraction of times a branch
was correctly predicted taken or not taken, regardless of whether that prediction “belonged to” the branch
being predicted)?
c. [10] <2.3> Find the most common branch. What was its contribution to the overall number of correct
predictions? (Hint: Count the number of times that branch occurs in the history.txt file, then track how each
instance of that branch fares within the BTB model.)
d. [10] <2.3> How many capacity misses did your branch predictor suffer?
e. [10] <2.3> What is the effect of a cold start versus a warm start? To find out, run the same input data set
once to initialize the history table, and then again to collect the new set of statistics.
f. [10] <2.3> Cold-start the BTB 4 more times, with BTB sizes 16, 32, and 64. Graph the resulting five
misprediction rates. Also graph the five hit rates.
g. [10] Submit the well-written, commented source code for your branch target buffer model.