2. openmp can set you free…. consider the following openmp...

2. OpenMP can set you free…. Consider the following OpenMP snippet: int values[100000]; #pragma omp parallel { int i = omp_get_thread_num(); int n = omp_get_num_threads(); for(; i< 100000 ; i += n) { values[i] = i; #pragma omp barrier } } #pragma omp barrier is a synchronization construct that causes a thread reaching it to continue execution only after all other threads have reached the barrier. Suppose each core has a single level of non-shared, write-back, write-allocate, direct-mapped, 32 KB data cache with 64-byte cache blocks. Assume data cache accesses other than those to the ‘values’ array are negligible and that all data caches are initially empty. Assume ints are 32 bits. Assume there are as many cores as threads.

a. If the snippet is run with one thread, how many data cache misses for the values array will there be?

b. If the snippet is run with two threads (each allocated to a separate core), what is the maximum number of data cache misses for the values array?

c. In five words or less, name the phenomenon that the difference or lack of difference between your answer to (a) and (b) illustrate.

d. Using just two threads, if we remove the barrier, could the number of data cache misses for accesses to the values array decrease by more than a factor of 2 from your answer if (b)? Explain briefly.

Justin

Text Box

Spring 2011 Final - Alvin

6. Revenge of the AMAT • Suppose that for 1000 memory references, we have

o 40 misses in direct-mapped L1$ (i.e. the miss rate is 4%) o 20 misses in 2-way set associative L1$ (i.e. the miss rate is 2%) o 10 misses in L2$ (i.e., the global miss rate is 1%)

• Further, o L1$ hits in 1 cycle o L2$ hits in 10 cycles o Miss to main memory costs 100 cycles

• Assume that we have 1.5 memory references per instruction (i.e. 50% loads and stores). In other words, for 1000 instructions we have 1500 memory references.

• Ideal CPI is 1.0 (if we had 100% hit rate in L1$)

a. What is the local miss rate for L2$... i. assuming a direct-mapped L1$?

ii. assuming 2-way set associative L1$?

b. What is the AMAT (Average Memory Access Time) i. assuming a direct-mapped L1$?


c. How much faster is the AMAT for a 2-way set associative cahce? Give your answer as a ratio.

Justin

Text Box


d. What is the average number of memory stall clock cycles per reference i. assuming a direct-mapped L1$?


e. What is the average number of memory stall clock cycles per instruction i. assuming a direct-mapped L1$?


f. How much faster would a program run using a 2-way set associative cache? g. Are the answers for AMAT (6c) and program execution time (6f) above the same? Explain

why or why not.

8. One, two, three….SIMD! a. SIMDize the following code:

void count( int n, float *c ) { for( int i = 0; i < n; i++ ) c[i] = i; }

Enter your solution by filling in the spaces provided. Assume n is a multiple of 4. (_mm_set1_ps(x) returns a __m128 with all four elements set to x.) void countfast( int n, float *c ) { float m[4] = { ____, ____, ____, ____ }; __m128 iterate = _mm_loadu_ps( m ); for( int i = 0; i < __________; i++ ) { _mm_storeu_ps( ___________, iterate ); iterate = _mm_add_ps( iterate, _mm_set1_ps( ___ )); } } b. Horner’s rule is an efficient way to find the value of polynomial p(x)=c0xn-1+c1xn-2+…+cn-2x+cn-1:

float poly( int n, float *c, float x) { float p = 0; for( int i = 0; i < n; i++ ) p = p*x + c[i]; return p; }

Complete the following SIMD solution by filling in the blanks. Assume n is a multiple of 4. float fastpoly( int n, float *c, float x ) { __m128 p = _mm_setzero_ps( ); for ( int i = 0; i < n; i += 4 ) { p = _mm_mul_ps( p, _mm_set1_ps( __________ ) ); p = _mm_add_ps( p, _mm_loadu_ps( _________ ) ); } float m[4] = { _____, _____, _____, _____ }; p = _mm_mul_ps( p, _mm_loadu_ps( m ) ); _mm_storeu_ps( m, p ); return _____________________________________; }

Justin

Text Box


13

10. Three’s Company Consider the following datapath with an Arithmetic Logic Unit (ALU) and an eight-register register file organized around a single bus. The ALU is to apply add, subtract, and so on operations to its two input operands to generate an output result. The register file has an asynchronous read and a synchronous write. That is, as soon as the Read Enable (RE) is asserted, the register file selects the indicated 32-bit register and presents its value on the Data Out (DO). On the other hand, the Write Enable (WE) is sampled only on the rising edge of the clock, and only writes the indicated register from the Data In (DI) lines on the same edge that WE is asserted. The ALU and Register File share the Bus via a 32-bit wide 2:1 multiplexer. When SelALU is set to 1, the ALU path is connected to the Bus. Otherwise, the Register File path is connected to the Bus. The datapath must support three-address instructions of the form Rz Rx <op> Ry. To make use of a single bus architecture, the ALU can be surrounded by one, two, or three 32-bit temporary registers, labeled A, B, and C, as shown below (the temporary registers are shown as dotted lines – the correct solution requires at least one and possibly all three of the registers):

Justin

Text Box

Fall 2010 Final - Alvin

14

Using the fewest of the A/B/C registers and possible clock cycles, what is the fewest number of each to implement the register transfer for the instructions of the three-address type (circle one for each):

Registers 1 2 3 Clock Cycles 1 2 3

For your answer, on the previous page cross out the registers you don’t need, and fill-in the outline of the registers that you do. For each clock cycle that you need according to your answer above, write in the space below the control signals that must be asserted to implement the register transfers for the three-address instructions: Clock Cycle 1: Clock Cycle 2: Clock Cycle 3:

Name: _______________________________ Login: cs61c-____

6/9

F2) Tune in to 101 on your FSM dial... We are designing a palindrome-finder circuit with a 1-bit input I(t) and a 1-bit output O(t), that will produce, at time t, whether the sequence {I(t-2), I(t-1), I(t)} is the same backwards and forwards (e.g, 101). We’ll assume I(t) has been 1 for all negative time (i.e., before the finder circuit starts). As an example, the input: I: 1 1 0 0 1 1 1 0 0 0 1 0 1 1 0 0 will produce the output: O: 1 1 0 0 0 0 1 0 0 1 0 1 1 0 0 0 a) Complete the FSM diagram below. Our states have been labeled Sxy indicating that

the previous 2 bits, {I(t-2), I(t-1)} would be {x, y}. Fill in the truth table on the right. The previous state is encoded in (P1,P0), the next state is encoded in (N1,N0), and the output is encoded as O. Make sure to indicate the value of the output on your state transitions AND to indicate the starting state with an “incoming arrow”.

b) Provide a fully reduced (i.e., fewest gates to implement…you can use any n-

input gates) Boolean expression for the Output O as a function of P1, P0 and I. If there is a name for the circuit, write it in the box above. E.g., “The always-1”, “3-input NAND”, etc. A 2-input XOR has the symbol of “⊕”.

c) How many different answers could I have put in the box for “b” above? Said another way, how many different circuits can a 3-LUT imitate?

d) We’re always concerned about testing. What is the shortest length of an I(t) stream that can guarantee you’ve tested this particular circuit exhaustively?

e) Finally, we wish to build our circuit as we normally do for SDS systems (shown below). Given the four standard spec times from the chip manufacturer (τsetup, τhold, τclk-to-q, and τCL), what is the smallest clock period τ we can drive our system with? (Write your answer as an expression involving the spec variables.) Feel free to draw timing diagrams if you wish.

f)

P1 P0 I O N1 N0

0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

Name:

O =

S00 S10

S01 S11

CLK

PS

I

NS

Justin

Text Box

Spring 2007 Final - Justin

8/10

Question F3: Pipelining (18 points, 24 minutes)Given the following MIPS code snippet (note that instruction #6 could be anything):

loop:1 addi $t0, $t0, 42 lw $v0, 0($t0)3 sw $v0, 20($t0)4 lw $s0, 60($t0)5 bne $s0, $0, loop6 ## The following instruction could be anything!

a) Detect hazards and insert no-ops to insure correct operation. Assume nodelayed branch, no forwarding units and no interlocked pipeline stages. Youranswer on the right should take the form of pair(s) of numbers: num@location –indicating num no-ops should be placed at location. E.g., if you wanted to place6 noops between lines 2 and 3 (i.e., location=2.5) and 8 noops between lines 5and 6 (i.e., location=5.5), you would write: “[email protected], [email protected]”. (6 points)

Scratch space

b) Now, reorder/rewrite the program to maximize performance. Assume delayed branch andforwarding units, but no interlocked pipeline stages. For unknown reasons, the first instructionafter the loop label must be the addi. Feel free to insert no-ops where needed. You should beable to do it using 6 instructions per loop (easier, half credit) or only 5 (hard, full credit). (12 pts)

## Extra instructions before the loop if necessary

## Extra instructions before the loop if necessary

loop:1 addi $t0, $t0, 4

2

3

4

5

6 ## The following instruction could be anything!

Justin

Text Box

Spring 2004 Final - Justin

Name: _______________________________ Login: cs61c-____

7/8

F3) “These Pipes are Clean…” (22 pts, 30 min) Consider a processor with the following specification:

o Standard five (5) stage (F, D, E, M, W) pipeline. o No forwarding. o Stalls on all data and control hazards. o Non-delayed branches o Branch comparison occurs during the second stage. o Instructions are not fetched until branch comparison is done. o Memory CAN be read/written on same clock cycle. o The same register CAN be read & written on the same clock cycle. o No out-of-order execution o “Dumb” control that does not optimize for “always-branch” conditional branches

a) Count how many cycles will be needed to execute the code below and write out each

instruction’s progress through the pipeline by filling in the table below with pipeline stages (F, D, E, M, W).

add $t1, $t2, $t3 xor $t1, $t4, $t5 lw $t3, 0($t1) beq $t3, $t3, 1 lw $t5, 0($t3) xor $t4, $t5, $t6 add $t5, $t5, $t4

Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Inst 1

Inst 2

Inst 3

Inst 4

Inst 5

Inst 6

b) Considering the following three changes, fill in the table again:

o Our processor now forwards values o Interlocks on load hazards o “Intelligent” control that optimizes for “always-branch” conditional branches

Cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Inst 1

Inst 2

Inst 3

Inst 4

Inst 5

Inst 6

Justin

Rectangle

Justin

Text Box

Fall 2006 Final - Justin

Name: _______________________________ Login: cs61c-____

6/8

F2) Congressman Mark Foley: “It was the Page’s fault” (22 pts, 30 min) The specs for a MIPS machine’s memory system that has one level of cache and virtual memory are:

o 1MiB of Physical Address Space o 4GiB of Virtual Address Space o 4KiB page size o 16KiB 8-way set-associative write-through cache, LRU replacement o 1KiB Cache Block Size o 2-entry TLB, LRU replacement

The following code is run on the system, which has no other users and process switching turned off.

#define NUM_INTS 8192 // This many ints... int *A = (int *)malloc(NUM_INTS * sizeof(int)); // malloc returns address 0x100000 int i, total = 0; for(i = 0; i < NUM_INTS; i += 128) A[i] = i; for(i = 0; i < NUM_INTS; i += 128) total += A[i]; // SPECIAL

a) What is the T:I:O bit breakup for the cache (assuming byte addressing)? ____:____:____ b) What is the VPN : PO bit breakup for VM (assuming byte addressing)? ______:______

For the following questions, only consider the line marked “SPECIAL”. Your answer can be a fraction. c) Calculate the hit percentage for the cache d) Calculate the hit percentage for the TLB e) Calculate the page hit percentage for the page table

Show all your work below...

Justin

Rectangle

Justin

Text Box

Fall 2006 Final - Justin

11

8. Bigger, Stronger, Faster:

Suppose that you are running an algorithm for various problem sizes, and have obtained the data below. Sketch a weak scaling plot of parallel code performance that shows speedup over the serial implementation. Be sure to label the Y-axis.

Problem Size

Gflop/s (serial)

Threads Gflop/s (parallel)

100 5 1 5 200 5 2 10 400 5 4 19 600 5 6 25 800 5 8 35 1000 5 10 36 1200 5 12 37 1400 5 14 37 1600 5 16 38

Weak Scaling of Speedup over Serial 1 2 4 6 8 10 12 14 16 18 20 22

Threads

Linear Speedup

Justin

Text Box

Fall 2010 Final - Sean

10

7. Pay It Forward

Consider the excerpt below of a 5-stage pipelined MIPS datapath.

a. Consider the following sequence of instructions [A] srl $zero, $zero, 0 [B] addu $t0, $t1, $t2 [C] addu $t0, $t0, $t2 [D] lw $s0, 0($t3) [E] subu $t3, $s0, $t0 During which of these instructions’ decode stages in the sequence above should ControlRS be 1 to avoid pipeline stalls? Use the labels [A], [B], …

b. Which fields of which instructions from part a does the control logic need to compute the value of ControlRS?

Justin

Text Box

Fall 2010 Final - Sean

CS61c-Final, Spring 1999, Login name: CS61c-_____

5/12/99 5

The Newsgroup Question (11 points):

Although the CS61C review lecture on variable arguments was great, Joe Computer is very puzzled. What is thisM'Piero thing? What does it have to do with Kelvin? "I don't get it!" Determined to find the answer to his confusion,he decides to check the newsgroup by starting the program "trn."

In what order do things happen when trn is run? Part 1 lists a set of things that occur when trn is run.Please time-order the steps from 1 to 13. The odd numbered steps are given to you. Fill in the rest witheven numbers.

Please assume:- No part of the program has been loaded into memory yet.- Page size is 4KB and there is only one cache.- The page table entry loaded from the memory for page 0x00040 maps to physical page 0x14329.- The TLB is between the CPU and the cache, as in class (the cache uses physical addresses).- Block size for the cache is 8 words (32 bytes).- In part 1 all of the actions occur. In part 2, some of them are incorrect and do not occur.

Part 1 (6 points):

Given steps:

__1__ Joe Computer types "trn" at the command line.

__3__ The CPU attempts to fetch the first instruction, 0x00040000 (pointed to by the pc).

__5__ The page table for this process is accessed to find the entry for address 0x00040000, which has the invalidbit set (not loaded from disk yet).

__7__ The TLB is updated with an entry mapping virtual page 0x00040 to physical page 0x14329, with the validbit set.

__9__ The cache misses for the block containing 0x14329000 and attempts to load the block from memory.

_11__ The instruction at virtual address 0x00040000 is successfully loaded from the cache, completing theinstruction fetch phase.

_13__ The CPU attempts to fetch the second instruction, 0x00040004.

Unordered steps: (Assign the even step numbers 2, 4, 6, 8, 10, and 12 to the six options below)

_____ The TLB hits for virtual page number 0x00040, the physical address 0x14329000 is sent to the cache.

_____ The TLB misses while attempting to find an entry for the virtual page number 0x00040.

_____ Physical page number 0x14329 is loaded into memory from disk, and the page table is updated.

_____ The instruction at virtual address 0x00040000 is successfully fetched, and on the next clock tick will moveon to its decode stage.

_____ A page table for the process is created by the operating system. Static memory area is created, space isallocated for the static parts (i.e. arrays) of the program, heap and stack are initialized. All the TLB entriesfrom the previous process are marked invalid.

_____ The block containing 0x14329000 is loaded into the cache from memory.

Justin

Text Box

Spring 1999 Final - Sean

CS61c-Final, Spring 1999, Login name: CS61c-_____

5/12/99 6

[Newsgroup Question continued]

Part 2 (5 points):

Now that you have ordered what happens for the first instruction, what will happen for the second instruction?(Assume that this question starts where Part 1 left off. Remember, some of these may NOT occur. Please orderthe *correct* actions [starting with the number 1], and put an “X” in front of incorrect actions):

_____ The TLB misses for the virtual page corresponding to address 0x00040004, and the previous proceduresare used to load the right page into memory and update the page table.

_____ The cache misses for the block containing 0x14329004 and attempts to load the block from memory.

_____ The TLB hits for virtual page number 0x00040, the physical address 0x14329004 is sent to the cache.

_____ (after many more instructions are executed)... The newsgroup article is read, M'Piero is displayed on thescreen, Joe Computer finally gets the joke and posts a message praising the fact that CS61C has such anifty teaching staff this semester !.

_____ The instruction at virtual address 0x00040004 is successfully loaded from the cache, completing itsinstruction fetch phase.

_____ The block containing 0x14329000 is loaded into the cache from memory.

Name: _____________________ Login: cs61c-____

2/8

M1) Hey buddy, can you run these instructions for me? Thanks! (10 pts, 20 min) Consider the following non-delayed branch MIPS function foo:

a) What does the following function call (in C) return? ________

foo(-1, 0x30880001, 0x00481020, 0x00042042);

foo: li $v0,0 la $t9,loop sw $a1,0($t9) sw $a2,4($t9) sw $a3,8($t9) loop: nop nop nop bne $a0,$0,loop jr $ra

b) You can probably see how foo could pose a security threat if misused. For the good of humanity, we must seal its functionality forever, and render it harmless. That is, you’re going to call it once with a special set of arguments for $a0-$a3 (list these below in human-readable form … not as numbers!) so that every future call to foo always just returns $a0 regardless of the value of $a1-$a3. Oh, and the call to foo with the arguments below should cause it (this time only) to return 0 to signal success that it has been “neutralized”.

$a0: __________________________ $a1: __________________________ $a2: __________________________ $a3: __________________________

Justin

Text Box

Spring 2008 Final - Sean

2. openmp can set you free…. consider the following openmp...

Documents