csc 4250 computer architectures

21
CSC 4250 Computer Architectures October 20, 2006 Chapter 3. Instruction-Level Parallelism & Its Dynamic Exploitation

Upload: uriel-mann

Post on 01-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

CSC 4250 Computer Architectures. October 20, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation. One More Example on Tomasulo’s Algorithm. L.DF0,0(R0) ADD.DF0,F0,F2 MUL.DF0,F0,F4 ADD.DF0,F0,F2 MUL.DF0,F0,F4 S.DF0,0(R0) ADD.DF0,F4,F2. - PowerPoint PPT Presentation

TRANSCRIPT

CSC 4250Computer Architectures

October 20, 2006

Chapter 3. Instruction-Level Parallelism

& Its Dynamic Exploitation

One More Example on Tomasulo’s Algorithm

L.D F0,0(R0)

ADD.D F0,F0,F2

MUL.D F0,F0,F4

ADD.D F0,F0,F2

MUL.D F0,F0,F4

S.D F0,0(R0)

ADD.D F0,F4,F2

IBM 360 Assembly Language

Only two operands. Advantage? Disadvantage? Example:

L.D F0,0(R0)

ADD.D F0,F2

MUL.D F0,F4

ADD.D F0,F2

MUL.D F0,F4

S.D F0,0(R0)

… …

Figure 0.1Instruction Issue Execute Write Result

L.D F0,0(R0) √

ADD.D F0,F0,F2

MUL.D F0,F0,F4

ADD.D F0,F0,F2

MUL.D F0,F0,F4

S.D F0,0(R0)

ADD.D F0,F4,F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 No

Add2 No

Add3 No

Mult1 No

Mult2 No

Store1 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Load1

Figure 0.2Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4

ADD.D F0,F0,F2

MUL.D F0,F0,F4

S.D F0,0(R0)

ADD.D F0,F4,F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 No

Add3 No

Mult1 No

Mult2 No

Store1 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Add1

Figure 0.3Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

ADD.D F0,F0,F2

MUL.D F0,F0,F4

S.D F0,0(R0)

ADD.D F0,F4,F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 No

Add3 No

Mult1 Yes Mult Reg[F4] Add1

Mult2 No

Store1 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mult1

Figure 0.4Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4

S.D F0,0(R0)

ADD.D F0,F4F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 Yes Add Reg[F2] Mult1

Add3 No

Mult1 Yes Mult Reg[F4] Add1

Mult2 No

Store1 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Add2

Figure 0.5Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

S.D F0,0(R0)

ADD.D F0,F4,F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 Yes Add Reg[F2] Mult1

Add3 No

Mult1 Yes Mult Reg[F4] Add1

Mult2 Yes Mult Reg[F4] Add2

Store1 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mult2

Figure 0.6Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

S.D F0,0(R0) √

ADD.D F0,F4,F2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 Yes Add Reg[F2] Mult1

Add3 No

Mult1 Yes Mult Reg[F4] Add1

Mult2 Yes Mult Reg[F4] Add2

Store1 Yes Store Mult2 0+Reg[R0]

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Mult2

Figure 0.7Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

S.D F0,0(R0) √

ADD.D F0,F4,F2 √

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 Yes Add Reg[F2] Mult1

Add3 Yes Add Reg[F4] Reg[F2]

Mult1 Yes Mult Reg[F4] Add1

Mult2 Yes Mult Reg[F4] Add2

Store1 Yes Store Mult2 0+Reg[R0]

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Add3

Figure 0.8Instruction Issue Execute Write Result

L.D F0,0(R0) √ √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

ADD.D F0,F0,F2 √

MUL.D F0,F0,F4 √

S.D F0,0(R0) √

ADD.D F0,F4,F2 √ √ √

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load 0+Reg[R0]

Add1 Yes Add Reg[F2] Load1

Add2 Yes Add Reg[F2] Mult1

Add3 No

Mult1 Yes Mult Reg[F4] Add1

Mult2 Yes Mult Reg[F4] Add2

Store1 Yes Store Mult2 0+Reg[R0]

F0 F2 F4 F6 F8 F10 F12 … F30

Qi

Modified Loop-Based Example

Loop: L.D F0,0(R1)

MUL.D F0,F0,F2

ADD.D F0,F0,F4

S.D F0,0(R1)

DADDIU R1,R1,#−8

BNE R1,R2,Loop

Figure 0.1. One active iteration of loopInstruction Iteration Issue Execute Write Result

L.D F0,0(R1) 1 √ √

MUL.D F0,F0,F2 1 √

ADD.D F0,F0,F4 1 √

S.D F0,0(R1) 1 √

L.D F0,0(R1) 2

MUL.D F0,F0,F2 2

ADD.D F0,F0,F4 2

S.D F0,0(R1) 2

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load Reg[R1]

Load2 No

Add1 Yes Add Reg[F4] Mult1

Add2 No

Mult1 Yes Mult Reg[F2] Load1

Mult2 No

Store1 Yes Store Add1 Reg[R1]

Store2 No

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Add1

Figure 0.2. Two active iterations of loopInstruction Iteration Issue Execute Write Result

L.D F0,0(R1) 1 √ √

MUL.D F0,F0,F2 1 √

ADD.D F0,F0,F4 1 √

S.D F0,0(R1) 1 √

L.D F0,0(R1) 2 √ √

MUL.D F0,F0,F2 2 √

ADD.D F0,F0,F4 2 √

S.D F0,0(R1) 2 √

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load Reg[R1]

Load2 Yes Load Reg[R1]-8

Add1 Yes Add Reg[F4] Mult1

Add2 Yes Add Reg[F4] Mult2

Mult1 Yes Mult Reg[F2] Load1

Mult2 Yes Mult Reg[F2] Load2

Store1 Yes Store Add1 Reg[R1]

Store2 Yes Add2 Reg[R1]-8

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Add2

Figure 0.2. Two active iterations of loopInstruction Iteration Issue Execute Write Result

L.D F0,0(R1) 1 √ √

MUL.D F0,F0,F2 1 √

ADD.D F0,F0,F4 1 √

S.D F0,0(R1) 1 √

L.D F0,0(R1) 2 √ √

MUL.D F0,F0,F2 2 √

ADD.D F0,F0,F4 2 √

S.D F0,0(R1) 2 √

Name Busy Op Vj Vk Qj Qk A

Load1 Yes Load Reg[R1]

Load2 Yes Load Reg[R1]-8

Add1 Yes Add Reg[F4] Mult1

Add2 Yes Add Reg[F4] Mult2

Mult1 Yes Mult Reg[F2] Load1

Mult2 Yes Mult Reg[F2] Load2

Store1 Yes Store Add1 Reg[R1]

Store2 Yes Add2 Reg[R1]-8

F0 F2 F4 F6 F8 F10 F12 … F30

Qi Add2

Dynamic Branch Prediction

Static branch prediction in Appendix A Branch Prediction Buffer: a small memory

indexed by the lower portion of the address of the branch instruction. The memory contains a bit that says whether the branch was recently taken or not

The prediction bit may have been placed there by another instruction

Figure 3.14. A Branch Prediction Buffer Use the 4 low-order

address bits of the branch (word address) to choose a row.

Nested Loops

Loop1: L.D F2,1600(R1)DADDIU R2,R0,#80

Loop2: L.D F0,1000(R2)ADD.D F0,F0,F2S.D F0,1000(R2)DADDIU R2,R2,#−8BNEZ R2,Loop2DADDIU R1,R1,#−8BNEZ R1,Loop1

Figure 3.7. States in 2-bit Prediction Scheme

Figure 3.8. Prediction Accuracy of 4096-entry 2-bit Prediction Buffer for SPEC89 Benchmarks

Figure 3.9. Prediction Accuracy of 4096-entry 2-bit Prediction Buffer versus an infinite 2-bit Prediction Buffer for SPEC89