farhan mohamed ali (w2-1) jigar vora (w2-2) sonali kapoor (w2-3) avni jhunjhunwala (w2-4)

1

Farhan Mohamed Ali (W2-1)Jigar Vora (W2-2)Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4)

Presentation 12

MAD MAC 525

26th April, 2006Short Final Presentation

W2

Project Objective:Design a crucial part of a GPU called the Multiply Accumulate Unit (MAC) which will revolutionize graphics.

Design Manager: Zack Menegakis

2

Agenda

• Marketing (Jigar)• Project Description (Farhan)• Algorithmic Description (Farhan)• Design Process (Sonali)• Floorplan Evolution (Sonali)• Layout (Avni)• Design Specifications (Avni)• Conclusion (Jigar)

3

MARKETING

• Application of product: HDR rendering in gaming graphics

• Why HDR? Used in games like Far Cry

• Optimization for speed( chose this because of market)

• Competition- if enter market, possible barriers to entry

4

MAD MAC and HDR

• What is HDR?

• Show animation explaining concept

5

MAD MAC and HDR• MAD MAC accelerates FP16 blending to enable true HDR graphics

• What is HDR?

• HDR = High Dynamic Range

• Dynamic range is defined as the ratio of the largest value of a signal to the lowest measurable value

• Dynamic range of luminance in real-world scenes can be 100,000 : 1

• With HDR rendering, pixel intensity are allowed to extend beyond [0..1] range of traditional graphics

•Nature isn’t clamped to [0..1] and neither should CG

• In lay terms:

• Bright things can be really bright

• Dark things can be really dark

• And the details can be seen in both

7

• Multiply Accumulate unit (MAC)

• Executes function AB+C on 16 bit floating point inputs. Inputs will be OpenEXR format.

• Multiply and add in parallel to greatly speed up operation

• Rounding is only performed only once so greater accuracy than individual multiply and add functions.

• Also known as:

• Fused Multiply Add (FMA)

• Multiply Add (MAD/MADD) in graphics shader programs

• Many applications benefit from a fast FMA

• Graphics – HDR rendering, blending and shader ops

• DSPs – computing vector dot-products in digital filters

• Fast division, square root – eliminates extra hardware

• Available in many newer CPUs and DSPs because it’s so cool

• One ring (circuit) to rule them all!

PROJECT DESCRIPTION

8

ALGORITHMIC DESCRIPTION

• Step through entire process

• Multiply and align occurs concurrently- always align C to A*B

• Outputs go to adder, normalize, round, overflow checker and output register

9

RegArray A RegArray B RegArray C

Multiplier Exp Calc Align

Adder/SubtractorControlLogic

&Sign

Dtrmin

Normalize

Round

Ovf Checker

Leading 0 Anticipator

10 10 10

5

55

1435225

4

36

14

101

5

5

Input Input Input

Output

16 16 16

16RegY

15

1

1

1

Block Diagram

10

IMPLEMENTATION

• Implementation of each module- how and why we chose a particular method keeping in mind goal of speed( multiplier, adder)

11

Design Decisions (contd.):• Multiplier Implementation

– 11 x 11 Carry-Save Multiplier– Reasons:

• Fast because it avoids having ripple carry in every stage

• Enables Compact Layout

12

Design Process

• Verilog-> Schematic-> Layout– Behavioral -> Structural Verilog– Transistors/gates -> Full Schematic– Gate/Component Layout -> Top Level

• Transistor Count fluctuated from 20,200 to 12,800• Major design decisions

– Decided against implementing denormal arithmetic because it would increase the complexity of the project beyond the scope of the class

– Round performed only once at the end.– Picked nPass over Tgate in the normalize shifter– Adder: variable length carry select-> Han-Carlson binary tree

adder

13

VERIFICATION OF DESIGN

Verilog Simulations ( show outputs)– Overview– How/Why it works– Behavioral/Structural

Explain why we couldn’t get a high-level simulator and how we tested our verilog design.

14

SCHEMATICS

• Show schematics of major blocks: adder, multiplier, and top-level

• HOW WE VERIFIED: analog simulation

15

Top Level Schematic

16

Multiplier Schematic

17

Adder Schematic

18

FLOORPLAN EVOLUTION

• Initial floorplan

• How it evolved (with animation)- why and how we changed it

19

Multiplier

Align C

Reg A

Reg

BExpCalc

Reg C

Pipeline Reg Pipeline Reg

AdderLd

Zero

Pipeline Reg

NormalizeRound

Reg Y

Main Floorplan

20

Floorplan

21

Full Chip LayoutExponent

AlignZero

Adder

MultiplierNormalize

Round

Ovf

22

Pipelining

• Initially planned 5-6 pipeline stages

• Reduced to 4 pipeline stages – made possible by implementing fast carry lookahead adders in critical path modules (adder and multiplier)

23

Pipeline Reg

Pipelining Stages

MultiplierAlign

C

Reg A

Reg

BExpCalc

Reg C

Pipeline Reg Pipeline Reg

AdderLd

Zero

Pipeline Reg

NormalizeRound

Reg Y

Pipeline Reg

Overflow checker

24

LAYOUT

• Final Layout

• Layout of large blocks such as multiplier, adder and normalize

25

Layout Decisions

• 3 standard cell heights

• Uniform width vdd and ground rails

• Wider vdd and ground rails in power hungry modules

• Max of 8 flip flops per clock pulse generator

• Metal directionality

26

Multiplier Layout with pipelining

27

Adder Layout

28

Normalize Layout

29

FINAL LAYOUT

30

Design Specifications

• Worst case delay = 2.25ns

• Long buses are all buffered (not tested yet)

• Estimated clocking speed = 400MHz

• Height by width = 193.86 um * 301.545 um

• Area = 58,458 um^2

• Aspect ratio = 1:1.55

• Total Transistor density = 0.22

31

Layout densities

• Active : 14.05%

• Poly : 9.25%

• Metal 1 : 33.89%

• Metal 2 : 18.00%

• Metal 3 : 14.99%

• Metal 4 : 6.29%

32

Layer Masks - Poly

33

Layer Masks – Metal 1

34


35


36


37

Schematic Power: mW (350Mhz)

Layout Power: mW

Schematic Delay

Layout Delay

Multiplier

-w/ pipeline

2.97

??

N/A

??

3.38n

1.9n

N/A

2.25n

Exponents 1.608 2.21 1.01n 1.2n

Align 0.094 0.113 480p 637p

Adder 8.48 9.73 1.34n 1.7n

Leading 0 0.232 0.857 506p 551p

Normalize 1.458 1.546 407p 437p

Round 0.631 1.21 864p 986p

OvfCheck 0.13 0.19 453p 475p

Registers ?? ?? 179p 193p

Total ?? ?? - -

38

Area:

um2

Transistor Count

Transistor

Density

Multiplier

-w/ pipeline

20388 4496 0.22

Exponents 5,163 738 0.14

Align 3,995 500 0.13

Adder 13,202 3174 0.24

Leading 0 1,253 364 0.29

Normalize 3,190 942 0.3

Round 1,802 494 0.28

OvfCheck 200 70 0.35

Registers, etc

N/A 1948 N/A

Total 58,458 12,730 0.22

39

Conclusion

• More marketing

• Summarize chip functionality

• Extending applications of chip

40

Comments?

farhan mohamed ali (w2-1) jigar vora (w2-2) sonali kapoor (w2-3) avni jhunjhunwala (w2-4)

Documents