farhan mohamed ali (w2-1) jigar vora (w2-2) sonali kapoor (w2-3) avni jhunjhunwala (w2-4)
DESCRIPTION
Presentation 12 MAD MAC 525. Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4). W2. Design Manager: Zack Menegakis. 26 th April, 2006 Short Final Presentation. Project Objective: - PowerPoint PPT PresentationTRANSCRIPT
1
Farhan Mohamed Ali (W2-1)Jigar Vora (W2-2)Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4)
Presentation 12
MAD MAC 525
26th April, 2006Short Final Presentation
W2
Project Objective:Design a crucial part of a GPU called the Multiply Accumulate Unit (MAC) which will revolutionize graphics.
Design Manager: Zack Menegakis
2
Agenda
• Marketing (Jigar)• Project Description (Farhan)• Algorithmic Description (Farhan)• Design Process (Sonali)• Floorplan Evolution (Sonali)• Layout (Avni)• Design Specifications (Avni)• Conclusion (Jigar)
3
MARKETING
• Application of product: HDR rendering in gaming graphics
• Why HDR? Used in games like Far Cry
• Optimization for speed( chose this because of market)
• Competition- if enter market, possible barriers to entry
4
MAD MAC and HDR
• What is HDR?
• Show animation explaining concept
5
MAD MAC and HDR• MAD MAC accelerates FP16 blending to enable true HDR graphics
• What is HDR?
• HDR = High Dynamic Range
• Dynamic range is defined as the ratio of the largest value of a signal to the lowest measurable value
• Dynamic range of luminance in real-world scenes can be 100,000 : 1
• With HDR rendering, pixel intensity are allowed to extend beyond [0..1] range of traditional graphics
•Nature isn’t clamped to [0..1] and neither should CG
• In lay terms:
• Bright things can be really bright
• Dark things can be really dark
• And the details can be seen in both
6
7
• Multiply Accumulate unit (MAC)
• Executes function AB+C on 16 bit floating point inputs. Inputs will be OpenEXR format.
• Multiply and add in parallel to greatly speed up operation
• Rounding is only performed only once so greater accuracy than individual multiply and add functions.
• Also known as:
• Fused Multiply Add (FMA)
• Multiply Add (MAD/MADD) in graphics shader programs
• Many applications benefit from a fast FMA
• Graphics – HDR rendering, blending and shader ops
• DSPs – computing vector dot-products in digital filters
• Fast division, square root – eliminates extra hardware
• Available in many newer CPUs and DSPs because it’s so cool
• One ring (circuit) to rule them all!
PROJECT DESCRIPTION
8
ALGORITHMIC DESCRIPTION
• Step through entire process
• Multiply and align occurs concurrently- always align C to A*B
• Outputs go to adder, normalize, round, overflow checker and output register
9
RegArray A RegArray B RegArray C
Multiplier Exp Calc Align
Adder/SubtractorControlLogic
&Sign
Dtrmin
Normalize
Round
Ovf Checker
Leading 0 Anticipator
10 10 10
5
55
1435225
4
36
14
101
5
5
Input Input Input
Output
16 16 16
16RegY
15
1
1
1
Block Diagram
10
IMPLEMENTATION
• Implementation of each module- how and why we chose a particular method keeping in mind goal of speed( multiplier, adder)
11
Design Decisions (contd.):• Multiplier Implementation
– 11 x 11 Carry-Save Multiplier– Reasons:
• Fast because it avoids having ripple carry in every stage
• Enables Compact Layout
12
Design Process
• Verilog-> Schematic-> Layout– Behavioral -> Structural Verilog– Transistors/gates -> Full Schematic– Gate/Component Layout -> Top Level
• Transistor Count fluctuated from 20,200 to 12,800• Major design decisions
– Decided against implementing denormal arithmetic because it would increase the complexity of the project beyond the scope of the class
– Round performed only once at the end.– Picked nPass over Tgate in the normalize shifter– Adder: variable length carry select-> Han-Carlson binary tree
adder
13
VERIFICATION OF DESIGN
Verilog Simulations ( show outputs)– Overview– How/Why it works– Behavioral/Structural
Explain why we couldn’t get a high-level simulator and how we tested our verilog design.
14
SCHEMATICS
• Show schematics of major blocks: adder, multiplier, and top-level
• HOW WE VERIFIED: analog simulation
15
Top Level Schematic
16
Multiplier Schematic
17
Adder Schematic
18
FLOORPLAN EVOLUTION
• Initial floorplan
• How it evolved (with animation)- why and how we changed it
19
Multiplier
Align C
Reg A
Reg
BExpCalc
Reg C
Pipeline Reg Pipeline Reg
AdderLd
Zero
Pipeline Reg
NormalizeRound
Reg Y
Main Floorplan
20
Floorplan
21
Full Chip LayoutExponent
AlignZero
Adder
MultiplierNormalize
Round
Ovf
22
Pipelining
• Initially planned 5-6 pipeline stages
• Reduced to 4 pipeline stages – made possible by implementing fast carry lookahead adders in critical path modules (adder and multiplier)
23
Pipeline Reg
Pipelining Stages
MultiplierAlign
C
Reg A
Reg
BExpCalc
Reg C
Pipeline Reg Pipeline Reg
AdderLd
Zero
Pipeline Reg
NormalizeRound
Reg Y
Pipeline Reg
Overflow checker
24
LAYOUT
• Final Layout
• Layout of large blocks such as multiplier, adder and normalize
25
Layout Decisions
• 3 standard cell heights
• Uniform width vdd and ground rails
• Wider vdd and ground rails in power hungry modules
• Max of 8 flip flops per clock pulse generator
• Metal directionality
26
Multiplier Layout with pipelining
27
Adder Layout
28
Normalize Layout
29
FINAL LAYOUT
30
Design Specifications
• Worst case delay = 2.25ns
• Long buses are all buffered (not tested yet)
• Estimated clocking speed = 400MHz
• Height by width = 193.86 um * 301.545 um
• Area = 58,458 um^2
• Aspect ratio = 1:1.55
• Total Transistor density = 0.22
31
Layout densities
• Active : 14.05%
• Poly : 9.25%
• Metal 1 : 33.89%
• Metal 2 : 18.00%
• Metal 3 : 14.99%
• Metal 4 : 6.29%
32
Layer Masks - Poly
33
Layer Masks – Metal 1
34
Layer Masks – Metal 2
35
Layer Masks – Metal 3
36
Layer Masks – Metal 4
37
Schematic Power: mW (350Mhz)
Layout Power: mW
Schematic Delay
Layout Delay
Multiplier
-w/ pipeline
2.97
??
N/A
??
3.38n
1.9n
N/A
2.25n
Exponents 1.608 2.21 1.01n 1.2n
Align 0.094 0.113 480p 637p
Adder 8.48 9.73 1.34n 1.7n
Leading 0 0.232 0.857 506p 551p
Normalize 1.458 1.546 407p 437p
Round 0.631 1.21 864p 986p
OvfCheck 0.13 0.19 453p 475p
Registers ?? ?? 179p 193p
Total ?? ?? - -
38
Area:
um2
Transistor Count
Transistor
Density
Multiplier
-w/ pipeline
20388 4496 0.22
Exponents 5,163 738 0.14
Align 3,995 500 0.13
Adder 13,202 3174 0.24
Leading 0 1,253 364 0.29
Normalize 3,190 942 0.3
Round 1,802 494 0.28
OvfCheck 200 70 0.35
Registers, etc
N/A 1948 N/A
Total 58,458 12,730 0.22
39
Conclusion
• More marketing
• Summarize chip functionality
• Extending applications of chip
40
Comments?