design and implementation of low-power …ssl.kaist.ac.kr/2007/data/thesis/ramchan woo_ms.pdframchan...
TRANSCRIPT
Ramchan Woo 1
Design and Implementation of Low-PowerEmbedded 3D Graphics Rendering Engine
for Mobile Applications using the Embedded Memory Logic Technology
Ramchan Woo
2000. 12. 19
Semiconductor System Laboratory
Department of Electrical Engineering and Computer Science
Korea Advanced Institute of Science and Technology (KAIST)
Ramchan Woo 2
Outline
• Introduction• Architecture• Memory Access• Circuit Implementation• Performance Comparison• Conclusion and Further Works
Ramchan Woo 3
Introduction
• Multimedia PDA-Chip– Personal Information
Management– MP3 Playback– Realtime Video Decoding– 3D Graphics Rendering
World’s FirstOne-chip Implementation
for PDA
Ramchan Woo 4
3D-CG in PDA-Chip
Internal32bitRISC
E3GRECore
FrameBuffer(6Mb)
MPEG-4Decoder
MPEG-4Decoder
InterfaceSRAM (1kb)
ExternalMemory
ExternalPC
ExternalGeometric Operation
ExternalPolygon Database Polygon BufferPreprocessing Rendering
Polygon conversionGeometric transformation
LIghtingClipping
Temporally storing buffer to match the bandwidth between 32bit RISC bus and 402bit Rendering
Engine bus
ShadingDepth comparisonAlpha blendingRasterizing
LCDDisplay
32bit 402bit
2048bit
24bit32bit
Ramchan Woo 5
E3GRE Features
• Gouraud Shading• Hidden Surface Removal with 16bit Z-buffer• Alpha-Blending for Transparency• Double-Buffering for Flicker-Free Animation• Direct Video Transfer through SAM• 24bit True color on 256 x 256 screen• 2.22Mtris/sec @ 20MHz• 3.2GByte/sec Memory BW @ 20MHz • 6Mb Embedded DRAM Frame-Buffer• Low Power consumption
Ramchan Woo 6
Outline
• Introduction• Architecture• Memory Access• Circuit Implementation• Performance Comparison• Conclusion and Further Works
Ramchan Woo 7
Previous EML Architecures
GPP + Single DRAM- IRAM (Berkeley), M32RD (Mitsubishi)- General-Purpose Scalar Processor- Sequential Memory Access with Bank Interleaving- Small-Width of Datapath (32b or 64b)- Optional Vector Unit (1024b -> 32b)- Huge Area, Huge Power
CacheGeneral
purposeProcessor
VectorUnit
MemoryM
em
ory
Inte
rface
Specialpurpose
Processor Cache
Mem
ory
Inte
rface
Memory
SPP + Single DRAM- 3D-RAM (Mitsubishi), GS@PS2 (SONY)- Special-Purpose Scalar Processor- Large-Width of Datapath (128b)- Sequential Memory Access with Bank Interleaving- High Power due to High-CLK @ SPP
Ramchan Woo 8
Previous EML Architecures (Cont’d)
1D Array (PEs+DRAMs)- Media-Chip (Hitachi), PIP-RAM (NEC)- Processing Elements- Independent Memory Access- Extra Controller is Required for 3D-CG- Low-Power- Area Panelty due to Bad DRAM Cell-Efficiency- Layout Difficulty
PE Memory
PE Memory
PE Memory
PE Memory
PE Memory
PE Memory
PE
Contr
ol
Memory
2D Array (PEs+DRAMs)- RamP-1 (KAIST)- Processing Elements- Independent Memory Access- Well matched to 3D-CG- Low-Power, High-Performance- Large Area Panelty due to Bad DRAM Cell-Efficiency- Layout Difficulty
Contr
ol PE
PEM
PEM
PE M
PE M
PE
PEM
PEM
PE M
PE M
PE
PEM
PEM
PE M
PE M
PE
PEM
PEM
PE M
PE M
Ramchan Woo 9
Proposed ViSTA Architecture
PEC
ontr
ol
PEMemory
PE
PEMemory
PE
PEMemory
PE
PEMemory
PE
Mem
ory
Inte
rface
Virtually Spanning 2D Array (ViSTA)- Hierarchical 1+8 Processing Elements- Independent-controlled DRAM’s- “Logically Local” / “Physically Global” Frame Buffer- Intellegent Memory Interface Circuit- Low-Power, High-Performance, Small-Area- Applicable to PDA-3D
Ramchan Woo 10
ViSTA Operation
Control
EP
PP
PP
PP Inte
rface M
M
M
EP
PP
PP
PP Inte
rface M
M
M
EP
PP
PP
PP Inte
rface M
M
M
EP
PP
PP
PP Inte
rface M
M
M
EP
PP
PP
PP
Inte
rface
M
M
M
EP
PP
PP
PP
Inte
rface
M
M
M
EP
PP
PP
PP
Inte
rface
M
M
M
EP
PP
PP
PP
Inte
rface
M
M
M
Virtually spans 2D array through EP pipeline
Horizonral renderingand memory accessthrough 8 PP'sin parallel
Vertical renderingthrough 8-stage
EP pipeline
2D Screen Space
Polygon Vertex
Rendered Polygon
Ramchan Woo 11
E3GRE Block Diagram
PF
EPI
EPD
PP
PolygonVertexFetch
EdgeInterpolation
HorizontalDivision
PixelCalculation
Pipeline
CLK cycle
Fetch & Control
Polygon Vertex
L R
PF
EPI
EPD
PP
Fra
me B
uff
er
Inte
rface
512kb 512kb
512kb 512kb
512kb 512kb
512kb 512kb
512kb 512kb
512kb 512kb
SAM(1.5Kb SRAM)
6Mb Embedded DRAM (Color Buffer + Z-Buffer)
512kb x 12 independent Macro
RGB out
402bit
640bit
8 Pixel Engines
2048bit
DLL
MUL
80MHz 20MHz80MHz100MHz
E3GRE Core
Clock
• Core and eDRAM are separated by FBI
Ramchan Woo 12
Outline
• Introduction• Architecture• Memory Access• Circuit Implementation• Performance Comparison• Conclusion and Further Works
Ramchan Woo 13
Conventional Memory Mappings
A
C D
A A
AB B
B B
CD
2D Blocking- PC-Graphics- 4-way Bank Interleaving- 1 Block maps into 1 DRAM Row- Unnecessary Power Consumption- Serial Pixel Access- Doesn’t fit well to Parallel Rasterization
Independent Assignment- RamP-1, PixelFlow, InfiniteReality- Independent Memory Control- Reduced Power Consumption- Parallel Pixel Access- Well matched to Parallel Rasterization- Increased Die Area due to DRAM cell
efficiency- Layout difficulty
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
8 9 A B C D E F 8 9 A B C D E F
G H I J K L M N G H I J K L M N
O P Q R S T U V O P Q R S T U V
W X Y Z a b c d
e f g h i j k l
m n o p q r s t
u v w x y z ! @
W X Y Z a b c d
e f g h i j k l
m n o p q r s t
u v w x y z ! @
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
8 9 A B C D E F 8 9 A B C D E F
G H I J K L M N G H I J K L M N
O P Q R S T U V O P Q R S T U V
W X Y Z a b c d
e f g h i j k l
m n o p q r s t
u v w x y z ! @
W X Y Z a b c d
e f g h i j k l
m n o p q r s t
u v w x y z ! @
Ramchan Woo 14
Proposed Memory Mapping
• Selective and Alternative Line-Block Activation (SALBA)– 4 Independent Memory– Line-Block consists of 8x1 Screen Pixels– 1 Line-Block maps into 1 DRAM SWD– Line-Block Read/Write with embedded-DRAM (x320/block)
A0A1A0A1A0A1A0A1A0A1A0A1
B0B1B0B1B0B1B0B1B0B1B0B1
B0B1B0B1B0B1B0B1B0B1B0B1
Screen
Macro #A0
Macro #A1
Macro #B0
Macro #B1
DRAM
Ramchan Woo 15
Memory Mapping (cont’d)
• Simultaneous and Continuous RMWCycle
READ WRITE
READ WRITE
READ WRITE
READ WRITE
READ WRITE
V0
V1
V2
V3
V4
Pixel
MacroA0, B0
READ and WRITE canoccur simultaneously
MacroA1, B1
V0V1V2V3V4V5V6V7
A0 B0
B1A1
A0 B0
B1A1
A0 B0
B1A1
A0 B0
B1A1
READ
WRITE READ
WRITE READ
WRITE READ
WRITE
CLK cycle
Modify Modify
Modify
A0, B0
A1, B1
A0, B0 A0, B0 A0, B0
A1, B1 A1, B1 A1, B1
Ramchan Woo 16
Low-Power DRAM Access
• Selective Macro Activation (SMA)#A
#A, #B#A, #B#A, #B#A, #B#A, #B
#B#B
Active only necessary Macro(s)
• Partial Wordline Activation (PWA)
Active only necessary Sub-Wordline
GW
L d
rv
SWD
One LB maps intoOne SWD-Activation
Simultaneously read or write8pixel x (24bit RGB + 16bit Z)
x320
Ramchan Woo 17
Low-Power DRAM Access
• Partial I/O Activation (PIA)
SW
D
SW
DSW
D
SWD inMacro #A0
SWD inMacro #B0
SWD inMacro #A1
SWD inMacro #B1
Unnecessary PowerConsumption
SW
D
• Eliminating Redundant Commands
Active only necessary I/O
ACT READ WRITE PCG ACT READRMW
ConventionalREAD and WRITE
ACT READ WRITE PCG ACT READ
MEMclk
PCG ACT PCG
RedundantActivation
Ramchan Woo 18
Memory Interface in ViSTA/SALBA
• Comand Generation for 12 Memories• Run-time bus reconfiguration
– Optimized for RMW data transaction– Data transfer between (2048bit @ 100MHz 2.5V
MEM) and (640bit @ 20MHz 1.5V PP)• Macro-based Design
– Any kind of DRAM macro can be attached– No wait for DRAM macro in RE core design– No more pitch-matched design of PE’s– More layout freedom to designers
Ramchan Woo 19
Run-Time Bus Reconfiguration
• Conventional • Proposed
PE
Mem
ory
Mem
ory
Mem
ory
Mem
ory
Mem
ory
PE
Mem
ory
Mem
ory
Mem
ory
Mem
ory
Mem
ory
PE PE PE PE
(a) Inter-memorybus reconfiguration
(b) Intra-memorybus reconfiguration
MemoryInterface
MemoryInterface
PE
Mem
ory
Mem
ory
Mem
ory
Mem
ory
Mem
ory
PE PE PE PE
MemoryInterface
Ramchan Woo 20
Memory Configuration
A0 B0
A1 B1
A0 B0
A1 B1
A0 B0
A1 B1
FB_alpha
FB_beta
Z-buffer
FrameBuffer
Interface
SAM
SAM
PP
768bit
768bit
512bit
640bit
100MHzMemory Clock
20MHzRendering Clock
• 3.2GByte/sec Memory BW @ 20MHz– 20MHz x 2rmw x 8PPs x 40RGBZ x 2ABmem
Ramchan Woo 21
Memory Access Modes
– (a) : Normal Read-Modify-Write– (b) : Memory Test / FB Initilize– (c) : Refresh
A0 B0 A1 B1
3DRE
A0 B0 A1 B1
3DRE
(a) Communication (b) Individual Activation
A0 B0 A1 B1
3DRE
(c) Broadcasting
ActivatedROW
Ramchan Woo 22
Outline
• Introduction• Architecture• Memory Access• Circuit Implementation• Performance Comparison• Conclusion and Further Works
Ramchan Woo 23
E3GRE Core Floorplan
PF
EPI
EPD
PP
FBI
SAMCTRL
Decap CLK
Low-Power Datapath- PowerPC 603 Latch- Adder- Divider- Alpha-Blend Unit
Special Functional Unit- Bus Interface Circuit- Clock Multipler/Generator
Ramchan Woo 24
Adder
• Low-Power and Fast Addition– More than 100 adders– Static version of “A 670ps, 64bit Dynamic Low-
Power Adder @ 0.25um, 2.5V”, [ISCAS 2000]• DORP instead of XORP• Separated Carry Propagation Path• Efficient Carry Grouping
– 11bit fixed-point for 8bit R,G,B,X,Y– 19bit fixed-point for 16bit Z– Less than 0.01mW power consumption– Less than 1ns 19bit addition time
Ramchan Woo 25
SAS Divider
• Shift-Add-Select Approximation for Low-Power and 1-cycle Division
2's complement Unpack
Routing Track
0 1 32 4 5 6
MUX
2's complenent Pack
++
+ ++
div2 div4 div3div01 div5div7div6
Divider
DividendDivisor
Signed Number
Unsigned Number
Sign
Signed Number
Unsigned Number
Output
D1/D
(exact)1/D
(approx.)Error
(% of X)
0 (000) Not Defined
1 (001) 1 1=(1/2)0 0
2 (010) 0.50.5
=(1/2)1 0
3 (011) 0.333333…0.328125
=(1/2)2+(1/2)4+(1/2)6 0.52%
4 (100) 0.250.25
= (1/2)2 0
5 (101) 0.20.203125
=(1/2)3+(1/2)4+(1/2)6 0.31%
6 (110) 0.166666…0.171875
=(1/2)3+(1/2)5+(1/2)6 0.52%
7 (111) 0.142857 0.140625=(1/2)3+(1/2)6 0.22%
Ramchan Woo 26
SAS Divider (Cont’d)
• Image Comparison between FP-div and SAS-div
FP-div SAS-divConventional
Floating-Point
Divider
Proposed
SAS-Approximation
Divider
Ramchan Woo 27
Alpha Blend Unit
• Restrict Alpha Value for Low-Power Operation -> Mul-Free– A=100%, 50%, 25%, 12.5%, 0%– 24 Alpha-Blend Unit
NEW TransparentObject
X
Y
Z
OLDObject
3D space 2D display
Blended
OLD OLD)-(NEWAOLDA)-(1NEWA BLEND
+×=×+×=
SUB
RightSHIFT
ADD
OLD NEW
Alpha
Alpha-Blend
Blended
Ramchan Woo 28
Frame Buffer Interface
• Run-Time Bus Reconfiguration• DRAM Control (12 independent Macros)
A0
B0
A1
B1
A0
B0
A1
B1
A0
B0
A1
B1
CB
_alp
ha
CB
_beta
Z-B
uff
er
SAM
A0
B0
A1
B1
Double-BufferControl
ctrl01 ctrlAB
PP
EPDA0 cmdB0 cmdA1 cmdB1 cmd
Control
FBI
CLK
100MHz 20MHz
Latc
hIN
IT
2.5V 1.5V
CLKSync
Read
Shifte
rShifte
r
Write
Address, Command
x320
x320
x640
x640
x512
x768
x768
Ramchan Woo 29
FBI Circuits
A0<319>
A1<319>
B0<319>
B1<319>
readH<319>
readL<319>
writeH<319>
writeL<319>
ctrl01 ctrlABread
ctrlABwritectrl01
B1<0>B0<0>A1<0>A0<0>
writeL<0>writeH<0>readL<0>readH<0>
toPP-Read-Bus
(x320)
to/from Macro A0(FB_a, FB_b, ZB)
PP0
PP7
to/from Macro A1(FB_a, FB_b, ZB)
to/from Macro B0(FB_a, FB_b, ZB)
to/from Macro B1(FB_a, FB_b, ZB)
Cascaded2-to-2 MUX
16-to-8Bus-Shifter
BidirectionalMemory Bus
OmnidirectionalPP Bus
PP0
PP7
fromPP-Write-Bus
(x320)
readH-Bus(x320)
readL-Bus(x320)
writeH-Bus(x320)
writeL-Bus(x320)
Shift
Shift
Shift
Shift
Ramchan Woo 30
Clock Generation/Multiplication
• Multiple Clocks– RISC : 80MHz / E3GRE : 20MHz / FB : 100MHz
• DLL-based
PFD
FF FF
DIV4EXTclk(80MHz)
ChargePump
VCDL
DLL
MUL5
DelayTunning
andBuffering
RISCclk(80MHz)
REclk(20MHz)
MEMclk(100MHz)
Ramchan Woo 31
CLK Circuit
CLKin
DelayControl
CLKoutint1 int2 int9
REclk
CLKin
int1
int2
int9
CLKout
ODD pulse(F)
EVEN pulse(R)
int1 int3 int5 int7 int9
VCDLMUL5
CLKin int2 int4 int6 int8
CLKout
Synchronized byCharge Pump
Feedback
Rising EdgeDetect
CLKin
int2
int4
int6
int8
int1
int3
int5
int7
int9
r1
r2
r3
r4
r5
f1
f2
f3
f4
f5
CLKout
r1 r2 r3 r4 r5
f1 f2 f3 f4 f5
PulseShaping
Ramchan Woo 32
CLK Simulation Result
VCDL
MU
L5Charge
Pump
PFD
DIV4 Clock Buffer
CLK Decoupling Capacitor
E3GRECore
CLK
Vdd
Gnd
20pF
Ramchan Woo 33
Chip Micrograph
MC Frame Buffer
#1
MC Frame Buffer
#2
E3GREColor Buffer
E3GREZ-Buffer
32bitRISC
Internal DRAM
CLK
E3GRECore
MCA
BandwidthEqualizer
SAMYCrCb to RGB
Ramchan Woo 34
Rendering Outputs
100% 50% 25% 12.5%
Name Data Size Number of Polygons
COLOR 100KByte 1,972
CAR 1.32MByte 26,700
DINO 352KByte 6,941
Ramchan Woo 35
Circuit Summary
Process 0.18µm Hyundai CMOS EML 3P6M
Total 122E3GRE Core 36
CLK 2
Total 24
E3GRE Core 188,188
Z-Buffer 2Mb DRAM
Frame-Buffer 100MHz
Frame-Buffer 2.5V
SAM 1.5Kb SRAM
CLK 1,343
E3GRE Core 5.75Area(mm2) FB 18.2
CLK 0.03
FB 84
Total 190,231
Color-Buffer 4Mb DRAM
E3GRE Core 20MHz
E3GRE Core 1.5VPower Supply
Operating Frequency
Embedded Memories
Number of Logic Transistors
Power Consumption(mW)
Ramchan Woo 36
Performance Comparison3D-RAM Media-Chip RamP-1 E3GRE
Year JSSC1995 JSSC 1997 ISSCC 2000 ISSCC 2001
Number of Chips 12 1 8 1
Die Area 1680mm2
(140mm2/chip)122mm2 416mm2
(52mm2/chip)24mm2
Power Consumption
9600mW(for ALU and SAM)
640mW(for 8Mb DRAM)
4700mW
122mW(36mW Logic
84mW 6Mb DRAM2mW CLK)
Functional Units
eDRAM, eSRAMSIMD-ShadeSIMD-BlendZ-compare
eDRAM4 ALUs
eDRAM, eSRAMSIMD-Shade, SIMD-Blend, Z-compare
Horizontal Setup, SIMD-Divide
Process 0.5µm 0.4µm 0.35µm 0.18µm
Main Clock 100MHz 100MHz 100MHz 20MHz
Maximum Pixel Fill Rate
400Mpixels(33Mpixels/chip)
400Mpixelsfor clearing
704Mpixels(88Mpixels/chip)
71Mpixels(320Mpixels for clearing)
SustainedTriangle Fill Rate
Less Much Less 352Mpixels(44Mpixels/chip)
71Mpixels
Ramchan Woo 37
Conclusion
• E3GRE Design/Implementation for PDA-Chip– ViSTA Architecture / SALBA Memory Mapping @ Low CLK– Low-Power Memory Access (SMA, PWA, PIA)– Low-Power Circuits– 100% Full-Custom Design (190,231 logic TRs with 6Mb DRAM Macro)
• Features– 2.22Mtris/sec (20MHz/9cycle)– 71Mpixel/sec Pixel Fill Rate (32pixel/triangle)– 122mW with Embedded Frame-Buffer and Clock Generation– 3.2GByte/sec Embedded-DRAM Bandwidth– 24bit True-Color on 256x256 screen– Hidden Surface Removal with 16bit Z-buffer– Double-Buffering for Flicker-Free Animation– Gouraud Shading, Depth Comparison, Alpha-Blending– Direct Video Transfer through SAM
Ramchan Woo 38
Further Works
• Efficient Texture Fetching using Unified DRAM Texture Cache
• Complete PDA-3D Pipeline with MobileGLOn-chipTexture
L1$
16KBSRAM
2texel/clkRasterizer
Rasterizer
Rasterizer
Rasterizer
Off-chipTexture
L2$
4MBDRAM
SystemMemory
128MBDRAM
Control
CPU
AGP
Rasterizer
Rasterizer
Rasterizer
Rasterizer
8texel/clk
On-chipUnified
Texture $
embeddedDRAM
ExternalMemory
DRAM