design and implementation of low-power …ssl.kaist.ac.kr/2007/data/thesis/ramchan woo_ms.pdframchan...

Ramchan Woo 1

Design and Implementation of Low-PowerEmbedded 3D Graphics Rendering Engine

for Mobile Applications using the Embedded Memory Logic Technology

Ramchan Woo

2000. 12. 19

Semiconductor System Laboratory

Department of Electrical Engineering and Computer Science

Korea Advanced Institute of Science and Technology (KAIST)

Ramchan Woo 2

Outline

• Introduction• Architecture• Memory Access• Circuit Implementation• Performance Comparison• Conclusion and Further Works

Ramchan Woo 3

Introduction

• Multimedia PDA-Chip– Personal Information

Management– MP3 Playback– Realtime Video Decoding– 3D Graphics Rendering

World’s FirstOne-chip Implementation

for PDA

Ramchan Woo 4

3D-CG in PDA-Chip

Internal32bitRISC

E3GRECore

FrameBuffer(6Mb)

MPEG-4Decoder

MPEG-4Decoder

InterfaceSRAM (1kb)

ExternalMemory

ExternalPC

ExternalGeometric Operation

ExternalPolygon Database Polygon BufferPreprocessing Rendering

Polygon conversionGeometric transformation

LIghtingClipping

Temporally storing buffer to match the bandwidth between 32bit RISC bus and 402bit Rendering

Engine bus

ShadingDepth comparisonAlpha blendingRasterizing

LCDDisplay

32bit 402bit

2048bit

24bit32bit

Ramchan Woo 5

E3GRE Features

• Gouraud Shading• Hidden Surface Removal with 16bit Z-buffer• Alpha-Blending for Transparency• Double-Buffering for Flicker-Free Animation• Direct Video Transfer through SAM• 24bit True color on 256 x 256 screen• 2.22Mtris/sec @ 20MHz• 3.2GByte/sec Memory BW @ 20MHz • 6Mb Embedded DRAM Frame-Buffer• Low Power consumption

Ramchan Woo 6

Outline


Ramchan Woo 7

Previous EML Architecures

GPP + Single DRAM- IRAM (Berkeley), M32RD (Mitsubishi)- General-Purpose Scalar Processor- Sequential Memory Access with Bank Interleaving- Small-Width of Datapath (32b or 64b)- Optional Vector Unit (1024b -> 32b)- Huge Area, Huge Power

CacheGeneral

purposeProcessor

VectorUnit

MemoryM

em

ory

Inte

rface

Specialpurpose

Processor Cache

Mem

ory

Inte

rface

Memory

SPP + Single DRAM- 3D-RAM (Mitsubishi), GS@PS2 (SONY)- Special-Purpose Scalar Processor- Large-Width of Datapath (128b)- Sequential Memory Access with Bank Interleaving- High Power due to High-CLK @ SPP

Ramchan Woo 8

Previous EML Architecures (Cont’d)

1D Array (PEs+DRAMs)- Media-Chip (Hitachi), PIP-RAM (NEC)- Processing Elements- Independent Memory Access- Extra Controller is Required for 3D-CG- Low-Power- Area Panelty due to Bad DRAM Cell-Efficiency- Layout Difficulty

PE Memory

PE Memory

PE Memory

PE Memory

PE Memory

PE Memory

PE

Contr

ol

Memory

2D Array (PEs+DRAMs)- RamP-1 (KAIST)- Processing Elements- Independent Memory Access- Well matched to 3D-CG- Low-Power, High-Performance- Large Area Panelty due to Bad DRAM Cell-Efficiency- Layout Difficulty

Contr

ol PE

PEM

PEM

PE M

PE M

PE

PEM

PEM

PE M

PE M

PE

PEM

PEM

PE M

PE M

PE

PEM

PEM

PE M

PE M

Ramchan Woo 9

Proposed ViSTA Architecture

PEC

ontr

ol

PEMemory

PE

PEMemory

PE

PEMemory

PE

PEMemory

PE

Mem

ory

Inte

rface

Virtually Spanning 2D Array (ViSTA)- Hierarchical 1+8 Processing Elements- Independent-controlled DRAM’s- “Logically Local” / “Physically Global” Frame Buffer- Intellegent Memory Interface Circuit- Low-Power, High-Performance, Small-Area- Applicable to PDA-3D

Ramchan Woo 10

ViSTA Operation

Control

EP

PP

PP

PP Inte

rface M

M

M

EP

PP

PP

PP Inte

rface M

M

M

EP

PP

PP

PP Inte

rface M

M

M

EP

PP

PP

PP Inte

rface M

M

M

EP

PP

PP

PP

Inte

rface

M

M

M

EP

PP

PP

PP

Inte

rface

M

M

M

EP

PP

PP

PP

Inte

rface

M

M

M

EP

PP

PP

PP

Inte

rface

M

M

M

Virtually spans 2D array through EP pipeline

Horizonral renderingand memory accessthrough 8 PP'sin parallel

Vertical renderingthrough 8-stage

EP pipeline

2D Screen Space

Polygon Vertex

Rendered Polygon

Ramchan Woo 11

E3GRE Block Diagram

PF

EPI

EPD

PP

PolygonVertexFetch

EdgeInterpolation

HorizontalDivision

PixelCalculation

Pipeline

CLK cycle

Fetch & Control

Polygon Vertex

L R

PF

EPI

EPD

PP

Fra

me B

uff

er

Inte

rface

512kb 512kb

512kb 512kb

512kb 512kb

512kb 512kb

512kb 512kb

512kb 512kb

SAM(1.5Kb SRAM)

6Mb Embedded DRAM (Color Buffer + Z-Buffer)

512kb x 12 independent Macro

RGB out

402bit

640bit

8 Pixel Engines

2048bit

DLL

MUL

80MHz 20MHz80MHz100MHz

E3GRE Core

Clock

• Core and eDRAM are separated by FBI

Ramchan Woo 12

Outline


Ramchan Woo 13

Conventional Memory Mappings

A

C D

A A

AB B

B B

CD

2D Blocking- PC-Graphics- 4-way Bank Interleaving- 1 Block maps into 1 DRAM Row- Unnecessary Power Consumption- Serial Pixel Access- Doesn’t fit well to Parallel Rasterization

Independent Assignment- RamP-1, PixelFlow, InfiniteReality- Independent Memory Control- Reduced Power Consumption- Parallel Pixel Access- Well matched to Parallel Rasterization- Increased Die Area due to DRAM cell

efficiency- Layout difficulty

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

8 9 A B C D E F 8 9 A B C D E F

G H I J K L M N G H I J K L M N

O P Q R S T U V O P Q R S T U V

W X Y Z a b c d

e f g h i j k l

m n o p q r s t

u v w x y z ! @

W X Y Z a b c d

e f g h i j k l

m n o p q r s t

u v w x y z ! @

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

8 9 A B C D E F 8 9 A B C D E F

G H I J K L M N G H I J K L M N

O P Q R S T U V O P Q R S T U V

W X Y Z a b c d

e f g h i j k l

m n o p q r s t

u v w x y z ! @

W X Y Z a b c d

e f g h i j k l

m n o p q r s t

u v w x y z ! @

Ramchan Woo 14

Proposed Memory Mapping

• Selective and Alternative Line-Block Activation (SALBA)– 4 Independent Memory– Line-Block consists of 8x1 Screen Pixels– 1 Line-Block maps into 1 DRAM SWD– Line-Block Read/Write with embedded-DRAM (x320/block)

A0A1A0A1A0A1A0A1A0A1A0A1

B0B1B0B1B0B1B0B1B0B1B0B1

B0B1B0B1B0B1B0B1B0B1B0B1

Screen

Macro #A0

Macro #A1

Macro #B0

Macro #B1

DRAM

Ramchan Woo 15

Memory Mapping (cont’d)

• Simultaneous and Continuous RMWCycle

READ WRITE

READ WRITE

READ WRITE

READ WRITE

READ WRITE

V0

V1

V2

V3

V4

Pixel

MacroA0, B0

READ and WRITE canoccur simultaneously

MacroA1, B1

V0V1V2V3V4V5V6V7

A0 B0

B1A1

A0 B0

B1A1

A0 B0

B1A1

A0 B0

B1A1

READ

WRITE READ

WRITE READ

WRITE READ

WRITE

CLK cycle

Modify Modify

Modify

A0, B0

A1, B1

A0, B0 A0, B0 A0, B0

A1, B1 A1, B1 A1, B1

Ramchan Woo 16

Low-Power DRAM Access

• Selective Macro Activation (SMA)#A

#A, #B#A, #B#A, #B#A, #B#A, #B

#B#B

Active only necessary Macro(s)

• Partial Wordline Activation (PWA)

Active only necessary Sub-Wordline

GW

L d

rv

SWD

One LB maps intoOne SWD-Activation

Simultaneously read or write8pixel x (24bit RGB + 16bit Z)

x320

Ramchan Woo 17

Low-Power DRAM Access

• Partial I/O Activation (PIA)

SW

D

SW

DSW

D

SWD inMacro #A0

SWD inMacro #B0

SWD inMacro #A1

SWD inMacro #B1

Unnecessary PowerConsumption

SW

D

• Eliminating Redundant Commands

Active only necessary I/O

ACT READ WRITE PCG ACT READRMW

ConventionalREAD and WRITE

ACT READ WRITE PCG ACT READ

MEMclk

PCG ACT PCG

RedundantActivation

Ramchan Woo 18

Memory Interface in ViSTA/SALBA

• Comand Generation for 12 Memories• Run-time bus reconfiguration

– Optimized for RMW data transaction– Data transfer between (2048bit @ 100MHz 2.5V

MEM) and (640bit @ 20MHz 1.5V PP)• Macro-based Design

– Any kind of DRAM macro can be attached– No wait for DRAM macro in RE core design– No more pitch-matched design of PE’s– More layout freedom to designers

Ramchan Woo 19

Run-Time Bus Reconfiguration

• Conventional • Proposed

PE

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

PE

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

PE PE PE PE

(a) Inter-memorybus reconfiguration

(b) Intra-memorybus reconfiguration

MemoryInterface

MemoryInterface

PE

Mem

ory

Mem

ory

Mem

ory

Mem

ory

Mem

ory

PE PE PE PE

MemoryInterface

Ramchan Woo 20

Memory Configuration

A0 B0

A1 B1

A0 B0

A1 B1

A0 B0

A1 B1

FB_alpha

FB_beta

Z-buffer

FrameBuffer

Interface

SAM

SAM

PP

768bit

768bit

512bit

640bit

100MHzMemory Clock

20MHzRendering Clock

• 3.2GByte/sec Memory BW @ 20MHz– 20MHz x 2rmw x 8PPs x 40RGBZ x 2ABmem

Ramchan Woo 21

Memory Access Modes

– (a) : Normal Read-Modify-Write– (b) : Memory Test / FB Initilize– (c) : Refresh

A0 B0 A1 B1

3DRE

A0 B0 A1 B1

3DRE

(a) Communication (b) Individual Activation

A0 B0 A1 B1

3DRE

(c) Broadcasting

ActivatedROW

Ramchan Woo 22

Outline


Ramchan Woo 23

E3GRE Core Floorplan

PF

EPI

EPD

PP

FBI

SAMCTRL

Decap CLK

Low-Power Datapath- PowerPC 603 Latch- Adder- Divider- Alpha-Blend Unit

Special Functional Unit- Bus Interface Circuit- Clock Multipler/Generator

Ramchan Woo 24

Adder

• Low-Power and Fast Addition– More than 100 adders– Static version of “A 670ps, 64bit Dynamic Low-

Power Adder @ 0.25um, 2.5V”, [ISCAS 2000]• DORP instead of XORP• Separated Carry Propagation Path• Efficient Carry Grouping

– 11bit fixed-point for 8bit R,G,B,X,Y– 19bit fixed-point for 16bit Z– Less than 0.01mW power consumption– Less than 1ns 19bit addition time

Ramchan Woo 25

SAS Divider

• Shift-Add-Select Approximation for Low-Power and 1-cycle Division

2's complement Unpack

Routing Track

0 1 32 4 5 6

MUX

2's complenent Pack

++

+ ++

div2 div4 div3div01 div5div7div6

Divider

DividendDivisor

Signed Number

Unsigned Number

Sign

Signed Number

Unsigned Number

Output

D1/D

(exact)1/D

(approx.)Error

(% of X)

0 (000) Not Defined

1 (001) 1 1=(1/2)0 0

2 (010) 0.50.5

=(1/2)1 0

3 (011) 0.333333…0.328125

=(1/2)2+(1/2)4+(1/2)6 0.52%

4 (100) 0.250.25

= (1/2)2 0

5 (101) 0.20.203125

=(1/2)3+(1/2)4+(1/2)6 0.31%

6 (110) 0.166666…0.171875

=(1/2)3+(1/2)5+(1/2)6 0.52%

7 (111) 0.142857 0.140625=(1/2)3+(1/2)6 0.22%

Ramchan Woo 26

SAS Divider (Cont’d)

• Image Comparison between FP-div and SAS-div

FP-div SAS-divConventional

Floating-Point

Divider

Proposed

SAS-Approximation

Divider

Ramchan Woo 27

Alpha Blend Unit

• Restrict Alpha Value for Low-Power Operation -> Mul-Free– A=100%, 50%, 25%, 12.5%, 0%– 24 Alpha-Blend Unit

NEW TransparentObject

X

Y

Z

OLDObject

3D space 2D display

Blended

OLD OLD)-(NEWAOLDA)-(1NEWA BLEND

+×=×+×=

SUB

RightSHIFT

ADD

OLD NEW

Alpha

Alpha-Blend

Blended

Ramchan Woo 28

Frame Buffer Interface

• Run-Time Bus Reconfiguration• DRAM Control (12 independent Macros)

A0

B0

A1

B1

A0

B0

A1

B1

A0

B0

A1

B1

CB

_alp

ha

CB

_beta

Z-B

uff

er

SAM

A0

B0

A1

B1

Double-BufferControl

ctrl01 ctrlAB

PP

EPDA0 cmdB0 cmdA1 cmdB1 cmd

Control

FBI

CLK

100MHz 20MHz

Latc

hIN

IT

2.5V 1.5V

CLKSync

Read

Shifte

rShifte

r

Write

Address, Command

x320

x320

x640

x640

x512

x768

x768

Ramchan Woo 29

FBI Circuits

A0<319>

A1<319>

B0<319>

B1<319>

readH<319>

readL<319>

writeH<319>

writeL<319>

ctrl01 ctrlABread

ctrlABwritectrl01

B1<0>B0<0>A1<0>A0<0>

writeL<0>writeH<0>readL<0>readH<0>

toPP-Read-Bus

(x320)

to/from Macro A0(FB_a, FB_b, ZB)

PP0

PP7

to/from Macro A1(FB_a, FB_b, ZB)

to/from Macro B0(FB_a, FB_b, ZB)

to/from Macro B1(FB_a, FB_b, ZB)

Cascaded2-to-2 MUX

16-to-8Bus-Shifter

BidirectionalMemory Bus

OmnidirectionalPP Bus

PP0

PP7

fromPP-Write-Bus

(x320)

readH-Bus(x320)

readL-Bus(x320)

writeH-Bus(x320)

writeL-Bus(x320)

Shift

Shift

Shift

Shift

Ramchan Woo 30

Clock Generation/Multiplication

• Multiple Clocks– RISC : 80MHz / E3GRE : 20MHz / FB : 100MHz

• DLL-based

PFD

FF FF

DIV4EXTclk(80MHz)

ChargePump

VCDL

DLL

MUL5

DelayTunning

andBuffering

RISCclk(80MHz)

REclk(20MHz)

MEMclk(100MHz)

Ramchan Woo 31

CLK Circuit

CLKin

DelayControl

CLKoutint1 int2 int9

REclk

CLKin

int1

int2

int9

CLKout

ODD pulse(F)

EVEN pulse(R)

int1 int3 int5 int7 int9

VCDLMUL5

CLKin int2 int4 int6 int8

CLKout

Synchronized byCharge Pump

Feedback

Rising EdgeDetect

CLKin

int2

int4

int6

int8

int1

int3

int5

int7

int9

r1

r2

r3

r4

r5

f1

f2

f3

f4

f5

CLKout

r1 r2 r3 r4 r5

f1 f2 f3 f4 f5

PulseShaping

Ramchan Woo 32

CLK Simulation Result

VCDL

MU

L5Charge

Pump

PFD

DIV4 Clock Buffer

CLK Decoupling Capacitor

E3GRECore

CLK

Vdd

Gnd

20pF

Ramchan Woo 33

Chip Micrograph

MC Frame Buffer

#1

MC Frame Buffer

#2

E3GREColor Buffer

E3GREZ-Buffer

32bitRISC

Internal DRAM

CLK

E3GRECore

MCA

BandwidthEqualizer

SAMYCrCb to RGB

Ramchan Woo 34

Rendering Outputs

100% 50% 25% 12.5%

Name Data Size Number of Polygons

COLOR 100KByte 1,972

CAR 1.32MByte 26,700

DINO 352KByte 6,941

Ramchan Woo 35

Circuit Summary

Process 0.18µm Hyundai CMOS EML 3P6M

Total 122E3GRE Core 36

CLK 2

Total 24

E3GRE Core 188,188

Z-Buffer 2Mb DRAM

Frame-Buffer 100MHz

Frame-Buffer 2.5V

SAM 1.5Kb SRAM

CLK 1,343

E3GRE Core 5.75Area(mm2) FB 18.2

CLK 0.03

FB 84

Total 190,231

Color-Buffer 4Mb DRAM

E3GRE Core 20MHz

E3GRE Core 1.5VPower Supply

Operating Frequency

Embedded Memories

Number of Logic Transistors

Power Consumption(mW)

Ramchan Woo 36

Performance Comparison3D-RAM Media-Chip RamP-1 E3GRE

Year JSSC1995 JSSC 1997 ISSCC 2000 ISSCC 2001

Number of Chips 12 1 8 1

Die Area 1680mm2

(140mm2/chip)122mm2 416mm2

(52mm2/chip)24mm2

Power Consumption

9600mW(for ALU and SAM)

640mW(for 8Mb DRAM)

4700mW

122mW(36mW Logic

84mW 6Mb DRAM2mW CLK)

Functional Units

eDRAM, eSRAMSIMD-ShadeSIMD-BlendZ-compare

eDRAM4 ALUs

eDRAM, eSRAMSIMD-Shade, SIMD-Blend, Z-compare

Horizontal Setup, SIMD-Divide

Process 0.5µm 0.4µm 0.35µm 0.18µm

Main Clock 100MHz 100MHz 100MHz 20MHz

Maximum Pixel Fill Rate

400Mpixels(33Mpixels/chip)

400Mpixelsfor clearing

704Mpixels(88Mpixels/chip)

71Mpixels(320Mpixels for clearing)

SustainedTriangle Fill Rate

Less Much Less 352Mpixels(44Mpixels/chip)

71Mpixels

Ramchan Woo 37

Conclusion

• E3GRE Design/Implementation for PDA-Chip– ViSTA Architecture / SALBA Memory Mapping @ Low CLK– Low-Power Memory Access (SMA, PWA, PIA)– Low-Power Circuits– 100% Full-Custom Design (190,231 logic TRs with 6Mb DRAM Macro)

• Features– 2.22Mtris/sec (20MHz/9cycle)– 71Mpixel/sec Pixel Fill Rate (32pixel/triangle)– 122mW with Embedded Frame-Buffer and Clock Generation– 3.2GByte/sec Embedded-DRAM Bandwidth– 24bit True-Color on 256x256 screen– Hidden Surface Removal with 16bit Z-buffer– Double-Buffering for Flicker-Free Animation– Gouraud Shading, Depth Comparison, Alpha-Blending– Direct Video Transfer through SAM

Ramchan Woo 38

Further Works

• Efficient Texture Fetching using Unified DRAM Texture Cache

• Complete PDA-3D Pipeline with MobileGLOn-chipTexture

L1$

16KBSRAM

2texel/clkRasterizer

Rasterizer

Rasterizer

Rasterizer

Off-chipTexture

L2$

4MBDRAM

SystemMemory

128MBDRAM

Control

CPU

AGP

Rasterizer

Rasterizer

Rasterizer

Rasterizer

8texel/clk

On-chipUnified

Texture $

embeddedDRAM

ExternalMemory

DRAM

design and implementation of low-power …ssl.kaist.ac.kr/2007/data/thesis/ramchan woo_ms.pdframchan...

Documents