yang greenstein part_2

21
AMD: DV Club - Westford MA 22 May 30, 2008 How Shaders are Created Application GPU Driver Video BIOS API

Upload: obsidian-software

Post on 08-May-2015

488 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Yang greenstein part_2

AMD: DV Club - Westford MA22 May 30, 2008

How Shaders are Created

Application

GPU DriverVideo BIOS

API

Page 2: Yang greenstein part_2

AMD: DV Club - Westford MA23 May 30, 2008

Images

Page 3: Yang greenstein part_2

AMD: DV Club - Westford MA24 May 30, 2008

No correction

Display Processing

Advanced Gamma and Color Correction

Page 4: Yang greenstein part_2

AMD: DV Club - Westford MA25 May 30, 2008

No correctionAvivo Display Engine 10-bitgamma and color correction

Display Processing

Advanced Gamma and Color Correction

Page 5: Yang greenstein part_2

AMD: DV Club - Westford MA26 May 30, 2008

“Call of Juarez” using DirectX 9

Page 6: Yang greenstein part_2

AMD: DV Club - Westford MA27 May 30, 2008

“Call of Juarez” using DirectX 10

Page 7: Yang greenstein part_2

AMD: DV Club - Westford MA28 May 30, 2008

GPU Verification

Page 8: Yang greenstein part_2

AMD: DV Club - Westford MA29 May 30, 2008

Graphics Verification Challenges

Large complex ASICs:

� Approaching 1B xtrs; >50 different clocks; > 600 MHZ; >100 top level tiles

� Parallel SIMDs, Multiple pipelines; hundreds of threads in flight; >300 ALUs

� High BW memory/cache interface; PCI Express; Display Ports

3rd party compliance: DirectX and OpenGL Graphic APIs and Apps

Firmware critical to ASIC function

� ASIC validation utilizes firmware release as part of tape out

� Firmware debug requires significant amounts of time

Full frames processing requires days/weeks of RTL simulation

Market window small – consumer market is harsh!

� Schedule is KING

� Need incremental development; hierarchy and reuse prior

� Respins are costly; time to market is critical

� Christmas, Dads/Grads, or bust!

Page 9: Yang greenstein part_2

AMD: DV Club - Westford MA30 May 30, 2008

GPU Architecture

Page 10: Yang greenstein part_2

AMD: DV Club - Westford MA31 May 30, 2008

Top LevelRadeon 2900

Red – Compute

Yellow – Cache

Unified shader

Shader R/W

Instr./Const. cache

Unified texture cache

Compression

4 SIMDs

16 Pipelines/SIMD

5 Stream processes

(32bit FP) per pipeline

320 ALU ops in parallel

Over 700M transistors

Z/S

tencil

Cache

Color Cache

VertexAssembler

Command Processor

Geometry

Assembler

Rasterizer

InterpolatorsHie

rarc

hic

al Z

ShaderC

aches

Instru

ctio

n &

Consta

nt

Vertex Index Fetch

Stream

Out

L1 T

extu

re C

ache

L2 T

extu

re C

ache

Tessellator

UltraUltra--Threaded Dispatch ProcessorThreaded Dispatch Processor

Shader Export

Unified

Shader

Processors

Unified

Shader

Processors

Render Back-EndsRender Back-Ends

Textu

re U

nits

Textu

re U

nits

Mem

ory

Read/W

rite

Cache

Setup Unit

Setup Unit

Z/S

tencil

Cache

Color Cache

VertexAssembler

Command Processor

Geometry

Assembler

Rasterizer

InterpolatorsHie

rarc

hic

al Z

ShaderC

aches

Instru

ctio

n &

Consta

nt

Vertex Index Fetch

Stream

Out

L1 T

extu

re C

ache

L2 T

extu

re C

ache

Tessellator

UltraUltra--Threaded Dispatch ProcessorThreaded Dispatch Processor

Shader Export

Unified

Shader

Processors

Unified

Shader

Processors

Render Back-EndsRender Back-Ends

Textu

re U

nits

Textu

re U

nits

Mem

ory

Read/W

rite

Cache

Setup Unit

Setup Unit

Page 11: Yang greenstein part_2

AMD: DV Club - Westford MA32 May 30, 2008

Technical Solutions

Layered CODE Methodology

� Multiple Layers of Testbenches

� Maximize Controllability, Observability, and Debug Efficiency

� Reference Model

Tools

� Coverage and assertions

� Visualization

HW Emulation

Page 12: Yang greenstein part_2

AMD: DV Club - Westford MA33 May 30, 2008

Layered CODE Verification

minutes

minutes-hours

hours – days

days -weeks

Debug / Fix Efficiency

MostMaxMax – closest to design; internal corner states

Sub Block

ManyHighHigh: block I/OBlock

FewMedMed: chip I/OChip/System

ZEROLowLowSilicon in Lab

Expected Bugs found for efficiency

Observability(Checking results in I/O; internal states)

Controllability(I/O; pipeline timing; sequencing; internal state; error injection)

Level

Testbench Capability - Maximize Controllability, Observability, and Debug Efficiency

Page 13: Yang greenstein part_2

AMD: DV Club - Westford MA34 May 30, 2008

Reference Model Methodology

C++ reference model of the DUT

� One “block” = one C++ object

� Non-synthesizeable => easier to write than RTL

� Very fast

–Several orders of magnitude faster than the design

–Used by driver, performance teams

Transaction-level accuracy

� Block-block interfaces modeled (see SystemVerilog definition)

� Matches design exactly (almost)

� Sub-transaction debug taps for added accuracy

Page 14: Yang greenstein part_2

AMD: DV Club - Westford MA35 May 30, 2008

Testbenches

Sub-block testbenches : designer boot-strap

Block-level testbenches: constrained-random

� Tests written in C++

� Test library in C++

–SCV, other randomization

� Threaded transport layer

–Based on SystemC

–C++ to C++

–C++ to verilog

� Two-pass approach

–Ref model, then RTL, then compare

SystemVerilog testbenches also used

Test

Test library

Transport

Block reference

model OR RTL

Page 15: Yang greenstein part_2

AMD: DV Club - Westford MA36 May 30, 2008

Testbenches

Chip/system testbenches

� Tests written in C++

� Tests debugged on chip reference model

–Collection of block ref models; see prev slide

� Test library in C++

–Mimics OpenGL, an industry standard

Test portability

� Write once, run everywhere

–Reference model

–Design

–H/W emulation

– Lab/diags

– Production drivers

� Overall TTM improved

–Driver schedule is nontrivial

Test

OpenGL-like test library

ORproduction driver

Transport

Chip reference model OR RTL OR emulation OR real H/W

Page 16: Yang greenstein part_2

AMD: DV Club - Westford MA37 May 30, 2008

HW Emulation

Usage:

In-Ckt Emulation of full chip design and running Chip DV and SW stack

� Simulates up to 1000X faster than SW (RTL) simulation

� Capable of rendering full image frames in minutes/hrs vs days/weeks

� Capture/playback scenes of benchmarks and games

Pre Silicon

� Verifying chip/system level functionalities and performance, block interactions, stress

� Allows for longer runs of random tests to look for hangs

� Prototype and test SW drivers and Diag

� Develop Boot Up settings

Post Silicon: BringUp to Production

� Debug platform for silicon

� Validate ECOs

Page 17: Yang greenstein part_2

AMD: DV Club - Westford MA38 May 30, 2008

Coverage and Assertions

Assertions are a Good Thing

� White-box testing

� Designer impact on DV

� Etc.

Functional coverage is a Good Thing

� Deep corner cases

� API spec does not show all implementation details

� Etc.

� Bug rates/DV closure improved greatly when func covg was adopted

Page 18: Yang greenstein part_2

AMD: DV Club - Westford MA39 May 30, 2008

Visualization

It is graphics, after all

� Nice to see pretty pictures for what you are drawing

Two overlapping textured

triangles, with depth

Page 19: Yang greenstein part_2

AMD: DV Club - Westford MA40 May 30, 2008

Visualization

Corruptions become easier to see; recognize patterns

Color

corruption

Page 20: Yang greenstein part_2

AMD: DV Club - Westford MA41 May 30, 2008

Summary

AMD + ATI = positioned for success

Graphics business/technology has many challenges

Market window is everything

Techniques mostly leverage standard industry practice, with some twists

� Reference-model-based flow

� High quality is required

– Rely on coverage, constrained-random, etc.

� H/W and S/W are both key to product success

– Seamless integration required

We are growing

� Always looking for good people!

[email protected]

[email protected]

Page 21: Yang greenstein part_2

AMD: DV Club - Westford MA42 May 30, 2008

Backup Slides