intel’s larrabee

Intel’s LarrabeeVipin.p.nairS7-ECRoll no: 24CEK

Introduction

•It is a multicore general purpose graphics processor unit (GPGPU), combines the functions of multi core CPU & GPU.•Larrabee is based on Intel’s x86 architecture.

Architectural convergence

Features

• Texture filtering, rasterization, depth testing and alpha blending entirely in software

• Implement binned renderer to increase parallelism • Reduced memory Bandwidth• Parallel processing on image processing, physical

simulation, medical & financial analysis.• DDR5 RAM support• Each core can execute 32Gigaflops/s with 1GHz

clock, results several teraflops/s speed

Differences with CPU

• Out of order execution• Vector processing unit supports 16-single

precision floating point numbers at a time• Texture sampling units – trilinear /anisotropic

filtering & texture decompression• 1024-bit ring bus between cores• Cache control instructions• 4-way multithreading

Difference with GPU

• x86 instruction set with Larrabee-specific extensions

• cache coherency across all its cores• z-buffering, clipping, and blending without

using graphics hardware

Larrabee – Block Diagram

Architecture

• Cores communicate on a 1024-bit wide ring bus - Fast access to memory, I/O interfaces and fixed function blocks - Fast access for cache coherency• L2 cache is partitioned among the cores - Provides high aggregate bandwidth - Allows data replication & sharing• Optimized for highly parallel workload using vector processor

In-order CPU Core

• Separate scalar & vector units with separate registers• Vector unit: 16 32-bit ops/clock• In-order instruction execution• Fast access from 64k L1 cache• Direct connection to eachcore’s subset of the 256k L2 cache• Prefetch instructions load L1and L2 caches

Vector Unit

• Vector complete instruction set – Scatter/gather for vector load/store – Mask registers select lanes to write, which allows data-parallel flow control – Masks also support data compaction

• Vector instructions support – Full speed when data in L1 cache – Fused multiply add (three arguments) – Int32, Float32 and Float64 data – Can read 8-bit unorm, 8-bit uint, 16 bit sine, 16 bit float data & convert it into 32 bit floats/ integers.

Fixed Function Logic

• Micro codes in place of fixed function logic for post shader alpha blending, rasterization and interpolation.

• Includes fixed function texture filter logic

• Virtual memory for textures

Larrabee’s Binning Renderer

Binning pipeline– Reduces synchronization– Front end processes vertex & geometry shading– Back end processes pixel shading, stencil testing, blending– Bin FIFO between them

• Multi-tasking by cores– Each orange box is a core– Cores run independently– Other cores can run othertasks, e.g. physics

Back-end Rendering a Tile

• Orange boxes represent work on separate threads• Three work threads do Z, pixel shader, and blending• Setup thread reads from bins and does pre-processing• Combines task parallel, data parallel, and sequential

Pipeline can be changed

• Parts can move between front end & back end – Vertex shading, tesselation, rasterization, etc. – Allows balancing computation vs. bandwidth• New features – Transparency, shadowing, ray tracing etc. – Each of these need irregular data structures – Also helps to be able to “repack” the data

Transparency

Transparency with & without pre-resolve effects

Examples of using Tasks• Applications – Scene traversal and culling – Procedural geometry synthesis – Physics contact group solve – Data parallel strand groups – Distribute across threads/cores using task system – Exploit core resources with SIMD

• Larrabee can submit work to itself! – Tasks can spawn other tasks – Exposed in Larrabee Native programming interface(c/c++

compiler)

Application scaling studies

Scalability Studies

• Based on memory Bandwidth & texture filtering speed

Performance Breakdowns

Binning & Bandwidth Studies

Bandwidth

•Immediate mode use more Bandwidth -2.4 to 7 times for F.E.A.R -1.5 to2.6 times more for Gears of War -1.6 to 1.8 times more for Half Life 2 Episode 2.

Overall performance

Conclusion

The Larrabee architecture opens the rich set of opportunities for both graphics rendering and throughput computing and is the appropriate platform for convergence of GPU & CPU

Reference

• IEEE Digital Library- Larrabee: a many- core x86 architecture for visual computing: - Larry Seiler, Doug Carmean, Toni Juan of Intel Corporation, Jeremy Sugerman & Peter Hanrahan – Stanford University

• IEEE spectrum January 2008• ACM transactions on graphics-Article 18• www.intel.com• www.wikipedia.com

http://www.intel.com/

Questions

Thank You

intel’s larrabee

Documents

intels larrabeevipin

intels x86 architecture