intel’s larrabee
DESCRIPTION
Larrabee is a new processor from Intel. It combines the features of bot CPU & GPUTRANSCRIPT
Intel’s LarrabeeVipin.p.nairS7-ECRoll no: 24CEK
Introduction
•It is a multicore general purpose graphics processor unit (GPGPU), combines the functions of multi core CPU & GPU.•Larrabee is based on Intel’s x86 architecture.
Architectural convergence
Features
• Texture filtering, rasterization, depth testing and alpha blending entirely in software
• Implement binned renderer to increase parallelism • Reduced memory Bandwidth• Parallel processing on image processing, physical
simulation, medical & financial analysis.• DDR5 RAM support• Each core can execute 32Gigaflops/s with 1GHz
clock, results several teraflops/s speed
Differences with CPU
• Out of order execution• Vector processing unit supports 16-single
precision floating point numbers at a time• Texture sampling units – trilinear /anisotropic
filtering & texture decompression• 1024-bit ring bus between cores• Cache control instructions• 4-way multithreading
Difference with GPU
• x86 instruction set with Larrabee-specific extensions
• cache coherency across all its cores• z-buffering, clipping, and blending without
using graphics hardware
Larrabee – Block Diagram
Architecture
• Cores communicate on a 1024-bit wide ring bus - Fast access to memory, I/O interfaces and fixed function blocks - Fast access for cache coherency• L2 cache is partitioned among the cores - Provides high aggregate bandwidth - Allows data replication & sharing• Optimized for highly parallel workload using vector processor
In-order CPU Core
• Separate scalar & vector units with separate registers• Vector unit: 16 32-bit ops/clock• In-order instruction execution• Fast access from 64k L1 cache• Direct connection to eachcore’s subset of the 256k L2 cache• Prefetch instructions load L1and L2 caches
Vector Unit
• Vector complete instruction set – Scatter/gather for vector load/store – Mask registers select lanes to write, which allows data-parallel flow control – Masks also support data compaction
• Vector instructions support – Full speed when data in L1 cache – Fused multiply add (three arguments) – Int32, Float32 and Float64 data – Can read 8-bit unorm, 8-bit uint, 16 bit sine, 16 bit float data & convert it into 32 bit floats/ integers.
Fixed Function Logic
• Micro codes in place of fixed function logic for post shader alpha blending, rasterization and interpolation.
• Includes fixed function texture filter logic
• Virtual memory for textures
Larrabee’s Binning Renderer
Binning pipeline– Reduces synchronization– Front end processes vertex & geometry shading– Back end processes pixel shading, stencil testing, blending– Bin FIFO between them
• Multi-tasking by cores– Each orange box is a core– Cores run independently– Other cores can run othertasks, e.g. physics
Back-end Rendering a Tile
• Orange boxes represent work on separate threads• Three work threads do Z, pixel shader, and blending• Setup thread reads from bins and does pre-processing• Combines task parallel, data parallel, and sequential
Pipeline can be changed
• Parts can move between front end & back end – Vertex shading, tesselation, rasterization, etc. – Allows balancing computation vs. bandwidth• New features – Transparency, shadowing, ray tracing etc. – Each of these need irregular data structures – Also helps to be able to “repack” the data
Transparency
Transparency with & without pre-resolve effects
Examples of using Tasks• Applications – Scene traversal and culling – Procedural geometry synthesis – Physics contact group solve – Data parallel strand groups – Distribute across threads/cores using task system – Exploit core resources with SIMD
• Larrabee can submit work to itself! – Tasks can spawn other tasks – Exposed in Larrabee Native programming interface(c/c++
compiler)
Application scaling studies
Scalability Studies
• Based on memory Bandwidth & texture filtering speed
Performance Breakdowns
Binning & Bandwidth Studies
Bandwidth
•Immediate mode use more Bandwidth -2.4 to 7 times for F.E.A.R -1.5 to2.6 times more for Gears of War -1.6 to 1.8 times more for Half Life 2 Episode 2.
Overall performance
Conclusion
The Larrabee architecture opens the rich set of opportunities for both graphics rendering and throughput computing and is the appropriate platform for convergence of GPU & CPU
Reference
• IEEE Digital Library- Larrabee: a many- core x86 architecture for visual computing: - Larry Seiler, Doug Carmean, Toni Juan of Intel Corporation, Jeremy Sugerman & Peter Hanrahan – Stanford University
• IEEE spectrum January 2008• ACM transactions on graphics-Article 18• www.intel.com• www.wikipedia.com
Questions
Thank You