nat ducajonathan cohen johns hopkins university peter kirchner ibm research stream caching:...

Nat Duca Jonathan CohenJohns Hopkins University

Peter KirchnerIBM Research

Stream Caching:Mechanisms for General Purpose Stream

Processing

Talk Outline

● Objective: reconcile current practices of CPU design with stream processing theory

● Part 1: Streaming Ideas in current architectures– Latency and Die-Space– Processor types and tricks

● Part 2: Insights about Stream Caches– Could window-based streaming be the

next step in computer architecture?

Streaming Architectures

● Graphics processors● Signal processors● Network processors● Scalar/Superscalar processors

● Data stream processors?● Software architectures?

What is a Streaming Computer?

● Two [overlapping] ideas– A system that executes strict-streaming

algorithms [unbounded N, small M]– A general purpose system that is geared

toward general computation, but is best for the streaming case

● Big motivator: ALU-bound computation!

● To what extent do present computer architectures serve these two views of a streaming computer?

[Super]scalar Architectures

● Keep memory latency from limiting computation speed

● Solutions:– Caches

– Pipelining

– Prefetching

– Eager execution / branch prediction[the super in superscalar]

● These are heuristics to locate streaming patterns in unstructured program behavior

By the Numbers, Data

● Optimized using caches, pipelines, and eager-execution

– Random: 182MB/s

– Sequential: 315MB/s

● Optimizing with prefetching

– Random: 490MB/s

– Sequential: 516MB/s

● Theoretical Maximum: 533MB/s

By the Numbers, Observations

● Achieving full throughput on a scalar CPU requires either– (a) prefetching [requires advance knowledge]

– (b) sequential access [no advances req'd]

● Vector architectures hide latency in their instruction set using implicit prefetching

● Dataflow machines solve latency using automatic prefetching

● Rule 1: Sequential I/O simplifies control and access to memory, etc

Superscalar (e.g. P4)

Local Memory Hierarchy

CachePrefetch

Superscalar (e.g. P4)


Cache

Prefetch

The P4, by surface area,is about 95% cache,prefetch, and branch-prediction logic.

The remaining area isprimarily the floatingpoint ALU.

Pure Streaming (e.g. Imagine)

In Streams Out Streams

Can We Build This Machine?



● Rule 2: Small memory footprint allows more room for ALU --> more throughput

Part II: Chromium

● Pure stream processing model● Deals with OpenGL command stream

– Begin(Triangles); Vertex, Vertex, Vertex; End;

● Record splits are supported, joins are not

● You perform useful computation in Chromium by joining together Stream Processors into a DAG– Note: DAG is constructed across multiple

processors (unlike dataflow)

Chromium w/ Stream Caches

● We added join capability to Chromium for the purpose of collapsing multiple records to one– Incidentally: this allows windowed

computations● Thought: there seems to be direct

connection between streaming-joins and sliding-windows

● Because we're in software, the windows can become quite big without too much hassle

● What if we move to hardware?

Windowed Streaming


Window Buffer

Uses for Window Buffer of size M:●Store program structures of up to size M●Cache M input records, where M << N

Windowed Streaming


Window Buffer

Realistic values of M if you stay exclusively on chip:128k... 256K ... 2MB [DRAM-on-chip tech is

promising]

Impact on Window Size


Window Buffer

Insight: As M increases, this starts to resemble a superscalar computer

The Continuum ArchitectureMemory Hierarchy


● For too large a value of M:– Non-Sequential I/O --> caches– Caches --> less room for ALU (etc)

Windowed Streaming


Window Buffer

Thought: Can we augment window-buffer limit by a loopback feature?

Loopback streams

Windowed Streaming


Window Buffer

Thought: What do we gain by allowing a finite delay in the loopback stream?

Loopback streams

Memory

Streaming Networks: Primitive

Streaming Networks: 1:N

[Hanrahan model]

Streaming Networks: N:1

Streaming Networks: The Ugly

Versatility of Streaming Networks?

● Question: What algorithms can we support here? How?– Both from a Theoretical and Practical view

● We have experimented with graphics problems only:– Stream compression, visibility & culling,

level of detail

New Concepts with Streaming Networks

● An individual processor's cost is small● Highly flexible: use high level ideas of Dataflow

– Multiple streams in and out– Interleaving or non-interleaved– Scalable window size

● Open to entirely new concepts– E.g. How do you add more memory in this system?

Summary

● Systems are easily built on the basis of streaming I/O and memory models

● By design, it makes maximum use of hardware: very very efficient

● Continuum of Architectures:Pure Streaming to Superscalar

● Stream processors are trivially chained, even in cycles

● Such a chained architecture may be higly flexible:

– Experimental evidence & systems work

– Dataflow literature

– Streaming literature

nat ducajonathan cohen johns hopkins university peter kirchner ibm research stream caching:...

Documents

mbs slide

dataflow slide

throughput slide

streaming computer

streaming ideas

pure streaming

windowed streaming

streaming patterns