nat ducajonathan cohen johns hopkins university peter kirchner ibm research stream caching:...
Post on 22-Dec-2015
214 views
TRANSCRIPT
Nat Duca Jonathan CohenJohns Hopkins University
Peter KirchnerIBM Research
Stream Caching:Mechanisms for General Purpose Stream
Processing
Talk Outline
● Objective: reconcile current practices of CPU design with stream processing theory
● Part 1: Streaming Ideas in current architectures– Latency and Die-Space– Processor types and tricks
● Part 2: Insights about Stream Caches– Could window-based streaming be the
next step in computer architecture?
Streaming Architectures
● Graphics processors● Signal processors● Network processors● Scalar/Superscalar processors
● Data stream processors?● Software architectures?
What is a Streaming Computer?
● Two [overlapping] ideas– A system that executes strict-streaming
algorithms [unbounded N, small M]– A general purpose system that is geared
toward general computation, but is best for the streaming case
● Big motivator: ALU-bound computation!
● To what extent do present computer architectures serve these two views of a streaming computer?
[Super]scalar Architectures
● Keep memory latency from limiting computation speed
● Solutions:– Caches
– Pipelining
– Prefetching
– Eager execution / branch prediction[the super in superscalar]
● These are heuristics to locate streaming patterns in unstructured program behavior
By the Numbers, Data
● Optimized using caches, pipelines, and eager-execution
– Random: 182MB/s
– Sequential: 315MB/s
● Optimizing with prefetching
– Random: 490MB/s
– Sequential: 516MB/s
● Theoretical Maximum: 533MB/s
By the Numbers, Observations
● Achieving full throughput on a scalar CPU requires either– (a) prefetching [requires advance knowledge]
– (b) sequential access [no advances req'd]
● Vector architectures hide latency in their instruction set using implicit prefetching
● Dataflow machines solve latency using automatic prefetching
● Rule 1: Sequential I/O simplifies control and access to memory, etc
Superscalar (e.g. P4)
Local Memory Hierarchy
Cache
Prefetch
The P4, by surface area,is about 95% cache,prefetch, and branch-prediction logic.
The remaining area isprimarily the floatingpoint ALU.
Can We Build This Machine?
Local Memory Hierarchy
In Streams Out Streams
● Rule 2: Small memory footprint allows more room for ALU --> more throughput
Part II: Chromium
● Pure stream processing model● Deals with OpenGL command stream
– Begin(Triangles); Vertex, Vertex, Vertex; End;
● Record splits are supported, joins are not
● You perform useful computation in Chromium by joining together Stream Processors into a DAG– Note: DAG is constructed across multiple
processors (unlike dataflow)
Chromium w/ Stream Caches
● We added join capability to Chromium for the purpose of collapsing multiple records to one– Incidentally: this allows windowed
computations● Thought: there seems to be direct
connection between streaming-joins and sliding-windows
● Because we're in software, the windows can become quite big without too much hassle
● What if we move to hardware?
Windowed Streaming
In Streams Out Streams
Window Buffer
Uses for Window Buffer of size M:●Store program structures of up to size M●Cache M input records, where M << N
Windowed Streaming
In Streams Out Streams
Window Buffer
Realistic values of M if you stay exclusively on chip:128k... 256K ... 2MB [DRAM-on-chip tech is
promising]
Impact on Window Size
In Streams Out Streams
Window Buffer
Insight: As M increases, this starts to resemble a superscalar computer
The Continuum ArchitectureMemory Hierarchy
In Streams Out Streams
● For too large a value of M:– Non-Sequential I/O --> caches– Caches --> less room for ALU (etc)
Windowed Streaming
In Streams Out Streams
Window Buffer
Thought: Can we augment window-buffer limit by a loopback feature?
Loopback streams
Windowed Streaming
In Streams Out Streams
Window Buffer
Thought: What do we gain by allowing a finite delay in the loopback stream?
Loopback streams
Memory
Versatility of Streaming Networks?
● Question: What algorithms can we support here? How?– Both from a Theoretical and Practical view
● We have experimented with graphics problems only:– Stream compression, visibility & culling,
level of detail
New Concepts with Streaming Networks
● An individual processor's cost is small● Highly flexible: use high level ideas of Dataflow
– Multiple streams in and out– Interleaving or non-interleaved– Scalable window size
● Open to entirely new concepts– E.g. How do you add more memory in this system?
Summary
● Systems are easily built on the basis of streaming I/O and memory models
● By design, it makes maximum use of hardware: very very efficient
● Continuum of Architectures:Pure Streaming to Superscalar
● Stream processors are trivially chained, even in cycles
● Such a chained architecture may be higly flexible:
– Experimental evidence & systems work
– Dataflow literature
– Streaming literature