Tom Knight 1
M.I.T.ArtificialIntelligenceLaboratory
Aries: Integrated Symbolic and Numeric Database Handling with Parallel Distributed
Processors
DARPA DIS Review 5/24/00Tom Knight
Andrew Huang
Norm MargolusHowie ShrobePeggy ChenGreg SullivanMichael PhillipsBen Vandiver
Kalman RetiJP GrossmanJeremy BrownJohn MalloryTom Cleary
Tom Knight 2
M.I.T.ArtificialIntelligenceLaboratory
Aries Project Components
• Technology development• Language development• Core Data and Pointer Representations• System Architecture• Experimental vehicle• Benchmark plan
Tom Knight 3
M.I.T.ArtificialIntelligenceLaboratory
Technology Substrate
• Fast SRAM tagged DRAM banks– leverage DRAM integration by allowing fast parallel
access to SRAM tags in parallel with slow DRAM acccess
– Traps, GC forwarding pointers, reference counts, timestamps for parallel out of order execution
– Multiple read/writes of SRAM data during fetch of the DRAM line
• Microchannel liquid cooling technology– copper-laminate manifold heat sinks
– etched silicon microchannels
Tom Knight 4
M.I.T.ArtificialIntelligenceLaboratory
Language Development
• Simple modifications to Scheme– parallel array operations from APL
– parallel pointer operations (e.g. join, mark)
– sophisticated GC techniques
– data structures built on new object pointers
– multi=threading synchronized with combiners and splitters
– Units
• Preparation for a much more extensive complete language rewrite
Tom Knight 5
M.I.T.ArtificialIntelligenceLaboratory
Pointer and Array Representations• New pointer structure allowing
– internal object pointers
– immediate access to object header
– immediate access to object size
– little wasted memory (< 3%)
• Objects < 32 words have dense, efficient representation
• Add only pointers
PointerB L E
Blocks of size 2E
B
L5 5 6 > 64
Tom Knight 6
M.I.T.ArtificialIntelligenceLaboratory
Address Format
Address
Node index
• MSB of address determines data stripe layout• index = 0 gives single word per processor• index = max gives whole objects in processor
Node #
Tom Knight 7
M.I.T.ArtificialIntelligenceLaboratory
Squids for Pointer Aliasing Equality
• Create a randomized SQUID tag for newly created objects (hash address, etc.)
• Store the SQUID with the pointer to the object• Copy the SQUID with forwarding or sub-
component pointer creation• Memory aliasing check has three outcomes
– pointers identical => insert barrier
– pointer different, squid identical => insert barrier (rare)
– pointers different, squid different => no barrier needed
• GC forwards; EQ testing; sub-object equality
Short Quasi Unique IDentifiers
Tom Knight 8
M.I.T.ArtificialIntelligenceLaboratory
Fast inter-node memory references
• Build on the low latency Metro architecture– DeHon et al., ISCA 1994
• Bring the network onto the die– H tree network connecting on chip multiprocessor array
– uniform addressing and access across the on/off chip pins
• Processor to processor– both remote memory reference and message passing
• Connection rather than packet based– acknowledgement inherent
– return path pre-allocated
– very simple retry-based routing and error recovery
– provably good message handling
Tom Knight 9
M.I.T.ArtificialIntelligenceLaboratory
Key Metro Characteristics
• Connection rather than packet based• No buffering, flow control, emergency handling• Provably good message routing
– insensitive to network permutation patterns
• One clock pin-pin routing within the component• Scalable bisection bandwidth at configuration time• Fault tolerant in both wiring and routers• Inherent acknowledgement and reply path
Tom Knight 10
M.I.T.ArtificialIntelligenceLaboratory
Proposed Aries Packaging
• 32-64 slot active routing backplane (“Rack”)– Integral/redundant cooling, power
• Two interchangeable slot components:• Computation Cluster (“Processor Box”)
– Disk
– Memory
– Processor clusters & active RAM
• Communication channel (“Network Box”)– High bandwidth channel to other active backplanes
Tom Knight 11
M.I.T.ArtificialIntelligenceLaboratory
Packaging Overview
Tom Knight 12
M.I.T.ArtificialIntelligenceLaboratory
Cluster Configurations
Tom Knight 13
M.I.T.ArtificialIntelligenceLaboratory
Cluster Configurations
Tom Knight 14
M.I.T.ArtificialIntelligenceLaboratory
Multiple Domains of Execution
Value Type Units Owner Integrity
Registers
Tom Knight 15
M.I.T.ArtificialIntelligenceLaboratory
Execution Unit Hardware
• Multiple execution domains– value, type, units, ownership
• Value domain requires special hardware• Strategy: handle others with uniform hash array
associative match techniques and software traps• Same hardware useful for
– data cache (default -- N way set associative)– value cache -- long ops and functional subroutines– tags– ownership– units
Tom Knight 16
M.I.T.ArtificialIntelligenceLaboratory
Uniform Reconfigurable Hardware
Bit Mask
Hash Table Lookup
Compare
Trap
Bit Mask
Operands
Result insert
Tom Knight 17
M.I.T.ArtificialIntelligenceLaboratory
Contexts
• Ownership / Pedigree carried with all data• Detect / enforce ownership rights to computed data• Control access to critical owned data
– Only data you have a right to see is visible
• Control actuation / authorization of sensitive tasks– Only data not touched by unauthorized users can actuate
• Automatic propagation of the set of assumptions– We know on what basis a decision has been made
– We know how to revoke it
– System has access to this data itself (Introspection)
Tom Knight 18
M.I.T.ArtificialIntelligenceLaboratory
Prototype and Verification Vehicle
• Construct a 1-10% uniform speed scaled prototype• Off the shelf components• FPGA based design• Compromise on size, cost, performance, power
– Do not compromise on functionality or debug access
• Becomes an experimental architecture vehicle• Allows language & software development• Allows inexpensive feature evaluation• Debugging hooks everywhere• Ready to transition to real hardware
– real implementation will still retain many of these ideas
Tom Knight 19
M.I.T.ArtificialIntelligenceLaboratory
Experimental Vehicle Details
• High performance disk drives• Xilinx Virtex-E FPGA arrays
– fast serial links (LVDS), embedded memory
– debug/access paths for all signals
• Fast host interface by masquerading as SDRAM• Memory augmented with fast SRAM tags• FPGA implementation of Metro network
– debug in a PC cluster environment
– off-the-shelf LVDS serializers @ 5 Gbps/chip
• Cycle by cycle power profiling
Tom Knight 20
M.I.T.ArtificialIntelligenceLaboratory
FPGA & SRAM (“Moore”) Board
Tom Knight 21
M.I.T.ArtificialIntelligenceLaboratory
Open PC System Context
Tom Knight 22
M.I.T.ArtificialIntelligenceLaboratory
Moore Board Details
• User can implement an early Pentium-class processor
• Performance scales with Moore’s law over time– leverage latest process
technologies in the form of FPGAs, SRAMs, new bus technologies
– minimal cost, risk
• High-performance host interface (direct memory-map via PC-100 DIMM interface)
Tom Knight 23
M.I.T.ArtificialIntelligenceLaboratory
Benchmark Plans
• Array primitive performance• Network intensive performance
– Sparse matrix operations
– Parallel Database Join
• DIS Data Management benchmark (rewritten)• Persistent data storage
– Database operations on out of core data
• Database commit performance
Tom Knight 24
M.I.T.ArtificialIntelligenceLaboratory
Summary
• Coherent plan for next generation HW & SW• Details will be verified with experimental vehicle• Manpower and resource limited
– the ideas are largely here
• We will know what to build when we have the resources to do so– early prototypes under way
Tom Knight 25
M.I.T.ArtificialIntelligenceLaboratory
Tom Knight 26
M.I.T.ArtificialIntelligenceLaboratory
“Box” Architecture
Processor Network
Tom Knight 27
M.I.T.ArtificialIntelligenceLaboratory
What’s Wrong with this Picture?
• ~80% of die area is cache, LSU and schedulers to help hide memory latency
– PPC 750 die shot below (obtained from IBM website)
Tom Knight 28
M.I.T.ArtificialIntelligenceLaboratory
Our Way
Multi-bank DRAM
ExecutionUnit
Total bandwidth between execution units and DRAMfor this chip: 128 GB/s at 4-8 cycle latency (supposing 1 GHz processors)
• Rendition of what Aries proc. + mem die might look like
die shots of IBM SA-27E DRAM-ASIC process and IBM PPC750, obtained from IBM website
16 GB/s, 4-8 cycles latencyper DRAM bank, multiple banks perexecution unit
Tom Knight 29
M.I.T.ArtificialIntelligenceLaboratoryPossible Die Layout of processor + DRAM
8 processors/chip + 128 MBits DRAM
Tom Knight 30
M.I.T.ArtificialIntelligenceLaboratorySystem Architecture
• Aggregate bandwidth to embedded DRAM processing elements is in excess of 16 Terabytes/s
• example card has 2 GBytes of DRAM-PEs = 1024 processing elements
• This architecture can comprehensively search for a keyword in 2 GB of pre-loaded memory in about 500 s
one node, consistingof 256 MB DRAM,128 processors, anda network processor
DIMM-style cards with 4 DRAM + processor chips (M),1 network interface chip (N)(64 MBytes of memory/card)
VME 9U card
lots of wires
Tom Knight 31
M.I.T.ArtificialIntelligenceLaboratory
Dynamically Reconfigurable Pipeline
• MATRIX style high level functional blocks• Dynamically reconfigurable interconnect
x S + /conditional
unit
similar to Rixner, Dally 1998
Tom Knight 32
M.I.T.ArtificialIntelligenceLaboratory
System Integrity
• Per object capabilities• Triad Pointer Structures• Garbage Collection• Transactional Semantics• Reliable message transport• Data integrity labelling• Data ownership
Tom Knight 33
M.I.T.ArtificialIntelligenceLaboratory
Ownership / Accessor Labels
• Every word of data d is labelled with owner / accessor information L(d)
• Every channel c is labelled with potential readers L(c)
L(d) d
L(c) contained in L(d) to write d to c
Tom Knight 34
M.I.T.ArtificialIntelligenceLaboratory
Synthesizing Labels
PC add r1, r2, r3
L(pc) L(add) L(r1) L(r2)
Label Intersector
L(r3)
Tom Knight 35
M.I.T.ArtificialIntelligenceLaboratory
Label Intersector Efficiency
• Compute Joins Once• Store results in hash table• Hardware label cache
– similar to value or tlb cache
• No slowdown in the usual case
Tom Knight 36
M.I.T.ArtificialIntelligenceLaboratory
PC Restriction
PC bne r1, foo
L(pc) L(bne) L(r1)
Label Intersector
L(pc)
Tom Knight 37
M.I.T.ArtificialIntelligenceLaboratory
Dynamic PC declassification
• How can you declassify the program counter?• Choose a definitive return address before the branch
– Halfway to transactional semantics
pushpc endpointbne r1, foo…poppc… … poppc
foo:
save a definitive returnraise pc security level
lower and return
lower and return
Tom Knight 38
M.I.T.ArtificialIntelligenceLaboratory
Efficiency Techniques
• Per word labels are costly and awkward• Most components of a compound structure have
identical labels• Reclassification of entire data structures should
be efficient
Put labels only on inter-structure links
Tom Knight 39
M.I.T.ArtificialIntelligenceLaboratory
Domain Representation
L1
L1
L1
L1
L2
L2
L2
L2
Domain Representation Semantic View
Tom Knight 40
M.I.T.ArtificialIntelligenceLaboratory
Domain Representation
Domain Representation Implementation View
L2
L1 L2
Tom Knight 41
M.I.T.ArtificialIntelligenceLaboratory
Label Format Details
• Ownership and Security model derived from the work of Meyers and Liskov
• A label is a set of pairs– each pair consists of an owner
– and a set of permitted accessors
• The effective set of accessors is the intersection of the permitted accessor sets
• Transitive permission delegation– revokeable
Tom Knight 42
M.I.T.ArtificialIntelligenceLaboratory
Ownership Semantics
• Fine grained control of ownership of data and derived data
• Safe dynamic control flow declassification• Fine grained control over information
dissemination• Efficient implementation as a parallel execution
domain• Leverages strong capability and transactional
models of data representation and exeecution
Tom Knight 43
M.I.T.ArtificialIntelligenceLaboratory
Summary
• Domain level parallelism within processors– robustness
– security
– higher level semantics
• Processor level parallelism within active RAM– parallelism friendly data structures, operations
– network friendly communications
• Explicitly parallel transactional programming environment
• Emphasis on the conceptual simplicity of the programming models
Tom Knight 44
M.I.T.ArtificialIntelligenceLaboratory
Tom Knight 45
M.I.T.ArtificialIntelligenceLaboratory
Symbolic Computing
• Symbolic data has no inherent local structure– Knowledge Databases
– Indices
– Higher level vision• target recognition
• Architectures must implement efficient non local communications
• Data representations are key– Triad pointer structures
• Clean parallel semantics– Transactions as hardware and software primitives
Tom Knight 46
M.I.T.ArtificialIntelligenceLaboratory
Triad Pointer Structures
• All pointers have back pointers associated• Fan in trees for parallel data access
– Combining networks
– Limited fan in and contention -- memoizing
• Fan out trees for data & operator distribution• Data movement is straightforward (paging)• Garbage collection is no longer an issue
– Compactness comes from linearized freelists
– Utility in balancing fanin/out trees (local)
• Typed pointers and data allow distributed processors to manipulate data autonomously
Tom Knight 47
M.I.T.ArtificialIntelligenceLaboratory
Semantically Richer Objects and Operators
• Sets• Ordered sets• Vectors, Tensors• APL operators• Extension of the type system
– Units
– Persistent objects
• Unguarded objects & barriers– Explicit communication between concurrent threads
– Combiners only allowed
– I/O as an unguarded operation
Tom Knight 48
M.I.T.ArtificialIntelligenceLaboratory
Transactional Processing
• We already use transactional semantics – instruction execution on any modern processor
– system calls in some operating systems mimic instruction semantics (e.g. ITS)
• Early proposals for parallel execution of sequential code relied on transactional semantics
• Raise the level - Reed– make transactions visible at the language level
– provide hardware support for efficient transactional processing
• Timestamp modifications for auditing, rollback and introspective analysis
Tom Knight 49
M.I.T.ArtificialIntelligenceLaboratory
Computing Models
• Sequential• Semantically sequential
– deterministic results
– static
– dynamic
• Concurrent independent• Concurrent atomic
– nondeterministic results
Tom Knight 50
M.I.T.ArtificialIntelligenceLaboratory
Timestamping Data
• Separate virtual and real time - Jefferson• Assign virtual time ranges to data objects• Subranging for nested transactions
– Reassignment of virtual time tokens on allocation failure
• Guarded references access the correct version of data
Tom Knight 51
M.I.T.ArtificialIntelligenceLaboratory
Liquid: Viewing Serial Execution as a Sequence of Transactions
• Instructions read and then modify processor and memory state -- a “transaction”
• Blocks of instructions can be viewed similarly• Key idea:
– execute multiple, logically sequential blocks in parallel• independent threads
– use database commit techniques to handle otherwise intractable problems of aliasing
– use cache coherency mechanisms to automatically detect aliasing problems and back out
• optimistic concurrency in executing parallel threads
Tom Knight 52
M.I.T.ArtificialIntelligenceLaboratory
Research Plan
• Architect physical simulation model• Locate partner to spin• Language design for symbolic processing• Feature list and architecture strawman for
symbolic applications• Implement symbolic processing in FPGA or Alpha
emulation• Logic level design of network and component• Locate partner to spin design
Tom Knight 53
M.I.T.ArtificialIntelligenceLaboratory
Impact
• New generation of data oriented processing– physical
– symbolic
• Parallel performance• Almost serial programming model• 100 - 1000x performance on a wide range of
problems– includes important symbolic processing and knowledge
database retrieval problems as well as physical simulation techniques
Tom Knight 54
M.I.T.ArtificialIntelligenceLaboratory
Language Innovation
• Learn the lessons from decades of AI languages– cleanliness in design
– manifest performance costs
– simple data structures
– trivial syntax
– performance counts
– pointer allocation is central
– type checking at both compile and run time is essential
– security can be enforced by pointer hygiene
– side effects are necessary as a programming tool
– manifest data types allow distributed data handling
Tom Knight 55
M.I.T.ArtificialIntelligenceLaboratory
Benchmark Problems– Graphics
• polygon rendering
• point sample rendering
• ray tracing
• radiosity
– CAD• verilog
• hspice
• place and route
– Simulation• mechanics
• n body
• fluid flow
• static PDEs
• EM field solver
– AI– neural networks– genetic algorithms
– knowledge databases
– computer vision
– object recognition
– database matching
– biological sequences
– Numerical– factoring
– primality testing
– Miscellaneous– text searching
– chess
– Mandelbrot sets
– protein folding
Tom Knight 56
M.I.T.ArtificialIntelligenceLaboratory
Enabling Ideas
• Physical simulations– SIMD processing
– Skip samples
– Lookup tables plus multiply array
• Symbolic processing– on chip GC
– Metro routing
– Fat tree packaging
– Timestamped memory
– Commit operations fundamental
Tom Knight 57
M.I.T.ArtificialIntelligenceLaboratory
Architectural Synthesis
• Lisp Machine– CADR, LM-2, 3600, Ivory, Open Genera
• Connection Machine
• Cross Omega Machine
• CAM-8
• Abacus
• Transit/Metro
• Matrix
• Terasys
• Liquid
Tom Knight 58
M.I.T.ArtificialIntelligenceLaboratory
Technology Opportunity
• DRAM + Logic– Terabaud access to on chip memory
– with state of the art on chip logic and interconnect
• BGA Packaging– 600-1000 pins/die
– GHz+ signalling
– Small die footprint
• Reconfigurable logic– Commodity component opportunity
– Delayed binding of architectures
Tom Knight 59
M.I.T.ArtificialIntelligenceLaboratory
Substrate Technology Development
• Resonant Clocking Design Tool
Crafted resonant transmission line
Tom Knight 60
M.I.T.ArtificialIntelligenceLaboratory
Garbage Collection Technology
• Problem: Large, distributed, highly linked, persistent data structures, partially swapped out
• Slow access to large parts of the data
• Solution: incremental, distributed, local garbage collection techniques– Maintenance of in and out vectors of external pointers
– In set used as a root for local garbage collection
– Out set maintained and opportunistically sent out
• Object reference safety essential -- sizes from ptr
• Techniques based on Area GC of Bishop (1968)
• Maheswari and Liskov (1998)
Tom Knight 61
M.I.T.ArtificialIntelligenceLaboratory
Garbage Collection Phases
• Within processor, within core collection• Among processors following coordinated
distribution of out sets• Distributed marking of out sets from in sets
– followed on null updates by global collection
• Compaction and maintenance of dense out sets for swapped out pages– possible because we can afford to spend lots of time
when swapping page data
• Layered on top of conventional ephemeral techniques (temporal reference counting)
Tom Knight 62
M.I.T.ArtificialIntelligenceLaboratory
Cluster Configurations