oceanstore status and directions roc/oceanstore retreat 1/13/03
DESCRIPTION
OceanStore Status and Directions ROC/OceanStore Retreat 1/13/03. John Kubiatowicz University of California at Berkeley. Everyone’s Data, One Utility. Millions of servers, billions of clients …. 1000-YEAR durability (excepting fall of society) Maintains Privacy, Access Control, Authenticity - PowerPoint PPT PresentationTRANSCRIPT
OceanStoreStatus and Directions
ROC/OceanStore Retreat 1/13/03
John KubiatowiczUniversity of California at Berkeley
OceanStore:2ROC/OceanStore Jan’02
Everyone’s Data, One Utility
• Millions of servers, billions of clients ….• 1000-YEAR durability (excepting fall of society)• Maintains Privacy, Access Control, Authenticity• Incrementally Scalable (“Evolvable”)• Self Maintaining!
• Not quite peer-to-peer: • Utilizing servers in infrastructure• Some computational nodes more equal than others
OceanStore:3ROC/OceanStore Jan’02
OceanStore Data Model• Versioned Objects
– Every update generates a new version– Can always go back in time (Time Travel)
• Each Version is Read-Only– Can have permanent name– Much easier to repair
• An Object is a signed mapping between permanent name and latest version– Write access control/integrity involves managing
these mappings
Comet Analogy updates
versions
OceanStore:4ROC/OceanStore Jan’02
Self-Verifying Objects
DataBlocks
VGUIDi VGUIDi + 1
d2 d4d3 d8d7d6d5 d9d1
Data B -Tree
IndirectBlocks
M
d'8 d'9
Mbackpointer
copy on write
copy on write
AGUID = hash{name+keys}
UpdatesHeartbeats +
Read-Only Data
Heartbeat: {AGUID,VGUID, Timestamp}signed
Patrick Eaton: Discussions of the Future formats
OceanStore:5ROC/OceanStore Jan’02
The Path of an OceanStore Update
Second-TierCaches
Multicasttrees
Inner-RingServers
Clients
OceanStore:6ROC/OceanStore Jan’02
OceanStore Goes Global!• Planet Lab global network
– 98 machines at 42 institutions, in North America, Europe, Australia (~ 60 machines utilized)
– 1.26Ghz PIII (1GB RAM), 1.8Ghz PIV (2GB RAM)– North American machines (2/3) on Internet2
OceanStore components running “globally:”• Word on the street: it was “straightfoward.”
– Basic architecture scales – Lots of Communications issues (NAT, Timeouts, etc)– Locality really important
• Challenge: Stability and fault tolerance!• Dennis Geels: Analysis (FAST 2003 paper)• Steve Czerwinski/B. Hoon Kang:
Tentative Updates
OceanStore:7ROC/OceanStore Jan’02
Enabling Technology: DOLR(Decentralized Object Location and
Routing)“TAPESTRY”
GUID1
DOLR
GUID1GUID2
OceanStore:8ROC/OceanStore Jan’02
• Preliminary resultsshow that this is effective– Dennis will talk about
effectiveness for streamingupdates
• Have simple algorithms for placing replicas on nodes in the interior– Intuition: locality properties
of network help place replicas– DOLR helps associate
parents and childrento build multicast tree
Self-Organizing second-tier
OceanStore:9ROC/OceanStore Jan’02
Tapestry Stability under Faults• Instability is the common case….!
– Small half-life for P2P apps (1 hour????)– Congestion, flash crowds, misconfiguration,
faults
• Must Use DOLR under instability!– The right thing must just happen
• Tapestry is natural framework to exploit redundant elements and connections– Multiple Roots, Links, etc.– Easy to reconstruct routing and location
information– Stable, repairable layer
• Thermodynamic analogies: – Heat Capacity of DOLR network– Entropy of Links (decay of underlying order)
OceanStore:10ROC/OceanStore Jan’02
Single Node Tapestry
Transport Protocols
Network Link Management
Application Interface / Upcall API
OceanStoreApplication-LevelMulticast
OtherApplications
RouterRouting Table
&Object Pointer DB
Dynamic Node
Management
OceanStore:11ROC/OceanStore Jan’02
It’s Alive On Planetlab!
• Tapestry Java deployment– 6-7 nodes on each physical machine– IBM Java JDK 1.30– Node virtualization inside JVM and SEDA– Scheduling between virtual nodes increases latency
• Dynamic insertion algorithms mostly working– Experiments with many simultaneous insertions– Node deletion getting there
• Tomorrow: Ben Zhao on Tapestry Deployment
OceanStore:12ROC/OceanStore Jan’02
Object Location
0
5
10
15
20
25
0 20 40 60 80 100 120 140 160 180 200
Client to Obj RTT Ping time (1ms buckets)
RD
P (
min
, m
ed
ian
, 9
0%
)
OceanStore:13ROC/OceanStore Jan’02
Tradeoff: Storage vs Locality
Tomorrow: Jeremy Stribling on Locality
OceanStore:14ROC/OceanStore Jan’02
Archival Disseminationof Fragments
OceanStore:15ROC/OceanStore Jan’02
Fraction of Blocks Lost per Year (FBLPY)
• Exploit law of large numbers for durability!• 6 month repair, FBLPY:
– Replication: 0.03– Fragmentation: 10-35
OceanStore:16ROC/OceanStore Jan’02
The Dissemination Process:Achieving Failure Independence
Model Builder
Set Creator
IntrospectionHuman Input
Network
Monitoringmodel
Inner Ring
Inner Ringse
t
set
probe
type
fragments
fragments
fragments
OceanStore:17ROC/OceanStore Jan’02
Active Data Maintenance
• Tapestry enables “data-driven multicast”– Mechanism for local servers to watch each other– Efficient use of bandwidth (locality)
3274
4577
5544
AE87
3213
9098
1167
6003
0128
L2L2
L1
L1
L2
L2
L3
L3
L2
L1L1
L2L3
L2
Ring of L1Heartbeats
OceanStore:18ROC/OceanStore Jan’02
Project Seagull• Push for long-term stable archive
– Fault Tolerant Networking– Periodic restart of servers– Correlation analysis for fragment placement– Efficient heart-beats for fragment tracking– Repair mechanisms
• Use for Backup system– Conversion of dump to use OceanStore– With versioning: yields first-class archival
system
• Use for Web browsing– Versioning yields long-term history of web sites
OceanStore:19ROC/OceanStore Jan’02
PondStorePrototype
OceanStore:20ROC/OceanStore Jan’02
First Implementation [Java]:• Event-driven state-machine model
– 150,000 lines of Java code and growing• Included Components
DOLR Network (Tapestry)• Object location with Locality• Self Configuring, Self R epairing
Full Write path• Conflict resolution and Byzantine agreement
Self-Organizing Second Tier• Replica Placement and Multicast Tree Construction
Introspective gathering of tacit info and adaptation• Clustering, prefetching, adaptation of network routing
Archival facilities • Interleaved Reed-Solomon codes for fragmentation• Independence Monitoring• Data-Driven Repair
• Downloads available from www.oceanstore.org
OceanStore:21ROC/OceanStore Jan’02
Event-Driven Architecture of an OceanStore Node
• Data-flow style– Arrows Indicate flow of messages
• Potential to exploit small multiprocessors at each physical node
World
OceanStore:22ROC/OceanStore Jan’02
Working Applications
OceanStore:23ROC/OceanStore Jan’02
InternetInternet
MINO: Wide-Area E-Mail Service
TraditionalMail Gateways
Client
Local network
Replicas
Replicas
OceanStore Client API
Mail Object Layer
IMAPProxy
SMTPProxy
• Complete mail solution– Email inbox – Imap folders
OceanStore Objects
OceanStore:24ROC/OceanStore Jan’02
Riptide: Caching the Web with
OceanStore
OceanStore:25ROC/OceanStore Jan’02
Other Apps• Long-running archive
– Project Segull
• File system support– NFS with time travel (like VMS)– Windows Installable file system (soon)
• Anonymous file storage:– Nemosyne uses Tapestry by itself
• Palm-pilot synchronization– Palm data base as an OceanStore DB
• Come see OceanStore demo at Poster Session: IMAP on OceanStore/Versioned NFS
OceanStore:26ROC/OceanStore Jan’02
Future Challenges• Fault Tolerance
– Network/Tapestry layer– Inner Ring
• Repair– Continuous monitoring/restart of components
• Online/offline validation – What mechanisms can be used to increase
confidence and reliability in systems like OceanStore?
• More intelligent replica management• Security
– Data Level security– Tapestry-level admission control
• “Eat our Own Dogfood”– Continuous deployment of OceanStore components
• Large-Scale Thermodynamic Design– Is there a science of aggregate systems design?
OceanStore:27ROC/OceanStore Jan’02
OceanStore Sessionshttp://10.0.0.1/
• ROC: Monday (3:30pm – 5:00pm)– OceanStore Pond Deployment– Evolution of Data Format and Structure– Tentative Updates
• Shared: Monday (5:30pm – 6:00pm)– OceanStore Long-Term Archival Storage
• Sahara: Tuesday (8:30am-9:10am)– Tapestry status and deployment– Peer-to-peer Benchmarking (Chord/Tapestry)– Tapestry Locality Enhancement
• Sahara: Tuesday (11:35-12:00am)– Peer-to-peer APIs
OceanStore:28ROC/OceanStore Jan’02
For more info:http://oceanstore.org
• OceanStore vision paper for ASPLOS 2000“OceanStore: An Architecture for Global-Scale
Persistent Storage”
• OceanStore Prototype (FAST 2003):“Pond: the OceanStore Prototype”
• Tapestry algorithms paper (SPAA 2002):“Distributed Object Location in a Dynamic Network”
• Upcoming Tapestry Deployment Paper (JSAC)“Tapestry: a Global-Scale Overlay for Rapid Service Deployment”
• Probabilistic Routing (INFOCOM 2002):“Probabilistic Location and Routing”
• Upcoming CACM paper (not until February):– “Extracting Guarantees from Chaos”