oceanstore toward global-scale, self-repairing, secure and persistent storage john kubiatowicz...

61
OceanStore Toward Global-Scale, Self- Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

Post on 21-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStoreToward Global-Scale, Self-Repairing,

Secure and Persistent Storage

John KubiatowiczUniversity of California at Berkeley

Page 2: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:2MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

OceanStore Context: Ubiquitous Computing

• Computing everywhere:– Desktop, Laptop, Palmtop– Cars, Cellphones– Shoes? Clothing? Walls?

• Connectivity everywhere:– Rapid growth of bandwidth in the interior of the net– Broadband to the home and office– Wireless technologies such as CMDA, Satelite, laser

• Where is persistent data????

Page 3: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:3MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Utility-based Infrastructure?

Pac Bell

Sprint

IBMAT&T

CanadianOceanStore

IBM

• Data service provided by storage federation• Cross-administrative domain • Pay for Service

Page 4: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:4MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

OceanStore: Everyone’s Data, One Big

Utility “The data is just out there”

• How many files in the OceanStore?– Assume 1010 people in world– Say 10,000 files/person (very conservative?)– So 1014 files in OceanStore!

– If 1 gig files (ok, a stretch), get 1 mole of bytes!

Truly impressive number of elements…… but small relative to physical constants

Aside: new results: 1.5 Exabytes/year (1.51018)

Page 5: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:5MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

OceanStore Assumptions• Untrusted Infrastructure:

– The OceanStore is comprised of untrusted components

– Individual hardware has finite lifetimes– All data encrypted within the infrastructure

• Responsible Party:– Some organization (i.e. service provider) guarantees

that your data is consistent and durable– Not trusted with content of data, merely its integrity

• Mostly Well-Connected: – Data producers and consumers are connected to a

high-bandwidth network most of the time– Exploit multicast for quicker consistency when

possible• Promiscuous Caching:

– Data may be cached anywhere, anytime

Page 6: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:6MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

The Peer-To-Peer View:Irregular Mesh of “Pools”

Page 7: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:7MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Key Observation:Want Automatic Maintenance

• Can’t possibly manage billions of servers by hand!

• System should automatically:– Adapt to failure – Exclude malicious elements– Repair itself – Incorporate new elements

• System should be secure and private– Encryption, authentication

• System should preserve data over the long term (accessible for 1000 years):– Geographic distribution of information– New servers added from time to time– Old servers removed from time to time– Everything just works

Page 8: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:8MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Outline: Three Technologies and

a PrinciplePrinciple: ThermoSpective Systems Design

– Redundancy and Repair everywhere

1. Structured, Self-Verifying Data– Let the Infrastructure Know What is important

2. Decentralized Object Location and Routing– A new abstraction for routing

3. Deep Archival Storage– Long Term Durability

Page 9: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:9MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

ThermoSpectiveSystems

Page 10: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:10MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Goal: Stable, large-scale systems

• State of the art:– Chips: 108 transistors, 8 layers of metal– Internet: 109 hosts, terabytes of bisection bandwidth– Societies: 108 to 109 people, 6-degrees of separation

• Complexity is a liability!– More components Higher failure rate– Chip verification > 50% of design team– Large societies unstable (especially when centralized)– Small, simple, perfect components combine to

generate complex emergent behavior!• Can complexity be a useful thing?

– Redundancy and interaction can yield stable behavior – Better figure out new ways to design things…

Page 11: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:11MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Question: Can we exploit Complexity to our Advantage?

Moore’s Law gains Potential for Stability

Page 12: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:12MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

The Thermodynamic Analogy

• Large Systems have a variety of latent order– Connections between elements– Mathematical structure (erasure coding, etc)– Distributions peaked about some desired behavior

• Permits “Stability through Statistics”– Exploit the behavior of aggregates (redundancy)

• Subject to Entropy– Servers fail, attacks happen, system changes

• Requires continuous repair– Apply energy (i.e. through servers) to reduce

entropy

Page 13: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:13MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

The Biological Inspiration• Biological Systems are built from (extremely)

faulty components, yet:– They operate with a variety of component failures

Redundancy of function and representation– They have stable behavior Negative feedback– They are self-tuning Optimization of common

case

• Introspective (Autonomic)Computing:– Components for performing– Components for monitoring and

model building– Components for continuous

adaptationAdapt

Dance

Monitor

Page 14: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:14MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

ThermoSpective• Many Redundant Components (Fault

Tolerance)• Continuous Repair (Entropy Reduction)

Adapt

Dance

Monitor

Page 15: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:15MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Object-Based Storage

Page 16: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:16MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Let the Infrastructure help You!

• End-to-end and everywhere else:– Must distribute responsibility to guarantee:

QoS, Latency, Availability, Durability

• Let the infrastructure understand the vocabulary or semantics of the application– Rules of correct interaction?

Page 17: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:17MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

OceanStore Data Model• Versioned Objects

– Every update generates a new version– Can always go back in time (Time Travel)

• Each Version is Read-Only– Can have permanent name– Much easier to repair

• An Object is a signed mapping between permanent name and latest version– Write access control/integrity involves managing

these mappings

Comet Analogy updates

versions

Page 18: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:18MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Secure Hashing

• Read-only data: GUID is hash over actual data– Uniqueness and Unforgeability: the data is what it is!– Verification: check hash over data

• Changeable data: GUID is combined hash over a human-readable name + public key– Uniqueness: GUID space selected by public key– Unforgeability: public key is indelibly bound to GUID

• Thermodynamic insight: Hashing makes “data particles” unique, simplifying interactions

SHA-1DATA 160-bit GUID

Page 19: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:19MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Self-Verifying Objects

DataBlocks

VGUIDi VGUIDi + 1

d2 d4d3 d8d7d6d5 d9d1

Data B -Tree

IndirectBlocks

M

d'8 d'9

Mbackpointer

copy on write

copy on write

AGUID = hash{name+keys}

UpdatesHeartbeats +

Read-Only Data

Heartbeat: {AGUID,VGUID, Timestamp}signed

Page 20: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:20MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

The Path of an OceanStore Update

Second-TierCaches

Multicasttrees

Inner-RingServers

Clients

Page 21: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:21MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

OceanStore Consistency viaConflict Resolution

• Consistency is form of optimistic concurrency – An update packet contains a series of predicate-

action pairs which operate on encrypted data– Each predicate tried in turn:

• If none match, the update is aborted• Otherwise, action of first true predicate is applied

• Role of Responsible Party– All updates submitted to Responsible Party which

chooses a final total order– Byzantine agreement with threshold signatures

• This is powerful enough to synthesize:– ACID database semantics– release consistency (build and use MCS-style

locks)– Extremely loose (weak) consistency

Page 22: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:22MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Self-Organizing Soft-State Replication

• Simple algorithms for placing replicas on nodes in the interior– Intuition: locality properties

of Tapestry help select positionsfor replicas

– Tapestry helps associateparents and childrento build multicast tree

• Preliminary resultsshow that this is effective

Page 23: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:23MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

DecentralizedObject Location

and Routing

Page 24: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:24MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Locality, Locality, LocalityOne of the defining

principles• The ability to exploit local resources over

remote ones whenever possible• “-Centric” approach

– Client-centric, server-centric, data source-centric

• Requirements:– Find data quickly, wherever it might reside

• Locate nearby object without global communication • Permit rapid object migration

– Verifiable: can’t be sidetracked• Data name cryptographically related to data

Page 25: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:25MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Enabling Technology: DOLR(Decentralized Object Location and

Routing)

GUID1

DOLR

GUID1GUID2

Page 26: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:26MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

4

2

3

3

3

2

2

1

2

4

1

2

3

3

1

34

1

1

4 3

2

4

NodeID0xEF34

NodeID0xEF31NodeID

0xEFBA

NodeID0x0921

NodeID0xE932

NodeID0xEF37

NodeID0xE324

NodeID0xEF97

NodeID0xEF32

NodeID0xFF37

NodeID0xE555

NodeID0xE530

NodeID0xEF44

NodeID0x0999

NodeID0x099F

NodeID0xE399

NodeID0xEF40

NodeID0xEF34

Basic Tapestry MeshIncremental Prefix-based Routing

Page 27: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:27MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Use of Tapestry MeshRandomization and

Locality

Page 28: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:28MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Stability under Faults• Instability is the common case….!

– Small half-life for P2P apps (1 hour????)– Congestion, flash crowds, misconfiguration,

faults

• Must Use DOLR under instability!– The right thing must just happen

• Tapestry is natural framework to exploit redundant elements and connections– Multiple Roots, Links, etc.– Easy to reconstruct routing and location

information– Stable, repairable layer

• Thermodynamic analogies: – Heat Capacity of DOLR network– Entropy of Links (decay of underlying order)

Page 29: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:29MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Single Node Tapestry

Transport Protocols

Network Link Management

Application Interface / Upcall API

OceanStoreApplication-LevelMulticast

OtherApplications

RouterRouting Table

&Object Pointer DB

Dynamic Node

Management

Page 30: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:30MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

It’s Alive!

• Planet Lab global network– 98 machines at 42 institutions, in North America,

Europe, Australia (~ 60 machines utilized)– 1.26Ghz PIII (1GB RAM), 1.8Ghz PIV (2GB RAM)– North American machines (2/3) on Internet2

• Tapestry Java deployment– 6-7 nodes on each physical machine– IBM Java JDK 1.30– Node virtualization inside JVM and SEDA– Scheduling between virtual nodes increases

latency

Page 31: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:31MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Object Location

0

5

10

15

20

25

0 20 40 60 80 100 120 140 160 180 200

Client to Obj RTT Ping time (1ms buckets)

RD

P (

min

, m

ed

ian

, 9

0%

)

Page 32: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:32MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Tradeoff: Storage vs Locality

Page 33: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:33MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Management Behavior

• Integration Latency (Left)– Humps = additional levels

• Cost/node Integration Bandwidth (right)– Localized!

• Continuous, multi-node insertion/deletion works!

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 100 200 300 400 500

Size of Existing Network (nodes)

Inte

grat

ion

Late

ncy

(ms)

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 50 100 150 200 250 300 350 400

Size of Existing Network (nodes)

Con

trol

Tra

ffic

BW

(KB

)

Page 34: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:34MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Deep Archival Storage

Page 35: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:35MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Two Types of OceanStore Data

• Active Data: “Floating Replicas”– Per object virtual server– Interaction with other replicas for consistency– May appear and disappear like bubbles

• Archival Data: OceanStore’s Stable Store– m-of-n coding: Like hologram

• Data coded into n fragments, any m of which are sufficient to reconstruct (e.g m=16, n=64)

• Coding overhead is proportional to nm (e.g 4)• Other parameter, rate, is 1/overhead

– Fragments are cryptographically self-verifying

• Most data in the OceanStore is archival!

Page 36: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:36MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Archival Disseminationof Fragments

Page 37: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:37MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Fraction of Blocks Lost per Year (FBLPY)

• Exploit law of large numbers for durability!• 6 month repair, FBLPY:

– Replication: 0.03– Fragmentation: 10-35

Page 38: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:38MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

The Dissemination Process:Achieving Failure Independence

Model Builder

Set Creator

IntrospectionHuman Input

Network

Monitoringmodel

Inner Ring

Inner Ringse

t

set

probe

type

fragments

fragments

fragments

Page 39: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:39MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Independence Analysis

• Information gathering: – State of fragment servers (up/down/etc)

• Correllation analysis:– Use metric such as mutual information – Cluster via that metric– Result partitions servers into uncorrellated clusters

Page 40: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:40MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Active Data Maintenance

• Tapestry enables “data-driven multicast”– Mechanism for local servers to watch each other– Efficient use of bandwidth (locality)

3274

4577

5544

AE87

3213

9098

1167

6003

0128

L2L2

L1

L1

L2

L2

L3

L3

L2

L1L1

L2L3

L2

Ring of L1Heartbeats

Page 41: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:41MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

1000-Year Durability?• Exploiting Infrastructure for Repair

– DOLR permits efficient heartbeat mechanism to notice:• Servers going away for a while• Or, going away forever!

– Continuous sweep through data also possible– Erasure Code provides Flexibility in Timing

• Data continuously transferred from physical medium to physical medium– No “tapes decaying in basement”– Information becomes fully Virtualized

• Thermodynamic Analogy: Use of Energy (supplied by servers) to Suppress Entropy

Page 42: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:42MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

PondStorePrototype

Page 43: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:43MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

First Implementation [Java]:• Event-driven state-machine model

– 150,000 lines of Java code and growing• Included Components

DOLR Network (Tapestry)• Object location with Locality• Self Configuring, Self R epairing

Full Write path• Conflict resolution and Byzantine agreement

Self-Organizing Second Tier• Replica Placement and Multicast Tree Construction

Introspective gathering of tacit info and adaptation• Clustering, prefetching, adaptation of network routing

Archival facilities • Interleaved Reed-Solomon codes for fragmentation• Independence Monitoring• Data-Driven Repair

• Downloads available from www.oceanstore.org

Page 44: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:44MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Event-Driven Architecture of an OceanStore Node

• Data-flow style– Arrows Indicate flow of messages

• Potential to exploit small multiprocessors at each physical node

World

Page 45: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:45MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

First Prototype Works!

• Latest: it is up to 8MB/sec (local area network)– Biggest constraint: Threshold Signatures

• Still a ways to go, but working

Page 46: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:46MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Update Latency

• Cryptography in critical path (not surprising!)

Page 47: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:47MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Working Applications

Page 48: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:48MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

InternetInternet

MINO: Wide-Area E-Mail Service

TraditionalMail Gateways

Client

Local network

Replicas

Replicas

OceanStore Client API

Mail Object Layer

IMAPProxy

SMTPProxy

• Complete mail solution– Email inbox – Imap folders

OceanStore Objects

Page 49: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:49MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Riptide: Caching the Web with

OceanStore

Page 50: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:50MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Other Apps

• Long-running archive– Project Segull

• File system support– NFS with time travel (like VMS)– Windows Installable file system (soon)

• Anonymous file storage:– Nemosyne uses Tapestry by itself

• Palm-pilot synchronization– Palm data base as an OceanStore DB

Page 51: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:51MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Conclusions• Exploitation of Complexity

– Large amounts of redundancy and connectivity– Thermodynamics of systems:

“Stability through Statistics”– Continuous Introspection

• Help the Infrastructure to Help you– Decentralized Object Location and Routing (DOLR)– Object-based Storage– Self-Organizing redundancy– Continuous Repair

• OceanStore properties:– Provides security, privacy, and integrity– Provides extreme durability– Lower maintenance cost through redundancy,

continuous adaptation, self-diagnosis and repair

Page 52: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:52MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

For more info:http://oceanstore.org

• OceanStore vision paper for ASPLOS 2000“OceanStore: An Architecture for Global-Scale

Persistent Storage”

• Tapestry algorithms paper (SPAA 2002):“Distributed Object Location in a Dynamic

Network”

• Bloom Filters for Probabilistic Routing (INFOCOM 2002):

“Probabilistic Location and Routing”

• Upcoming CACM paper (not until February):– “Extracting Guarantees from Chaos”

Page 53: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:53MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Backup Slides

Page 54: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:54MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Secure Naming

• Naming hierarchy:– Users map from names to GUIDs via hierarchy of

OceanStore objects (ala SDSI)– Requires set of “root keys” to be acquired by user

FooBarBaz

Myfile

Out-of-Band“Root link”

Page 55: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:55MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Self-OrganizedReplication

Page 56: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:56MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Effectiveness of second tier

Page 57: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:57MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Second Tier Adaptation: Flash Crowd

• Actual Web Cache running on OceanStore– Replica 1 far away– Replica 2 close to most requestors (created t ~ 20)– Replica 3 close to rest of requestors (created t ~ 40)

Page 58: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:58MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Introspective Optimization• Secondary tier self-organized into

overlay multicast tree:– Presence of DOLR with locality to suggest

placement of replicas in the infrastructure– Automatic choice between update vs invalidate

• Continuous monitoring of access patterns:– Clustering algorithms to discover object

relationships• Clustered prefetching: demand-fetching related

objects• Proactive-prefetching: get data there before needed

– Time series-analysis of user and data motion• Placement of Replicas to Increase Availability

Page 59: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:59MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Statistical Advantage of Fragments

• Latency and standard deviation reduced:– Memory-less latency model– Rate ½ code with 32 total fragments

Time to Coalesce vs. Fragments Requested (TI5000)

0

20

40

60

80

100

120

140

160

180

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Objects Requested

La

ten

cy

Page 60: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:60MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Parallel Insertion Algorithms (SPAA ’02)

• Massive parallel insert is important – We now have algorithms that handle “arbitrary

simultaneous inserts”– Construction of nearest-neighbor mesh links

• Log2 n message complexityfully operational routing mesh

– Objects kept available during this process • Incremental movement of pointers

• Interesting Issue: Introduction service– How does a new node find a gateway into the

Tapestry?

Page 61: OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz University of California at Berkeley

OceanStore:61MIT Seminar ©2002 John Kubiatowicz/UC Berkeley

Can You Delete (Eradicate) Data?

• Eradication is antithetical to durability!– If you can eradicate something, then so can someone

else! (denial of service)– Must have “eradication certificate” or similar

• Some answers:– Bays: limit the scope of data flows– Ninja Monkeys: hunt and destroy with certificate

• Related: Revocation of keys– Need hunt and re-encrypt operation

• Related: Version pruning – Temporary files: don’t keep versions for long– Streaming, real-time broadcasts: Keep? Maybe– Locks: Keep? No, Yes, Maybe (auditing!)– Every key stroke made: Keep? For a short while?