eece 571r: data-intensive computing systems matei ripeanu matei at ece.ubc.ca

EECE 571R:Data-intensive computing

systems

Matei Ripeanu

matei at ece.ubc.ca

2Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07)

Contact Info

Email: matei @ ece.ubc.caOffice: KAIS 4033Office hours: by appointment (email me)Course page: http://www.ece.ubc.ca/~matei/EECE571/


EECE 571R: Course Goals

Primary– Gain deep understanding of fundamental issues that

affect design of:> Data-intensive systems

> (more generally) Large-scale distributed systems

– Survey main current research themes

– Gain experience with distributed systems research> Research on: federated system, networks

Secondary– By studying a set of outstanding papers, build

knowledge of how to do & present research

– Learn how to read papers & evaluate ideas


What I’ll Assume You Know

Basic Internet architecture– IP, TCP, DNS, HTTP

Basic principles of distributed computing– Asynchrony (cannot distinguish between

communication failures and latency)

– Incomplete & inconsistent global state knowledge (cannot know everything correctly)

– Failures happen (In large systems, even rare failures of individual components, aggregate to high failure rates)

If there are things that don’t make sense, ask!


Outline

Case study (and project ideas):– Volunteer computing: SETI@home /BOINC

– Virtual Data System

– Batch Aware Distributed File System Administrative


How does it work?

Characteristics: Fixed-rate data processing task Low bandwidth/computation ratio Independent parallelism Error tolerance

SETI@home

Master-workerarchitecture


SETI@home Operations

datarecorder

screensavers

WU storage

splitters

DLT tapes

dataserver

science DBuser DB

resultqueue

acct.queue

garbagecollector

tape archive,delete

tape backup

master DBredundancy

checking

RFIelimination

repeatdetection

web site

CGI program

web pagegenerator


History and Statistics Conceived 1995, launched April 1999 Millions of users, hosts… No ET signals yet, but other results

Total Last 24 Hours(as of Wed Feb 23 07:04:51)

Users 5,361,313 4,391

Results received 1,779 millions 5 million

Total CPU time 2.2 million years 3610.717 years

Average CPU time/work unit

10 hr 58 min 14.0 sec 6 hr 19 min 30.1 sec


Millions of individual contributors!(Problems)

Server scalability Dealing with excess CPU time Untrusted environment: Bad user behavior

– Cheating

– Team recruitment by spam

– Sale of accounts on eBay Malfunctions of individual components


SETI@home: Summary The characteristics of the problem …

– Massive (“embarrassing”) parallelism– Low bandwidth/computation ratio– Fixed-rate data processing task

… make possible a solution that operates in an unfriendly environment– Wide area distribution; huge scale – High failure rates– Untrusted/malicious components

Solution: Master-worker design>Master=central point of control>Single point of failure>Performance bottleneck


Outline





Virtual Data System

Context: ’big science’ Motivation/goals: support science process,

– i.e., track all aspects of data capture, production, transformation, and analysis

Requirements: ability to define complex workflows, and to reliably & efficiently execute workflows in heterogeneous, multi-domain environments.

Derived benefits: helps to audit, validate, reproduce, and/or rerun with corrections various data transformations.


The European Organisation

for Nuclear ResearchCERN builds particle accelerators for particle physics research

BIG Science!

Matei Ripeanu, UBC EECE571R Data-intensive computing (Spring’07)

reconstruction

simulation

analysis

interactivephysicsanalysis

batchphysicsanalysis

batchphysicsanalysis

detector

event summary data

rawdata

eventreprocessing

eventreprocessing

eventsimulation

eventsimulation

analysis objects(extracted by physics topic)

Data Handling and Computation for

Physics Analysisevent filter(selection &

reconstruction)

event filter(selection &

reconstruction)

processeddata

les.

rob

ert

son

@ce

rn.c

h

CERN


CMS Grid Hierarchy

Tier2 Center

Online System

CERN Computer Center > 20

TIPS

USA CenterFrance Center

Italy Center UK Center

InstituteInstituteInstitute

Workstations,other portals

100MB~1.5GB/sec

2.5-10 Gbits/sec

0.1-1 Gbits/sec

Bunch crossing per 25 ns100 triggers per second~1 MByte per event

Physics data cache

10 ~ 40 Gbits/sec

Tier2 CenterTier2 Center

0.6-2.5 Gbits/sec

Tier 0

Tier 1

Tier 3

Tier 4

Experiment2500 Physists, 40 countries

10s of Petabytes/Yr by 2008

InstituteInstituteInstituteInstituteInstituteInstitute

Tier 2


Transformation Derivation

Data

Product-of

execution-of

consumed-by/generated-by

“I’ve detected a calibration error in an

instrument and want to know which derived data

to recompute.”

“I’ve come across some interesting data, but I need to understand the nature of the corrections applied when it was constructed before I can trust it for my purposes.”

“I want to search an astronomical database for galaxies with certain characteristics. If a program that performs this analysis exists, I won’t have to write one from scratch.”

“I want to apply an astronomical analysis

program to millions of objects. If the results already exist, I’ll save weeks of computation.”

Motivations (1)


Motivations (2)

Data track-ability and result audit-ability Repair and correction of data

– Rebuild data products—c.f., “make” Workflow management

– A new, structured paradigm for organizing, locating, specifying, and requesting data products

Performance optimizations– Ability to re-create data rather than move it


Requirements Express complex multi-step “workflows”

– Perhaps 100,000s of individual tasks Operate on heterogeneous distributed data

– Different formats & access protocols Harness many computing resources

– Parallel computers &/or distributed Grids Execute workflows reliably

– Despite diverse failure conditions Enable reuse of data & workflows

– Discovery & composition Support many users, workflows, resources

– Policy specification & enforcement


Virtual Data System

Local planner

DAGmanDAG

StaticallyPartitioned

DAG

DAGman &Condor-GDynamically

PlannedDAG

JobPlanner

JobCleanup

Abstractworkflow

VDLProgram

Virtual Datacatalog

Virtual DataWorkflowGenerator

Workflow spec Create Execution Plan Grid Workflow Execution


VDS Software Stack Express complex multi-step “workflows”

– Perhaps 100,000s of individual tasks Operate on heterogeneous distributed data

– Different formats & access protocols Harness many computing resources

– Parallel computers &/or distributed res. Execute workflows reliably & efficiently

– Despite diverse failure conditions Enable reuse of data & workflows

– Discovery & composition Support many users, workflows, resources

– Policy specification & enforcement

VDL,XDTM

Pegasus,DAGman,

Globus

VDC

TBD


Outline




Batch-aware Distributed File System


Motivating question: Are existing distributed file systems adequate for batch computing workloads?

NO. Internal decisions inappropriate– Caching, consistency, replication

A solution: Combine scheduling knowledge with external storage control– Detail information about workload is known

– Storage layer allows external control

– External scheduler makes informed storage decisions Combining information and control results in

– Improved performance

– More robust failure handling

– Simplified implementation

Explicit Control in a Batch-Aware Distributed File System, John Bent, Douglas Thain, Andrea C.Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Miron Livny, (NSDI '04)


Outline

Batch computing– Systems– Workloads– Environment– Why not DFS?

Solution: BAD-FS– Design– Experimental evaluation


Batch computing

Home storage

Internet


Batch computing

Not interactive Compute Loop

– Users submit jobs> Job description languages

– System itself executes

– Results are copied back to user system Many exiting batch systems

– Condor, LSF, PBS, Sun Grid Engine


Internet

Batch computing

Scheduler

Compute node

CPUManager

Compute node

CPUManager

Compute node

CPUManager

Compute node

CPUManager

Job queue

1 2

3 4Home storage

1 2

3 4


Batch workloads General properties

– Large number of processes– Process and data dependencies– I/O intensive

Different types of I/O– Endpoint– Batch– Pipeline

Usage: mainly scientific workloads, but also video production, data mining, electronic design, financial services, graphic rendering

Pipeline and Batch Sharing in Grid Workloads, Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003.


Batch workloads

Endpoint

Endpoint

EndpointBatch

dataset

Batch dataset

Pipeline

Pip

elin

e

Endpoint Endpoint

EndpointEndpointEndpointEndpoint

Pipeline Pipeline

Pipeline Pipeline Pipeline

PipelinePipeline


Cluster-to-cluster (c2c)

Not quite p2p– More organized

– Less hostile

– More homogeneity Each cluster is autonomous

– Run and managed by different entities An obvious bottleneck is wide-area network

Q: How to manage flow of data into, within and out of these clusters?

InternetHomestore


Why not a traditional Distributed File System ?

Distributed file system (DFS) would be ideal– Easy to use

– Uniform name space

But . . . – Designed for wide-area networks

– Not practical

– Embedded decisions are wrong

InternetHomestore


Distributed file systems make ‘bad’ decisions

Caching – Must guess what and how to cache

Consistency – Output: Must guess when to commit

– Input: Needs mechanism to invalidate cache Replication

– Must guess what to replicate


BAD-FS makes ‘good’ (i.e. informed) decisions

Removes the guesswork– Scheduler has detailed workload knowledge

– Storage layer designed to allow external control

– Scheduler makes informed storage decisions> Manages data as well as computations

Retains simplicity of distributed file systems Practical and deployable


Outline

Introduction Batch computing

– Systems– Workloads– Environment– Why not DFS?

One solution: BAD-FS– Design– Experimental evaluation


User-level; requires no privilege Packaged as a modified batch system

A new batch system which includes BAD-FS General: will work on all batch systems

Solution BAD-FS: Practical and deployable

Internet

SGE SGE

SGE SGE SGE

SGE SGE

SGEBAD-

FSBAD-

FSBAD-

FSBAD-

FSBAD-

FSBAD-

FSBAD-

FSBAD-

FS

Homestore


Solution BAD-FS: Components

Scheduler

Compute node

CPUManager

Compute node

CPUManager

Compute node

CPUManager

Compute node

CPUManager

Job queue

1 2

3 4Home storage

Job queue

3) Expanded job description language

BAD-FSScheduler

4) BAD-FS scheduler

1) Storage managers

2) Batch-Aware Distributed File System

StorageManager

StorageManager

StorageManager

StorageManager BAD-FS BAD-FS BAD-FS


Information used

Remote cluster knowledge– Storage availability

– Failure rates Workload knowledge

– Data type (batch, pipeline, or endpoint)

– Data quantity

– Job dependencies


Control through volumes

Guaranteed storage allocations– Containers for job I/O

Scheduler– Creates volumes to cache input data

> Subsequent jobs can reuse this data

– Creates volumes to buffer output data> Destroys pipeline, copies endpoint

– Configures workload to access containers


Knowledge plus control

Enhanced performance– I/O scoping

– Capacity-aware scheduling Improved failure handling

– Cost-benefit replication Simplified implementation

– No cache consistency protocol


Real workload experience

Setup– 16 jobs– 16 compute nodes– Emulated wide-area

Configuration– Remote I/O– AFS-like with /tmp– BAD-FS

Result is order of magnitude improvement


BAD-FS Lessons

Generic solutions may be inefficient– Often designed with specific tradeoffs in

mind (e.g., most common workloads) Fix:

– Redesign for new workload

– Use explicit information available at runtime to optimize the execution of lower layers


Course Organization/Syllabus/etc.


Administravia: Course structure

Lectures – About 1/3 of all classes

Student projects – Aim high! Have fun! It’s a class project, not

your PhD!

– Teams of up to 3 students

– Project presentations at the end of the term Paper discussion

– The other classes


Administravia: Weekly schedule (tentative)1. Introduction. Overview of current research problems, technologies, and

applications.2. File system semantics, data durability and availability, replication and

consistency, fault-tolerance. 3. Data storage technologies. Storage hierarchies. Capacity management.4. Scientific applications: data access patterns, workload characterization. 5. Integration with compute systems. Grids and Virtual Data6. Performance focus: caching, parallel access, striping. 7. Structured overlays. Distributed hash tables. Data systems harnessing

structured overlays.8. Security. 9. Applications I: Experience with deployed systems. (NFS, AFS, Google File

System)10. Applications II: Data archival. Cooperative internet proxy caches.

Content distribution networks.11. Applications III: Peer-to-peer file-sharing (BitTorrent, FreeLoader)12. Project presentations


Administravia: Grading

Paper reviewing:35% Discussion leading: 15%

Project: 50%


Administravia:Paper Reviewing (1)

Goals:– Think of what you read– Expand your knowledge beyond the papers that are

assigned– Get used to writing paper reviews

Reviews due by midnight the day before the class Be professional in your writing Have an eye on the writing style:

– Clarity– Beware of traps: learn to use them in writing and detect

them in reading– Detect (and stay away from) trivial claims. E.g., 1st sentence in the Introduction: “The tremendous/unprecedented/phenomenal

growth/scale/ubiquity of the Internet…”



Follow the form provided when relevant. State the main contribution of the paper Critique the main contribution: Rate the significance of the paper on a scale of 5 (breakthrough), 4

(significant contribution), 3 (modest contribution), 2 (incremental contribution), 1 (no contribution or negative contribution).

Explain your rating in a sentence or two. Rate how convincing the methodology is. Do the claims and conclusions follow from the experiments? Are the assumptions realistic? Are the experiments well designed? Are there different experiments that would be more convincing? Are there other alternatives the authors should have considered? (And, of course, is the paper free of methodological errors?)



What is the most important limitation of the approach? What are the three strongest and/or most interesting ideas in

the paper? What are the three most striking weaknesses in the paper? Name three questions that you would like to ask the authors. Detail an interesting extension to the work not mentioned in

the future work section. Optional comments on the paper that you’d like to see

discussed in class.


Administravia:Discussion leading

Come prepared!– Prepare discussion outline– Prepare questions:

> “What if”s> Unclear aspects of the solution proposed> …

– Similar ideas in different contexts– Initiate short brainstorming sessions

Leaders do NOT need to submit paper reviews Main goals:

– Keep discussion flowing – Keep discussion relevant– Engage everybody (I’ll have an eye on this, too)


Administravia:Projects

Combine with your research if relevant to the class Get approval from all instructors if you overlap final

projects:– Don’t sell the same piece of work twice

– You can get more than twice as many results with less than twice as much work

Aim high!– Put one extra month and get a publication out of it

– It is doable! Try ideas that you postponed out of fear: it’s just a

class, not your PhD.


Administravia:Project deadlines (tentative)

3rd week (Tue): 1-page project proposal 5th week (Tue): 3-page literature survey

– Know relevant work in your problem area– If implementation project, list tools, similar projects– Expand proposal

7th week (Tue): 5-page Midterm project due– Have a clear image of what’s possible/doable– Report preliminary results

First week of exam session: In-class project presentation– Demo, if appropriate

Last week of exam session:– 10-page write-up


Next Class (Thu, 11/01)

Note room change: KAIS Discussion of some project ideas Presentation by Matei

To do: Subscribe to mailing list Volunteers for discussion leaders for class

next week


Questions?

eece 571r: data-intensive computing systems matei ripeanu matei at ece.ubc.ca

Documents