david p. anderson space sciences laboratory university of california – berkeley

Post on 12-Jan-2016

31 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

David P. Anderson Space Sciences Laboratory University of California – Berkeley. Public Distributed Computing with BOINC. Public-resource computing. 1 billion Internet-connected PCs in 2010 >50% of PCs are privately owned Assume 100M participants At least 100 PetaFLOPs - PowerPoint PPT Presentation

TRANSCRIPT

David P. AndersonSpace Sciences Laboratory

University of California – Berkeley

Public Distributed Computingwith BOINC

Public-resource computing

● 1 billion Internet-connected PCs in 2010● >50% of PCs are privately owned● Assume 100M participants

– At least 100 PetaFLOPs– At least 1 Exabyte (10^18) storage

● Problems– incentive, security, failures, ...

SETI@home

● Started May 1999● ~600,000 active participants● ~60 TeraFLOPs● Problems with current software

– hard to change/add algorithms– can't share participants w/ other projects– inflexible data architecture

SETI@home data architecture

ideal:current:

commercialInternet

Berkeley

participants

tapes Internet2(free)

commercialInternet

Berkeley Stanford USC

participants

50 Mbps

BOINC: Berkeley Open Infrastructure for Network Computing

● Multiple projects

– easy to develop and operate

– independent● Support wide range of tasks

– computation/storage

– task “topologies”● Participant features

– can choose projects, resource allocation

– configurable; invisible on participant hosts

– many platforms supported

BOINC server architecture

work generator

projectDBBOINC

DB

timeout/retry

validater

assimilator

file deleter data serverdata serverdata server

data serverdata serverscheduling server

Web interfaces(PHP)

BOINC client architecture

BOINCcore client

screensaver

application

BOINClibrary

application

BOINClibrary

files,shared memory

messages schedulers,data servers

Data architecture

● Files

– immutable, replicated– may originate on client or project– may remain resident on client

● Persistent, non-intrusive file transfers● XML descriptor:

<file_info><name>arecibo_3392474_jun_23_01</name><url>http://ds.ssl.berkeley.edu/a3392474</url><url>http://dt.ssl.berkeley.edu/a3392474</url><md5_cksum>uwi7eyufiw8e972h8f9w7</md5_cksum><nbytes>10000000</nbytes>

</file_info>

BOINC applications

● Any language (C, C++, Fortran)● BOINC API

– filename translation– checkpoint/restart, % done, CPU time– graphics (based on OpenGL, GLUT)

Work units● Template for a computation● Resource estimates

– Integer, FP ops; memory; disk space● Delay bound

– determines retry, client abort

<file_info><name>arecibo_3392474_jun_23_01</name>...

</file_info><workunit>

<name>ar_13323313</name><file_ref>

<name>arecibo_3392474_jun_23_01</name><open_name>input_file</open_name>

</file_ref><command_line>-niter 1000</command_line>

</workunit>

Results

● An instance of a computation (completed or not)

● Includes: host ID, claimed/granted credit

<file_info><name>arecibo_3392474_jun_23_01.out</name>...

</file_info><result>

<workunit_name>ar_13323313</workunit_name><file_ref>

<name>arecibo_3392474_jun_23_01.out</name><open_name>output_file</open_name>

</file_ref></result>

Scheduling

● Work buffering on client– upper, lower bounds

● Host attributes– FP/int/mem speeds, disk/memory sizes– network bandwidth up/down– fraction of time connected, computing

● Scheduler policy:– send as much work as requested, subject

to feasibility, WU deadlines

Client/server protocol (XML-RPC)

● Request– Authentication– Host description– Persistent file descriptions– Result descriptions– Duration of work requested

● Reply– Application, workunit, result descriptors– Result acknowledgements– Preferences– Control messages (redirect, back off, etc.)

Work sequences● Handle long (weeks or months)

computations with large local state● Sequence normally stays on one host;

move to different host if failure● Scheduling, redundancy checking are

trickyUpload state

Check for abort

Redundant computing

● Create several results per workunit● Find “canonical result” with project-

specific consensus policy● Generate additional copies as needed,

up to error thresholds● One result per WU per user

Participant Credit● Goals:

– credit for work actually done (CPU, network, storage)

– don't know workunit size in advance– cheat-proof

● Integration with redundancy– claimed credit = benchmark * CPU time– granted credit = minimum claimed credit

● Handling graphics coprocessors– project-specific benchmarks

Work unit lifecycle

● Work generator: create WU, N results

● Timeout check

– create new results if needed

– detect too many errors, too many results without consensus

● Validator

– find canonical result; grant credit● Assimilator

– merge canonical result into project DB● File deleter

– delete input and output files when no longer needed

Participating in a BOINC project

User Project web site

create account

email account IDdownload core client

core client

enter account ID, project URL

get list of scheduling servers

scheduler RPC

Windows GUI

● Multi-language● Operations: suspend/resume,

attach/detach projects, etc.

Participant preferences

Project-specific preferences

User-visible web features

● User profiles– user of the day

● Forums● Self-moderating FAQs● Teams● XML data export (3rd party statistics

reporting)

Project configuration file

<boinc><config> <db_name>ap</db_name> <db_passwd></db_passwd> <shmem_key>0x35740417</shmem_key> <key_dir>/mydisks/a/users/boincadm/keys</key_dir> <upload_url>http://setiboinc.ssl.berkeley.edu/ap_cgi/file_upload_handler</upload_url> <upload_dir>/mydisks/a/users/boincadm/projects/AstroPulse_Beta/upload</upload_dir> <cgi_url>http://setiboinc.ssl.berkeley.edu/ap_cgi</cgi_url> <log_dir>/mydisks/a/users/boincadm/projects/AstroPulse_Beta/log</log_dir> <disable_account_creation/></config><daemons> <daemon><cmd>feeder -d 1</cmd></daemon> <daemon><cmd>validate_test -d 2 -app AstroPulse -quorum 3</cmd></daemon> <daemon><cmd>timeout_check -d 2 -app AstroPulse -nerror 10 -ndet 10 -nredundancy 3</cmd></daemon> <daemon><cmd>assimilator -d 2 -app AstroPulse</cmd></daemon> <daemon><cmd>file_deleter -d 2</cmd></daemon></daemons><tasks> <task><cmd>update_stats -update_users -update_hosts -update_teams</cmd><period>1 hour</period></task> <task><cmd>get_load</cmd><period>5 min</period></task> <task><cmd>db_count "user"</cmd><output>count_users.out</output><period>5 min</period></task> <task><cmd>db_count "result"</cmd><output>count_results_all.out</output><period>5 min</period></task></tasks></boinc>

Project control

● Single control program– enable, disable– cron– status

● uses PID files to keep track of daemons● uses timestamp file for period tasks● uses lockfiles for mutual exclusion

Python-based testing system● Create objects representing projects,

hosts, applications, work, etc.● Activate objects to realize (create

databases and directories, run servers and clients)

● Simulate various types of failures● Check correctness of final system state

(database, result files, etc.) host = Host() user = UserUC() for i in range(2): ProjectUC(users=[user], hosts=[host], redundancy=5, short_name="test_1sec_%d"%i, resource_share=[1, 5][i]) run_check_all()

Monitoring/debugging tools

● All backend processes create log files– web/grep tool for tracking particular

WU/result● Database browsing tools

– summary of activity; entry point for browsing● Strip charts

– record, graph measures of system health● Watchdogs

– detect system failures; ring pager

Summary and status

● BOINC is funded by a 3-year NSF grant● Computing projects at Space Sciences Lab

– Astropulse (in beta test)– SETI@home (original, Australian)

● Other projects– Folding@home– Climateprediction.net

● Source code is free for noncommercial use

top related