judyqiu talk iit nov 4 2011

81
Iterative MapReduce Enabling HPC-Cloud Interoperability S  ALS  A HPC Group http://salsahpc.indiana.edu School of Informatics and Computing Indiana University

Upload: sherril-vincent

Post on 28-Feb-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 1/81

Iterative MapReduce EnablingHPC-Cloud Interoperability

S  ALS  A  HPC Grouphttp://salsahpc.indiana.edu

School of Informatics and Computing

Indiana University

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 2/81

A New Book from Morgan Kaufmann u!lishers" an imprint of #lsevier" Inc$"

Burlington" MA %&'%(" USA$ )ISBN* +,'%&-('.''%&/

 Distributed Systems and Cloud Computing: 

From Parallel Processing to the Internet of Things

Kai Hwang, Geoffrey Fox, Jac Dongarra

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 3/81

0wisterBingjing Zhang, Richard Teng

Funded by Microsoft Foundation Grant,Indiana University's Faculty Research

Support Program and SF !"I#$%&())Grant 

0wister1A2ure Thilina Gunarathne

Funded by Microsoft *+ure Grant 

3igh4erformance

5isuali2ation Algorithms6or 7ata4Intensive Analysis

Seung-Hee Bae and Jong oul !hoi

Funded by I Grant $R"G%%-.%(#%$

SALSA HPC Group

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 4/81

7ryad8IN9 C0 #valuationHui "i, ang Ruan, and uduo Zhou

Funded by Microsoft Foundation Grant 

Cloud Storage" 6uture:rid#iao$ing Gao, Stephen %u

Funded by Indiana University's Faculty

Research Support Program and aturalScience Foundation Grant %/$%.$

Million Se;uence ChallengeSali&a '(ana&a(e, )da$ Hughs, ang

RuanFunded by I Grant $R"G%%-.%(#%$

Cy!erinfrastructure for<emote Sensing of Ice Sheets

 Jero$e *itchell+unded & S+ Grant !-0121213

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 5/81

4MICROSOFT 

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 6/81

S ALS A1  *le0 S+alay, 1he 2ohns op3ins

University 

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 7/81

Paradigm Shift in Data Intensive Computing

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 8/81S ALS Antel5s )pplication Stac(

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 9/81S ALS A

(Iterative) MapReduce in Context

"inu6 H7!Bare-

s&ste$

)$a8on!loud

%indo9sSerer H7!Bare-s&ste$ ;irtuali8ation

!ross 7lat<or$ teratie *apReduce =!ollecties, +ault Tolerance, Scheduling>

?ernels, Genomics, Proteomics, n<or$ation Retrieal, 7olarScience, Scienti@c Si$ulation Aata )nal&sis and *anage$ent,

Aissi$ilarit& !o$putation, Clustering, MultidimensionalScaling, Generatie Topological *apping

!7 odes

;irtuali8ation

Applications

Programming

Model

Infrastructure

Hardware

)8ure !loud

Securit&, 7roenance, 7ortal

High "eel "anguage

Aistriuted +ile

S&ste$s

Aata 7arallel +ile

S&ste$Grid

)ppliance

G7 odes

Support Scientific Simulations (Data Mining and Data Analysis)

Runtime

Storage

Serices and %or(Co9

ject Store

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 10/81S ALS A

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 11/81

!"at are t"e c"allenges#

Providing both cost e4ectiveness and po5erful parallel programming

 paradigms that is capable of handling the incredible increases indataset si+es67 large-scale data anal&sis <or Aata ntensie applications 8

Research issues

portailit& et9een H7! and !loud s&ste$s scaling per<or$ance <ault tolerance

1hese challenges must be met for both computation and storage6 If

computation and storage are separated, it9s not possible to bringcomputing to data6

ata localit! its i$pact on per<or$anceD

the <actors that aEect data localit&D the $a6i$u$ degree o< data localit& that can e achieed.

"actors #e!ond data localit! to improveper$ormance To achiee the est data localit& is not al9a&s the opti$al scheduling

decision. +or instance, i< the node 9here input data o< a tas( are storedis oerloaded, to run the tas( on it 9ill result in per<or$ancedegradation.

%as& granularit! and load #alancen *apReduce , tas( granularit& is @6ed. This $echanis$ has t9o dra9ac(s

3> li$ited degree o< concurrenc&F> load unalancing resulting <ro$ the ariation o< tas( e6ecution ti$e.

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 12/813F

3F

MICROSOFT 

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 13/81S ALS A32

MICROSOFT 

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 14/81S ALS A

!louds hide !o$ple6it&

3

SaaS: So<t9are as a Serice

=e.g. !lustering is a serice>

aaS =HaaS>: n<rasturcture as a Serice

=get co$puter ti$e 9ith a credit card and 9ith a %einter<ace li(e '!F>

7aaS: 7lat<or$ as a Serice

aaS plus core so<t9are capailities on 9hich &ouuild SaaS

=e.g. )8ure is a 7aaSD *apReduce is a 7lat<or$> 

!&erin<rastructure s Research as a SericeI

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 15/81

• 7lease sign and return &our ideo 9aier.

• 7lan to arrie earl& to &our session in

order to cop&&our presentation to the con<erence 7!.

• 7oster drop-oE is at Scholars Hall on%ednesda&<ro$ :20 a$ K oon. 7lease ta(e &our

poster 9ithou a<ter the session on %ednesda

Innovations in IP (esp' pen Source) S!stems

Consistenc! models

Integration o$ Main$rame and Large S!stems

Poer*aare Pro+ling, Modeling, and ptimi-ations

I% Service and Relationship Management

Scala#le "ault Resilience %echni.ues $or Large Computing

/e and Innovative Pedagogical Approaches

ata grid 0 Semantic e#

Peer to peer computing

Autonomic Computing

Scala#le Scheduling on Heterogeneous Architectures

Hardare as a Service (HaaS)

1tilit! computing

/ovel Programming Models $or Large Computing

"ault tolerance and relia#ilit!

ptimal deplo!ment con+guration

Load #alancing

2e# services

Auditing, monitoring and scheduling

So$tare as a Service (SaaS)

Securit! and Ris& 

3irtuali-ation technologiesHigh*per$ormance computing

Cloud*#ased Services and 4ducation

Middleare $rameor&s

Cloud 5Grid architecture

0 30 F0 20 G0 40 10 J0 L0 M0 300

/um#er o$ Su#missions

%opic

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 16/81S ALS A

Gartner 6778 H!pe CurveSource: Gartner =)ugust F00M

HPC

9

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 17/81

3

"3 cache re<erence 0.4 ns

Branch $ispredict 4 ns

"F cache re<erence ns

*ute6 loc(/unloc( F4 ns

*ain $e$or& re<erence 300 ns

!o$press 3? 9/cheap co$pression algorith$ 2,000 ns

Send F? &tes oer 3 Gps net9or( F0,000 ns

Read 3 *B seNuentiall& <ro$ $e$or& F40,000 ns

Round trip 9ithin sa$e datacenter 400,000 ns

Ais( see( 30,000,000 ns

Read 3 *B seNuentiall& <ro$ dis( F0,000,000 ns

Send pac(et !)-Oetherlands-O!) 340,000,000 ns

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 18/81

3L

http:;;thecloudtutorial6com;hadoop#tutorial6html

Serers running Hadoop at ahoo.co$

Programming on a Computer Cluster

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 19/81

3M

Parallel Thinking

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 20/81

F0

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 21/81

F3

SPM D Softw are

Single Program Multiple <ata 7SPM<8:a coarse#grained S*A approach to progra$$ing <or **As&ste$s.

<ata parallel soft5are:do the same thing to all elements o< a structure =e.g., $an&

$atri6 algorith$s>. 'asiest to 9rite and understand.

n<ortunatel&, diPcult to appl& to co$ple6 prole$s =as 9ere theS*A $achinesD *apreduce>.

%hat applications are suitale <or S7*AQ=e.g. %ordcount>

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 22/81

FF

M PM D Softw are

Multiple Program Multiple <ata 7SPM<8:a coarse#grained M*A approach to progra$$ing

<ata parallel soft5are:

do the same thing to all elements o< a structure =e.g., $an&$atri6 algorith$s>. 'asiest to 9rite and understand.

t applies to co$ple6 prole$s =e.g. *7, distriuted s&ste$>.

%hat applications are suitale <or *7*AQ=e.g. 9i(ipedia>

i d l d l

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 23/81

F2

Program m ing M odels and ToolsMapReduce in Heterogeneous Environment 

MICROSOFT 

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 24/81

S ALS A

/ext Generation Se.uencing Pipelineon Cloud

F

:lastPairiseistanceCalculation

issimilarit!Matrix

/(/*;)56

values

"AS%A "ile/Se.uences

bloc

$airings

MapReduce

3 F 2

Clustering %isuali&ation

  $lot'i&

%isuali&ation

  $lot'i&

MS

Pairiseclustering

MPI

4

$it their jos to the pipeline and the results 9ill e sho9n in a isuali8ation tool.rt illustrate a h&rid $odel 9ith *apReduce and *7. T9ister 9ill e an uni@ed solution <or the pipelinponents are serices and so is the 9hole pipeline.

d research on 9hich stages o< pipeline serices are suitale <or priate or co$$ercial !louds.

i i

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 25/81

*otiationAata

Aeluge*apReduce !lassic 7arallel

Runti$es =*7>

4xperiencing in man!domains

ataCentered, <oS

4=cient andProventechni.ues

nput

utput

$ap

nput

$ap

reduce

nput

$ap

reduce

iterations

7ij

'6pand the )pplicailit& o< *apReduce to$ore classes o< )pplications

Map*nl!

MapReduce

Iterative

MapReduce

More

4xtensions

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 26/81

%ister v7'8

<istinction on static and

variable data

 "on=gurable long running

7cacheable8 map;reduce tas3s

 Pub;sub messaging based

communication;data transfers>ro3er et5or3 for facilitating

communication

/e In$rastructure $orIterative MapReduce

Programming

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 27/81

con+gureMaps('')

con+gureReduce('')

runMapReduce('')

hile(condition)>

? 55end hile

updateCondition()

close()

Com#ine()

operation

Reduce()

Map()

%or(er odes

!o$$unications/data trans<ers ia

the pu-su ro(er net9or( direct T!7

 Iterations

*a& send ?e&,;alueO pairsdirectl&

"ocal Ais(

!acheale $ap/reducetas(s

• *ain progra$ $a& contain $an&

*apReduce inocations or

iteratie *apReduce inocations

*ain progra$5s process space

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 28/81

%or(er ode

"ocal Ais(

%or(er 7ool

 T9ister Aae$on

*aster ode

 T9ister

Arier

*ain 7rogra$

B

B BB

 7u/su

Bro(er et9or(

%or(er ode

"ocal Ais(

%or(er 7ool

 T9ister Aae$on

Scripts per<or$:Aata distriution, datacollection, and partition+le creation

$ap

reduce !achealetas(s

ne ro(erseres seeral

 T9isterdae$ons

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 29/81

FM

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 30/81

MRRoles@A-ure

)8ure ueues <or scheduling, Tales to store $eta-data and$onitoring data, Blos <or input/output/inter$ediate data storage.

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 31/81

Iterative MapReduce$or A-ure

7rogra$$ing $odel e6tensions to supportroadcast data

*erge Step n-*e$or& !aching o< static data

!ache a9are h&rid scheduling using ueues,ulletin oard =special tale> and e6ecution

histories

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 32/81

MRRoles@A-ure• Aistriuted, highl& scalale and highl& aailale

cloud serices as the uilding loc(s.• tili8e eentuall&-consistent , high-latenc& cloud

serices eEectiel& to delier per<or$anceco$parale to traditional *apReduce runti$es.

• Aecentrali8ed architecture 9ith gloal Nueueased d&na$ic tas( scheduling

• *ini$al $anage$ent and $aintenance oerhead

• Supports d&na$icall& scaling up and do9n o< theco$pute resources.

• *apReduce <ault tolerance

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 33/81

Per$ormance Comparisons

Cap Se.uence Assem#l!

Smith 2aterman Se.uenceAlignment

:LAS% Se.uence Search

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 34/81

Per$ormance B meansClustering

/um#er o$ 4xecuting Map %as&Histogram

Strong Scaling ith ;6DM ataPoints 2ea& Scaling

%as& 4xecution %ime Histogram

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 35/81

Per$ormance B Multiimensional Scaling

A-ure Instance %!pe Stud! /um#er o$ 4xecuting Map %as&

2ea& Scaling ata Si-e Scaling

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 36/81

Plot3i-, 3isuali-ation S!stem

Parallelvisuali-ationalgorithms (G%M,

MS, E) Improved .ualit!

#! using Aoptimi-ation

Interpolation

Parallel 3isuali-ationAlgorithms Plot3i-

Provide 3irtual space

Cross*plat$orm

3isuali-ation%ool&it (3%)

<t $rameor& 

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 37/81

G%M vs' MSGT* *AS =S*)!+>

*a6i$i8e "og-"i(elihood

*ini$i8e STR'SS or SSTR'SSjectie+unction

=?> =? > =F>!o$ple6it&

• on-linear di$ension reduction• +ind an opti$al con@guration in a lo9er-di$ension• teratie opti$i8ation $ethod

7urpose

'* teratie *ajori8ation ='*-li(e>pti$i8ati

on*ethod

;ector-ased dataon-ector =7air9ise si$ilarit&

$atri6>nput

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 38/81

*7 / *7-

Parallel G%M

 +inding ? clusters <or data points

Relationship is a ipartite graph =i-graph>Represented & ?-&- $atri6 =? >

Aeco$position <or 7-&- co$pute gridReduce $e$or& reNuire$ent & 3/7

?latentpoints

data

points

3

F

)

B

!3

F

) B !

7arallel +ileS&ste$

!ra& / "inu6 / %indo9s!luster

7arallelHA+4

Sca")7)!? 

GT* / GT*-nterpolation

G%M S"%2AR4

S%AC 

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 39/81

Scala#le MSParallel MS

• =F> $e$or& and co$putationreNuired. – 300( data  L0GB $e$or&

• Balanced deco$position o< 6$atrices & 7-&- grid. – Reduce $e$or& and co$puting

reNuire$ent & 3/7

• !o$$unicate ia *7 pri$ities

MS Interpolation

• +inding appro6i$ate$apping position 9.r.t.(-5s prior $apping.

• 7er point it reNuires: – =*> $e$or&

 – =(> co$putation

• 7leasingl& parallel

• *apping F* in 340

sec. – s. 300( in F000 sec.

 – 400 ti$es <aster thanesti$ation o< the <ull *AS.

2M

c3

cF

c2

r3

rF

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 40/81

Interpolation extension toG%M5MS

+ull data processing & GT* or *AS is co$puting- and

$e$or&-intensie T9o step procedure

1raining : training & * sa$ples out o< data

Interpolation : re$aining =-*> out-o<-sa$ples areappro6i$ated 9ithout training

nIn-sample

N-n

Out-of-sample

Total N data

Training

Interpolation

Trained data

Interpolated

map

MPI,%ister

%ister

3F

......7-3p

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 41/81

G%M5MS ApplicationsPubChem data with CTDvisuai!ation b" usingMDS #e$t% and &TM#right%)out M20,000 che$icalco$pounds are isuali8ed asa point in 2A space,

annotated & the relatedgenes in !o$paratie

 To6icogeno$ics Aataase=!TA>

Chemica compoundsshown in iteratures'

visuai!ed b" MDS #e$t%and &TM #right%

;isuali8ed F2,000 che$icalco$pounds 9hich $a& e

related 9ith a set o< 4 genes o<interest =)B!B3, !HRBF,

ARAF, 'SR3, and +F> ased on

the dataset collected <ro$

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 42/81

%ister*MS emoThis demo is for real time isuali!ation of the

 process of multidimensional scaling(MDS)

calculation"

#e use Twister to do parallel calculation inside the

cluster$ and use Plot%i! to show the intermediate

results at the user client computer"

The process of computation and monitoring is

automated &y the program"

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 43/81

MDS projection of 100,000 protein sequences showing a few experimentally

identified clusters in preliminary work with Seattle Childrens !esearch "nstitute

%ister*MS utput

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 44/81

%ister*MS 2or& "lo

Master /ode

%isterriver

%ister*MS

ActiveM<:ro&er

MSMonitor

Plot3i-

I' Sendmessage tostart the Fo#

II' Sendintermediate

results

Local is& 

III' 2rite data I3' Read data

Client /ode

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 45/81

MS utput Monitoring Inter$acePu#5Su#:ro&er/etor& 

2or&er /ode

2or&er Pool

%ister aemon

Master /ode

%isterriver

%ister*MS

2or&er /ode

2or&er Pool

%ister aemon

map

reduce

map

reduc

e

calculateSt

ress

calculate:C

%ister*MS Structure

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 46/81

:ioin$ormatics PipelineGene

SeNuences= U 3*illion>

Aistance *atri6

nterpolatie*AS 9ith7air9iseAistance

!alculation

*ulti-Ai$ensio

nalScaling=*AS>

;isuali8ation

2A 7lot

Re<erenceSeNuenceSet =* U300?>

- *SeNuenc

e Set=M00?>

SelectRe<eren

ce

Re<erence!oordinate

s6, &,8

- *!oordinate

s

6, &,8

7air9ise)lign$en

t Aistance!alculati

on

!7 8

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 47/81

/e /etor& o$ :ro&ers

 T9ister Arierode

 T9ister Aae$onode)ctie* Bro(er

ode

Bro(er-Aae$on!onnection

Bro(er-Bro(er!onnection

Bro(er-Arier!onnection

Bro(ersand 2F!o$putingodes intotal

A' "ull Mesh

/etor& 

:' Hierarchical

Sending

C'

4 Bro(ers and !o$puting odes intotal

P $

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 48/81

Per$ormanceImprovement

2L00 43F00 1L00 30F000.000

F00.000

00.000

100.000

L00.000

3000.000

3F00.000

300.000

3100.000

3LM.FLL

24M.1F4

L31.21

340L.L

3L.L04

202.2F

2.02

30.23

%ister*MS 4xecution %ime;77 iterations, @7 nodes, under dierent input data si-es

riginal '6ecution Ti$e =3 ro(er onl&> !urrent '6ecution Ti$e = ro(ers, the est ro(er nu$er>

/um#er o$ ata Points

%otal 4xecution %ime (Seconds)

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 49/81

:roadcasting on @7 /odes=n *ethod !, centroids are split to 310 loc(s, sent through 0 ro(ers in

rounds>

00* 100* L00*

0.00

30.00

F0.00

20.00

0.00

40.00

10.00

0.00

L0.00

M0.00

300.00

32.0

3L.M

F.40

1.3M

0.41

M2.3

*ethod ! *ethod B

:roadcasting %ime (1nit Second)

T i t ) hit t

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 50/81

 T9ister e9 )rchitecture2or&er /ode

%ister aemon%isterriver

2or&er /ode

%ister aemon

map

reduce

merge

:ro&er

Map

Reduce

Master /ode

Con+gureMapper

Add to MemCache

:ro&er :ro&er

Cachea#letas&s

map #roadcasting

chain

reduce collectionchain

!h i /Ri B d ti

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 51/81

!hain/Ring Broadcasting T9ister Aae$on ode

 T9ister Arier ode • Arier sender:

• send roadcasting data• get ac(no9ledge$ent• send ne6t roadcasting data• V

• Aae$on sender:

• receie data <ro$ the lastdae$on =or drier>• cache data to dae$on• Send data to ne6t dae$on =9aits

<or )!?>• send ac(no9ledge$ent to the

last dae$on

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 52/81

!hain Broadcasting 7rotocolsend

get ac& 

send

Arier Aae$on

0receivehandledatasend

Aae$on

3

ac& 

receivehandledatasend

ac& receive

handledataget ac& 

ac& 

receivehandledata

get ac& 

send

send

get ac& 

Aae$on

F

receivehandle

dataac& 

receivehandledataac& 

ac& 

get ac& 

get ac&  ac& 

ac& get ac& 

(no9 this is theend o< Aae$on!hain

(no9 this is theend o< !ache

Bloc(

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 53/81

Broadcasting Ti$e!o$parison

0

4

30

34

F0

F4

20

:roadcasting %ime Comparison on D7 nodes, I77 M: data, ;I7 pieces

!hain Broadcasting )ll-to-)ll Broadcasting, G0 ro(ers

4xecution /o'

:roadcasting %ime (1nitH Seconds)

Applications 0 ierent

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 54/81

S ALS A

Interconnection PatternsMap nl! Classic

MapReduceIterative

Reductions%ister

Loosel!S!nchronous

CAP )nal&sisAocu$entconersion =7A+ -OHT*">Brute <orce searches

in cr&ptograph&7ara$etric s9eeps

High 'nerg&7h&sics =H4P>Histogra$sS2G genealign$ent

Aistriuted searchAistriuted sortingn<or$ationretrieal

'6pectation$a6i$i8ationalgorith$s!lustering"inear )lgera

*an& *7 scienti@capplicationsutili8ing 9ideariet& o<co$$unication

constructsincluding localinteractions

- !)72 Gene)sse$l&- 7olarGrid *atladata anal&sis

- n<or$ationRetrieal - H'7Aata )nal&sis- !alculation o<7air9ise Aistances<or )" SeNuences

- ?$eans- eterministic

AnnealingClustering* *ultidi$ensionalScaling MS

- SolingAiEerential'Nuations and- particle d&na$ics9ith short range<orces

nput

utput

$ap

nput$ap

reduce

nput

$ap

reduce

iterations

7ij

Ao$ain o< *apReduce and teratie '6tensions *7

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 55/81

%ister "utures

Deelopment of li!rary of Collectives to use at Reduce phase'roadcast and ather needed &y current applications

Discoer other important ones

Implement efficiently on each platform especially A!ure

'etter software message routing with &ro*er networ*s using

asynchronous I+, with communication fault tolerance

Support near!y location of data and computing using data

 parallel file systems

-learer application fault tolerance model &ased on implicit

synchroni!ations points at iteration end points

.ater/ Inestigate :U support

.ater/ run time for data parallel languages li*e Saw!all$ Pig .atin$

.I01

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 56/81

!onergence is Happening

Multico

re

Clouds

ata IntensiveParadigms

Aata intensie application =three asic actiiticapture, curation, and anal&sis =isuali8ation>

!loud in<rastructure and runti$e

7arallel threading and processes

+utureGrid: a Grid Tested

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 57/81

S ALS A

+utureGrid: a Grid Tested•   I1 !ra& operational, I1 B* =iAata7le6> co$pleted stailit& test *a& 1

•   1CS B* operational, 1" B* stailit& test co$pletes W *a& 3F

• /etor& , /I and P1 HT! s&ste$ operational

• 1C B* stailit& test co$pletes W *a& FD %ACC Aell a9aiting delier& o< co$ponents

/I: et9or( $pair$ent Aeic

7riate7ulic +G et9or(

S)"S)H7! A&na$ic ;irtual !luster on

emonstrate the concept

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 58/81

S ALS A

• S9itchale clusters on the sa$e hard9are =W4 $inutes et9een diEerent S such as "inu6X#en to%indo9sXH7!S>

• Support <or irtual clusters

• S%-G : S$ith %ater$an Gotoh Aissi$ilarit& !o$putation as an pleasingl& parallel prole$ suitale <or*apReduce st&le applications

Pu#5Su#:ro&er/etor& 

Summari-er

Sitcher

Monitoring Inter$ace

iataplex

:are*metal/odes

JCA%In$rastructure

3irtual5Ph!sicalClusters

Monitoring 0 ControlIn$rastructure

iAataple6 Bare-$etalodes

=2F nodes>

#!)T n<rastructure

Linux:are*s!stem

Linux onJen

2indos

Server677D:are*

s!stem

S2*G1sing

Hadoop

S2*G1sing

Hadoop

S2*G1sing

r!adLI/<

*onitoring n<rastructure

!namic ClusterArchitecture

&+utureGrid -- Ae$o at S!0M

po$ Science on Clouds on"utureGrid

S)"S)H7! A&na$ic ;irtual !luster on

emonstrate the concept

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 59/81

S ALS A

&+utureGrid -- Ae$o at S!0M

• Top: 2 clusters are s9itching applications on @6ed eniron$ent. Ta(es appro6i$atel& 20seconds.

• Botto$: !luster is s9itching et9een eniron$ents: "inu6D "inu6 X#enD %indo9s XH7!S.

po$ Science on Clouds using

a "utureGrid cluster

' i ti " d

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 60/81

S ALS A

• Bac(ground: data intensie co$puting reNuiresstorage solutions <or huge a$ounts o< data

• ne proposed solution: HBase, Hadoopi$ple$entation o< Google5s BigTale

 '6peri$enting "ucene nde6 onHBase in an H7! 'niron$ent

S t d i d i l t ti

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 61/81

S ALS A

S&ste$ design and i$ple$entation

K solution

• nerted inde6:

cloudI -O doc3, docF, V

co$putingI -O doc3, doc2, V

• )pache "ucene:- ) lirar& 9ritten in Jaa <or uilding inerted indices andsupporting <ull-te6t search

- ncre$ental inde6ing, docu$ent scoring, and $ulti-inde6search 9ith $erged results, etc.

- '6isting solutions using "ucene store inde6 data 9ith @lesK no natural integration 9ith HBase

• Solution: $aintain inerted indices directl& inHBase as tales

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 62/81

S ALS A

S&ste$ design• Aata <ro$ a real digital lirar& application: iliograph& data,

page i$age data, te6ts data• S&ste$ design:

S d i

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 63/81

S ALS A

S&ste$ design•  Tale sche$as:

- title inde6 tale: ter$ alueO --O Y<reNuencies:docidO, doc idO, ...[\

- te6ts inde6 tale: ter$ alueO --O Y<reNuencies:docidO, doc idO, ...[\

- te6ts ter$ position ector tale: ter$ alueO --OYpositions:doc idO, doc idO, ...[\

• atural integration 9ith HBase

• Reliale and scalale inde6 data storage

• Real-ti$e docu$ent addition and deletion

• *apReduce progra$s <or uilding inde6 andanal&8ing inde6 data

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 64/81

S ALS A

S&ste$ i$ple$entation• '6peri$ents co$pleted in the )la$o H7! cluster o< +utureGrid

• *&Hadoop -O *&HBase

• %or(Co9:

d d t l i

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 65/81

S ALS A

nde6 data anal&sis•  Test run 9ith 4 oo(s

•  Total nu$er o< distinct ter$s: LF12

• +ollo9ing @gures sho9 diEerent <eatures aout thete6t inde6 tale

d d t l i

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 66/81

S ALS A

nde6 data anal&sis

!o$parison 9ith related

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 67/81

S ALS A

!o$parison 9ith related9or(

• 7ig and Hie:- Aistriuted plat<or$s <or anal&8ing and 9arehousing large

data sets

- 7ig "atin and Hie" hae operators <or search

- Suitale <or atch anal&sis to large data sets

• Solr!loud, 'lasticSearch, ?atta:- Aistriuted search s&ste$s ased on "ucene indices

- ndices organi8ed as @lesD not a natural integration 9ith

HBase

• Solandra:- nerted inde6 i$ple$ented as tales in !assandra

- AiEerent inde6 tale designsD no *apReduce support

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 68/81

S ALS A

+uture 9or(

• Aistriuted per<or$ance ealuation

• *ore data anal&sis or te6t $ining ased on theinde6 data

• Aistriuted search engine integrated 9ith HBaseregion serers

4ducation and :roader

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 69/81

4ducation and :roaderImpact

2e devote a lot to guidestudentsho are interested in

computing

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 70/81

4ducation

2e oer classesith emergingne topics

%ogether ithtutorials on the

most popularcloud computingtools

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 71/81

Hosting or&shops andspreading our technolog!across the nation

Giving studentsun$orgetta#leresearch experience

:roader Impact

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 72/81

Ac&noledgement

S  ALS  A  HPC GroupIndiana 1niversit!http://salsahpc.indiana.edu

High 'nerg& 7h&sics Aata )nal&sis

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 73/81

2

g g& & &

nput to a $ap tas(: (e&, alueO(e& U So$e d alue U H'7 @le a$e

utput o< a $ap tas(: (e&, alueO(e& U rando$ ] =0U nu$U $a6 reduce tas(s>

  alue U Histogra$ as inar& data

nput to a reduce tas(: (e&, "istalueOO(e& U rando$ ] =0U nu$U $a6 reduce tas(s>

  alue U "ist o< histogra$ as inar& data

utput <ro$ a reduce tas(: aluealue U Histogra$ @le

!o$ine outputs <ro$ reduce tas(s to<or$ the @nal histogra$

)n application anal&8ing data <ro$ "arge Hadron !ollider  =3TB ut 300 7eta&tes eent

Reduce 7hase o< 7article 7h&sics

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 74/81

+ind the HiggsI using Ar&ad

• !o$ine Histogra$s produced & separate Root *apsI =o< eent data to partial histogra$s> intoa single Histogra$ deliered to !lient

•  This is an e6a$ple using *apReduce to do distriuted histogra$$ing.

Higgs in Monte Carlo

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 75/81

4http:;;blog6?team6nl;5p#content;uploads;%%/;%.;MapReduce@ord"ount!vervie56png

Sort

0.33, 0.LM,0.F

0.FM, 0.F2,0.LM

0.F, 0.F2,0.33

4mit K:ini , ;

0.F2,0.LM,0.F,

0.FM,0.F2,0.LM,

0.F, 0.F2,

0.33

4vents (Pi )

Bin&,

3

Bin./, 

3

Bin),

3Bin/, 

3

Bin&,

3

Bin./, 

3

Bin),

3

Bin&, 

3

Bin$$,

3

Bin&, 

3Bin&,

3

Bin&,

3

Bin$$,

3

Bin$$, 

3

Bin),

3

Bin)

,

3Bin./, 

3

Bin./,

3

Bin$$,

F

Bin&,2

Bin),

F

Bin./,

F

Bin$$,

 F

Bin&, 2

Bin/,

F

Bin./, 

F

 The oerall *apReduce H'7 )nal&sis7rocess

+ro$ %ord!ount to H'7 )nal sis

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 76/81

1

3L.

  pu#lic void map(Long2rita#le &e!, %ext value,

utputCollectorK%ext, Int2rita#le output, Reporter reporter)thros I4xception >

3M.   String line N value'toString()O

F0.   String%o&eni-er to&eni-er N ne String%o&eni-er(line)O

F3.   hile (to&eni-er'hasMore%o&ens()) >

FF.   ord'set(to&eni-er'next%o&en())O 55 Parsing

F2.   output'collect(ord, one)O

F.   ?

5 Single propert!5histogram5

dou#le event N 7O

int :I/QSI4 N ;77Odou#le #insT N nedou#le:I/QSI4TOEE'i$ (event is in #iniT)>55Pseudo  event'set (i)O?output'collect(event, one)O

+ro$ %ord!ount  to H'7 )nal&sis

5 Multiple properties5histograms 5

int PRP4R%I4S N ;7Oint :I/QSI4 N ;77O 55assume properties arenormali-eddou#le event3ectorT N ne

dou#le34CQL4/G%HTOdou#le #insT N ne dou#le:I/QSI4TOEE'$or (int iN7O i K34CQL4/G%HO iUU) >  $or (int F N 7O F K PRP4R%I4SO FUU) >  i$ (event3ectoriT is in #insFT)>55Pseudo  UU#insFTO

?  ??output'collect(Propert!, #insT) 55Pseudo

K M eans Clustering

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 77/81

n statistics and $achine learning, ( *means clustering is a$ethod o< cluster anal&sis 9hich ai$s to partition n oserationsinto 3  clusters in 9hich each oseration elongs to the cluster9ith the nearest $ean. t is si$ilar to thee6pectation-$a6i$i8ation algorith$  ='*> <or $i6tures o<Gaussians in that the& oth atte$pt to @nd the centers o< natural

clusters in the data as 9ell as in the iteratie re@ne$entapproach e$plo&ed & oth algorith$s.

K-M eans Clustering

wikipedia

'-step: the ^assign$ent^ step asexpectation step*-step: the ^update step^ as

maximi-ation step

How itw orks?

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 78/81

L

How it w orks?

wikipedia

?-$eans !lustering )lgorith$ <or *apReduce

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 79/81

M

 _ Ao _ Broadcast !n _

_ 7er<or$ in parallel[ Kthe $ap=> operation _ <or each ;i _ <or each !n,j _ Aij U 'uclidian =;i,!n,j> _ )ssign point ;i to !n,j 9ith $ini$u$ Aij

 _ <or each !n,j _ !n,j U!n,j/?  __ 7er<or$ SeNuentiall&[ Kthe reduce=> operation _ !ollect all !n_ !alculate ne9 cluster centers !nX3

 _ AiEU 'uclidian =!n, !nX3> __ 9hile =AiE THR'SH"A>

Map

Reduce

3i re<ers to the ith ectorCn,F  re<ers to the jth cluster center in nth _ iterationiF  re<ers to the 'uclidian distance et9een ith ector and jth _

cluster center  is the nu$er o< cluster centers

4*Step

M*Step

Glo#al reduction

7aralleli8ation o< ?-$eans !lustering

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 80/81

L0

C;

!F

!2V

V

V

!(

7artition

C;

!F

!2V

V

V

!(

7artition

C;

!F

!2V

V

V

!(

7artition

V

:roadcast

C; C6 C E E E C&  

x; !; count;

6F &F countF62 &2 count2

V

V

V

6( &( count(

 T9ister ?-$eans

7/25/2019 JudyQiu Talk IIT Nov 4 2011

http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 81/81

'6ecution 

!3 !F !2

K, K,

!3

 

!F 

!2

 

K,

!3

 

!F 

!2

 

K,

!3

 

!F 

!2

 

K,

!3

 

!F 

!2

 

K,

!3 

!F 

!

!3 

!F 

!

!3 

!F 

!2

!3 

!F 

!2

!3 

!F 

!2

Kc, "ile&Kc,"ile; Kc,"ile6