Download - JudyQiu Talk IIT Nov 4 2011
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 1/81
Iterative MapReduce EnablingHPC-Cloud Interoperability
S ALS A HPC Grouphttp://salsahpc.indiana.edu
School of Informatics and Computing
Indiana University
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 2/81
A New Book from Morgan Kaufmann u!lishers" an imprint of #lsevier" Inc$"
Burlington" MA %&'%(" USA$ )ISBN* +,'%&-('.''%&/
Distributed Systems and Cloud Computing:
From Parallel Processing to the Internet of Things
Kai Hwang, Geoffrey Fox, Jac Dongarra
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 3/81
0wisterBingjing Zhang, Richard Teng
Funded by Microsoft Foundation Grant,Indiana University's Faculty Research
Support Program and SF !"I#$%&())Grant
0wister1A2ure Thilina Gunarathne
Funded by Microsoft *+ure Grant
3igh4erformance
5isuali2ation Algorithms6or 7ata4Intensive Analysis
Seung-Hee Bae and Jong oul !hoi
Funded by I Grant $R"G%%-.%(#%$
SALSA HPC Group
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 4/81
7ryad8IN9 C0 #valuationHui "i, ang Ruan, and uduo Zhou
Funded by Microsoft Foundation Grant
Cloud Storage" 6uture:rid#iao$ing Gao, Stephen %u
Funded by Indiana University's Faculty
Research Support Program and aturalScience Foundation Grant %/$%.$
Million Se;uence ChallengeSali&a '(ana&a(e, )da$ Hughs, ang
RuanFunded by I Grant $R"G%%-.%(#%$
Cy!erinfrastructure for<emote Sensing of Ice Sheets
Jero$e *itchell+unded & S+ Grant !-0121213
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 5/81
4MICROSOFT
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 6/81
S ALS A1 *le0 S+alay, 1he 2ohns op3ins
University
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 7/81
Paradigm Shift in Data Intensive Computing
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 8/81S ALS Antel5s )pplication Stac(
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 9/81S ALS A
(Iterative) MapReduce in Context
"inu6 H7!Bare-
s&ste$
)$a8on!loud
%indo9sSerer H7!Bare-s&ste$ ;irtuali8ation
!ross 7lat<or$ teratie *apReduce =!ollecties, +ault Tolerance, Scheduling>
?ernels, Genomics, Proteomics, n<or$ation Retrieal, 7olarScience, Scienti@c Si$ulation Aata )nal&sis and *anage$ent,
Aissi$ilarit& !o$putation, Clustering, MultidimensionalScaling, Generatie Topological *apping
!7 odes
;irtuali8ation
Applications
Programming
Model
Infrastructure
Hardware
)8ure !loud
Securit&, 7roenance, 7ortal
High "eel "anguage
Aistriuted +ile
S&ste$s
Aata 7arallel +ile
S&ste$Grid
)ppliance
G7 odes
Support Scientific Simulations (Data Mining and Data Analysis)
Runtime
Storage
Serices and %or(Co9
ject Store
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 10/81S ALS A
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 11/81
!"at are t"e c"allenges#
Providing both cost e4ectiveness and po5erful parallel programming
paradigms that is capable of handling the incredible increases indataset si+es67 large-scale data anal&sis <or Aata ntensie applications 8
Research issues
portailit& et9een H7! and !loud s&ste$s scaling per<or$ance <ault tolerance
1hese challenges must be met for both computation and storage6 If
computation and storage are separated, it9s not possible to bringcomputing to data6
ata localit! its i$pact on per<or$anceD
the <actors that aEect data localit&D the $a6i$u$ degree o< data localit& that can e achieed.
"actors #e!ond data localit! to improveper$ormance To achiee the est data localit& is not al9a&s the opti$al scheduling
decision. +or instance, i< the node 9here input data o< a tas( are storedis oerloaded, to run the tas( on it 9ill result in per<or$ancedegradation.
%as& granularit! and load #alancen *apReduce , tas( granularit& is @6ed. This $echanis$ has t9o dra9ac(s
3> li$ited degree o< concurrenc&F> load unalancing resulting <ro$ the ariation o< tas( e6ecution ti$e.
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 12/813F
3F
MICROSOFT
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 13/81S ALS A32
MICROSOFT
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 14/81S ALS A
!louds hide !o$ple6it&
3
SaaS: So<t9are as a Serice
=e.g. !lustering is a serice>
aaS =HaaS>: n<rasturcture as a Serice
=get co$puter ti$e 9ith a credit card and 9ith a %einter<ace li(e '!F>
7aaS: 7lat<or$ as a Serice
aaS plus core so<t9are capailities on 9hich &ouuild SaaS
=e.g. )8ure is a 7aaSD *apReduce is a 7lat<or$>
!&erin<rastructure s Research as a SericeI
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 15/81
• 7lease sign and return &our ideo 9aier.
• 7lan to arrie earl& to &our session in
order to cop&&our presentation to the con<erence 7!.
• 7oster drop-oE is at Scholars Hall on%ednesda&<ro$ :20 a$ K oon. 7lease ta(e &our
poster 9ithou a<ter the session on %ednesda
Innovations in IP (esp' pen Source) S!stems
Consistenc! models
Integration o$ Main$rame and Large S!stems
Poer*aare Pro+ling, Modeling, and ptimi-ations
I% Service and Relationship Management
Scala#le "ault Resilience %echni.ues $or Large Computing
/e and Innovative Pedagogical Approaches
ata grid 0 Semantic e#
Peer to peer computing
Autonomic Computing
Scala#le Scheduling on Heterogeneous Architectures
Hardare as a Service (HaaS)
1tilit! computing
/ovel Programming Models $or Large Computing
"ault tolerance and relia#ilit!
ptimal deplo!ment con+guration
Load #alancing
2e# services
Auditing, monitoring and scheduling
So$tare as a Service (SaaS)
Securit! and Ris&
3irtuali-ation technologiesHigh*per$ormance computing
Cloud*#ased Services and 4ducation
Middleare $rameor&s
Cloud 5Grid architecture
0 30 F0 20 G0 40 10 J0 L0 M0 300
/um#er o$ Su#missions
%opic
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 16/81S ALS A
Gartner 6778 H!pe CurveSource: Gartner =)ugust F00M
HPC
9
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 17/81
3
"3 cache re<erence 0.4 ns
Branch $ispredict 4 ns
"F cache re<erence ns
*ute6 loc(/unloc( F4 ns
*ain $e$or& re<erence 300 ns
!o$press 3? 9/cheap co$pression algorith$ 2,000 ns
Send F? &tes oer 3 Gps net9or( F0,000 ns
Read 3 *B seNuentiall& <ro$ $e$or& F40,000 ns
Round trip 9ithin sa$e datacenter 400,000 ns
Ais( see( 30,000,000 ns
Read 3 *B seNuentiall& <ro$ dis( F0,000,000 ns
Send pac(et !)-Oetherlands-O!) 340,000,000 ns
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 18/81
3L
http:;;thecloudtutorial6com;hadoop#tutorial6html
Serers running Hadoop at ahoo.co$
Programming on a Computer Cluster
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 19/81
3M
Parallel Thinking
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 20/81
F0
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 21/81
F3
SPM D Softw are
Single Program Multiple <ata 7SPM<8:a coarse#grained S*A approach to progra$$ing <or **As&ste$s.
<ata parallel soft5are:do the same thing to all elements o< a structure =e.g., $an&
$atri6 algorith$s>. 'asiest to 9rite and understand.
n<ortunatel&, diPcult to appl& to co$ple6 prole$s =as 9ere theS*A $achinesD *apreduce>.
%hat applications are suitale <or S7*AQ=e.g. %ordcount>
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 22/81
FF
M PM D Softw are
Multiple Program Multiple <ata 7SPM<8:a coarse#grained M*A approach to progra$$ing
<ata parallel soft5are:
do the same thing to all elements o< a structure =e.g., $an&$atri6 algorith$s>. 'asiest to 9rite and understand.
t applies to co$ple6 prole$s =e.g. *7, distriuted s&ste$>.
%hat applications are suitale <or *7*AQ=e.g. 9i(ipedia>
i d l d l
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 23/81
F2
Program m ing M odels and ToolsMapReduce in Heterogeneous Environment
MICROSOFT
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 24/81
S ALS A
/ext Generation Se.uencing Pipelineon Cloud
F
:lastPairiseistanceCalculation
issimilarit!Matrix
/(/*;)56
values
"AS%A "ile/Se.uences
bloc
$airings
MapReduce
3 F 2
Clustering %isuali&ation
$lot'i&
%isuali&ation
$lot'i&
MS
Pairiseclustering
MPI
4
$it their jos to the pipeline and the results 9ill e sho9n in a isuali8ation tool.rt illustrate a h&rid $odel 9ith *apReduce and *7. T9ister 9ill e an uni@ed solution <or the pipelinponents are serices and so is the 9hole pipeline.
d research on 9hich stages o< pipeline serices are suitale <or priate or co$$ercial !louds.
i i
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 25/81
*otiationAata
Aeluge*apReduce !lassic 7arallel
Runti$es =*7>
4xperiencing in man!domains
ataCentered, <oS
4=cient andProventechni.ues
nput
utput
$ap
nput
$ap
reduce
nput
$ap
reduce
iterations
7ij
'6pand the )pplicailit& o< *apReduce to$ore classes o< )pplications
Map*nl!
MapReduce
Iterative
MapReduce
More
4xtensions
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 26/81
%ister v7'8
<istinction on static and
variable data
"on=gurable long running
7cacheable8 map;reduce tas3s
Pub;sub messaging based
communication;data transfers>ro3er et5or3 for facilitating
communication
/e In$rastructure $orIterative MapReduce
Programming
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 27/81
con+gureMaps('')
con+gureReduce('')
runMapReduce('')
hile(condition)>
? 55end hile
updateCondition()
close()
Com#ine()
operation
Reduce()
Map()
%or(er odes
!o$$unications/data trans<ers ia
the pu-su ro(er net9or( direct T!7
Iterations
*a& send ?e&,;alueO pairsdirectl&
"ocal Ais(
!acheale $ap/reducetas(s
• *ain progra$ $a& contain $an&
*apReduce inocations or
iteratie *apReduce inocations
*ain progra$5s process space
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 28/81
%or(er ode
"ocal Ais(
%or(er 7ool
T9ister Aae$on
*aster ode
T9ister
Arier
*ain 7rogra$
B
B BB
7u/su
Bro(er et9or(
%or(er ode
"ocal Ais(
%or(er 7ool
T9ister Aae$on
Scripts per<or$:Aata distriution, datacollection, and partition+le creation
$ap
reduce !achealetas(s
ne ro(erseres seeral
T9isterdae$ons
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 29/81
FM
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 30/81
MRRoles@A-ure
)8ure ueues <or scheduling, Tales to store $eta-data and$onitoring data, Blos <or input/output/inter$ediate data storage.
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 31/81
Iterative MapReduce$or A-ure
7rogra$$ing $odel e6tensions to supportroadcast data
*erge Step n-*e$or& !aching o< static data
!ache a9are h&rid scheduling using ueues,ulletin oard =special tale> and e6ecution
histories
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 32/81
MRRoles@A-ure• Aistriuted, highl& scalale and highl& aailale
cloud serices as the uilding loc(s.• tili8e eentuall&-consistent , high-latenc& cloud
serices eEectiel& to delier per<or$anceco$parale to traditional *apReduce runti$es.
• Aecentrali8ed architecture 9ith gloal Nueueased d&na$ic tas( scheduling
• *ini$al $anage$ent and $aintenance oerhead
• Supports d&na$icall& scaling up and do9n o< theco$pute resources.
• *apReduce <ault tolerance
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 33/81
Per$ormance Comparisons
Cap Se.uence Assem#l!
Smith 2aterman Se.uenceAlignment
:LAS% Se.uence Search
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 34/81
Per$ormance B meansClustering
/um#er o$ 4xecuting Map %as&Histogram
Strong Scaling ith ;6DM ataPoints 2ea& Scaling
%as& 4xecution %ime Histogram
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 35/81
Per$ormance B Multiimensional Scaling
A-ure Instance %!pe Stud! /um#er o$ 4xecuting Map %as&
2ea& Scaling ata Si-e Scaling
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 36/81
Plot3i-, 3isuali-ation S!stem
Parallelvisuali-ationalgorithms (G%M,
MS, E) Improved .ualit!
#! using Aoptimi-ation
Interpolation
Parallel 3isuali-ationAlgorithms Plot3i-
Provide 3irtual space
Cross*plat$orm
3isuali-ation%ool&it (3%)
<t $rameor&
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 37/81
G%M vs' MSGT* *AS =S*)!+>
*a6i$i8e "og-"i(elihood
*ini$i8e STR'SS or SSTR'SSjectie+unction
=?> =? > =F>!o$ple6it&
• on-linear di$ension reduction• +ind an opti$al con@guration in a lo9er-di$ension• teratie opti$i8ation $ethod
7urpose
'* teratie *ajori8ation ='*-li(e>pti$i8ati
on*ethod
;ector-ased dataon-ector =7air9ise si$ilarit&
$atri6>nput
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 38/81
*7 / *7-
Parallel G%M
+inding ? clusters <or data points
Relationship is a ipartite graph =i-graph>Represented & ?-&- $atri6 =? >
Aeco$position <or 7-&- co$pute gridReduce $e$or& reNuire$ent & 3/7
?latentpoints
data
points
3
F
)
B
!3
F
) B !
7arallel +ileS&ste$
!ra& / "inu6 / %indo9s!luster
7arallelHA+4
Sca")7)!?
GT* / GT*-nterpolation
G%M S"%2AR4
S%AC
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 39/81
Scala#le MSParallel MS
• =F> $e$or& and co$putationreNuired. – 300( data L0GB $e$or&
• Balanced deco$position o< 6$atrices & 7-&- grid. – Reduce $e$or& and co$puting
reNuire$ent & 3/7
• !o$$unicate ia *7 pri$ities
MS Interpolation
• +inding appro6i$ate$apping position 9.r.t.(-5s prior $apping.
• 7er point it reNuires: – =*> $e$or&
– =(> co$putation
• 7leasingl& parallel
• *apping F* in 340
sec. – s. 300( in F000 sec.
– 400 ti$es <aster thanesti$ation o< the <ull *AS.
2M
c3
cF
c2
r3
rF
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 40/81
Interpolation extension toG%M5MS
+ull data processing & GT* or *AS is co$puting- and
$e$or&-intensie T9o step procedure
1raining : training & * sa$ples out o< data
Interpolation : re$aining =-*> out-o<-sa$ples areappro6i$ated 9ithout training
nIn-sample
N-n
Out-of-sample
Total N data
Training
Interpolation
Trained data
Interpolated
map
MPI,%ister
%ister
3F
......7-3p
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 41/81
G%M5MS ApplicationsPubChem data with CTDvisuai!ation b" usingMDS #e$t% and &TM#right%)out M20,000 che$icalco$pounds are isuali8ed asa point in 2A space,
annotated & the relatedgenes in !o$paratie
To6icogeno$ics Aataase=!TA>
Chemica compoundsshown in iteratures'
visuai!ed b" MDS #e$t%and &TM #right%
;isuali8ed F2,000 che$icalco$pounds 9hich $a& e
related 9ith a set o< 4 genes o<interest =)B!B3, !HRBF,
ARAF, 'SR3, and +F> ased on
the dataset collected <ro$
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 42/81
%ister*MS emoThis demo is for real time isuali!ation of the
process of multidimensional scaling(MDS)
calculation"
#e use Twister to do parallel calculation inside the
cluster$ and use Plot%i! to show the intermediate
results at the user client computer"
The process of computation and monitoring is
automated &y the program"
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 43/81
MDS projection of 100,000 protein sequences showing a few experimentally
identified clusters in preliminary work with Seattle Childrens !esearch "nstitute
%ister*MS utput
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 44/81
%ister*MS 2or& "lo
Master /ode
%isterriver
%ister*MS
ActiveM<:ro&er
MSMonitor
Plot3i-
I' Sendmessage tostart the Fo#
II' Sendintermediate
results
Local is&
III' 2rite data I3' Read data
Client /ode
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 45/81
MS utput Monitoring Inter$acePu#5Su#:ro&er/etor&
2or&er /ode
2or&er Pool
%ister aemon
Master /ode
%isterriver
%ister*MS
2or&er /ode
2or&er Pool
%ister aemon
map
reduce
map
reduc
e
calculateSt
ress
calculate:C
%ister*MS Structure
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 46/81
:ioin$ormatics PipelineGene
SeNuences= U 3*illion>
Aistance *atri6
nterpolatie*AS 9ith7air9iseAistance
!alculation
*ulti-Ai$ensio
nalScaling=*AS>
;isuali8ation
2A 7lot
Re<erenceSeNuenceSet =* U300?>
- *SeNuenc
e Set=M00?>
SelectRe<eren
ce
Re<erence!oordinate
s6, &,8
- *!oordinate
s
6, &,8
7air9ise)lign$en
t Aistance!alculati
on
!7 8
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 47/81
/e /etor& o$ :ro&ers
T9ister Arierode
T9ister Aae$onode)ctie* Bro(er
ode
Bro(er-Aae$on!onnection
Bro(er-Bro(er!onnection
Bro(er-Arier!onnection
Bro(ersand 2F!o$putingodes intotal
A' "ull Mesh
/etor&
:' Hierarchical
Sending
C'
4 Bro(ers and !o$puting odes intotal
P $
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 48/81
Per$ormanceImprovement
2L00 43F00 1L00 30F000.000
F00.000
00.000
100.000
L00.000
3000.000
3F00.000
300.000
3100.000
3LM.FLL
24M.1F4
L31.21
340L.L
3L.L04
202.2F
2.02
30.23
%ister*MS 4xecution %ime;77 iterations, @7 nodes, under dierent input data si-es
riginal '6ecution Ti$e =3 ro(er onl&> !urrent '6ecution Ti$e = ro(ers, the est ro(er nu$er>
/um#er o$ ata Points
%otal 4xecution %ime (Seconds)
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 49/81
:roadcasting on @7 /odes=n *ethod !, centroids are split to 310 loc(s, sent through 0 ro(ers in
rounds>
00* 100* L00*
0.00
30.00
F0.00
20.00
0.00
40.00
10.00
0.00
L0.00
M0.00
300.00
32.0
3L.M
F.40
1.3M
0.41
M2.3
*ethod ! *ethod B
:roadcasting %ime (1nit Second)
T i t ) hit t
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 50/81
T9ister e9 )rchitecture2or&er /ode
%ister aemon%isterriver
2or&er /ode
%ister aemon
map
reduce
merge
:ro&er
Map
Reduce
Master /ode
Con+gureMapper
Add to MemCache
:ro&er :ro&er
Cachea#letas&s
map #roadcasting
chain
reduce collectionchain
!h i /Ri B d ti
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 51/81
!hain/Ring Broadcasting T9ister Aae$on ode
T9ister Arier ode • Arier sender:
• send roadcasting data• get ac(no9ledge$ent• send ne6t roadcasting data• V
• Aae$on sender:
• receie data <ro$ the lastdae$on =or drier>• cache data to dae$on• Send data to ne6t dae$on =9aits
<or )!?>• send ac(no9ledge$ent to the
last dae$on
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 52/81
!hain Broadcasting 7rotocolsend
get ac&
send
Arier Aae$on
0receivehandledatasend
Aae$on
3
ac&
receivehandledatasend
ac& receive
handledataget ac&
ac&
receivehandledata
get ac&
send
send
get ac&
Aae$on
F
receivehandle
dataac&
receivehandledataac&
ac&
get ac&
get ac& ac&
ac& get ac&
(no9 this is theend o< Aae$on!hain
(no9 this is theend o< !ache
Bloc(
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 53/81
Broadcasting Ti$e!o$parison
0
4
30
34
F0
F4
20
:roadcasting %ime Comparison on D7 nodes, I77 M: data, ;I7 pieces
!hain Broadcasting )ll-to-)ll Broadcasting, G0 ro(ers
4xecution /o'
:roadcasting %ime (1nitH Seconds)
Applications 0 ierent
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 54/81
S ALS A
Interconnection PatternsMap nl! Classic
MapReduceIterative
Reductions%ister
Loosel!S!nchronous
CAP )nal&sisAocu$entconersion =7A+ -OHT*">Brute <orce searches
in cr&ptograph&7ara$etric s9eeps
High 'nerg&7h&sics =H4P>Histogra$sS2G genealign$ent
Aistriuted searchAistriuted sortingn<or$ationretrieal
'6pectation$a6i$i8ationalgorith$s!lustering"inear )lgera
*an& *7 scienti@capplicationsutili8ing 9ideariet& o<co$$unication
constructsincluding localinteractions
- !)72 Gene)sse$l&- 7olarGrid *atladata anal&sis
- n<or$ationRetrieal - H'7Aata )nal&sis- !alculation o<7air9ise Aistances<or )" SeNuences
- ?$eans- eterministic
AnnealingClustering* *ultidi$ensionalScaling MS
- SolingAiEerential'Nuations and- particle d&na$ics9ith short range<orces
nput
utput
$ap
nput$ap
reduce
nput
$ap
reduce
iterations
7ij
Ao$ain o< *apReduce and teratie '6tensions *7
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 55/81
%ister "utures
Deelopment of li!rary of Collectives to use at Reduce phase'roadcast and ather needed &y current applications
Discoer other important ones
Implement efficiently on each platform especially A!ure
'etter software message routing with &ro*er networ*s using
asynchronous I+, with communication fault tolerance
Support near!y location of data and computing using data
parallel file systems
-learer application fault tolerance model &ased on implicit
synchroni!ations points at iteration end points
.ater/ Inestigate :U support
.ater/ run time for data parallel languages li*e Saw!all$ Pig .atin$
.I01
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 56/81
!onergence is Happening
Multico
re
Clouds
ata IntensiveParadigms
Aata intensie application =three asic actiiticapture, curation, and anal&sis =isuali8ation>
!loud in<rastructure and runti$e
7arallel threading and processes
+utureGrid: a Grid Tested
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 57/81
S ALS A
+utureGrid: a Grid Tested• I1 !ra& operational, I1 B* =iAata7le6> co$pleted stailit& test *a& 1
• 1CS B* operational, 1" B* stailit& test co$pletes W *a& 3F
• /etor& , /I and P1 HT! s&ste$ operational
• 1C B* stailit& test co$pletes W *a& FD %ACC Aell a9aiting delier& o< co$ponents
/I: et9or( $pair$ent Aeic
7riate7ulic +G et9or(
S)"S)H7! A&na$ic ;irtual !luster on
emonstrate the concept
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 58/81
S ALS A
• S9itchale clusters on the sa$e hard9are =W4 $inutes et9een diEerent S such as "inu6X#en to%indo9sXH7!S>
• Support <or irtual clusters
• S%-G : S$ith %ater$an Gotoh Aissi$ilarit& !o$putation as an pleasingl& parallel prole$ suitale <or*apReduce st&le applications
Pu#5Su#:ro&er/etor&
Summari-er
Sitcher
Monitoring Inter$ace
iataplex
:are*metal/odes
JCA%In$rastructure
3irtual5Ph!sicalClusters
Monitoring 0 ControlIn$rastructure
iAataple6 Bare-$etalodes
=2F nodes>
#!)T n<rastructure
Linux:are*s!stem
Linux onJen
2indos
Server677D:are*
s!stem
S2*G1sing
Hadoop
S2*G1sing
Hadoop
S2*G1sing
r!adLI/<
*onitoring n<rastructure
!namic ClusterArchitecture
&+utureGrid -- Ae$o at S!0M
po$ Science on Clouds on"utureGrid
S)"S)H7! A&na$ic ;irtual !luster on
emonstrate the concept
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 59/81
S ALS A
&+utureGrid -- Ae$o at S!0M
• Top: 2 clusters are s9itching applications on @6ed eniron$ent. Ta(es appro6i$atel& 20seconds.
• Botto$: !luster is s9itching et9een eniron$ents: "inu6D "inu6 X#enD %indo9s XH7!S.
po$ Science on Clouds using
a "utureGrid cluster
' i ti " d
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 60/81
S ALS A
• Bac(ground: data intensie co$puting reNuiresstorage solutions <or huge a$ounts o< data
• ne proposed solution: HBase, Hadoopi$ple$entation o< Google5s BigTale
'6peri$enting "ucene nde6 onHBase in an H7! 'niron$ent
S t d i d i l t ti
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 61/81
S ALS A
S&ste$ design and i$ple$entation
K solution
• nerted inde6:
cloudI -O doc3, docF, V
co$putingI -O doc3, doc2, V
• )pache "ucene:- ) lirar& 9ritten in Jaa <or uilding inerted indices andsupporting <ull-te6t search
- ncre$ental inde6ing, docu$ent scoring, and $ulti-inde6search 9ith $erged results, etc.
- '6isting solutions using "ucene store inde6 data 9ith @lesK no natural integration 9ith HBase
• Solution: $aintain inerted indices directl& inHBase as tales
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 62/81
S ALS A
S&ste$ design• Aata <ro$ a real digital lirar& application: iliograph& data,
page i$age data, te6ts data• S&ste$ design:
S d i
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 63/81
S ALS A
S&ste$ design• Tale sche$as:
- title inde6 tale: ter$ alueO --O Y<reNuencies:docidO, doc idO, ...[\
- te6ts inde6 tale: ter$ alueO --O Y<reNuencies:docidO, doc idO, ...[\
- te6ts ter$ position ector tale: ter$ alueO --OYpositions:doc idO, doc idO, ...[\
• atural integration 9ith HBase
• Reliale and scalale inde6 data storage
• Real-ti$e docu$ent addition and deletion
• *apReduce progra$s <or uilding inde6 andanal&8ing inde6 data
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 64/81
S ALS A
S&ste$ i$ple$entation• '6peri$ents co$pleted in the )la$o H7! cluster o< +utureGrid
• *&Hadoop -O *&HBase
• %or(Co9:
d d t l i
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 65/81
S ALS A
nde6 data anal&sis• Test run 9ith 4 oo(s
• Total nu$er o< distinct ter$s: LF12
• +ollo9ing @gures sho9 diEerent <eatures aout thete6t inde6 tale
d d t l i
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 66/81
S ALS A
nde6 data anal&sis
!o$parison 9ith related
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 67/81
S ALS A
!o$parison 9ith related9or(
• 7ig and Hie:- Aistriuted plat<or$s <or anal&8ing and 9arehousing large
data sets
- 7ig "atin and Hie" hae operators <or search
- Suitale <or atch anal&sis to large data sets
• Solr!loud, 'lasticSearch, ?atta:- Aistriuted search s&ste$s ased on "ucene indices
- ndices organi8ed as @lesD not a natural integration 9ith
HBase
• Solandra:- nerted inde6 i$ple$ented as tales in !assandra
- AiEerent inde6 tale designsD no *apReduce support
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 68/81
S ALS A
+uture 9or(
• Aistriuted per<or$ance ealuation
• *ore data anal&sis or te6t $ining ased on theinde6 data
• Aistriuted search engine integrated 9ith HBaseregion serers
4ducation and :roader
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 69/81
4ducation and :roaderImpact
2e devote a lot to guidestudentsho are interested in
computing
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 70/81
4ducation
2e oer classesith emergingne topics
%ogether ithtutorials on the
most popularcloud computingtools
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 71/81
Hosting or&shops andspreading our technolog!across the nation
Giving studentsun$orgetta#leresearch experience
:roader Impact
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 72/81
Ac&noledgement
S ALS A HPC GroupIndiana 1niversit!http://salsahpc.indiana.edu
High 'nerg& 7h&sics Aata )nal&sis
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 73/81
2
g g& & &
nput to a $ap tas(: (e&, alueO(e& U So$e d alue U H'7 @le a$e
utput o< a $ap tas(: (e&, alueO(e& U rando$ ] =0U nu$U $a6 reduce tas(s>
alue U Histogra$ as inar& data
nput to a reduce tas(: (e&, "istalueOO(e& U rando$ ] =0U nu$U $a6 reduce tas(s>
alue U "ist o< histogra$ as inar& data
utput <ro$ a reduce tas(: aluealue U Histogra$ @le
!o$ine outputs <ro$ reduce tas(s to<or$ the @nal histogra$
)n application anal&8ing data <ro$ "arge Hadron !ollider =3TB ut 300 7eta&tes eent
Reduce 7hase o< 7article 7h&sics
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 74/81
+ind the HiggsI using Ar&ad
• !o$ine Histogra$s produced & separate Root *apsI =o< eent data to partial histogra$s> intoa single Histogra$ deliered to !lient
• This is an e6a$ple using *apReduce to do distriuted histogra$$ing.
Higgs in Monte Carlo
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 75/81
4http:;;blog6?team6nl;5p#content;uploads;%%/;%.;MapReduce@ord"ount!vervie56png
Sort
0.33, 0.LM,0.F
0.FM, 0.F2,0.LM
0.F, 0.F2,0.33
4mit K:ini , ;
0.F2,0.LM,0.F,
0.FM,0.F2,0.LM,
0.F, 0.F2,
0.33
4vents (Pi )
Bin&,
3
Bin./,
3
Bin),
3Bin/,
3
Bin&,
3
Bin./,
3
Bin),
3
Bin&,
3
Bin$$,
3
Bin&,
3Bin&,
3
Bin&,
3
Bin$$,
3
Bin$$,
3
Bin),
3
Bin)
,
3Bin./,
3
Bin./,
3
Bin$$,
F
Bin&,2
Bin),
F
Bin./,
F
Bin$$,
F
Bin&, 2
Bin/,
F
Bin./,
F
The oerall *apReduce H'7 )nal&sis7rocess
+ro$ %ord!ount to H'7 )nal sis
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 76/81
1
3L.
pu#lic void map(Long2rita#le &e!, %ext value,
utputCollectorK%ext, Int2rita#le output, Reporter reporter)thros I4xception >
3M. String line N value'toString()O
F0. String%o&eni-er to&eni-er N ne String%o&eni-er(line)O
F3. hile (to&eni-er'hasMore%o&ens()) >
FF. ord'set(to&eni-er'next%o&en())O 55 Parsing
F2. output'collect(ord, one)O
F. ?
5 Single propert!5histogram5
dou#le event N 7O
int :I/QSI4 N ;77Odou#le #insT N nedou#le:I/QSI4TOEE'i$ (event is in #iniT)>55Pseudo event'set (i)O?output'collect(event, one)O
+ro$ %ord!ount to H'7 )nal&sis
5 Multiple properties5histograms 5
int PRP4R%I4S N ;7Oint :I/QSI4 N ;77O 55assume properties arenormali-eddou#le event3ectorT N ne
dou#le34CQL4/G%HTOdou#le #insT N ne dou#le:I/QSI4TOEE'$or (int iN7O i K34CQL4/G%HO iUU) > $or (int F N 7O F K PRP4R%I4SO FUU) > i$ (event3ectoriT is in #insFT)>55Pseudo UU#insFTO
? ??output'collect(Propert!, #insT) 55Pseudo
K M eans Clustering
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 77/81
n statistics and $achine learning, ( *means clustering is a$ethod o< cluster anal&sis 9hich ai$s to partition n oserationsinto 3 clusters in 9hich each oseration elongs to the cluster9ith the nearest $ean. t is si$ilar to thee6pectation-$a6i$i8ation algorith$ ='*> <or $i6tures o<Gaussians in that the& oth atte$pt to @nd the centers o< natural
clusters in the data as 9ell as in the iteratie re@ne$entapproach e$plo&ed & oth algorith$s.
K-M eans Clustering
wikipedia
'-step: the ^assign$ent^ step asexpectation step*-step: the ^update step^ as
maximi-ation step
How itw orks?
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 78/81
L
How it w orks?
wikipedia
?-$eans !lustering )lgorith$ <or *apReduce
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 79/81
M
_ Ao _ Broadcast !n _
_ 7er<or$ in parallel[ Kthe $ap=> operation _ <or each ;i _ <or each !n,j _ Aij U 'uclidian =;i,!n,j> _ )ssign point ;i to !n,j 9ith $ini$u$ Aij
_ <or each !n,j _ !n,j U!n,j/? __ 7er<or$ SeNuentiall&[ Kthe reduce=> operation _ !ollect all !n_ !alculate ne9 cluster centers !nX3
_ AiEU 'uclidian =!n, !nX3> __ 9hile =AiE THR'SH"A>
Map
Reduce
3i re<ers to the ith ectorCn,F re<ers to the jth cluster center in nth _ iterationiF re<ers to the 'uclidian distance et9een ith ector and jth _
cluster center is the nu$er o< cluster centers
4*Step
M*Step
Glo#al reduction
7aralleli8ation o< ?-$eans !lustering
7/25/2019 JudyQiu Talk IIT Nov 4 2011
http://slidepdf.com/reader/full/judyqiu-talk-iit-nov-4-2011 80/81
L0
C;
!F
!2V
V
V
!(
7artition
C;
!F
!2V
V
V
!(
7artition
C;
!F
!2V
V
V
!(
7artition
V
:roadcast
C; C6 C E E E C&
x; !; count;
6F &F countF62 &2 count2
V
V
V
6( &( count(
T9ister ?-$eans