parallel applications and tools for cloud computing environments

28
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010

Upload: sine

Post on 23-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Parallel Applications And Tools For Cloud Computing Environments. CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010. Azure MapReduce. AzureMapReduce. A MapRedue runtime for Microsoft Azure using Azure cloud services Azure Compute - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Parallel Applications And Tools For Cloud Computing Environments

Parallel Applications And Tools For Cloud Computing Environments

CloudCom 2010Indianapolis, Indiana, USA

Nov 30 – Dec 3, 2010

Page 2: Parallel Applications And Tools For Cloud Computing Environments

Azure MapReduce

Page 3: Parallel Applications And Tools For Cloud Computing Environments

AzureMapReduce A MapRedue runtime for Microsoft Azure using Azure

cloud services Azure Compute Azure BLOB storage for in/out/intermediate data storage Azure Queues for task scheduling Azure Table for management/monitoring data storage

Advantages of the cloud services Distributed, highly scalable & available Backed by industrial strength data centers and technologies

Decentralized control Dynamically scale up/down No Single Point of Failure

Page 4: Parallel Applications And Tools For Cloud Computing Environments

AzureMapReduce Features Familiar MapReduce programming model Combiner step Fault Tolerance

Rerunning of failed and straggling tasks Web based monitoring console Easy testing and deployment Customizable

Custom Input & output formats Custom Key and value implementations

Load balanced global queue based scheduling

Page 5: Parallel Applications And Tools For Cloud Computing Environments

Advantages Fills the void of parallel programming frameworks

on Microsoft Azure Well known, easy to use programming model Overcome the possible unreliability's of cloud

compute nodes Designed to co-exist with eventual consistency of

cloud services Allow the user to overcome the large latencies of

cloud services by using coarser grained tasks Minimal management/maintanance overhead

Page 6: Parallel Applications And Tools For Cloud Computing Environments

AzureMapReduce Architecture

Page 7: Parallel Applications And Tools For Cloud Computing Environments

Performance

0

500

1000

1500

2000

2500

3000

Azure MRAmazon EMRHadoop on EC2Hadoop on Bare Metal

Num. of Cores * Num. of Blocks

Adju

sted

Tim

e (s

)

Smith Watermann Pairwise Distance All-Pairs Normalized Performance

64 * 1024

96 * 1536

128 * 2048

160 * 2560

192 * 3072

50%

60%

70%

80%

90%

100%

Azure MapReduceAmazon EMRHadoop Bare MetalHadoop on EC2

Num. of Cores * Num. of Files

Para

llel E

ffici

ency

CAP3 Sequence Assembly Parallel Efficiency

Page 8: Parallel Applications And Tools For Cloud Computing Environments

Large-scale PageRank with Twister

Page 9: Parallel Applications And Tools For Cloud Computing Environments

Pagerank with MapReduce Efficient processing of large scale Pagerank challenges

current MapReduce runtimes. Difficulties: messaging > memory > computation Implementations: Twister, DryadLINQ, Hadoop, MPI Optimization strategies

Load static data in memory Fit partition size to memory Local merge in Reduce stage

Results Visualization with PlotViz3 1K 3D vertices processed with MDS Red vertex represent “wikipedia.org”

Page 10: Parallel Applications And Tools For Cloud Computing Environments

Pagerank Optimization Strategies

500 1500 2500 3500 45000

1000200030004000500060007000

Twister Hadoop

1. Implement with Twister and Hadoop with 50 million web pages.

2. Twister caches the partitions of web graph in memory during multiple iteration, while Hadoop need reload partition from disk to memory for each iteration.

1. Implement with DryadLINQ with 50 million web pages on a 32 nodes Windows HPC cluster

2. Split web graph in different granularity coarse granularity: split whole web graph into 1280 files. fine granularity: split whole web graph into 256 files.

160/32 files 320/64 files 640/128 files

960/196 files

1280/256 files

01000200030004000500060007000

fine granularity Linear (fine granularity)Linear (fine granularity) Linear (fine granularity)coarse granularity Linear (coarse granularity)

Page 11: Parallel Applications And Tools For Cloud Computing Environments

Pagerank Architecture

Page 12: Parallel Applications And Tools For Cloud Computing Environments

Twister BLAST

Page 13: Parallel Applications And Tools For Cloud Computing Environments

Twister-BLASTA simple parallel BLAST application

based on Twister MapReduce framework

Runs on a single machine, a cluster, or Amazon EC2 cloud platform

Adaptable to the latest BLAST tool (BLAST+ 2.2.24)

Page 14: Parallel Applications And Tools For Cloud Computing Environments

Twister-BLAST Architecture

Page 15: Parallel Applications And Tools For Cloud Computing Environments

Database ManagementReplicated to all the nodes, in order

to support BLAST binary executionCompression before replication Transported through file share script

tool in Twister

Page 16: Parallel Applications And Tools For Cloud Computing Environments

Twister-BLAST Performance

Page 17: Parallel Applications And Tools For Cloud Computing Environments

SALSA Portal and Biosequence Analysis Workflow

Page 18: Parallel Applications And Tools For Cloud Computing Environments

Biosequence AnalysisConceptual Workflow

Alu Sequences

Pairwise Alignment & Distance Calculation

Distance Matrix

Pairwise Clustering

Multi-Dimensional

Scaling

Visualization

Cluster Indices

Coordinates

3D Plot

Page 19: Parallel Applications And Tools For Cloud Computing Environments

Biosequence Analysis

Retrieve Results

SubmitMicrosoft HPC Cluster

Distribute Job

Write Results

Job Configuration

and Submission Tool

Cluster Head-node

Compute Nodes

Sequence Aligning

Pairwise Clustering

Dimension ScalingPlotViz - 3D

Visualization Tool

Workflow Implementation

Page 20: Parallel Applications And Tools For Cloud Computing Environments

SALSA PortalUse Cases

Create Biosequence Analysis Job

<<extend>>

Page 21: Parallel Applications And Tools For Cloud Computing Environments

SALSA PortalArchitecture

Page 22: Parallel Applications And Tools For Cloud Computing Environments

PlotViz Visualization with parallel MDS/GTM

Page 23: Parallel Applications And Tools For Cloud Computing Environments

PlotVizA tool for visualizing data points

Dimension reduction by GTM and MDSBrowse large and high-dimensional dataUse many open (value-added) data

Parallel Visualization AlgorithmsGTM (Generative Topographic Mapping)MDS (Multi-dimensional Scaling) Interpolation extensions to GTM and MDS

Page 24: Parallel Applications And Tools For Cloud Computing Environments

PlotViz System Overview

24

Visualization Algorithms Chem2Bio2RDF

PlotViz

Parallel dimension reduction algorithms

Aggregated public databases

3-D Map File SPARQL queryMeta data

Light-weight client

PubChem

CTDDrugBank

QSAR

Page 25: Parallel Applications And Tools For Cloud Computing Environments

25

CTD data for gene-disease

PubChem data with CTD visualization by using MDS (left) and GTM (right)About 930,000 chemical compounds are visualized as a point in 3D space, annotated by the related genes in Comparative Toxicogenomics Database (CTD)

Page 26: Parallel Applications And Tools For Cloud Computing Environments

26

Chem2Bio2RDF

Chemical compounds shown in literatures, visualized by MDS (left) and GTM (right)Visualized 234,000 chemical compounds which may be related with a set of 5 genes of interest (ABCB1, CHRNB2, DRD2, ESR1, and F2) based on the dataset collected from major journal literatures which is also stored in Chem2Bio2RDF system.

Page 27: Parallel Applications And Tools For Cloud Computing Environments

27

Activity Cliffs

GTM Visualization of bioassay activities

Page 28: Parallel Applications And Tools For Cloud Computing Environments

28

Solvent Screening

Visualizing 215 solvents215 solvents (colored and labeled) are embedded with 100,000 chemical compounds (colored in grey) in PubChem database