an mpi-io cloud cluster bioinformatics summer project (bdt205) | aws re:invent 2013

35
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. An MPI-IO Cloud Cluster Bioinformatics Summer Project Brandon Posey, Dougal Ballantyne, Boyd Wilson November 13, 2013

Upload: amazon-web-services

Post on 08-May-2015

559 views

Category:

Technology


2 download

DESCRIPTION

Researchers at Clemson University assigned a student summer intern to explore bioinformatics cloud solutions that leverage MPI, the OrangeFS parallel file system, AWS CloudFormation templates, and a Cluster Scheduler. The result was an AWS cluster that runs bioinformatics code optimized using MPI-IO. We give an overview of the process and show how easy it is to create clusters in AWS.

TRANSCRIPT

Page 1: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

An MPI-IO Cloud Cluster Bioinformatics Summer Project

Brandon Posey, Dougal Ballantyne, Boyd Wilson

November 13, 2013

Page 2: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Filesystems on AWS

Page 3: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

What filesystems *MUST* you use on AWS?

Page 4: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

The one that means the needs of your unique application needs!

Some things to consider: • Total amount of storage required? • Resilience required? • Expected number of clients? • Locality of servers and clients? • Average file sizes? (KB, MB, GB, TB) • Block sizes used by applications? • IO profile? Read/Write%? • Typical IO use case?

Page 5: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Filesystems on AWS are all about building blocks!

Page 6: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Building Blocks • Amazon Elastic Compute Cloud (Amazon EC2)

– 1ECU to 88ECU of compute power – 613MB to 240GB of memory – Shared network, EBS optimized, dedicated 10Gb

• Amazon Simple Storage Service (Amazon S3) – Unlimited capacity – Web-scale – Lifecycle management

Amazon EC2

Amazon S3

Page 7: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Building Blocks • Local storage (ephemeral)

– 150GB to 3360GB per instance – HDD and SSD – FREE! (part of instance cost)

• Amazon Elastic Block Store (Amazon EBS) – 1G to 1000GB per volume – Standard and Provisioned IOPS – Multiple volumes per instance – Supports snapshot to Amazon S3

Amazon EBS

Ephemeral Disk

Page 8: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Storage-optimized EC2 instances http://aws.amazon.com/ec2/instance-types/ "This family includes the HI1 and HS1 instance types, and provides you with Intel Xeon processors and direct-attached storage options optimized for applications with specific disk I/O and storage capacity requirements." • HI1 instances features SSD storage • HS1 instances feature direct attach HDD

Page 9: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Amazon EBS optimized instances http://aws.amazon.com/ebs/ "To enable your Amazon EC2 instances to fully utilize the IOPS provisioned on an EBS volume, you can launch selected Amazon EC2 instance types as “EBS-Optimized” instances."

Page 10: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

What Are Your Needs? • Temporary or long-term storage? • Shared or per instance? • How much? • How fast?

Page 11: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Long term storage • Use Amazon S3 • Pull datasets when needed • Easy to access using AWS CLI or API

$ aws s3 cp s3://mybucket/dataset/input /ephemeral/input

• Lifecycle to Amazon Glacier

Page 12: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Temporary Storage • Local ephemeral for scratch • Distributed filesystem for high-performance

scratch – OrangeFS – Lustre – Ceph

• Pull data from Amazon S3

Page 13: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

How much? • With Amazon S3, you pay for what you use • With Amazon EBS, you pay for what you

provision • Keeping data in Amazon S3 and only pulling

what is needed helps mange cost

Page 14: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

How fast? • Ephemeral storage can deliver up to 2.2GB/sec

– more instances == more throughput

• Amazon EBS volumes support up to 4000 IOPS – more volumes == more IOPS

• Amazon S3 scales horizontally – more client == more throughput – more connections == more throughput

Page 15: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Making filesystems persist • Use Amazon EBS for block storage • Use Amazon EBS snapshots for recovery • Use a replicated distributed filesystem

Page 16: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Automating deployments • AWS CloudFormation • Drive storage through parameters • Easy to set up and tear down • Track template changes in SCM

Page 17: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Solutions on AWS • OrangeFS from Omnibond

• Red Hat Storage 2.0

• Intel Cloud Edition Lustre - Private Beta

Page 18: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Customer presentation

Page 19: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

RNA-Seq Differential Gene Expression Workflow

Clemson University Professor, Dr. Alex Feltus had been discussing with Eddie Duffy and Dr. Barr Von Oehsen, about optimizing the Gene Expression Workflow. As a result, a summer project with Brandon Posey was started to work with this optimization in the AWS cloud. The longest processing steps were the FastQ steps and is where the optimization started.

*Workflow chart provided with permission from Allele Systems (www.allelesystems.com)

Page 20: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

OrangeFS – Scalable Parallel File System on AWS

Available on the AWS Marketplace and brought to you by Omnibond

OrangeFS Instance

Unified High Performance File System

Amazon DynamoDB

Amazon EBS

volumes

Page 21: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Cloud Cluster Built using AWS, Torque/Maui, OrangeFS

OrangeFS WebDAV

Torque / Maui

Optimization Areas • Data uploaded and

retrieved via OrangeFS WebDav Interface

• MPI Jobs are submitted via Torque & Maui Scheduler

• All built with AWS CloudFormation template

MPI-IO Clients

OrangeFS Servers

Amazon DynamoDB

Page 22: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

AWS CloudFormation Prompts "KeyName" : {

"VpcId" : {

"VpcPublicSubnetId" : {

"NAT & OrangeFS… AccessFrom" : {

"FSConfigDDB" : {… "WorkerConfigDDB" : {… "Type" : "AWS::DynamoDB::Table",

"CfnUser" : { …. "Type" : "AWS::IAM::User",…

Page 23: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

AWS CloudFormation – Amazon DynamoDB "FSConfigDDB" : {

"Type" : "AWS::DynamoDB::Table",

"WorkerConfigDDB" : {

"Type" : "AWS::DynamoDB::Table",

Page 24: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

AWS CloudFormation - IAM & Network "instanceRootRole" : {

"instanceRootProfile" : {

"HostKeys" : {

"PrivateSubnet" : {

"PrivateRouteTable" : {

"PrivateSubnetRouteTableAssociation" : {

"PrivateNetworkAcl" : {

"NATIPAddress" : {… "Type" : "AWS::EC2::EIP",

Page 25: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

AWS CloudFormation – Instances "NATDevice" : {…

"Type" : "AWS::EC2::Instance",

"MasterCoordinator" : {… "Type" : "AWS::EC2::Instance",

"OrangeFSFleet" : {… "Type" : "AWS::AutoScaling::AutoScalingGroup",

"WorkerFleet" : {… "Type" : "AWS::AutoScaling::AutoScalingGroup",

"WebDavDevice" : {… "Type" : "AWS::EC2::Instance",

Page 26: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

AWS CloudFormation – Cloud Init (python & Boto) "sudo /usr/bin/python2.7 /home/ec2-user/TorqueMasterConfigure.py -l DEBUG -f /home/ec2-user/MasterConfig.log”,

" -n ", {"Ref" : "WorkerConfigDDB"}, " -o ", {"Ref" : "FSConfigDDB"}, " -s ", {"Fn::FindInMap" : [ "ConfigParameters", "OrangeFSFleetSize", "item"]}, " -z ", {"Fn::FindInMap" : [ "ConfigParameters", "WorkerFleetSize", "item"]}, " -m ", {"Fn::FindInMap" : [ "ConfigParameters", "WorkerMaxFleetSize", "item"]}, " -p ", {"Fn::FindInMap" : [ "ConfigParameters", "OrangeFSPort", "item"]}, " -a ", {"Fn::FindInMap" : [ "ConfigParameters", "FSName", "item"]}, " -d ", {"Fn::FindInMap" : [ "ConfigParameters", "FSID", "item"]}, "\n",

Page 27: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Demo • Spin up a cluster on AWS live

Page 28: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

*Workflow chart provided with permission from Allele Systems (www.allelesystems.com)

RNA-Seq Differential Gene Expression Workflow

Optimization Areas • Fast- Splitter

rewritten in MPI-IO to leverage OrangeFS in AWS

• Merge-FastQ also rewritten in MPI-IO to leverage OrangeFS in AWS

Page 29: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Genomics – Data @@@FFF=BFHFDHCCDECJHIIIHG@GEEGAGEHFDHDHGIF@FGDEBFGIIGG=CGFGCDCEGHFEEECEBADBB?BCCCC<5:>@CCCA<9>C@A@ACB

@HWI-ST1097:170:C1LBBACXX:6:1101:1379:2208 1:N:0:CGATGT

CCTGTTATTGCCTCAAACTTCCGTGGCCTAAAACGCCAAAGTCCCCCTAAGAAGATAGCTGCGGGGGGGTGGCTCCGCCTAGCTAGTTAGGAAGCTGAGGG

+

CCCFFFFFHHHHHJJJJJJJJJJFAC8A*1?E#####################################################################

@HWI-ST1097:170:C1LBBACXX:6:1101:1582:2059 1:N:0:CGATGT

GTATTGTCATAAGCAGTTAAAGCTGATGTGCGCCTGTCATGTAATGCTGTAGAAACAAGCTCAGCAAGCTGCTGCTTTTGTGTTCTTGCACCGGAGNTCTT

Page 30: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Torque/Maui Job #!/bin/bash

#PBS -l nodes=4

#PBS -l walltime=4:00:00

#PBS -j oe

#PBS -q batch

#PBS -N AWS

cd /mnt/orangefs

mpirun /usr/local/bin/concat -p '/mnt/orangefs/Sample_Feltus1_L006_R2.cat.fastq.*' -o Combined.fastq >> /mnt/orangefs/Results.txt

Page 31: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

FastQ Splitter Time (seconds)

0 20 40 60 80 100

m1.xlarge

m3.xlarge

cc2.8xlarge

Read Input Transfer Write Output

0 500 1000 1500 2000 2500 3000 3500 4000

Old Method

Seconds

Seconds

Page 32: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

FastQ Merge Time (seconds)

0 20 40 60 80 100 120

m1.xlarge

m3.xlarge

cc2.8xlarge

Merge Time

0 500 1000 1500 2000 2500

Old Method

Seconds

Seconds

Page 33: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Demo • Torque/Maui Job on the cluster that was spun

up.

Page 34: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

More Info • AWS Marketplace…

– OrangeFS Community Edition – OrangeFS Advanced Edition

• Community… Orangefs.org

• Pipeline – Allele Systems… allelesystems.com

Page 35: An MPI-IO Cloud Cluster Bioinformatics Summer Project (BDT205) | AWS re:Invent 2013

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

BDT205