petascale genomics (strata singapore 20151203)

Post on 21-Jan-2018

1.005 Views

Category:

Health & Medicine

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1© Cloudera, Inc. All rights reserved.

Scaling Up Genomics with Hadoop and Spark

Uri Laserson | @laserson | 14 November 2015

Petascale Genomics

2© Cloudera, Inc. All rights reserved.

We come in peace.

Pioneer plaque

3© Cloudera, Inc. All rights reserved.

What is genomics?

4© Cloudera, Inc. All rights reserved.

Organism

5© Cloudera, Inc. All rights reserved.

Organism Cell

6© Cloudera, Inc. All rights reserved.

Organism Cell Genome

7© Cloudera, Inc. All rights reserved.

8© Cloudera, Inc. All rights reserved.

9© Cloudera, Inc. All rights reserved.

Reference chromosome

10© Cloudera, Inc. All rights reserved.

Reference chromosome

Location

11© Cloudera, Inc. All rights reserved.“… decoding the Book of Life”

12© Cloudera, Inc. All rights reserved.

...atatggaaccaaaaaagagcccgcatcgccaaggcaatcctaagccaaaagaacaaagctggaggcatcacactacctgacttcaaactatactaca

agcctacagtaaccaaaacagcatggtactggtaccaaaacagagatatagatcaatggaacagaacagagccctcagaaataacgccgcatatctacaa

ctatctgatctttgacgaacctgagaaaaacaagcaatggggaaaggattccctatttaataaatggtgctgggaaaactggctagccatatgtagaaag

ctgaaactggatcccttccttacaccttatacaaaaatcaattcaagatggattaaagacttaaacgttagacctaaaaccataaaaaccctagaagaaa

acctaggcagtaccattcaggacataggcatgggcaaggacttcatgtccaaaacaccaaaagcaatggcaacaaaagacaaaattgacaaatgggatct

aattaaactaaagagcttctgcacagcaaaagaaactaccatcagagtgaacaggaaacctacaaaatgggagaaaattttcgcaacctactcatctgac

aaagggctaatatccagaatctacaatgaactcaaacaaatttacaagaaaaaaacaaacaaccccatcaaaaagtgggcaaaggacatgaacagacact

tctcaaatgaagacatttatgcagccaaaaaacacatgaaaaaatgctcatcatcactggccatcagagaaatgcaaatcaaaaccacaatgagatacca

tctcacaccagttagaatggcaatcattaaaaagtcaggaaacaacaggtgctggagaggatgtggagaaataggaacacttttacactgttggtgggac

tgtaaactagttcaaccattgtggaagtcagtgtggtgattcctcagggatctagaactagaaataccatttgacccagccatcccattactgggtatat

acccaaaggactataaatcatgctgctataaagacacatgcacacgtatgtttattgcggcattattcacaatagcaaagacttggaaccaacccaaatg

tccaacaatgataaactggattaagaaaatgtggcacatatacaccatggaatactctgcagccataaaaaaggatgagttcatgtcctttgtagggaca

tggatgaaattggaaatcatcattctcagtaaactatcgcaagaataaaaaaccaaacaccgcatattctcactcataggtgggaattgaacaatgagat

cacatggacacaggaagaggaatatcacactctggggactgtggtggggtggggggaggggggagggatagcattgggagatatacctaatgctagatga

cgagttagtgggtgcagcgcaccagcatggcacatgtatacatatgtaactaacctgcacattgtgcacatgtaccctaaaacttaaagtataataaaaa

aataaaaaaaataaagtgtgtgtgtgtatgactttaattaacttgatcacccacacacacacaaacactgaccaaaattaatatcaagtcaggtctgtct

gaatgtaaagccaacagcaaacatccctctctccaaatggaaaagaaacagggggttatgggcagctacactgctaaatgttaaaactttatttttaaat

gtggccataaaaatcactaaataaaattgataatatatgtttttgatgaataaattttatatatgtctacactggaaactatatagcaataaaaactaac

catgtacaactaaactcataaatttcataaacataataagtaaaagaagccagacaaaaagtagtgtatactgttaaattccatttatataaaagttcaa

aaaagccaaaaagaaactatgctgttaaaagtaaggattatagttactattcagggaagagagtagtggctggaaagaaacataaagggggtctctgaag

tggaataatgttctgttttttgatctgggtattagggtgtttaatttcggaaaattattttatctttatacttattgtattattgattttttgcttaaca

aattactcaaaacttagaggtttaaaaaaaattaattattgtattaatttctctgggccaggaattggagagagcttagctgggtagttctggttcaaaa

tttctcatgagattaccgtcaagctgttggagggggctgcatcatctgaaggcttgaccgaggctagaggatctactttcaagatggcccactcacatgg

ctgttggcaagaagtttcagtttctcactagcttctagcaggaggccataatttctcaccacatagatctctctatagggctactcgagtgtcctcacag

caaggtagctggctttcttcagagccaagtgactcaaaggcaaagaggaagtcactatgccatttatgacctagttttggaactcacactttgttccgaa

ttgaccttccatcactttctagtcattaggatttaagtcactaactctgatccatagtcaaggggagtaaaatttggctttattgttggaggatggagta

gcaaagaatttgttgacacattttaaaactaccatacttaaacagttcatttttctgaatatgcttcaattagaagttaaaatgatgcaattttaaaaca

ttgtttcaaatgaacactgttagggagagaagtgcttcttctccatatctaatgtttcttccatatttagggagttccattagtttaacactttaag...

13© Cloudera, Inc. All rights reserved.

14© Cloudera, Inc. All rights reserved.

15© Cloudera, Inc. All rights reserved.

16© Cloudera, Inc. All rights reserved.

17© Cloudera, Inc. All rights reserved.

18© Cloudera, Inc. All rights reserved.

>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT

19© Cloudera, Inc. All rights reserved.

>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT

Bioinformatics!

20© Cloudera, Inc. All rights reserved.

>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT

Bioinformatics!

21© Cloudera, Inc. All rights reserved.

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Pipelines!

22© Cloudera, Inc. All rights reserved.

##fileformat=VCFv4.1

##fileDate=20090805

##source=myImputationProgramV3.1

##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta

##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>

##phasing=partial

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">

##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">

##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">

##FILTER=<ID=q10,Description="Quality below 10">

##FILTER=<ID=s50,Description="Less than 50% of samples have data">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">

##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3

20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4

20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Compressed text files (non-splittable)Semi-structuredPoorly specified

23© Cloudera, Inc. All rights reserved.

##fileformat=VCFv4.1

##fileDate=20090805

##source=myImputationProgramV3.1

##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta

##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>

##phasing=partial

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">

##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">

##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">

##FILTER=<ID=q10,Description="Quality below 10">

##FILTER=<ID=s50,Description="Less than 50% of samples have data">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">

##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3

20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4

20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Compressed text files (non-splittable)Semi-structuredPoorly specified

Global sort order

24© Cloudera, Inc. All rights reserved.

CHPC (scheduler)POSIX filesystem

JavaHPC (Queue)POSIX filesystem

C++Single-nodeSQLite

It’s file formats all the way down!

25© Cloudera, Inc. All rights reserved.

Dedup

26© Cloudera, Inc. All rights reserved.

/*** Main work method. Reads the BAM file once and collects sorted information about* the 5' ends of both ends of each read (or just one end in the case of pairs).* Then makes a pass through those determining duplicates before re-reading the* input file and writing it out with duplication flags set correctly.*/protected int doWork() {

// build some data structuresbuildSortedReadEndLists(useBarcodes);generateDuplicateIndexes(useBarcodes);

final SAMFileWriter out =new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT);

final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator;while (iterator.hasNext()) {

final SAMRecord rec = iterator.next();if (!rec.isSecondaryOrSupplementary()) {

if (recordInFileIndex == nextDuplicateIndex) {rec.setDuplicateReadFlag(true);// Now try and figure out the next duplicate indexif (this.duplicateIndexes.hasNext()) {

nextDuplicateIndex = this.duplicateIndexes.next();} else {

// Only happens once we've marked all the duplicatesnextDuplicateIndex = -1;

}} else {

rec.setDuplicateReadFlag(false);

Method

Code

27© Cloudera, Inc. All rights reserved.

/*** Main work method. Reads the BAM file once and collects sorted information about* the 5' ends of both ends of each read (or just one end in the case of pairs).* Then makes a pass through those determining duplicates before re-reading the* input file and writing it out with duplication flags set correctly.*/protected int doWork() {

// build some data structuresbuildSortedReadEndLists(useBarcodes);generateDuplicateIndexes(useBarcodes);

final SAMFileWriter out =new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT);

final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator;while (iterator.hasNext()) {

final SAMRecord rec = iterator.next();if (!rec.isSecondaryOrSupplementary()) {

if (recordInFileIndex == nextDuplicateIndex) {rec.setDuplicateReadFlag(true);// Now try and figure out the next duplicate indexif (this.duplicateIndexes.hasNext()) {

nextDuplicateIndex = this.duplicateIndexes.next();} else {

// Only happens once we've marked all the duplicatesnextDuplicateIndex = -1;

}} else {

rec.setDuplicateReadFlag(false);

Method

Code

28© Cloudera, Inc. All rights reserved.

@Option(shortName = "MAX_FILE_HANDLES",doc = "Maximum number of file handles to keep open when spilling " +

"read ends to disk. Set this number a little lower than the " +"per-process maximum number of file that may be open. This " +"number can be found by executing the 'ulimit -n' command on " +"a Unix system.")

public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;

29© Cloudera, Inc. All rights reserved.

@Option(shortName = "MAX_FILE_HANDLES",doc = "Maximum number of file handles to keep open when spilling " +

"read ends to disk. Set this number a little lower than the " +"per-process maximum number of file that may be open. This " +"number can be found by executing the 'ulimit -n' command on " +"a Unix system.")

public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;

Dedup

Method/Algo

Code

Platform

30© Cloudera, Inc. All rights reserved.

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

31© Cloudera, Inc. All rights reserved.

It’s pipelines all the way down!

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

32© Cloudera, Inc. All rights reserved.

It’s pipelines all the way down!

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Node 1

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Node 2

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Node 3

33© Cloudera, Inc. All rights reserved.

Manually running pipelines on HPC

$ bsub –q shared_12h python split_genotypes.py

$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv

$ bsub –q shared_12h python merge_maf.py

34© Cloudera, Inc. All rights reserved.

35© Cloudera, Inc. All rights reserved.

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Alignment Dedup Recalibrate QC/Filter

Alignment Dedup Recalibrate QC/Filter

36© Cloudera, Inc. All rights reserved.

Node 1

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Node 2

Node 3

Alignment Dedup Recalibrate QC/Filter

Alignment Dedup Recalibrate QC/Filter

Node 4

37© Cloudera, Inc. All rights reserved.

Node 1

Alignment Dedup QC/FilterVariantCalling

VariantAnnotation

Node 2

Node 3

Alignment Dedup QC/Filter

Alignment Dedup QC/Filter

Node 4

Recalibrate

38© Cloudera, Inc. All rights reserved.

Why Are We Still Defining File Formats By Hand?

• Instead of defining custom file formats for each data type and access pattern…

• Parquet creates a compressed format for each Avro-defineddata model

• Improvements over existing formats• ~20% for BAM• ~90% for VCF

39© Cloudera, Inc. All rights reserved.

YARN-managedHadoop cluster

Sparkexecutors

𝑗=1

𝑑𝑖

𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)

𝑗=1

𝑑𝑖

𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)

𝑗=1

𝑑𝑖

𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)Partial sums

𝑖=1

𝑁

𝑗=1

𝑑𝑖

𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)

Driver

Applicationcode

ContEst Algorithm

40© Cloudera, Inc. All rights reserved.

Hadoop provides layered abstractions for data processing

HDFS (scalable, distributed storage)

YARN (resource management)

MapReduce Impala (SQL) Solr (search) Spark

ADAMquince guacamole …

bd

g-fo

rmat

s (A

vro

/Par

qu

et)

41© Cloudera, Inc. All rights reserved.

• Hosted at Berkeley and the

AMPLab

• Apache 2 License

• Contributors from both

research and commercial

organizations

• Core spatial primitives,

variant calling

• Avro and Parquet for data

models and file formats

Spark + Genomics = ADAM

42© Cloudera, Inc. All rights reserved.

Core Genomics Primitives: Spatial Join

43© Cloudera, Inc. All rights reserved.

Executing query in Hadoop: interactive Spark shell (ADAM)

def inDbSnp(g: Genotype): Boolean = true or false

def isDeleterious(g: Genotype): Boolean = g.getPolyPhen

val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()

val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()

val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”)

val genotypesRDD = sc.adamLoad("path/to/genotypes")

val filteredRDD = genotypesRDD

.filter(!inDbSnp(_))

.filter(isDeleterious(_))

.filter(isFramingham(_))

val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)

val maf = joinedRDD

.keyBy(x => (x.getVariant, getPopulation(x)))

.groupByKey()

.map(computeMAF(_))

maf.saveAsNewAPIHadoopFile("path/to/output")

apply predicates

load data

join data

group-byaggregate (MAF)

persist data

44© Cloudera, Inc. All rights reserved.

Executing query in Hadoop: distributed SQL

SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)

FROM genotypes g

INNER JOIN samples s

ON g.sample = s.sample

INNER JOIN dnase d

ON g.chr = d.chr

AND g.pos >= d.start

AND g.pos < d.end

LEFT OUTER JOIN dbsnp p

ON g.chr = p.chr

AND g.pos = p.pos

AND g.ref = p.ref

AND g.alt = p.alt

WHERE

s.study = "framingham"

p.pos IS NULL AND

g.polyphen IN ( "possibly damaging", "probably damaging" )

GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop

apply predicates

“load” and join data

group-by

aggregate (UDAF)

45© Cloudera, Inc. All rights reserved.

ADAM preliminary performance

46© Cloudera, Inc. All rights reserved.

1. Somebody will build on your code

2. You should have assembled a team to build your software

3. If you choose the right license, more people will use and build on your

software.

4. Making software free for commercial use shows you are not against

companies.

5. You should maintain your software indefinitely

6. Your “stable URL” can exist forever

7. You should make your software “idiot proof”

8. You used the right programming language for the task.

Lior Pachterhttps://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/

“Myths of Bioinformatics Software”

47© Cloudera, Inc. All rights reserved.

48© Cloudera, Inc. All rights reserved.

Acknowledgements

UCBerkeleyMatt MassieFrank NothaftMichael Heuer

TamrTimothy Danford

MSSMJeff HammerbacherRyan Williams

ClouderaTom WhiteSandy Ryza

49© Cloudera, Inc. All rights reserved.

Thank you@laserson

laserson@cloudera.com

top related