PAGE: A Framework for Easy Parallelization of Genomic
Applications
1
Mucahid Kutlu Gagan AgrawalDepartment of Computer Science and Engineering
The Ohio State University
IPDPS 2014, Phoenix, Arizona
IPDPS'14 2
Motivation
• The sequencing costs are decreasing
*Adapted from genome.gov/sequencingcosts
IPDPS'14 3
• Big data problem– 1000 Human Genome Project already produced 200 TB data
– Parallel processing is inevitable!*Adapted from https://www.nlm.nih.gov/about/2015CJ.html
Motivation
IPDPS'14 4
Typical Analysis on Genomic Data
• Single Nucleotide Polymorphism (SNP) calling
Sequences 1 2 3 4 5 6 7 8Read-1 A G C GRead-2 G C G GRead-3 G C G T ARead-4 C G T T C C
Alig
nmen
t File
-1
Reference A G C G T A C C
Sequences 1 2 3 4 5 6 7 8
Read-1 A G A G
Read-2 A G A G T
Read-3 G A G T
Read-4 G T T C CAlig
nmen
t File
-2
*Adapted from Wikipedia
A single SNP may cause Mendelian disease!
✖ ✓✖
IPDPS'14 5
Outline
• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion
IPDPS'14 6
Existing Solutions for Implementation
• Serial tools– SamTools, VCFTools, BedTools – File merging, sorting etc.– VarScan – SNP calling
• Parallel implementations– Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal– Biodoop, statistical analysis
• Middleware Systems– Hadoop
• Not designed for specific needs of genetic data• Limited programmability
– Genome Analysis Tool Kit (GATK)• Designed for genetic data processing• Provides special data traversal patterns• Limited parallelization for some of its tools
IPDPS'14 7
Outline
• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion
IPDPS'14 8
Our Goal
• We want to develop a middleware system– Specific for parallel genetic data processing– Allow parallelization of a variety of genetic algorithms– Be able to work with different popular genetic data
formats – Allows use of existing programs
IPDPS'14
Challenges
• Load Imbalance due to nature of genomic data– It is not just an array of
A, G, C and T characters
• High overhead of tasks
• I/O contention
9
1 3 4
Coverage Variance
IPDPS'14 10
Our Work
• PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications
• Mappers and reducers are executable programs– Allows us to exploit existing applications– No restriction on programming language
IPDPS'14 11
File-mFile-2File-1
Map
Reduce
Region-1
Map
Region-n
Intra-dependent Processing
O-11
O-1n
Output-1
Map
Reduce
Region-1
Map
Region-n
O-m1
O-mn
Output-m
• Each file is processed independently
IPDPS'14 12
Map O1
Ok
On
Reduce Output
Region-1
Input Files
Map
Region-k
Map
Region-n
Inter-dependent Processing• Each map task processes a particular region of ALL files
IPDPS'14 13
What Can PAGE Parallelize?• PAGE can parallelize all applications that have the
following property• M - Map task• R, R1 and R2 are three regions such that
R = concatenation of R1 and R2
• M (R) = M(R1) M(R⊕ 2) where is the reduction ⊕function
R1 R2
R
IPDPS'14 14
Data Partitioning• Data is NOT packaged into equal-size data blocks as in
Hadoop– Each application has a different way of reading the data– Equal-size data block packaging ignores nucleotide base
location information
• Genome structure is divided into regions and each map task is assigned for a region.– Takes account location information– The map task is responsible of accessing particular region of
the input files• It is a common feature for many genomic tools (GATK, SamTools)
IPDPS'14 15
Genome Partition
• PAGE provides two data partitioning methods– By-locus partitioning: Chromosomes are divided into
regions
– By-chromosome partitioning: Chromosomes preserve their unity
Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6
Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6
IPDPS'14 16
Task Scheduling
Static • Each processor is responsible of regions with equal length.• All map tasks should finish before the execution of reduce
tasks.
Dynamic• Map & reduce tasks are assigned by a master process• Reduce tasks can start if there are enough available
intermediate results.
PAGE provides two types of scheduling schemes.
IPDPS'14 17
Applications Developed Using PAGE
• We parallelized 4 applications– VarScan: SNP detection– Realigner Target Creator: Detects insertion/deletions in
alignment files– Indel Realigner: Applies local realignment to improve
quality of alignment files– Unified Genotyper: SNP detection
IPDPS'14 18
Sample Application Development with PAGE
• Serial execution command of VarScan Software– samtools mpileup –b file_list -f reference | java -jar VarScan.jar mpileup2snp
• To parallelize VarScan with PAGE, user needs to define:– Genome Partition: By-Locus– Scheduling Scheme: Dynamic (or Static)– Execution Model: Inter-dependent– Map command: samtools mpileup –b file_list -r regionloc -f
reference | java -jar VarScan.jar mpileup2snp >outputloc– Reduction : cat bash shell command
IPDPS'14 19
Outline
• Motivation• Existing Solutions for Implementation• Our Work• Experimental Evaluation• Conclusion
IPDPS'14 20
Experiments
• Experimental Setup– In our cluster
• Each node has 12 GB memory• 8 cores (2.53 GHz)
– We obtained the data from 1000 Human Genome Project– We evaluated PAGE with 4 applications– We compared PAGE with Hadoop Streaming and GATK
IPDPS'14 21
Comparison with GATK
Scalability Data Size Impact
- Indel Realigner tool of GATK
Data Size: 11 GB # of cores: 128
3.3x
9x
IPDPS'14 22
Comparison with GATK
Scalability Data Size Impact
- Unified Genotyper tool of GATK
10.9x 12.8x
Data Size: 34 GB # of cores: 128
IPDPS'14 23
Scalability Data Size Impact
- VarScan Application
6.9x 12.7x
Comparison with Hadoop Streaming
Data Size: 52 GB # of cores: 128
IPDPS'14 24
Summary of Experimental Results
When the computing power increased by 16 times
Indel Realigner
Unified Genotyper
VarScan Realigner Target Creator
PAGE 9x 12.8x 12.7x 14.1x
GATK 3.3x 10.9x - -
Hadoop Streaming
- - 6.9x -
IPDPS'14 25
Conclusion
• We developed a middleware – Easily parallelizes genomic applications– High applicability
• No restriction on programming language or data format• Allows to use existing applications
– Provides user to control the parallel execution while hiding the details
• Alternative scheduling schemes, execution models and data partitioning types
– Good Scalability
IPDPS'14 26
Thank you for listening …
Questions