hadoop for bioinformatics: building a scalable variant store

89
1 Hadoop ecosystem for genomics Uri Laserson Mount Sinai School of Medicine 29 October 2013

Upload: uri-laserson

Post on 26-Jan-2015

116 views

Category:

Technology


2 download

DESCRIPTION

Talk at Mount Sinai School of Medicine. Introduction to the Hadoop ecosystem, problems in bioinformatics data analytics, and a specific use case of building a genome variant store backed by Cloudera Impala.

TRANSCRIPT

Page 1: Hadoop for Bioinformatics: Building a Scalable Variant Store

1

Hadoop ecosystem for genomicsUri LasersonMount Sinai School of Medicine29 October 2013

Page 2: Hadoop for Bioinformatics: Building a Scalable Variant Store

2

Agenda

1. Hadoop overview• Historical context• Hadoop overview• Some sins in bioinformatics

2. Scalable variant store• Possible conventional solutions• Hadoop/Impala implementation

Page 3: Hadoop for Bioinformatics: Building a Scalable Variant Store

3

Historical Context

Page 4: Hadoop for Bioinformatics: Building a Scalable Variant Store

4

1999!

Page 5: Hadoop for Bioinformatics: Building a Scalable Variant Store

5

Indexing the Web

• Web is Huge• Hundreds of millions of pages in 1999

• How do you index it?• Crawl all the pages• Rank pages based on relevance metrics• Build search index of keywords to pages• Do it in real time!

Page 6: Hadoop for Bioinformatics: Building a Scalable Variant Store

6

Page 7: Hadoop for Bioinformatics: Building a Scalable Variant Store

7

Databases in 1999

1. Buy a really big machine2. Install expensive DBMS on it3. Point your workload at it4. Hope it doesn’t fail5. Ambitious: buy another big machine as backup

Page 8: Hadoop for Bioinformatics: Building a Scalable Variant Store

8

Page 9: Hadoop for Bioinformatics: Building a Scalable Variant Store

9

Database Limitations

• Didn’t scale horizontally• High marginal cost ($$$)

• No real fault-tolerance story• Vendor lock-in ($$$)• SQL unsuited for search ranking

• Complex analysis (PageRank)• Unstructured data

Page 10: Hadoop for Bioinformatics: Building a Scalable Variant Store

10

Page 11: Hadoop for Bioinformatics: Building a Scalable Variant Store

11

Google does something different

• Designed their own storage and processing infrastructure

• Google File System (GFS) and MapReduce (MR)• Goals: KISS

• Cheap• Scalable• Reliable

Page 12: Hadoop for Bioinformatics: Building a Scalable Variant Store

12

Google does something different

• It worked!• Powered Google Search for many years• General framework for large-scale batch computation

tasks• Still used internally at Google to this day

Page 13: Hadoop for Bioinformatics: Building a Scalable Variant Store

13

Google benevolent enough to publish

2003 2004

Page 14: Hadoop for Bioinformatics: Building a Scalable Variant Store

14

Birth of Hadoop at Yahoo!

• 2004-2006: Doug Cutting and Mike Cafarella implement GFS/MR.

• 2006: Spun out as Apache Hadoop• Named after Doug’s son’s yellow stuffed elephant

Page 15: Hadoop for Bioinformatics: Building a Scalable Variant Store

15

Open-source proliferation

Google Open-source Function

GFS HDFS Distributed file system

MapReduce MapReduce Batch distributed data processing

Bigtable HBase Distributed DB/key-value store

Protobuf/Stubby Thrift or Avro Data serialization/RPC

Pregel Giraph Distributed graph processing

Dremel/F1 Cloudera Impala Scalable interactive SQL (MPP)

FlumeJava Crunch Abstracted data pipelines on Hadoop

Hadoop

Page 16: Hadoop for Bioinformatics: Building a Scalable Variant Store

16

Overview of core technology

Page 17: Hadoop for Bioinformatics: Building a Scalable Variant Store

17

HDFS design assumptions

• Based on Google File System• Files are large (GBs to TBs)• Failures are common

• Massive scale means failures very likely• Disk, node, or network failures

• Accesses are large and sequential• Files are append-only

Page 18: Hadoop for Bioinformatics: Building a Scalable Variant Store

18

HDFS properties

• Fault-tolerant• Gracefully responds to node/disk/network failures

• Horizontally scalable• Low marginal cost

• High-bandwidth

1

2

3

4

5

2

4

5

1

2

5

1

3

4

2

3

5

1

3

4

Input File

HDFS storage distributionNode A Node B Node C Node D Node E

Page 19: Hadoop for Bioinformatics: Building a Scalable Variant Store

19

MapReduce computation

Page 20: Hadoop for Bioinformatics: Building a Scalable Variant Store

20

MapReduce computation

• Structured as1. Embarrassingly parallel “map stage”2. Cluster-wide distributed sort (“shuffle”)3. Aggregation “reduce stage”

• Data-locality: process the data where it is stored• Fault-tolerance: failed tasks automatically detected

and restarted• Schema-on-read: data must not be stored conforming

to rigid schema

Page 21: Hadoop for Bioinformatics: Building a Scalable Variant Store

21

WordCount example

Page 22: Hadoop for Bioinformatics: Building a Scalable Variant Store

22

Cloudera Hadoop Stack

Page 23: Hadoop for Bioinformatics: Building a Scalable Variant Store

23

Cloudera Hadoop Stack

Page 24: Hadoop for Bioinformatics: Building a Scalable Variant Store

24

Cloudera Hadoop Stack

Page 25: Hadoop for Bioinformatics: Building a Scalable Variant Store

25

Cloudera Hadoop Stack

Storm

STREAM

Spark

DISTRIBUTED MEMORY

GraphLab

GRAPH COMPUTATION

Page 26: Hadoop for Bioinformatics: Building a Scalable Variant Store

26

Cloudera Impala

Modern MPP database built on top of HDFS

Designed for interactive queries on terabyte-scale data sets.

Page 27: Hadoop for Bioinformatics: Building a Scalable Variant Store

27

Cloudera Search

• Interactive search queries on top of HDFS

• Built on Solr and SolrCloud• Near-realtime indexing of new documents

Page 28: Hadoop for Bioinformatics: Building a Scalable Variant Store

28

Serialization/RPC formats

• Specify schemas/services in user-friendly IDLs• Code-generation to multiple languages (wire-

compatible/portable)• Compact, binary formats• Natural support for schema evolution• Multiple implementations:

• Apache Thrift, Apache Avro, Google’s Protocol Buffers

Page 29: Hadoop for Bioinformatics: Building a Scalable Variant Store

29

Serialization/RPC formats

service Twitter { void ping(); bool postTweet(1:Tweet tweet); TweetSearchResult searchTweets(1:string query);}

struct Tweet { 1: required i32 userId; 2: required string userName; 3: required string text; 4: optional Location loc; 16: optional string language = "english"}

Page 30: Hadoop for Bioinformatics: Building a Scalable Variant Store

30

Serialization/RPC formatsstruct Observation { // can be general contig too 1: required string chromosome, // python-style 0-based slicing 2: required i64 start, 3: required i64 end, // unique identifier for data set // (like UCSC genome browser track) 4: required string track, // these are likely derived from the // track; separated for convenient join 5: optional string experiment, 6: optional string sample, // one of these should be non-null, // depending on the type of data 7: optional string valueStr, 8: optional i64 valueInt, 9: optional double valueDouble}

Page 31: Hadoop for Bioinformatics: Building a Scalable Variant Store

31

Parquet format

Row-major format

Page 32: Hadoop for Bioinformatics: Building a Scalable Variant Store

32

Parquet format

Column-major format

Page 33: Hadoop for Bioinformatics: Building a Scalable Variant Store

33

Parquet format advantages

• Columnar format• read fewer bytes• compression more efficient (incl. dictionary encodings)

• Thrift/Avro/Protobuf-compatible data model• Support for nested data structures

• Binary encodings• Hadoop-friendly (“splittable”; implemented in Java)• Predicate pushdown• http://parquet.io/

Page 34: Hadoop for Bioinformatics: Building a Scalable Variant Store

34

Query Times on TPCDS Queries

Q27 Q34 Q42 Q43 Q46 Q52 Q55 Q59 Q65 Q73 Q79 Q960

50

100

150

200

250

300

350

400

450

500

TextSeq w/ SnappyRC w/SnappyParquet w/SnappySe

cond

s

Page 35: Hadoop for Bioinformatics: Building a Scalable Variant Store

35

Core paradigm shifts with Hadoop

Colocation of storage and compute

Fault tolerance with cheap hardware

Page 36: Hadoop for Bioinformatics: Building a Scalable Variant Store

36

Benefits of Hadoop ecosystem

• Inexpensive commodity compute/storage• Tolerates random hardware failure

• Decreased need for high-bandwidth network pipes• Co-locate compute and storage• Exploit data locality

• Simple horizontal scalability by adding nodes• MapReduce jobs effectively guaranteed to scale

• Fault-tolerance/replication built-in. Data is durable• Large ecosystem of tools• Flexible data storage. Schema-on-read. Unstructured

data.

Page 37: Hadoop for Bioinformatics: Building a Scalable Variant Store

37

Some sins in genomics data infrastructure

Page 38: Hadoop for Bioinformatics: Building a Scalable Variant Store

38

HPC separates compute from storage

Storage infrastructure Compute cluster

• Proprietary, distributed file system

• Expensive

• High-performance hardware

• Low failure rate• Expensive

Big network pipe ($$$)

User typically works by manually submitting jobs to scheduler

e.g., LSF, Grid Engine, etc.

HPC is about compute.Hadoop is about data.

Page 39: Hadoop for Bioinformatics: Building a Scalable Variant Store

39

Hadoop colocates compute and storage

Compute clusterStorage infrastructure

• Commodity hardware• Data-locality• Reduced networking

needs

User typically works by manually submitting jobs to scheduler

e.g., LSF, Grid Engine, etc.

HPC is about compute.Hadoop is about data.

Page 40: Hadoop for Bioinformatics: Building a Scalable Variant Store

40

HPC is lower-level than Hadoop

• HPC only exposes job scheduling• Parallelization typically occurs through MPI

• Very low-level communication primitives• Difficult to horizontally scale by simply adding nodes

• Large data sets must be manually split• Failures must be dealt with manually

• Hadoop has fault-tolerance, data locality, horizontal scalability

Page 41: Hadoop for Bioinformatics: Building a Scalable Variant Store

41

File system as DB; text file as LCD

• Broad joint caller with 25k genomes hits file handle limits

• Files streamed over network (HPC architecture)• Large files split manually• Sharing data/collaborating involves copying large files

Page 42: Hadoop for Bioinformatics: Building a Scalable Variant Store

42

Job scheduler as workflow tool

• Submitting jobs to scheduler is very low level• Workflow engines/execution models provide high

level execution graphs with fault-tolerance• e.g., MapReduce, Oozie, Spark, Luigi, Crunch, Cascading,

Pig, Hive

Page 43: Hadoop for Bioinformatics: Building a Scalable Variant Store

43

Poor security/access models

• Deal with complex set of constraints from a variety of consents/redactions

• Certain individuals redact certain parts of their genomes• Certain samples can only be used as controls for particular

studies• Different research groups want to control access to the

data they generate• Clinical trial data must have more rigorous access

restrictions

Page 44: Hadoop for Bioinformatics: Building a Scalable Variant Store

44

Treating computation as free

• Many institutions make large clusters available for “free” to the average researcher

• Focus of dropping sequencing cost has been on biochemistry

Page 45: Hadoop for Bioinformatics: Building a Scalable Variant Store

45

Treating computation as free

Stein, L. D. The case for cloud computing in genome informatics. Genome Biol (2010).

Page 46: Hadoop for Bioinformatics: Building a Scalable Variant Store

46

Treating computation as free

Sboner et al. “The real cost of sequencing: higher than you think”. Genome Biology (2011).

Page 47: Hadoop for Bioinformatics: Building a Scalable Variant Store

47

Lack of benchmarks for tracking progress

• Need to benchmark whether quality of methods are improving

http://www.nist.gov/mml/bbd/ppgenomeinabottle2.cfm

Page 48: Hadoop for Bioinformatics: Building a Scalable Variant Store

48

Lack of benchmarks for tracking progress

Bradnam et al. “Assemblathon 2”, Gigascience 2, 10 (2013).

Page 49: Hadoop for Bioinformatics: Building a Scalable Variant Store

49

Academic code

“…people in my lab have requested code from authors and received source code with syntax errors in it” [3]

Most developers self-taught. Only one-third think formal training is important. [1, 2]

[1]: Haussler et al. “A Million Cancer Genome Warehouse” (2012)[2]: Hannay et al. “How do scientists develop and use scientific software?” (2009)[3]: http://ivory.idyll.org/blog/on-code-review-of-scientific-code.html

Unreproducible, unbuildable, undocumented, unmaintained, unavailable, backward-incompatible, shitty code

Page 50: Hadoop for Bioinformatics: Building a Scalable Variant Store

50

Fundamentally a barrier to scaling.

Page 51: Hadoop for Bioinformatics: Building a Scalable Variant Store

51

Page 52: Hadoop for Bioinformatics: Building a Scalable Variant Store

52

NCBI Sequence Read Archive (SRA)

Today…1.14 petabytes

One year ago…609 terabytes

Page 53: Hadoop for Bioinformatics: Building a Scalable Variant Store

53

Every ‘ome has a -seq

Genome DNA-seq

TranscriptomeRNA-seqFRT-seqNET-seq

Methylome Bisulfite-seq

Immunome Immune-seq

ProteomePhIP-seqBind-n-seq

Page 54: Hadoop for Bioinformatics: Building a Scalable Variant Store

54

Prescriptions for the future

Page 55: Hadoop for Bioinformatics: Building a Scalable Variant Store

55

Move to Hadoop-style environment

• Data centralization on HDFS• Data-local execution to avoid moving terabytes• Higher-level execution engines to abstract away

computations from details of execution• Hadoop-friendly, evolvable, serialization formats for:

• Storage- and compute-efficiency• Abstracting data model from data storage details

• Built-in horizontal scalability and fault-tolerance

Page 56: Hadoop for Bioinformatics: Building a Scalable Variant Store

56

APIs instead of file formats

• Service-oriented architectures ensure stable contracts• Allows for implementation changes with new

technologies• Software community has lots of experience with this

type of architecture, along with mature tools.• Can be implemented as language-independent.

Page 57: Hadoop for Bioinformatics: Building a Scalable Variant Store

57

High-granularity access/common consent

1. Use technologies with highly-granular access control• e.g., Apache Accumulo, cell-based access control

2. Create common consents for patients to “donate” their data to research• e.g., Personal Genome Project, SAGE Portable Legal

Consent, NCI “information donor”

Page 58: Hadoop for Bioinformatics: Building a Scalable Variant Store

58

Tools for open-source/reproducibility

• Software and computations should be open-sourced, e.g., on GitHub

• Release VMs or ipython notebooks with publications• “executable paper” to generate figures

• Allow others to easily recompute all analyses

Page 59: Hadoop for Bioinformatics: Building a Scalable Variant Store

59

Building scalable variant store

Page 60: Hadoop for Bioinformatics: Building a Scalable Variant Store

60

Genomics ETL

.fastq .bam .vcf

short read alignment

genotype calling

• Short read alignment is embarrassingly parallel• Pileup/variant calling requires distributed sort• GATK is a reimplementation of MapReduce; could run on Hadoop• Early Hadoop tools

• Crossbow: short read alignment/variant calling• Hadoop-BAM: distributed bamtools• BioPig: manipulating large fasta/q• Contrail: de-novo assembly

analysisbiochemistry

Page 61: Hadoop for Bioinformatics: Building a Scalable Variant Store

61

Genomics ETL

GATK best practices

Page 62: Hadoop for Bioinformatics: Building a Scalable Variant Store

62

ADAM

Page 63: Hadoop for Bioinformatics: Building a Scalable Variant Store

63

ADAM

• Defining alternative to BAM format that’s• Hadoop-friendly, splittable, designed for

distributed computing• Format built as Avro objects• Data stored as Parquet format (columnar)

• Attempting to reimplement GATK pipeline to function on Hadoop/Parquet

• Currently run out of the AMPLab at UC Berkeley

Page 64: Hadoop for Bioinformatics: Building a Scalable Variant Store

64

Genomics ETL

.fastq .bam .vcf

short read alignment

genotype calling analysis

Page 65: Hadoop for Bioinformatics: Building a Scalable Variant Store

65

Querying large, integrated variant data

• Biotech client has thousands of genomes• Want to expose ad hoc querying functionality on large

scale• e.g., vcftools/PLINK-SEQ on terabyte-scale data sets

• Integrating data with public data sets (e.g., ENCODE, UCSC tracks, dbSNP, etc.)

• Terabyte-scale annotation sets

Page 66: Hadoop for Bioinformatics: Building a Scalable Variant Store

66

Conventional approaches: manual

• Manually parsing flat files• Write ad hoc scripts in perl or python• Build data structures in memory for

histograms/aggregations• Custom script per query

counts_dict = {}for chain in vdj.parse_VDJXML(inhandle): try: counts_dict[chain.junction] += 1 except KeyError: counts_dict[chain.junction] = 1

for count in counts_dict.itervalues(): print >>outhandle, np.int_(count)

Page 67: Hadoop for Bioinformatics: Building a Scalable Variant Store

67

Conventional approaches: database

• Very feature rich and mature• Common analytical tasks (e.g., joins, group-by, etc.)• Access control• Very mature

• Scalability issues• Indices can be prohibitive• RDBMS: schemas can be annoyingly rigid• NoSQL: adolescent implementations (but easy to

start)

Page 68: Hadoop for Bioinformatics: Building a Scalable Variant Store

68

Conventional approaches: domain-specific

• e.g., PLINK/SEQ• Designed for specific use-cases• Workflows are highly opinionated/rigid• Requires learning another language• Scalability issues

Page 69: Hadoop for Bioinformatics: Building a Scalable Variant Store

69

Hadoop sol’n: storage

• Impala/Hive metastore provide a unified, flexible data model

• Define Avro types for all data• Data stored as Parquet format to maximize

compression and query performance

Page 70: Hadoop for Bioinformatics: Building a Scalable Variant Store

70

Hadoop sol’n: available analytics engines

• Analytical operations implemented by experts in distributed systems

• Impala implements RDBMS-style operations• Search offers metadata indexing• Spark offers in-memory processing for ML• HDFS-based analytical engines designed for horizontal

scalability

Page 71: Hadoop for Bioinformatics: Building a Scalable Variant Store

71

Variant store architecture

.vcf .parquetETL

Avro schema

Hive metastoreImpala query engine

.csv

Thrift serviceJDBC

REST APIImpala shell

Resultsquery

externalannotations

Page 72: Hadoop for Bioinformatics: Building a Scalable Variant Store

72

Example schema

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Page 73: Hadoop for Bioinformatics: Building a Scalable Variant Store

73

Example schema

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Page 74: Hadoop for Bioinformatics: Building a Scalable Variant Store

74

Example schema

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Page 75: Hadoop for Bioinformatics: Building a Scalable Variant Store

75

Why denormalization is good

• Replace joins with filters• For query engines with efficient scans, this simplifies

queries and can improve performance• Parquet format supports predicate pushdowns, reducing

necessary I/O• Because storage is cheap, amortize cost of up-front

join over simpler queries going forward

Page 76: Hadoop for Bioinformatics: Building a Scalable Variant Store

76

Example schema

{ "name": "VCF", "type": "record" "fields": [ { "type": "string", "name": "VCF_CHROM" }, { "type": "int", "name": "VCF_POS" }, { "type": "string", "name": "VCF_ID" }, { "type": "string", "name": "VCF_REF" }, { "type": "string", "name": "VCF_ALT" }, ...

... { "default": null, "doc": "Genotype", "type": [ "null", "string" ], "name": "VCF_CALL_GT" }, { "default": null, "doc": "Genotype Quality", "type": [ "null", "int" ], "name": "VCF_CALL_GQ" }, { "default": null, "doc": "Read Depth", "type": [ "null", "int" ], "name": "VCF_CALL_DP" }, { "default": [], "doc": "Haplotype Quality", "type": "string", "name": "VCF_CALL_HQ" } ]}

Page 77: Hadoop for Bioinformatics: Building a Scalable Variant Store

77

Example variant-filtering query

• “Give me all SNPs that are:• on chromosome 5• absent from dbSNP• present in COSMIC• observed in breast cancer samples• absent from prostate cancer samples”

• On full 1000 genome data set (~37 billion variants), query finishes in a couple seconds

Page 78: Hadoop for Bioinformatics: Building a Scalable Variant Store

78

Example variant-filtering query

SELECT cosmic as snp_id, vcf_chrom as chr, vcf_pos as pos, sample_id as sample, vcf_call_gt as genotype, sample_affection as phenotypeFROM hg19_parquet_snappy_join_cached_partitionedWHERE COSMIC IS NOT NULL AND dbSNP IS NULL AND sample_study = ”breast_cancer" AND VCF_CHROM = "16";

Page 79: Hadoop for Bioinformatics: Building a Scalable Variant Store

79

Impala execution

• Query compiled into execution tree, chopped up across all nodes (if possible)

• Two join implementations1. Broadcast: each node gets copy of full right table2. Shuffle: both sides of join are partitioned

• Partitioned tables vastly reduce amount of I/O• File formats make enormous difference in query

performance

Page 80: Hadoop for Bioinformatics: Building a Scalable Variant Store

80

Other desirable query-examples

• “How do the mutations in a given subject compare to the mutations in other phenotypically similar subjects?”

• “For a given gene, in what pathways and cancer subtypes is it involved?” (connecting phenotypes to annotations)

• “How common are an observed set of mutations?”• “For a given type of cancer, what are the

characteristic disruptions?”

Page 81: Hadoop for Bioinformatics: Building a Scalable Variant Store

81

Types of queries desired

• Lot’s of these queries can be simply translated into SQL queries

• Similar to functionality provided by PLINK/SEQ, but designed to scale to much larger data sets

Page 82: Hadoop for Bioinformatics: Building a Scalable Variant Store

82

All-vs-all eQTL

• Possible to generate trillions of hypothesis tests• 107 loci x 104 phenotypes x 10s of tissues = 1012 p-values• Tested below on 120 billion associations

• Example queries:• “Given 5 genes of interest, find top 20 most significant

eQTLs (cis and/or trans)”• Finishes in several seconds

• “Find all cis-eQTLs across the entire genome”• Finishes in a couple of minutes• Limited by disk throughput

Page 83: Hadoop for Bioinformatics: Building a Scalable Variant Store

83

All-vs-all eQTL

• “Find all SNPs that are:• in LD with some lead SNP

or eQTL of interest• align with some functional

annotation of interest”• Still in testing, but likely

finishes in seconds

Schaub et al, Genome Research, 2012

Page 84: Hadoop for Bioinformatics: Building a Scalable Variant Store

84

Conclusions

• Hadoop ecosystem provides centralized, scalable repository for data

• An abundance of tools for providing views/analytics into the data store

• Separate implementation details from data pipelines• Software quality/data structures/file formats matter• Genomics has much to gain from moving away from

HPC architecture toward Hadoop ecosystem architecture

Page 85: Hadoop for Bioinformatics: Building a Scalable Variant Store

85

Cloud-based implementation

• Hadoop-ecosystem architecture easily translates to the cloud (AWS, OpenStack)

• Provides elastic capacity; no large initial CAPEX• Risk of vendor lock-in once data set is large• Allows simple sharing of data via public S3 buckets,

for example

Page 86: Hadoop for Bioinformatics: Building a Scalable Variant Store

86

Future work

• Broad Institute has experimented with Google’s BigQuery for a variant store

• BigQuery is Google’s Dremel exposed to public on Google’s cloud

• Closed-source, only Google cloud• Developed API for working with variant data• Soon develop Impala-backed implementation of

Broad API• To be open-sourced

Page 87: Hadoop for Bioinformatics: Building a Scalable Variant Store

87

Future work

• Drive towards several large data warehouses; storage backend optimized for particular access patterns

• Each can expose one or more APIs for different applications/access levels.

• Haussler, D. et al. A Million Cancer Genome Warehouse. (2012). Tech Report.

Page 88: Hadoop for Bioinformatics: Building a Scalable Variant Store

88

Acknowledgements

ClouderaJosh WillsJeff HammerbacherImpala team (Nong Li)Sandy Ryza

Julien Le Dem (Twitter)

Our biotech client

Mike Schatz (CSHL)Matt Massie

Page 89: Hadoop for Bioinformatics: Building a Scalable Variant Store

89