o connor bosc2010
TRANSCRIPT
![Page 1: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/1.jpg)
SeqWare Query EngineStoring & Searching Sequence Data in the Cloud
Brian O'ConnorUNC Lineberger Comprehensive
Cancer Center
BOSCJuly 9th, 2010
![Page 2: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/2.jpg)
SeqWare Query Engine
● Want to ask simple questions: ● “What SNVs are in 5'UTR of phosphatases?”● “What frameshift indels affect PTEN?”● “What genes include homozygous, non-
synonymous SNVs?”● SeqWare Query Engine created to query data
● RESTful Webservice● Scalable/Queryable Backend
![Page 3: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/3.jpg)
Variant Annotation with SeqWareWhole Genome/Exome
pileup
Alignment
Variant,Coverage, &
Consequence
dbSNP
Consequence
SeqWare Pipeline
SeqWare QueryEngine WebserviceBAM
VariantCalling
SeqWare QueryEngine
SeqWare QueryEngine Backend
HBase orBerkeley DBStores
RESTlet
BackendInterface
WIG BED
![Page 4: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/4.jpg)
Webservice Interfaces
track name="SeqWare BED Mon Oct 26 22:16:18 PDT 2009" chr10 89675294 89675295 G->T(24:23:95.8%[F:18:75.0%|R:5:20.8%])
SeqWare Query Engine includes a RESTful web service that returns XML describing variant DBs, a web form for querying, and returns BED/WIG data available via a well-defined URL
http://server.ucla.edu/seqware/queryengine/realtime/variants/mismatches/1?format=bed&filter.tag=PTEN
orRESTful XML Client API HTML Forms
![Page 5: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/5.jpg)
Loading in Genome BrowsersSeqWare Query Engine URLs can be directly loaded into IGV & UCSC genome browsers
![Page 6: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/6.jpg)
Requirements for QueryEngine Backend
The backend must:– Represent many types of data– Support a rich level of annotation – Support very large variant databases
(~3 billion rows x thousands of columns)– Be distributed across a cluster– Support processing, annotating, querying &
comparing samples (variants, coverage, annotations)
– Support a crazy growth of data
![Page 7: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/7.jpg)
Increase in Sequencer OutputNelson Lab - UCLA
08/10/06 02/26/07 09/14/07 04/01/08 10/18/08 05/06/09 11/22/09 06/10/10 12/27/10
10000000
100000000
1000000000
10000000000
100000000000
Illumina Sequencer Ouput
Sequence File Sizes Per Lane
Date
File
Siz
e (
Byt
es)
Log scale
Suggests Sequencer OutputIncreases by 5-10x Every 2 Years!
Far outpacing hard drive,CPU, and bandwidth growth
![Page 8: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/8.jpg)
HBase to the Rescue?● Billions of rows x millions of columns!● Focus on random access (vs. HDFS)● Table is column oriented, sparse matrix● Versioning (timestamps) built in● Flexible storage of different data types● Splits DB across many nodes transparently● Locality of data, I can run map/reduce jobs that
process the table rows present on a given node● 22M variants processed <1 minute on 5 node cluster
![Page 9: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/9.jpg)
Underlying HBase Tables
chr15:00000123454
key
byte[]
variant:genome4
byte[]
variant:genome7
byte[]
coverage:genome7hg18Table
Variant object byte array
Database on filesystem (HDFS)
family label
chr15:00000123454
key
t1
timestamp
genome7
column:variant
byte[]
is_dbSNP|hg18.chr15.00000123454.variant.genome4.SNV.A.G.v123
key
byte[]
rowId:Genome1102NTagIndexTable
queries look up by tag thenfilter the variant results
![Page 10: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/10.jpg)
HBase API & Map/Reduce Querying● HBase API
● Powers the current Backend & Webservice● Provides a familiar API, scanners, iterators, etc● Backend written using this, retrieve variants by tags● Distributed database but single thread using API
● Prototype somatic mutations by Map/Reduce● Every row is examined, variants in tumor not in
normal are retrieved● Map/Reduce jobs run on node with local data● Highly parallel & faster than API with single thread
![Page 11: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/11.jpg)
SeqWare Query Engine on HBase
DataNode
DataNode
HBase on HDFSVariant & CoverageDatabase System
Analysis/Web Nodes
Analysis/Web Nodes
Analysis/Web Nodes
Querying &LoadingNodes processqueries via API
RESTfulWeb Service
Backend
MetaDB
Webservice combinesVariant/Coverage datawith metadata
BED/WIGFiles
XMLMetadata
clients
NameNode
ETL Map Job
ETL Map Job
ETLReduce Job
ETLjobs extract,transform, &/orload in parallel
Webservice
MapReduce HBase API
![Page 12: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/12.jpg)
Status of HBase Backend
● Both BerkeleyDB & HBase, Relational soon● Multiple genomes stored in the same table,
very Map/Reduce compatible● Basic secondary indexing for “tags”● API used for queries via Webservice● Prototype Map/Reduce example for “somatic”
mutation detection in paired normal/cancer samples
● Currently loading 1102 normal/tumor (GBM)
![Page 13: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/13.jpg)
Backend Performance Comparison
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
Pileup Load Time 1102N
HBase vs. Berkeley DB
load time bdbload time hbase
time (s)
vari
an
ts
![Page 14: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/14.jpg)
Backend Performance Comparison
0 1000 2000 3000 4000 5000 6000 7000
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
BED Export Time 1102NHBase API vs. M/R vs. BerkeleyDB
dump time bdbdump time hbasedump time m/r
time
vari
an
ts
![Page 15: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/15.jpg)
HBase/Hadoop Have Potential!
● Era of Big Data for Biology is here!● CPU bound problems no doubt but as short
reads become long reads and prices per GBase drop the problems seem to shift to handling/mining data
● Tools designed for Peta-scale datasets are key
![Page 16: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/16.jpg)
Next Steps● Model other datatypes: copy number, RNAseq
gene/exon/splice junction counts, isoforms etc
● Focus on porting analysis/querying to Map/Reduce
● Indexing beyond “tags” with Katta (distributed Lucene)
● Push scalability, what are the limits of an 8 node HBase/Hadoop cluster?
● Look at Cascading, Pig, Hive, etc as advanced workflow and data mining tools
● Standards for Webservice dialect (DAS?)
● Exposing Query Engine through GALAXY
![Page 17: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/17.jpg)
Acknowledgments
Jordan Mendler Michael Clark Hane Lee Bret Harry Stanley Nelson
Sara Grimm Matt Soloway Jianying Li Feri Zsuppan Neil Hayes Chuck Perou Derek Chiang
UCLA UNC
![Page 18: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/18.jpg)
Resources● Hbase & Hadoop: http://hadoop.apache.org ● When to use HBase:
http://blog.rapleaf.com/dev/?p=26● NOSQL presentations:
http://blog.oskarsson.nu/2009/06/nosql-debrief.html
● Other DBs: CouchDB, Hypertable, Cassandra, Project Voldemort, and more...
● Data mining tools: Pig and Hive● SeqWare: http://seqware.sourceforge.net● [email protected]
![Page 19: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/19.jpg)
Extra Slides
![Page 20: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/20.jpg)
Overview
● SeqWare Query Engine background● New tools for combating the data deluge● HBase/Hadoop in SeqWare Query Engine
● HBase for backend● Map/Reduce & HBase API for webservice
● Better performance and scalability?● Next steps
![Page 21: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/21.jpg)
SeqWare Query Engine:BerkeleyDB
GenomeDatabase
GenomeDatabase
GenomeDatabase
BerkeleyDBVariant & CoverageDatabases
Lustre Filesystem
Analysis/Web Nodes
Analysis/Web Nodes
Analysis/Web Nodes
Web/AnalysisNodes processqueries
RESTfulWeb Service
backend webservice
MetaDB
Webservice combinesVariant/Coverage datawith metadata
BED/WIGFiles
XMLMetadata
clients
![Page 22: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/22.jpg)
More
● Details on API vs. M/R● Details on XML Restful API & web app including
loading in UCSC browser● Details on generic store object (BerkeleyDB,
HBase, and Relational at Renci)● Byte serialization from BerkeleyDB, custom
secondary key creation
![Page 23: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/23.jpg)
Pressures of Sequencing
● A lot of data (50GB SRF file, 150GB alignment files, 60GB variants for a 20x human genome)
● PostgreSQL (2xquad core, 64GB RAM) died with the Celsius schema (microarray database) after loading ~1 billion rows
● Needs to be processed, annotated, and queryable & comparable (variants, coverage, annotations)
● ~3 billion rows x thousands of columns● COMBINE WITH PREVIOUS SLIDE
![Page 24: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/24.jpg)
Thoughts on BerkeleyDB
● BerkeleyDB let me:● Create a database per genome, independent from a
single database daemon● Provision database to cluster● Adapt to key-value database semantics
● Limitations:● Creation on single node only● Not inherently distributed● Performance issues with big DBs, high I/O wait
● Google to the rescue?
![Page 25: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/25.jpg)
HBase Backend
● How the table(s) are actually structured● Variants● Coverage● Etc
● How I do indexing currently (similar to indexing feature extension)● Multiple secondary indexes
![Page 26: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/26.jpg)
Frontend
● RESTlet API● What queries can you do?
● Examples● URLs
● Potential for swapping out generic M/R for many of these queries (less reliance on indexes which will speed things up as DB grows)
![Page 27: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/27.jpg)
Ideas for a distributed future
● Federated Dbs/datastores/clusters for computation rather than one giant datacenter
● Distribute software not data
![Page 28: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/28.jpg)
Potential Questions
● How big is the DB to store whole human genome?
● How long does it take to M/R 3 billion positions on 5 node cluster?
● How does my stuff compare to other bioinf software? GATK, Crossbow, etc
● How did I choose HBase instead of Pig, Hive, etc?
![Page 29: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/29.jpg)
Current Prototyping Work
● Validate creation of U87 (genome resequencing at 20x) genome database● SNVs● Coverage● Annotations
● Test fast querying of record subsets● Test fast processing of whole DB using
MapReduce● Test stability, fault-tolerance, auto-balancing,
and deployment issues along the way
![Page 30: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/30.jpg)
What About Fast Queries?
● I'm fairly convinced I can create a distributed HBase database on a Hadoop cluster
● I have a prototype HBase database running on two nodes
● But HBase shines when bulk processing DB● Big question is how to make individual lookups
fast● Possible solution is Hbase+Katta for indexes
(distributed Lucene)
![Page 31: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/31.jpg)
SeqWare Query Engine
GenomeDatabase
GenomeDatabase
GenomeDatabase
BerkeleyDBVariant & CoverageDatabases
Lustre Filesystem
Analysis/Web Nodes(8 CPU, 32GB RAM)
Analysis/Web Nodes(8 CPU, 32GB RAM)
Analysis/Web Nodes(8 CPU, 32GB RAM)
Web/AnalysisNodes processqueries
RESTfulWeb Service
backend webservice
AnnotationDatabase
Webservice combinesVariant/Coverage datawith annotations (hg18)
BED/WIGFiles
DASXML
clients
![Page 32: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/32.jpg)
How Do We Scale Up the QE?
● Sequencers are increasing output by a factor of 10 every two years!
● Hard drives: 4x every 2 years● CPUs: 2x every 2 years● Bandwidth: 2x every 2 years (really?!)● So there's a huge disconnect, can't just throw
more hardware at a single database server!● Must look for better ways to scale for
![Page 33: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/33.jpg)
Google to the Rescue?
● Companies like Google, Amazon, Facebook, etc have had to deal with massive scalability issues over the last 10+ years
● Solutions include:● Frameworks like MapReduce● Distributed file systems like HDFS● Distributed databases like HBase
● Focus here on HBase
![Page 34: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/34.jpg)
What Do You Give Up?
● SQL queries● Well defined schema, normalized data structure● Relationships manged by DB● Flexible and easy indexing of table columns● Existing tools that query a SQL database must
be re-written● Certain ACID aspects● Software maturity, most distributed NOSQL
projects are very new
![Page 35: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/35.jpg)
What Do You Gain?
● Scalability is the clear win, you can have many processes on many nodes hit the collection of database servers
● Ability to look at very large datasets and do complex computations across a cluster
● More flexibility in representing information now and in the future
● HBase includes data timestamps/versions● Integration with Hadoop
![Page 36: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/36.jpg)
SeqWare Query Engine
DataNode
DataNode
HBaseVariant & CoverageDatabase System
Hadoop HDFS
Analysis/Web Nodes
Analysis/Web Nodes
Analysis/Web Nodes
Web/AnalysisNodes processqueries
RESTfulWeb Service
backend webservice
AnnotationDatabase
Webservice combinesVariant/Coverage datawith annotations (hg18)
BED/WIGFiles
XMLMetadata
clients
NameNode
ETL Map Job
ETL Map Job
HadoopMap Reduce
ETLReduce Job
MapReducejobs extract,transform, &load in parallel
![Page 37: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/37.jpg)
What an HBase DB Looks Like
A Record in my HBase
chr15:00000123454
key
byte[]
variant:genome4
byte[]
variant:genome7
byte[]
coverage:genome7
A Record in my HBase
Variant object to byte array
Database on filesystem (HDFS)
family label
chr15:00000123454
key
t1
timestamp
genome7
column:variant
byte[]
![Page 38: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/38.jpg)
Scalability and BerkeleyDB
● BerkeleyDB let me:
● Create a database per genome, independent from a single database daemon
● Provision database to cluster for distributed analysis● Adapt to key-value database semantics with nice API
● Limitations:
● Creation on single node only● Want to query easily across genomes● Database are not distributed● I saw performance issues, high I/O wait
![Page 39: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/39.jpg)
Would 2,000 Genomes Kill SQL?● Say each genome has 5M variants (not counting
coverage!)● 5M variant rows x 2,000 genomes = 10 billion rows● Our DB server running PostgreSQL (2xquad core,
64GB RAM) died with the Celsius (Chado) schema after loading ~1 billion rows
● So maybe conservatively we would have issues with 150+ genomes
● That threshold is probably 1 year away with public datasets available via SRA, 1000 genomes, TCGA
![Page 40: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/40.jpg)
Related Projects
![Page 41: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/41.jpg)
My Abstract
● backend/frontend● Traverse and query with Map/Reduce● Java web service with RESTlet● Deployment on 8 node cluster
![Page 42: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/42.jpg)
Background on Problem
● Why abandon PostgreSQL/MySQL/SQL?● Experience with Celsius...
● What you give up● What you gain
![Page 43: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/43.jpg)
First Solution: BerkeleyDB
● Good:● key/value data store● Easy to use● Great for testing
● Bad:● Not performant for multiple genomes● Manual distribution across cluster● Annoying phobia of shared filesystems
![Page 44: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/44.jpg)
Sequencers vs. Information Technology
● Sequencers are increasing output by a factor of 10 every two years!
● Hard drives: 4x every 2 years● CPUs: 2x every 2 years● Bandwidth: 2x every 2 years (really?!)● So there's a huge disconnect, can't just throw
more hardware at a single database server!● Must look for better ways to scale
![Page 45: O connor bosc2010](https://reader034.vdocuments.site/reader034/viewer/2022050817/55599b99d8b42ac7648b5629/html5/thumbnails/45.jpg)
● What are we doing, what are the challenges. Big picture of the project (webservice, backend etc)
● How did people solve this problem before? How did I attempt to solve this problem? Where did it break down?
● “New” approach, looking to Google et al for scalability for big data problems
● What is Hbase/Hadoop & what do they provide?
● How did I adapt Hbase/hadoop to my problem?
● Specifics of implementation: overall flow, tables, query engine search (API), example M/R task
● Is this performant, does this scale? Can I get billionxmillion? Fast arbitrary retrieval?
● Next steps: more data types, focus on M/R for analytical tasks, focus on Katta for rich querying, push scalability w/ 8 nodes (test with genomes), look at Cascading & other tools for datamining