a genome sequence analysis system built with hypertable

27
A Genome Sequence A Genome Sequence Analysis System Built with Analysis System Built with H t bl H t bl Hypertable Hypertable Doug Judd Doug Judd CEO, Hypertable, Inc. CEO, Hypertable, Inc. CEO, Hypertable, Inc. CEO, Hypertable, Inc.

Upload: dataversity

Post on 14-Jul-2015

511 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: A Genome Sequence Analysis System Built with Hypertable

A Genome Sequence A Genome Sequence qqAnalysis System Built with Analysis System Built with

H t blH t blHypertableHypertable

Doug JuddDoug JuddCEO, Hypertable, Inc.CEO, Hypertable, Inc.CEO, Hypertable, Inc.CEO, Hypertable, Inc.

Page 2: A Genome Sequence Analysis System Built with Hypertable

ApplicationApplicationDevelopment TeamDevelopment Team

UCSFUCSF--Abbott Viral Diagnostics andAbbott Viral Diagnostics andUCSFUCSF Abbott Viral Diagnostics and Abbott Viral Diagnostics and Discovery CenterDiscovery Center

Director: Dr. Charles Chiu, M.D./Ph.D.Director: Dr. Charles Chiu, M.D./Ph.D.,,http://vddc.ucsf.edu/http://vddc.ucsf.edu/

Helices Inc.Helices Inc.Taylor Sittler, M.D.Taylor Sittler, M.D.John DennisJohn DennisBrad Miller, M.D.Brad Miller, M.D.http://helic.es/http://helic.es/

Page 3: A Genome Sequence Analysis System Built with Hypertable

What is Hypertable?What is Hypertable?What is Hypertable?What is Hypertable?

Modeled after Google’sModeled after Google’s BigtableBigtableModeled after Google s Modeled after Google s BigtableBigtableOpen Source (GPL v2)Open Source (GPL v2)Horizontally ScalableHorizontally ScalableHigh Performance Implementation (C++)High Performance Implementation (C++)Thrift Interface for all popular languagesThrift Interface for all popular languages(Java PHP Ruby Python Perl etc )(Java PHP Ruby Python Perl etc )(Java, PHP, Ruby, Python, Perl, etc.)(Java, PHP, Ruby, Python, Perl, etc.)NoSQLNoSQL

No joins (not yet)No joins (not yet)N t ti ( t t)N t ti ( t t)No transactions (not yet)No transactions (not yet)

Project Started in March 2007Project Started in March 2007

Page 4: A Genome Sequence Analysis System Built with Hypertable

Hypertable DeploymentsHypertable DeploymentsHypertable DeploymentsHypertable Deployments

Page 5: A Genome Sequence Analysis System Built with Hypertable

Why NoSQL?Why NoSQL?Why NoSQL?Why NoSQL?

Page 6: A Genome Sequence Analysis System Built with Hypertable

Q i kTi ™ dQuickTime™ and aBMP decompressor

are needed to see this picture.

Source: Nature 458, 719-724 (2009)

Page 7: A Genome Sequence Analysis System Built with Hypertable

Source: wired.com, February 2011

Page 8: A Genome Sequence Analysis System Built with Hypertable

Genomics 101Genomics 101Genomics 101Genomics 101

Page 9: A Genome Sequence Analysis System Built with Hypertable

Base PairBase Pair(aka “base”)(aka “base”)

Two nucleotides on opposite Two nucleotides on opposite ppppcompl. DNA or RNA strands compl. DNA or RNA strands connected via hydrogen bondsconnected via hydrogen bondsD bl d d DNA/RNA iD bl d d DNA/RNA iDouble stranded DNA/RNA is Double stranded DNA/RNA is made up of base pairsmade up of base pairsadenine (A) pairs with thymine (T)adenine (A) pairs with thymine (T)adenine (A) pairs with thymine (T)adenine (A) pairs with thymine (T)guanine (G) pairs with cytosine (C)guanine (G) pairs with cytosine (C)BaseBase--paired DNA sequence:paired DNA sequence:BaseBase--paired DNA sequence:paired DNA sequence:ATCGATTGAGCTCTAGCGATCGATTGAGCTCTAGCGTAGCTAACTCGAGATCGCTAGCTAACTCGAGATCGC

Page 10: A Genome Sequence Analysis System Built with Hypertable

GeneGeneGeneGene

Encodes info on how to make a proteinEncodes info on how to make a proteinEncodes info on how to make a proteinEncodes info on how to make a proteinDNA or RNA sequenceDNA or RNA sequenceTh d t illi f b i lTh d t illi f b i lThousands to millions of base pairs longThousands to millions of base pairs longCorresponds to various different biological Corresponds to various different biological traitstraitsHuman genome contains about 23,000 Human genome contains about 23,000 g ,g ,genesgenes

Page 11: A Genome Sequence Analysis System Built with Hypertable

Biological SamplesBiological SamplesBiological SamplesBiological SamplesSpecimen taken from human or animalSpecimen taken from human or animalpp

Nasal SwabsNasal SwabsBlood SerumBlood SerumDiarrhealDiarrhealCerebral spinal fluidCerebral spinal fluid

Sent to a sequencing company to process intoSent to a sequencing company to process intoSent to a sequencing company to process into Sent to a sequencing company to process into DNA sequence information in digital formatDNA sequence information in digital formatEach sample will generate anywhere from 1M to Each sample will generate anywhere from 1M to p g yp g y100M “reads”100M “reads”A A readread is a short DNA sequence snippets of is a short DNA sequence snippets of

i t l 100 bi t l 100 bapproximately 100 basesapproximately 100 bases

Page 12: A Genome Sequence Analysis System Built with Hypertable

Example Reads FileExample Reads FileExample Reads FileExample Reads FileGTGGATAGGGGGAGACTAATGTAGTATGATTATCATCATCAACAGAAGCTATGACACCAGGATAAACATTTCTTATTGCTGAAAGTATTCTATTGTAGAGATGTACCACAATTTGGTTTCTGGTTTTGTATTCATTTCTTATTGCTGAAAGTATTCTATTGTAGAGATGTACCACAATTTGGTTTCTGGTTTTGTATTGGGAGGATACTAGGGATTACTGAAGCCAACTTTGCAGACTCATACATTTGACTAGACACAGCCACATTACAGTTTTCTGAGGAAAATTCTTAAGATGTTACCCCAAAACATAGCATTTTAAATTAAAACGGACCGGCTGAAGCCATGGCAGAAGAACATAAATTGTGAAGATTTCATGGGCATTTATTAGTTGGAAGTGATAAGTGTCCATGAAATCTTCACAATTTATGTTCAGAGATTGCAGTAAAGACAGGTGTAAAGACACAGCAAAGCTAAGAGGACCCAACACACGGTAGGGTCGGGGACCTTGGAGAAACATGGTGGCTTCTTCCTACATGCTTGTGATAGATGACCAAAAAACATTTGTTGAGTTGATGAATAGTACAAAAAAGGGGCGGATAATAAATGAAAAGGGAATGTGCTGTTATTTCCTACTAAGATCAGAAAGAGATATAAACAAAAGCTGTCATCACTTAGGGACTTCAGCCACATAAAACAATGTCAGGCTAGTCACTTAGAGCTTTGGGACTAGTTGAGTGGCAGCTTAACAAAGCAACGCAATATCCATAGGGATTGGGGATATTTACATCTAGTGGATTCTACCAGTATGGTGGTCTTATGTGGACTGCACGTGGTTTTCTAGTAAGATAGCAGCTCTTCCCAAATTTATTTATAATTGTGGCATTATTTATAATATCAAAATATTATAGATAGCAGCTCTTCCCAAATTTATTTATAATTGTGGCATTATTTATAATATCAAAATATTATGTTGCCAAAGGAGATTAACATTTGAGTCAGTGGGCGGGGTAAGGCCGACCTACCCTTAATCTGGTGGAGAAAGAAGCTGCTAATGGAGTTTAAAAGGTTACTGTCATTAATGAAAAATAAATTTACAGCCAGACATTTATGAACAGAAATGGGAAAAACACACTAGGAAAGCACTGCAAAGACTAATCTGTCTTTAAAGGAGATAGAGTGACTCCAGGCCCCTTAGAAATGACTATACCTGGCAGAGCATGCCAACTGATGGGCTCGAGTCCTCACAAATATGAATTCCCCCTAAGTCTTGAGAGGTCATTTGTGCATTTGGAAATGGGCTCGAGTCCTCACAAATATGAATTCCCCCTAAGTCTTGAGAGGTCATTTGTGCATTTGGAAGGAAGAACATTCCATGCTCATGGGTAGGAAGAATCAATATCGTGAAAATGGTCATACTGCCCAGCGGGGTTTTTTTTTGTTTCATATTAACTTTAAAGTAGTTTTTTTCCATTTTGTGAAGAAAGACATAAAGAACCAAGGCTAATAGTTGTTTGAGTTGTACTTACCATGTTGTTAAATGTCACCTCACACCGCTGCCAGCCTATCAGAGCCGGGAATTACACCGTGCTTGGAGTTCTGGCACAGATCCACAGCTACAGTTCTTCATTGTAAGAAATGGATGCTAACATGTAACAAGAAAACATCTGAAGGTTAAACTCAAATAAATGGGTTAATAGTTTGTCTTTCGGTCTTCATACTTTCAATATAAGTGGTTTACTTAGCCGA

Page 13: A Genome Sequence Analysis System Built with Hypertable

Sequence AlignmentSequence AlignmentSequence AlignmentSequence AlignmentArranging the sequences of DNA or RNA to identify Arranging the sequences of DNA or RNA to identify g g q yg g q yregions of similarityregions of similarityFuzzy matching algorithmFuzzy matching algorithmAlignment methodsAlignment methodsAlignment methodsAlignment methods

BLAST BLAST ‐‐ Basic Local Alignment Search ToolBasic Local Alignment Search ToolMegaBLASTMegaBLAST

Faster but less accurate alignment methodsFaster but less accurate alignment methodsFaster but less accurate alignment methodsFaster but less accurate alignment methodsSOAP SOAP ‐‐ Short Oligonucleotide Analysis PackageShort Oligonucleotide Analysis PackageBLAT BLAT ‐‐ BLASTBLAST‐‐like Alignment Toollike Alignment Tool

Page 14: A Genome Sequence Analysis System Built with Hypertable

TaxonomyTaxonomyTaxonomyTaxonomyHierarchical biological classificationHierarchical biological classificationHierarchical biological classificationHierarchical biological classificationMethod to group and categorize organisms by Method to group and categorize organisms by biological typebiological typeg ypg ypBasic RanksBasic RanksKingdom, Phylum/Division, Class, Order, Family, Genus, SpeciesKingdom, Phylum/Division, Class, Order, Family, Genus, Species

Downloadable from National Center for Downloadable from National Center for Biotechnology Information (NCBI) websiteBiotechnology Information (NCBI) websiteEvery node in the taxonomy tree is assigned a Every node in the taxonomy tree is assigned a unique numeric IDunique numeric ID

Page 15: A Genome Sequence Analysis System Built with Hypertable

GenBankGenBankGenBankGenBankNIH genetic sequence databaseNIH genetic sequence databaseg qg q

380,000 distinct organisms380,000 distinct organisms126,551,501,141 nucleotide bases126,551,501,141 nucleotide bases135,440,924 sequence records135,440,924 sequence records

Most important and most influential database for Most important and most influential database for research in almost all biological fieldsresearch in almost all biological fieldsresearch in almost all biological fieldsresearch in almost all biological fieldsGrowth rate is exponentialGrowth rate is exponentialInformation on each sequence includes:Information on each sequence includes:Information on each sequence includes:Information on each sequence includes:

Numeric IDNumeric IDTaxonomic informationTaxonomic information

Page 16: A Genome Sequence Analysis System Built with Hypertable

Schema DesignSchema DesignSchema DesignSchema Design

Page 17: A Genome Sequence Analysis System Built with Hypertable

Taxa TableTaxa TableTaxa TableTaxa TableSchemaCREATE TABLE T (ID T Child N )

ContentsCREATE TABLE Taxa (ID, Type, Children, Name);

/1 ID 1/1 ID:fullName /root/1 Type no rank/1 Children 1,10239,12884,12908,28384,131567/1 Name root/1/10239 ID 10239/1/10239 ID:fullName /root/Viruses/1/10239 Type superkingdom/1/10239 Children 12333,12429,12877,29258,35237, …/1/10239 Name Viruses/1/10239/12333 ID 12333/1/10239/12333 ID:fullName /root/Viruses/unclassified phages/1/10239/12333 Type no rank/1/10239/12333 Children 12340,12347,12366,12371,12374, …/1/10239/12333 Name unclassified phages

Page 18: A Genome Sequence Analysis System Built with Hypertable

Reads TableReads TableReads TableReads TableSchema

ContentsCREATE TABLE Reads (Sequence, Quality, GeneKey, Comments);

AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Sequence ATCGCACCATTGAACTCCAGTC...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Quality eeaeeeed\\e_Ycc]dcacab...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Comments:qualityFilter 11071815...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Sequence GGCTTACGCCTGTAATCCCAGC...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Quality gfee_cgggegggecggggegc...AbCam1 100 ACAGTG,HWI...56#ACAGTG/1 GeneKey:gnl|GNOMON|1320663.m 11...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 GeneKey:gnl|GNOMON|1320663.m 11...AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 Sequence AGGATACGGAAGGCCCAAGGAG...AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 Quality cdd`dffffffgffgggegf^e...AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 GeneKey:chr10 110718151643.1308...AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Sequence ACGGAAGAGCACACGTCTGAAC...b 1 100 80# /1 Q li b b[^ \\ b] b` [b ]AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Quality cbccb[^W\\Ub]_b`_[bR_]...

AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Comments:qualityFilter 11071815...AbCam1_100_ACAGTG,HWI...88#ACAGTG/1 Sequence GAACTCCAGTCACACAGTGATC...AbCam1_100_ACAGTG,HWI...88#ACAGTG/1 Quality eeeeeeeeeeeceeeeeaeeTQ...AbCam1 100 ACAGTG,HWI...88#ACAGTG/1 Comments:qualityFilter 11071815..._ _ , q y

Page 19: A Genome Sequence Analysis System Built with Hypertable

Genes TableGenes TableGenes TableGenes TableSchema

ContentsCREATE TABLE Genes (Sequence, TaxID, ID, ReadID);

1000075 Sequence GAATTCCATGGCAGTAAAACATCTTCCCTTC…1000075 TaxID 96061000075 ID:name HSLFBPS6 Human fructose-1,6-biphosphatase 1000075 ReadID:0310.Lane8big,HWI-EAS355:8:91:1231:1315#0/1 …1000075 ReadID:0908.Mexus2.TATTAT,SCS:1:22:395:324#0/1_TA …1000075 ReadID:0916.Enceph2,SCS:6:24:1519:513#0/11000075 ReadID:0916.Mexus,SCS:1:22:410:248#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:17:811:769#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:21:1132:1067#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:24:1207:492#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:33:1138:547#0/11000075 ReadID:0916.Parecho,SCS:3:4:679:1416#0/1|11000075 ReadID:HIV.HIV18_Lane7.s_7_sequence.AAA,SCS:7:30:688 …1000075 ReadID:HIV.HIV18_Lane7.s_7_sequence.AAA,SCS:7:30:688 …1000075 ReadID:HIV.HIV18_Lane7.s_7_sequence.unbiased,SCS:7:30 …

Page 20: A Genome Sequence Analysis System Built with Hypertable

Monitoring Table OverviewMonitoring Table OverviewMonitoring Table OverviewMonitoring Table Overview

Page 21: A Genome Sequence Analysis System Built with Hypertable

ApplicationsApplicationsApplicationsApplications

Page 22: A Genome Sequence Analysis System Built with Hypertable

Novel Virus DiscoveryNovel Virus DiscoveryNovel Virus DiscoveryNovel Virus DiscoveryProcess for discovering new viral DNA in aProcess for discovering new viral DNA in aProcess for discovering new viral DNA in a Process for discovering new viral DNA in a biological samplebiological sampleAlgorithm OverviewAlgorithm OverviewAlgorithm OverviewAlgorithm Overview

Import biological sample Import biological sample readread data from data from sequencing company into systemsequencing company into systemStrip out all Strip out all readsreads that align to known DNA that align to known DNA sequencessequencesWhat’s left over is novelWhat’s left over is novel

Page 23: A Genome Sequence Analysis System Built with Hypertable

Novel Virus DiscoveryNovel Virus DiscoveryAlgorithm DetailAlgorithm Detail

Import sample data into Reads tableImport sample data into Reads tableImport sample data into Reads tableImport sample data into Reads tableRun MapReduce program to filter/align reads Run MapReduce program to filter/align reads and update and update CommentComment column of Reads tablecolumn of Reads tablepp

Filter out poor quality (“low entropy”) readsFilter out poor quality (“low entropy”) readsAlign to common human RNA/DNAAlign to common human RNA/DNAAlign to virus databaseAlign to virus databaseAlign to GenBankAlign to GenBank

All Reads left in Reads table with no All Reads left in Reads table with no CommentCommentcolumn are novelcolumn are novel

Page 24: A Genome Sequence Analysis System Built with Hypertable

Pathogen DiscoveryPathogen Discoveryin Cancer Samplesin Cancer Samples

Accomplished using same technique asAccomplished using same technique asAccomplished using same technique as Accomplished using same technique as novel virus discoverynovel virus discoveryMatthew Meyerson's Lab @ BroadMatthew Meyerson's Lab @ BroadMatthew Meyerson s Lab @ Broad Matthew Meyerson s Lab @ Broad InstituteInstitute

Page 25: A Genome Sequence Analysis System Built with Hypertable

Taxonomic Tree ViewerTaxonomic Tree ViewerTaxonomic Tree ViewerTaxonomic Tree Viewer

Display Taxonomy breakdown of biologicalDisplay Taxonomy breakdown of biologicalDisplay Taxonomy breakdown of biological Display Taxonomy breakdown of biological samplesampleFor each aligned read in sample consultFor each aligned read in sample consultFor each aligned read in sample, consult For each aligned read in sample, consult Genes table to determine Taxonomy IDGenes table to determine Taxonomy IDP l t HitS t bl ith tP l t HitS t bl ith tPopulate HitSummary table with taxonomy Populate HitSummary table with taxonomy IDs for IDs for all aligned readsall aligned reads from from all samplesall samples

Page 26: A Genome Sequence Analysis System Built with Hypertable

Depletion Array (future)Depletion Array (future)Depletion Array (future)Depletion Array (future)Align reads to human genomeAlign reads to human genomeg gg gDetermine set of Determine set of probesprobes -- sequences of human sequences of human genome with most number of alignmentsgenome with most number of alignmentsSend probes to Agilent to produce vial of Send probes to Agilent to produce vial of “magnetized” DNA sequences of the probes“magnetized” DNA sequences of the probesMi i l i ith bi l i l lMi i l i ith bi l i l lMix vial in with biological sampleMix vial in with biological sampleMagnetized DNA binds to human DNA which Magnetized DNA binds to human DNA which precipitates from solutionprecipitates from solutionprecipitates from solutionprecipitates from solutionIncreases viral percentage of sample fromIncreases viral percentage of sample from~0.01% ~0.01% -- 0.1% to 10 %0.1% to 10 %

Page 27: A Genome Sequence Analysis System Built with Hypertable

The EndThe EndThe EndThe End

Questions?Questions?