a genome sequence analysis system built with hypertable
TRANSCRIPT
![Page 1: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/1.jpg)
A Genome Sequence A Genome Sequence qqAnalysis System Built with Analysis System Built with
H t blH t blHypertableHypertable
Doug JuddDoug JuddCEO, Hypertable, Inc.CEO, Hypertable, Inc.CEO, Hypertable, Inc.CEO, Hypertable, Inc.
![Page 2: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/2.jpg)
ApplicationApplicationDevelopment TeamDevelopment Team
UCSFUCSF--Abbott Viral Diagnostics andAbbott Viral Diagnostics andUCSFUCSF Abbott Viral Diagnostics and Abbott Viral Diagnostics and Discovery CenterDiscovery Center
Director: Dr. Charles Chiu, M.D./Ph.D.Director: Dr. Charles Chiu, M.D./Ph.D.,,http://vddc.ucsf.edu/http://vddc.ucsf.edu/
Helices Inc.Helices Inc.Taylor Sittler, M.D.Taylor Sittler, M.D.John DennisJohn DennisBrad Miller, M.D.Brad Miller, M.D.http://helic.es/http://helic.es/
![Page 3: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/3.jpg)
What is Hypertable?What is Hypertable?What is Hypertable?What is Hypertable?
Modeled after Google’sModeled after Google’s BigtableBigtableModeled after Google s Modeled after Google s BigtableBigtableOpen Source (GPL v2)Open Source (GPL v2)Horizontally ScalableHorizontally ScalableHigh Performance Implementation (C++)High Performance Implementation (C++)Thrift Interface for all popular languagesThrift Interface for all popular languages(Java PHP Ruby Python Perl etc )(Java PHP Ruby Python Perl etc )(Java, PHP, Ruby, Python, Perl, etc.)(Java, PHP, Ruby, Python, Perl, etc.)NoSQLNoSQL
No joins (not yet)No joins (not yet)N t ti ( t t)N t ti ( t t)No transactions (not yet)No transactions (not yet)
Project Started in March 2007Project Started in March 2007
![Page 4: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/4.jpg)
Hypertable DeploymentsHypertable DeploymentsHypertable DeploymentsHypertable Deployments
![Page 5: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/5.jpg)
Why NoSQL?Why NoSQL?Why NoSQL?Why NoSQL?
![Page 6: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/6.jpg)
Q i kTi ™ dQuickTime™ and aBMP decompressor
are needed to see this picture.
Source: Nature 458, 719-724 (2009)
![Page 7: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/7.jpg)
Source: wired.com, February 2011
![Page 8: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/8.jpg)
Genomics 101Genomics 101Genomics 101Genomics 101
![Page 9: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/9.jpg)
Base PairBase Pair(aka “base”)(aka “base”)
Two nucleotides on opposite Two nucleotides on opposite ppppcompl. DNA or RNA strands compl. DNA or RNA strands connected via hydrogen bondsconnected via hydrogen bondsD bl d d DNA/RNA iD bl d d DNA/RNA iDouble stranded DNA/RNA is Double stranded DNA/RNA is made up of base pairsmade up of base pairsadenine (A) pairs with thymine (T)adenine (A) pairs with thymine (T)adenine (A) pairs with thymine (T)adenine (A) pairs with thymine (T)guanine (G) pairs with cytosine (C)guanine (G) pairs with cytosine (C)BaseBase--paired DNA sequence:paired DNA sequence:BaseBase--paired DNA sequence:paired DNA sequence:ATCGATTGAGCTCTAGCGATCGATTGAGCTCTAGCGTAGCTAACTCGAGATCGCTAGCTAACTCGAGATCGC
![Page 10: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/10.jpg)
GeneGeneGeneGene
Encodes info on how to make a proteinEncodes info on how to make a proteinEncodes info on how to make a proteinEncodes info on how to make a proteinDNA or RNA sequenceDNA or RNA sequenceTh d t illi f b i lTh d t illi f b i lThousands to millions of base pairs longThousands to millions of base pairs longCorresponds to various different biological Corresponds to various different biological traitstraitsHuman genome contains about 23,000 Human genome contains about 23,000 g ,g ,genesgenes
![Page 11: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/11.jpg)
Biological SamplesBiological SamplesBiological SamplesBiological SamplesSpecimen taken from human or animalSpecimen taken from human or animalpp
Nasal SwabsNasal SwabsBlood SerumBlood SerumDiarrhealDiarrhealCerebral spinal fluidCerebral spinal fluid
Sent to a sequencing company to process intoSent to a sequencing company to process intoSent to a sequencing company to process into Sent to a sequencing company to process into DNA sequence information in digital formatDNA sequence information in digital formatEach sample will generate anywhere from 1M to Each sample will generate anywhere from 1M to p g yp g y100M “reads”100M “reads”A A readread is a short DNA sequence snippets of is a short DNA sequence snippets of
i t l 100 bi t l 100 bapproximately 100 basesapproximately 100 bases
![Page 12: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/12.jpg)
Example Reads FileExample Reads FileExample Reads FileExample Reads FileGTGGATAGGGGGAGACTAATGTAGTATGATTATCATCATCAACAGAAGCTATGACACCAGGATAAACATTTCTTATTGCTGAAAGTATTCTATTGTAGAGATGTACCACAATTTGGTTTCTGGTTTTGTATTCATTTCTTATTGCTGAAAGTATTCTATTGTAGAGATGTACCACAATTTGGTTTCTGGTTTTGTATTGGGAGGATACTAGGGATTACTGAAGCCAACTTTGCAGACTCATACATTTGACTAGACACAGCCACATTACAGTTTTCTGAGGAAAATTCTTAAGATGTTACCCCAAAACATAGCATTTTAAATTAAAACGGACCGGCTGAAGCCATGGCAGAAGAACATAAATTGTGAAGATTTCATGGGCATTTATTAGTTGGAAGTGATAAGTGTCCATGAAATCTTCACAATTTATGTTCAGAGATTGCAGTAAAGACAGGTGTAAAGACACAGCAAAGCTAAGAGGACCCAACACACGGTAGGGTCGGGGACCTTGGAGAAACATGGTGGCTTCTTCCTACATGCTTGTGATAGATGACCAAAAAACATTTGTTGAGTTGATGAATAGTACAAAAAAGGGGCGGATAATAAATGAAAAGGGAATGTGCTGTTATTTCCTACTAAGATCAGAAAGAGATATAAACAAAAGCTGTCATCACTTAGGGACTTCAGCCACATAAAACAATGTCAGGCTAGTCACTTAGAGCTTTGGGACTAGTTGAGTGGCAGCTTAACAAAGCAACGCAATATCCATAGGGATTGGGGATATTTACATCTAGTGGATTCTACCAGTATGGTGGTCTTATGTGGACTGCACGTGGTTTTCTAGTAAGATAGCAGCTCTTCCCAAATTTATTTATAATTGTGGCATTATTTATAATATCAAAATATTATAGATAGCAGCTCTTCCCAAATTTATTTATAATTGTGGCATTATTTATAATATCAAAATATTATGTTGCCAAAGGAGATTAACATTTGAGTCAGTGGGCGGGGTAAGGCCGACCTACCCTTAATCTGGTGGAGAAAGAAGCTGCTAATGGAGTTTAAAAGGTTACTGTCATTAATGAAAAATAAATTTACAGCCAGACATTTATGAACAGAAATGGGAAAAACACACTAGGAAAGCACTGCAAAGACTAATCTGTCTTTAAAGGAGATAGAGTGACTCCAGGCCCCTTAGAAATGACTATACCTGGCAGAGCATGCCAACTGATGGGCTCGAGTCCTCACAAATATGAATTCCCCCTAAGTCTTGAGAGGTCATTTGTGCATTTGGAAATGGGCTCGAGTCCTCACAAATATGAATTCCCCCTAAGTCTTGAGAGGTCATTTGTGCATTTGGAAGGAAGAACATTCCATGCTCATGGGTAGGAAGAATCAATATCGTGAAAATGGTCATACTGCCCAGCGGGGTTTTTTTTTGTTTCATATTAACTTTAAAGTAGTTTTTTTCCATTTTGTGAAGAAAGACATAAAGAACCAAGGCTAATAGTTGTTTGAGTTGTACTTACCATGTTGTTAAATGTCACCTCACACCGCTGCCAGCCTATCAGAGCCGGGAATTACACCGTGCTTGGAGTTCTGGCACAGATCCACAGCTACAGTTCTTCATTGTAAGAAATGGATGCTAACATGTAACAAGAAAACATCTGAAGGTTAAACTCAAATAAATGGGTTAATAGTTTGTCTTTCGGTCTTCATACTTTCAATATAAGTGGTTTACTTAGCCGA
![Page 13: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/13.jpg)
Sequence AlignmentSequence AlignmentSequence AlignmentSequence AlignmentArranging the sequences of DNA or RNA to identify Arranging the sequences of DNA or RNA to identify g g q yg g q yregions of similarityregions of similarityFuzzy matching algorithmFuzzy matching algorithmAlignment methodsAlignment methodsAlignment methodsAlignment methods
BLAST BLAST ‐‐ Basic Local Alignment Search ToolBasic Local Alignment Search ToolMegaBLASTMegaBLAST
Faster but less accurate alignment methodsFaster but less accurate alignment methodsFaster but less accurate alignment methodsFaster but less accurate alignment methodsSOAP SOAP ‐‐ Short Oligonucleotide Analysis PackageShort Oligonucleotide Analysis PackageBLAT BLAT ‐‐ BLASTBLAST‐‐like Alignment Toollike Alignment Tool
![Page 14: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/14.jpg)
TaxonomyTaxonomyTaxonomyTaxonomyHierarchical biological classificationHierarchical biological classificationHierarchical biological classificationHierarchical biological classificationMethod to group and categorize organisms by Method to group and categorize organisms by biological typebiological typeg ypg ypBasic RanksBasic RanksKingdom, Phylum/Division, Class, Order, Family, Genus, SpeciesKingdom, Phylum/Division, Class, Order, Family, Genus, Species
Downloadable from National Center for Downloadable from National Center for Biotechnology Information (NCBI) websiteBiotechnology Information (NCBI) websiteEvery node in the taxonomy tree is assigned a Every node in the taxonomy tree is assigned a unique numeric IDunique numeric ID
![Page 15: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/15.jpg)
GenBankGenBankGenBankGenBankNIH genetic sequence databaseNIH genetic sequence databaseg qg q
380,000 distinct organisms380,000 distinct organisms126,551,501,141 nucleotide bases126,551,501,141 nucleotide bases135,440,924 sequence records135,440,924 sequence records
Most important and most influential database for Most important and most influential database for research in almost all biological fieldsresearch in almost all biological fieldsresearch in almost all biological fieldsresearch in almost all biological fieldsGrowth rate is exponentialGrowth rate is exponentialInformation on each sequence includes:Information on each sequence includes:Information on each sequence includes:Information on each sequence includes:
Numeric IDNumeric IDTaxonomic informationTaxonomic information
![Page 16: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/16.jpg)
Schema DesignSchema DesignSchema DesignSchema Design
![Page 17: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/17.jpg)
Taxa TableTaxa TableTaxa TableTaxa TableSchemaCREATE TABLE T (ID T Child N )
ContentsCREATE TABLE Taxa (ID, Type, Children, Name);
/1 ID 1/1 ID:fullName /root/1 Type no rank/1 Children 1,10239,12884,12908,28384,131567/1 Name root/1/10239 ID 10239/1/10239 ID:fullName /root/Viruses/1/10239 Type superkingdom/1/10239 Children 12333,12429,12877,29258,35237, …/1/10239 Name Viruses/1/10239/12333 ID 12333/1/10239/12333 ID:fullName /root/Viruses/unclassified phages/1/10239/12333 Type no rank/1/10239/12333 Children 12340,12347,12366,12371,12374, …/1/10239/12333 Name unclassified phages
![Page 18: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/18.jpg)
Reads TableReads TableReads TableReads TableSchema
ContentsCREATE TABLE Reads (Sequence, Quality, GeneKey, Comments);
AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Sequence ATCGCACCATTGAACTCCAGTC...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Quality eeaeeeed\\e_Ycc]dcacab...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Comments:qualityFilter 11071815...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Sequence GGCTTACGCCTGTAATCCCAGC...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 Quality gfee_cgggegggecggggegc...AbCam1 100 ACAGTG,HWI...56#ACAGTG/1 GeneKey:gnl|GNOMON|1320663.m 11...AbCam1_100_ACAGTG,HWI...56#ACAGTG/1 GeneKey:gnl|GNOMON|1320663.m 11...AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 Sequence AGGATACGGAAGGCCCAAGGAG...AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 Quality cdd`dffffffgffgggegf^e...AbCam1_100_ACAGTG,HWI...17#ACAGTG/1 GeneKey:chr10 110718151643.1308...AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Sequence ACGGAAGAGCACACGTCTGAAC...b 1 100 80# /1 Q li b b[^ \\ b] b` [b ]AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Quality cbccb[^W\\Ub]_b`_[bR_]...
AbCam1_100_ACAGTG,HWI...80#ACAGTG/1 Comments:qualityFilter 11071815...AbCam1_100_ACAGTG,HWI...88#ACAGTG/1 Sequence GAACTCCAGTCACACAGTGATC...AbCam1_100_ACAGTG,HWI...88#ACAGTG/1 Quality eeeeeeeeeeeceeeeeaeeTQ...AbCam1 100 ACAGTG,HWI...88#ACAGTG/1 Comments:qualityFilter 11071815..._ _ , q y
![Page 19: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/19.jpg)
Genes TableGenes TableGenes TableGenes TableSchema
ContentsCREATE TABLE Genes (Sequence, TaxID, ID, ReadID);
1000075 Sequence GAATTCCATGGCAGTAAAACATCTTCCCTTC…1000075 TaxID 96061000075 ID:name HSLFBPS6 Human fructose-1,6-biphosphatase 1000075 ReadID:0310.Lane8big,HWI-EAS355:8:91:1231:1315#0/1 …1000075 ReadID:0908.Mexus2.TATTAT,SCS:1:22:395:324#0/1_TA …1000075 ReadID:0916.Enceph2,SCS:6:24:1519:513#0/11000075 ReadID:0916.Mexus,SCS:1:22:410:248#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:17:811:769#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:21:1132:1067#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:24:1207:492#0/11000075 ReadID:0916.MonkeyAdeno,SCS:2:33:1138:547#0/11000075 ReadID:0916.Parecho,SCS:3:4:679:1416#0/1|11000075 ReadID:HIV.HIV18_Lane7.s_7_sequence.AAA,SCS:7:30:688 …1000075 ReadID:HIV.HIV18_Lane7.s_7_sequence.AAA,SCS:7:30:688 …1000075 ReadID:HIV.HIV18_Lane7.s_7_sequence.unbiased,SCS:7:30 …
![Page 20: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/20.jpg)
Monitoring Table OverviewMonitoring Table OverviewMonitoring Table OverviewMonitoring Table Overview
![Page 21: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/21.jpg)
ApplicationsApplicationsApplicationsApplications
![Page 22: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/22.jpg)
Novel Virus DiscoveryNovel Virus DiscoveryNovel Virus DiscoveryNovel Virus DiscoveryProcess for discovering new viral DNA in aProcess for discovering new viral DNA in aProcess for discovering new viral DNA in a Process for discovering new viral DNA in a biological samplebiological sampleAlgorithm OverviewAlgorithm OverviewAlgorithm OverviewAlgorithm Overview
Import biological sample Import biological sample readread data from data from sequencing company into systemsequencing company into systemStrip out all Strip out all readsreads that align to known DNA that align to known DNA sequencessequencesWhat’s left over is novelWhat’s left over is novel
![Page 23: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/23.jpg)
Novel Virus DiscoveryNovel Virus DiscoveryAlgorithm DetailAlgorithm Detail
Import sample data into Reads tableImport sample data into Reads tableImport sample data into Reads tableImport sample data into Reads tableRun MapReduce program to filter/align reads Run MapReduce program to filter/align reads and update and update CommentComment column of Reads tablecolumn of Reads tablepp
Filter out poor quality (“low entropy”) readsFilter out poor quality (“low entropy”) readsAlign to common human RNA/DNAAlign to common human RNA/DNAAlign to virus databaseAlign to virus databaseAlign to GenBankAlign to GenBank
All Reads left in Reads table with no All Reads left in Reads table with no CommentCommentcolumn are novelcolumn are novel
![Page 24: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/24.jpg)
Pathogen DiscoveryPathogen Discoveryin Cancer Samplesin Cancer Samples
Accomplished using same technique asAccomplished using same technique asAccomplished using same technique as Accomplished using same technique as novel virus discoverynovel virus discoveryMatthew Meyerson's Lab @ BroadMatthew Meyerson's Lab @ BroadMatthew Meyerson s Lab @ Broad Matthew Meyerson s Lab @ Broad InstituteInstitute
![Page 25: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/25.jpg)
Taxonomic Tree ViewerTaxonomic Tree ViewerTaxonomic Tree ViewerTaxonomic Tree Viewer
Display Taxonomy breakdown of biologicalDisplay Taxonomy breakdown of biologicalDisplay Taxonomy breakdown of biological Display Taxonomy breakdown of biological samplesampleFor each aligned read in sample consultFor each aligned read in sample consultFor each aligned read in sample, consult For each aligned read in sample, consult Genes table to determine Taxonomy IDGenes table to determine Taxonomy IDP l t HitS t bl ith tP l t HitS t bl ith tPopulate HitSummary table with taxonomy Populate HitSummary table with taxonomy IDs for IDs for all aligned readsall aligned reads from from all samplesall samples
![Page 26: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/26.jpg)
Depletion Array (future)Depletion Array (future)Depletion Array (future)Depletion Array (future)Align reads to human genomeAlign reads to human genomeg gg gDetermine set of Determine set of probesprobes -- sequences of human sequences of human genome with most number of alignmentsgenome with most number of alignmentsSend probes to Agilent to produce vial of Send probes to Agilent to produce vial of “magnetized” DNA sequences of the probes“magnetized” DNA sequences of the probesMi i l i ith bi l i l lMi i l i ith bi l i l lMix vial in with biological sampleMix vial in with biological sampleMagnetized DNA binds to human DNA which Magnetized DNA binds to human DNA which precipitates from solutionprecipitates from solutionprecipitates from solutionprecipitates from solutionIncreases viral percentage of sample fromIncreases viral percentage of sample from~0.01% ~0.01% -- 0.1% to 10 %0.1% to 10 %
![Page 27: A Genome Sequence Analysis System Built with Hypertable](https://reader038.vdocuments.site/reader038/viewer/2022110310/55a3f0611a28ab38168b469c/html5/thumbnails/27.jpg)
The EndThe EndThe EndThe End
Questions?Questions?