edbt 2010 march 22-26 2010
DESCRIPTION
EDBT 2010 March 22-26 2010. Tien Huynh 1 , Michalis Vlachos 2 , Isidore Rigoutsos 3. Anchoring Millions of Distinct Reads on the Human Genome Within Seconds. 1 IBM TJ Watson Research, NY, USA 2 IBM Zurich Research Laboratory, CH 3 Thomas Jefferson University, USA. Introduction - Past. - PowerPoint PPT PresentationTRANSCRIPT
-
EDBT 2010March 22-26 2010Tien Huynh1, Michalis Vlachos2, Isidore Rigoutsos3
Anchoring Millions of Distinct Reads on the Human GenomeWithin Seconds1IBM TJ Watson Research, NY, USA2IBM Zurich Research Laboratory, CH3 Thomas Jefferson University, USA
-
Introduction - PastBioinformatics Sequencing EvolutionHuman Genome3.3B nucleotides825MB raw data20MB compressed25,000 genes
-
Introduction Past, Present, FutureIt took 13 years for teams of scientists around the globe to first read the human genome completing the project in 2001. In 2007, it took 2 months to sequence the genome of DNA-co-discoverer James Watson. By 2013 it is likely that your personal genome could be read in the time it takes to boil an egg. Science 2009
-
What we want to doQuestion from industry: Within 24 hours, can we find FAST, ALL the positions of (newly sequenced) DNA fragments on a reference genomeReference genomeNew generation Sequencer
-
ApplicationsCancer Research: help isolate cancer-initiating mutationsBetter design of siRNAs (short interfering RNAs)RNA sequences of ~21 nucleotides -> RNA interferenceTherapeutic gene silencingCancer therapy by tumor-related gene targetingJunk DNA analysisRe-sequencingetc
-
How can we achieve that?Solve as hash join between the reference and the query databaseReference Genome/Database (eg human genome)Static, one timeQuery Database (produced fragments)Dynamic recreated on every experimentHow fast can we achieve this on a commodity desktop?
We call the technique QPick (=Quick Pick)
-
ACGGTTTCGGTTTAGGTTTAGk-mer extractiongenomic sequence (the database)
-
QPick - AdvantagesDisk Based. Small Memory RequirementsOther techniques run out of memory on large datasets3GB RAM is sufficient to IndexSearch Human GenomeQuery Millions of Short Fragments
-
QPick - AdvantagesHighly ParallelizableSingle CPU/Core are sufficient.
-
QPick - AdvantagesFlexibility Rigid and wildcard matchesAACG?AGTA?wildcardwildcardACGAGTACqueryACGAGTACACGTTACGAGGreference genome..
-
QPick - AdvantagesCompleteness of resultsVarious competitors failed to return all matches
ACGTTACGCGTTACGACG?AGTAQPick does not miss any matches
-
Technical HighlightsData CompressionExploit small DNA alphabetFast bitwise operationsTake advantage of 64bit word comparisonsSimple implementation (hash based)Data Pruning
ACGT001110001164bits
-
Data Representationreference genomeACGAGTACACGAGTCGreference genomewindow Lmax
-
Data Representation - Headreference genomeACBinary: 0001101100011011000110110001Decimal:28,422,577headwindowposition 28,422,577tail
-
Data RepresentationAdvantagesHead = key for hashtableReduction in dataset sizeWe dont have to explicitly store the headreference genome
-
Data Representation - Tailreference genomeAGTC0001001001001000Codebook .0000GTACheadtailpadded to proper lengthtail: 16 nts 16x4bits = one 64bit word
-
ExampleATAGACTAAAAAAAAAAAAAAATTreference genome24 symbolspadding
-
Up to now we have seen how to encode the reference genome
Now we show how to encode the query sequences
-
Encoding the query DBNow, instead of one long sequence, many shorter ones
AGTC .AGC .AGTC . .AGTCGmillions
-
Encoding the query DBNow, instead of one long sequence, many shorter ones AGTC .AGC .AGTC . .AGTCGmillionsWe do similar extractions in heads and tailstailhead16 symbols = 64bits14 symbols: key
-
Encoding the query DBDifferences between target db and query db:Query sequences may contain wildcards (in the db only used for padding)
AGTC . .
-
Encoding the query DBDifferences between target db and query db:Query sequences may contain wildcards (in the db only used for padding)
Query sequences may have variable length (compared to the fixed size n-grams extracted from the target db)AGTC . .AGC .AGTC . .AGTCG
-
Encoding the query DB - wildcardsTreated differently depending where they appearHead. Expanded to the possible symbols
AGTC .
-
Encoding the query DB - wildcardsTreated differently depending where they appearHead. Expanded to the possible symbols
Tail. Encoded as binary wildcard 0000
AGTC .AGTCAGTCAGTCAGTCAGTC
-
Handling Forward & Reverse DNA strandsACGAGTACACGAGTCGreference genomebackward strandforward strandIf we explicitly encode the reverse strand we would be indexing twice as much data
-
Handling Forward & Reverse DNA strandsACGAGTACACGAGTCGreference genomebackward strandforward strandForm complementary sequences
-
Handling Forward & Reverse DNA strandsACGAGTACACGAGTCGreference genomebackward strandforward strandForm complementary sequences
-
Handling Forward & Reverse DNA strandsACGAGTACACGAGTCGreference genomebackward strandforward strandForm complementary sequences
-
Overall search process.
-
Overall search process.
-
Overall search process.
-
Was our hash key fuction a correct one?a bad key function would have led to many collisions
-
ExperimentsComparison with other Short Sequence Anchoring ToolsQPick PerformanceNumber of WildcardsIndex Creation TimeQuery Time
DatasetsHomo sapiensftp://ftp.ensembl.org/pub/release-49/fasta/mus_musculus/dna/ftp://ftp.ensembl.org/pub/release-42/homo_sapiens_42_36d/data/fasta/dna/Mus Musculus
-
Up to 60x faster COMPARISONShort Sequence Anchoring Tools
ToolHitsMissesTime (sec)ImprovementQPick10,611,858079- BWA10,611,85801491.8xfetchGWI10,611,85621862.3xBowtie10,611,8580331 4.1x SOAP10,573,52838,330472859.84xEland10,559,7995,014,0595557.0x
-
Visualizing the results
-
Comparison with Hash-Based techniques QPick 16x faster than FetchGWI
- Varying the wildcards5-6 seconds retrieval time for 0-1 wildcards
-
Full Human-Genome SearchTime to indexReference GenomeQuery sequencesTime to Search1M- 10M queries (short DNA fragments)
-
Full Genome Search Index Time~ 7 hours (on single core CPU)< 50 sec for query hashtables
-
Full Genome Search Search Time~130sec to find 1M matches (single core)
-
ConclusionsQPick: Fast and complete search for short sequence fragmentsTakes Advantage of:Small DNA alphabetBit packetizationHash JoinsUp to 60x faster the competitive techniquesApplications for .Future
ACGT0011100011
The long serpantine is the human genomeAll techniques developed 2008-2009