edbt 2010 march 22-26 2010

Download EDBT 2010 March 22-26 2010

If you can't read please download the document

Upload: derora

Post on 10-Jan-2016

32 views

Category:

Documents


3 download

DESCRIPTION

EDBT 2010 March 22-26 2010. Tien Huynh 1 , Michalis Vlachos 2 , Isidore Rigoutsos 3. Anchoring Millions of Distinct Reads on the Human Genome Within Seconds. 1 IBM TJ Watson Research, NY, USA 2 IBM Zurich Research Laboratory, CH 3 Thomas Jefferson University, USA. Introduction - Past. - PowerPoint PPT Presentation

TRANSCRIPT

  • EDBT 2010March 22-26 2010Tien Huynh1, Michalis Vlachos2, Isidore Rigoutsos3

    Anchoring Millions of Distinct Reads on the Human GenomeWithin Seconds1IBM TJ Watson Research, NY, USA2IBM Zurich Research Laboratory, CH3 Thomas Jefferson University, USA

  • Introduction - PastBioinformatics Sequencing EvolutionHuman Genome3.3B nucleotides825MB raw data20MB compressed25,000 genes

  • Introduction Past, Present, FutureIt took 13 years for teams of scientists around the globe to first read the human genome completing the project in 2001. In 2007, it took 2 months to sequence the genome of DNA-co-discoverer James Watson. By 2013 it is likely that your personal genome could be read in the time it takes to boil an egg. Science 2009

  • What we want to doQuestion from industry: Within 24 hours, can we find FAST, ALL the positions of (newly sequenced) DNA fragments on a reference genomeReference genomeNew generation Sequencer

  • ApplicationsCancer Research: help isolate cancer-initiating mutationsBetter design of siRNAs (short interfering RNAs)RNA sequences of ~21 nucleotides -> RNA interferenceTherapeutic gene silencingCancer therapy by tumor-related gene targetingJunk DNA analysisRe-sequencingetc

  • How can we achieve that?Solve as hash join between the reference and the query databaseReference Genome/Database (eg human genome)Static, one timeQuery Database (produced fragments)Dynamic recreated on every experimentHow fast can we achieve this on a commodity desktop?

    We call the technique QPick (=Quick Pick)

  • ACGGTTTCGGTTTAGGTTTAGk-mer extractiongenomic sequence (the database)

  • QPick - AdvantagesDisk Based. Small Memory RequirementsOther techniques run out of memory on large datasets3GB RAM is sufficient to IndexSearch Human GenomeQuery Millions of Short Fragments

  • QPick - AdvantagesHighly ParallelizableSingle CPU/Core are sufficient.

  • QPick - AdvantagesFlexibility Rigid and wildcard matchesAACG?AGTA?wildcardwildcardACGAGTACqueryACGAGTACACGTTACGAGGreference genome..

  • QPick - AdvantagesCompleteness of resultsVarious competitors failed to return all matches

    ACGTTACGCGTTACGACG?AGTAQPick does not miss any matches

  • Technical HighlightsData CompressionExploit small DNA alphabetFast bitwise operationsTake advantage of 64bit word comparisonsSimple implementation (hash based)Data Pruning

    ACGT001110001164bits

  • Data Representationreference genomeACGAGTACACGAGTCGreference genomewindow Lmax

  • Data Representation - Headreference genomeACBinary: 0001101100011011000110110001Decimal:28,422,577headwindowposition 28,422,577tail

  • Data RepresentationAdvantagesHead = key for hashtableReduction in dataset sizeWe dont have to explicitly store the headreference genome

  • Data Representation - Tailreference genomeAGTC0001001001001000Codebook .0000GTACheadtailpadded to proper lengthtail: 16 nts 16x4bits = one 64bit word

  • ExampleATAGACTAAAAAAAAAAAAAAATTreference genome24 symbolspadding

  • Up to now we have seen how to encode the reference genome

    Now we show how to encode the query sequences

  • Encoding the query DBNow, instead of one long sequence, many shorter ones

    AGTC .AGC .AGTC . .AGTCGmillions

  • Encoding the query DBNow, instead of one long sequence, many shorter ones AGTC .AGC .AGTC . .AGTCGmillionsWe do similar extractions in heads and tailstailhead16 symbols = 64bits14 symbols: key

  • Encoding the query DBDifferences between target db and query db:Query sequences may contain wildcards (in the db only used for padding)

    AGTC . .

  • Encoding the query DBDifferences between target db and query db:Query sequences may contain wildcards (in the db only used for padding)

    Query sequences may have variable length (compared to the fixed size n-grams extracted from the target db)AGTC . .AGC .AGTC . .AGTCG

  • Encoding the query DB - wildcardsTreated differently depending where they appearHead. Expanded to the possible symbols

    AGTC .

  • Encoding the query DB - wildcardsTreated differently depending where they appearHead. Expanded to the possible symbols

    Tail. Encoded as binary wildcard 0000

    AGTC .AGTCAGTCAGTCAGTCAGTC

  • Handling Forward & Reverse DNA strandsACGAGTACACGAGTCGreference genomebackward strandforward strandIf we explicitly encode the reverse strand we would be indexing twice as much data

  • Handling Forward & Reverse DNA strandsACGAGTACACGAGTCGreference genomebackward strandforward strandForm complementary sequences

  • Handling Forward & Reverse DNA strandsACGAGTACACGAGTCGreference genomebackward strandforward strandForm complementary sequences

  • Handling Forward & Reverse DNA strandsACGAGTACACGAGTCGreference genomebackward strandforward strandForm complementary sequences

  • Overall search process.

  • Overall search process.

  • Overall search process.

  • Was our hash key fuction a correct one?a bad key function would have led to many collisions

  • ExperimentsComparison with other Short Sequence Anchoring ToolsQPick PerformanceNumber of WildcardsIndex Creation TimeQuery Time

    DatasetsHomo sapiensftp://ftp.ensembl.org/pub/release-49/fasta/mus_musculus/dna/ftp://ftp.ensembl.org/pub/release-42/homo_sapiens_42_36d/data/fasta/dna/Mus Musculus

  • Up to 60x faster COMPARISONShort Sequence Anchoring Tools

    ToolHitsMissesTime (sec)ImprovementQPick10,611,858079- BWA10,611,85801491.8xfetchGWI10,611,85621862.3xBowtie10,611,8580331 4.1x SOAP10,573,52838,330472859.84xEland10,559,7995,014,0595557.0x

  • Visualizing the results

  • Comparison with Hash-Based techniques QPick 16x faster than FetchGWI

  • Varying the wildcards5-6 seconds retrieval time for 0-1 wildcards
  • Full Human-Genome SearchTime to indexReference GenomeQuery sequencesTime to Search1M- 10M queries (short DNA fragments)

  • Full Genome Search Index Time~ 7 hours (on single core CPU)< 50 sec for query hashtables

  • Full Genome Search Search Time~130sec to find 1M matches (single core)

  • ConclusionsQPick: Fast and complete search for short sequence fragmentsTakes Advantage of:Small DNA alphabetBit packetizationHash JoinsUp to 60x faster the competitive techniquesApplications for .Future

    ACGT0011100011

    The long serpantine is the human genomeAll techniques developed 2008-2009