edbt 2010 march 22-26 2010

EDBT 2010March 22-26 2010Tien Huynh1, Michalis Vlachos2, Isidore Rigoutsos3

Anchoring Millions of Distinct Reads on the Human GenomeWithin Seconds1IBM TJ Watson Research, NY, USA2IBM Zurich Research Laboratory, CH3 Thomas Jefferson University, USA

Introduction - PastBioinformatics Sequencing EvolutionHuman Genome3.3B nucleotides825MB raw data20MB compressed25,000 genes

Introduction Past, Present, FutureIt took 13 years for teams of scientists around the globe to first read the human genome completing the project in 2001. In 2007, it took 2 months to sequence the genome of DNA-co-discoverer James Watson. By 2013 it is likely that your personal genome could be read in the time it takes to boil an egg. Science 2009

What we want to doQuestion from industry: Within 24 hours, can we find FAST, ALL the positions of (newly sequenced) DNA fragments on a reference genomeReference genomeNew generation Sequencer

ApplicationsCancer Research: help isolate cancer-initiating mutationsBetter design of siRNAs (short interfering RNAs)RNA sequences of ~21 nucleotides -> RNA interferenceTherapeutic gene silencingCancer therapy by tumor-related gene targetingJunk DNA analysisRe-sequencingetc

How can we achieve that?Solve as hash join between the reference and the query databaseReference Genome/Database (eg human genome)Static, one timeQuery Database (produced fragments)Dynamic recreated on every experimentHow fast can we achieve this on a commodity desktop?

We call the technique QPick (=Quick Pick)

ACGGTTTCGGTTTAGGTTTAGk-mer extractiongenomic sequence (the database)

QPick - AdvantagesDisk Based. Small Memory RequirementsOther techniques run out of memory on large datasets3GB RAM is sufficient to IndexSearch Human GenomeQuery Millions of Short Fragments

QPick - AdvantagesHighly ParallelizableSingle CPU/Core are sufficient.

QPick - AdvantagesFlexibility Rigid and wildcard matchesAACG?AGTA?wildcardwildcardACGAGTACqueryACGAGTACACGTTACGAGGreference genome..

QPick - AdvantagesCompleteness of resultsVarious competitors failed to return all matches

ACGTTACGCGTTACGACG?AGTAQPick does not miss any matches

Technical HighlightsData CompressionExploit small DNA alphabetFast bitwise operationsTake advantage of 64bit word comparisonsSimple implementation (hash based)Data Pruning

ACGT001110001164bits

Data Representationreference genomeACGAGTACACGAGTCGreference genomewindow Lmax

Data Representation - Headreference genomeACBinary: 0001101100011011000110110001Decimal:28,422,577headwindowposition 28,422,577tail

Data RepresentationAdvantagesHead = key for hashtableReduction in dataset sizeWe dont have to explicitly store the headreference genome

Data Representation - Tailreference genomeAGTC0001001001001000Codebook .0000GTACheadtailpadded to proper lengthtail: 16 nts 16x4bits = one 64bit word

ExampleATAGACTAAAAAAAAAAAAAAATTreference genome24 symbolspadding

Up to now we have seen how to encode the reference genome

Now we show how to encode the query sequences

Encoding the query DBNow, instead of one long sequence, many shorter ones

AGTC .AGC .AGTC . .AGTCGmillions

Encoding the query DBNow, instead of one long sequence, many shorter ones AGTC .AGC .AGTC . .AGTCGmillionsWe do similar extractions in heads and tailstailhead16 symbols = 64bits14 symbols: key

Encoding the query DBDifferences between target db and query db:Query sequences may contain wildcards (in the db only used for padding)

AGTC . .

Encoding the query DBDifferences between target db and query db:Query sequences may contain wildcards (in the db only used for padding)

Query sequences may have variable length (compared to the fixed size n-grams extracted from the target db)AGTC . .AGC .AGTC . .AGTCG

Encoding the query DB - wildcardsTreated differently depending where they appearHead. Expanded to the possible symbols

AGTC .

Encoding the query DB - wildcardsTreated differently depending where they appearHead. Expanded to the possible symbols

Tail. Encoded as binary wildcard 0000

AGTC .AGTCAGTCAGTCAGTCAGTC

Handling Forward & Reverse DNA strandsACGAGTACACGAGTCGreference genomebackward strandforward strandIf we explicitly encode the reverse strand we would be indexing twice as much data

Handling Forward & Reverse DNA strandsACGAGTACACGAGTCGreference genomebackward strandforward strandForm complementary sequences

Overall search process.

Was our hash key fuction a correct one?a bad key function would have led to many collisions

ExperimentsComparison with other Short Sequence Anchoring ToolsQPick PerformanceNumber of WildcardsIndex Creation TimeQuery Time

DatasetsHomo sapiensftp://ftp.ensembl.org/pub/release-49/fasta/mus_musculus/dna/ftp://ftp.ensembl.org/pub/release-42/homo_sapiens_42_36d/data/fasta/dna/Mus Musculus

Up to 60x faster COMPARISONShort Sequence Anchoring Tools

ToolHitsMissesTime (sec)ImprovementQPick10,611,858079- BWA10,611,85801491.8xfetchGWI10,611,85621862.3xBowtie10,611,8580331 4.1x SOAP10,573,52838,330472859.84xEland10,559,7995,014,0595557.0x

Visualizing the results

Comparison with Hash-Based techniques QPick 16x faster than FetchGWI

Varying the wildcards5-6 seconds retrieval time for 0-1 wildcards

Full Human-Genome SearchTime to indexReference GenomeQuery sequencesTime to Search1M- 10M queries (short DNA fragments)

Full Genome Search Index Time~ 7 hours (on single core CPU)< 50 sec for query hashtables

Full Genome Search Search Time~130sec to find 1M matches (single core)

ConclusionsQPick: Fast and complete search for short sequence fragmentsTakes Advantage of:Small DNA alphabetBit packetizationHash JoinsUp to 60x faster the competitive techniquesApplications for .Future

ACGT0011100011

The long serpantine is the human genomeAll techniques developed 2008-2009

edbt 2010 march 22-26 2010

Documents