1 tien huynh 1, michalis vlachos 2, isidore rigoutsos 3 edbt 2010 march 22-26 2010 anchoring...

40
1 n Huynh 1 , Michalis Vlachos 2 , Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds 1 IBM TJ Watson Research, NY, USA 2 IBM Zurich Research Laboratory, CH 3 Thomas Jefferson University, USA

Upload: julian-allen

Post on 17-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

1

Tien Huynh1, Michalis Vlachos2, Isidore Rigoutsos3

EDBT 2010March 22-26 2010

Anchoring Millions of Distinct Reads on the Human Genome

Within Seconds

1IBM TJ Watson Research, NY, USA2IBM Zurich Research Laboratory, CH3 Thomas Jefferson University, USA

Page 2: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

2

Introduction - Past• Bioinformatics Sequencing Evolution

Human Genome3.3B nucleotides825MB raw data20MB compressed25,000 genes

Page 3: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

3

Introduction – Past, Present, Future

It took 13 years for teams of scientists around the globe to first read the human genome – completing the project in 2001.

In 2007, it took 2 months to sequence the genome of DNA-co-discoverer James Watson.

By 2013 it is likely that your personal genome could be read in the time it takes to boil an egg.

“Science 2009”

Page 4: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

4

Question from industry: Within 24 hours, can we find FAST, ALL the positions of (newly sequenced) DNA fragments on a reference genome

Reference genome

A C G T T A C G

20-75 nts

C G T T A C G

AC G ? AG T A

millions

New generation Sequencer

What we want to do

Page 5: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

5

Applications• Cancer Research: help isolate cancer-initiating mutations• Better design of siRNA’s (short interfering RNA’s)

– RNA sequences of ~21 nucleotides -> RNA interference– Therapeutic gene silencing– Cancer therapy by tumor-related gene targeting

• Junk DNA analysis• Re-sequencing• etc

Page 6: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

6

How can we achieve that?• Solve as hash join between the reference

and the query database– Reference Genome/Database

(eg human genome)• Static, one time

– Query Database (produced fragments)• Dynamic

recreated on every experiment– How fast can we achieve this

on a commodity desktop?

• We call the technique QPick (=Quick Pick)

Page 7: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

7

ACGG…TTT

CGG…TTTA

GG…TTTAG

k-mer extraction

genomic sequence (the database)

acggtttaTTTAGGGGGCCAAAAAAATTT

cggtttatTTAGGGGGCCAAAAAAATTTA

head tail

key extraction &bit packetization

• • •

hashtablekey

bit packetization

cggtttat

aaaaaaaa

acaaaaaa

tttttttt

target database

011..011 010..111 011..111

000..011 001..001

101..000

……

ctttttat

ggtataaa

hashtable join

short-read generator

TTT?AGGGGGATAGAGAATTAAAA?TTAG?AATT?CC

output(with wildcards)

cggtttat

aaaaaaaa

agggaaaa

tttttttt

011..011 010..111

001..011 011..001

101..000…

ggtttgat

ggtttttt

head & tail separation

query set expansion(wildcards + reverse strand)

bit packetization / hashtable construction

query database

Page 8: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

8

QPick - Advantages

• Disk Based. Small Memory Requirements– Other techniques run out of memory on large

datasets…

– 3GB RAM is sufficient to

a)Indexb)Search Human Genomec) Query Millions of Short Fragments

Page 9: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

9

• Highly Parallelizable– Single CPU/Core are sufficient.

QPick - Advantages

Page 10: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

10

A

• Flexibility – Rigid and wildcard matches

QPick - Advantages

AC G ? AG T A ?

wildcard wildcard

AC G AG T AC

query

AC G AG T AC AC G AG T AC

A C G T T A C G A C G

A C G T T

A C G T T A C G

T T

AG

G

reference genome……..

Page 11: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

11

• Completeness of results– Various competitors failed to return all

matches

QPick - Advantages

A C G T T A C G

C G T T A C G

AC G ? AG T A

QPick does not miss any matches

Page 12: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

12

Technical Highlights• Data Compression

– Exploit small DNA alphabet

• Fast bitwise operations– Take advantage of 64bit

word comparisons

• Simple implementation (hash based)

• Data Pruning

A C G T

…0011100011…

64bits

Page 13: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

13

Data Representationreference genome

AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC

AC G AG T AC T T C

GAG AG T AC T T C

reference genome

window Lmax

Page 14: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

14

Data Representation - Headreference genome

A G TC A G TC A G TC A G TC A G TC A G TC A C

A

G

T

C

00

01

10

11

Codebook

Binary: 0001101100011011000110110001Decimal: 28,422,577

head

window

position28,422,577 tail

00 01 10 11 00 01 10 11 00 01 10 11 00 01

tail

Page 15: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

15

Advantages– Head = key for hashtable– Reduction in dataset size

• We don’t have to explicitly store the head

Data Representationreference genome

Page 16: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

16

Data Representation - Tailreference genome

A

G

T

C

0001

0010

0100

1000

Codebook

. 0000

G T A G TC A G TC A Chead

tail

. . . .

padded toproper lengthtail: 16 nts 16x4bits = one 64bit word

0100 1000 etc … 0000 1

position of thewindow(how many shiftswe have made since the beginning ofthe string…)pattern p

query q[p & q = q]

Page 17: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

17

Example

ATAGACTAAAAAAAAAAAAAAATT

reference genome

24 symbols

……

padding

30 symbols

atagactaaaaaaa AAAAAAAATT...... 1

tagactaaaaaaaa AAAAAAATT ....... 2

aaaaaaaaaaaaaa

aaaaaaaaaaaaaa

aaaaaaaaaaaaat

A T T . . . . . . . . . . ...

T T . . . . . . . . . . . . ..

T . . . . . . . . . . . . . . .

8

9

10

Page 18: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

18

Up to now we have seen how to encode the reference genome

Now we show how to encode the query sequences

Page 19: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

19

Encoding the query DBNow, instead of one long sequence, many shorter ones

A G TC A G TC .

A GC .

A G TC A G TC . .

A G TC G

millions

Page 20: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

20

Encoding the query DBNow, instead of one long sequence, many shorter ones

A G TC A G TC .

A GC .

A G TC A G TC . .

A G TC G

millions

We do similar extractions in heads and tails

tailhead

16 symbols = 64bits14 symbols: key

Page 21: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

21

Encoding the query DBDifferences between target db and query db:1. Query sequences may contain wildcards

(in the db only used for padding)

A G TC A G TC . .

Page 22: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

22

Encoding the query DBDifferences between target db and query db:1. Query sequences may contain wildcards

(in the db only used for padding)

2. Query sequences may have variable length (compared to the fixed size n-grams extracted from the target db)

A G TC A G TC . .

A GC .

A G TC A G TC . .

A G TC G

Page 23: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

23

Encoding the query DB - wildcards

Treated differently depending where they appear

1. Head. Expanded to the possible symbols

A G TC .

A G TC

A G TC

A G TC

A G TC

A

G

T

C

Page 24: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

24

Encoding the query DB - wildcards

Treated differently depending where they appear1. Head. Expanded to the possible symbols

2. Tail. Encoded as binary wildcard 0000

A G TC .

A G TC

A G TC

A G TC

A G TC

A

G

T

C

Page 25: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

25

Handling Forward & Reverse DNA strands

AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC

reference genome

backward strand

forward strand

If we explicitly encode the reverse strand we would be indexing twice as much data

Page 26: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

26

Handling Forward & Reverse DNA strands

AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC

reference genome

backward strand

forward strand

G G C C T T T

Form complementary sequences

A A A G G C C

Page 27: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

27

Handling Forward & Reverse DNA strands

AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC

reference genome

backward strand

forward strand

G G C C T T T

Form complementary sequences

A A A G G C C

Page 28: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

28

Handling Forward & Reverse DNA strands

AC G AG T AC AC G AG T AC AC G AG TCT T G AC G AGC

reference genome

backward strand

forward strand

G G C C T T T

Form complementary sequences

A A A G G C C

C C T

GGA

Page 29: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

29

Overall search process….

Page 30: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

30

Overall search process….

Page 31: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

31

Overall search process….

Page 32: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

32

Was our hash key fuction a correct one?…a bad key function would have led to many collisions…

YES! 98.5% of buckets contain less than 10 entries

Page 33: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

33

Homo sapiens

Experiments• Comparison with other Short Sequence

Anchoring Tools• QPick Performance

– Number of Wildcards– Index Creation Time– Query Time

• Datasets

ftp://ftp.ensembl.org/pub/release-49/fasta/mus_musculus/dna/

ftp://ftp.ensembl.org/pub/release-42/homo_sapiens_42_36d/data/fasta/dna/

Mus Musculus

Page 34: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

34

Up to 60x faster …

Tool Hits Misses Time (sec) ImprovementQPick 10,611,

8580 79 -

BWA 10,611,858

0 149 1.8x

fetchGWI

10,611,856

2 186 2.3x

Bowtie 10,611,858

0 331 4.1x

SOAP 10,573,528

38,330 4728 59.84x

Eland 10,559,799

5,014,059

555 7.0x

COMPARISONShort Sequence Anchoring Tools

Page 35: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

36

Comparison with Hash-Based techniques

QPick – 16x faster than FetchGWI

Page 36: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

37

Varying the wildcards• 5-6 seconds retrieval time for 0-1 wildcards• <60 sec for up to 4 wildcards

Page 37: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

38

• Time to index– Reference Genome– Query sequences

• Time to Search– 1M- 10M queries (short DNA fragments)

Full Human-Genome Search

Page 38: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

39

Full Genome Search – Index Time

~ 7 hours (on single core CPU)

< 50 sec for query hashtables

Page 39: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

40

Full Genome Search – Search Time

• ~130sec to find 1M matches (single core)

Page 40: 1 Tien Huynh 1, Michalis Vlachos 2, Isidore Rigoutsos 3 EDBT 2010 March 22-26 2010 Anchoring Millions of Distinct Reads on the Human Genome Within Seconds

41

Conclusions• QPick: Fast and complete search for short

sequence fragments• Takes Advantage of:

– Small DNA alphabet– Bit packetization– Hash Joins

• Up to 60x faster the competitive techniques• Applications for ….• Future…

A C G T

…0011100011…