hashing algorithm and its applications in bioinformatics by zemin ning informatics division the...

Hashing Algorithm Hashing Algorithm and its Applications in and its Applications in

BioinformaticsBioinformaticsBy

Zemin NingZemin Ning

Informatics Division

The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute

Outline of the Talk: Research Background SSAHA – The Fastest Sequence Search Engine

- Hash table;

- Sequence search based on the hash table;

- Various applications. Euler Path – consensus generation

- Euler Path;

- Consensus generation;

- SNP calling. Phusion – the WGS assembler:

- Phusion pipeline;

- Reads grouping;

- Applications. Current Research

Powder Simulation

Hair Dynamics

Genetics and Human Hair Structure Genetics and Human Hair Structure

AFRICANAFRICAN CAUCASIANCAUCASIAN EAST ASIANEAST ASIAN

Sequence Search and Alignment Algorithms

- Dynamic programming; - Suffix tree; - Hash method; - …

Software tools - FASTA; - BLAST; - Cross_Match; - Blat; - …

CPU vs Memory

Objectives:

With SSAHA algorithm, we aim to achieve the following objectives:

(ii) To explore applications such as large scale sequence assembly and single nucleotide polymorphism (SNP) detection;

(i) To develop a sequence search engine to search genomic sequences with a fast speed and acceptable accuracy;

(iii)To provide possible tools for sequence analysis based on the search engine.

Automatic Sequencing

W hole genomeBAC/cosm id clone

f in a l con sen sus seq u en ce

Finishingq u a lity

b o th s ta n ds covera geg a p f illing

Partial Assem blyco n tigs

DNA sequencingra n d om clo n es

Clone libraryp U C 18

Sm all fragm ents1 .0 - 2 .0 kb

DNA fragm entationso n ic d is rup tion

n e bu liza tion

W hole genomeBAC/cosm id clone

ATGCAGGTCC …….ATGCAGGTCC …….

Sequence Representation

Sequence S: (s1s2, …, si, …, sm) i =1,2, …, m

K-tuple: (sisi+1...si+k-1)

Using two binary digits for each base, we may have the following representations:

“A” =00; “C” = 01; “G” = 10; “T” = 11

For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way

where i = 0 or 1, depending on the value of the sequence base and Emax is the maximum value of the possible E values.

k

i

ii EE

2

1

2kmax

1 1-2 with 2SSAHASSAHAIndex:Index:

E k-tuple Ni Indices and Offsets0 AA 1 2, 19 1 AC 3 1, 9 2, 5 2, 11 2 AG 2 1, 15 2, 35 3 AT 2 2, 13 3, 3 4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 235 CC 4 1, 21 2, 31 3, 5 3, 7 6 CG 1 1, 5 7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17 8 GA 4 1, 3 1, 17 2, 15 2, 25 9 GC 0

10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1 11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19 12 TA 1 3, 25 13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11 14 TG 3 1, 13 2, 7 3, 9 15 TT

S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT)S3=(GGATCCCCTGTCCTCTCTGTCACATA)

Hash Table: A 2-tuple hashing table of S1, S2 and S3

Query sequence: Sq = (TGCAACAT)

E k-tuple Ni Indices and Offsets0 AA 1 2, 19 1 AC 3 1, 9 2, 5 2, 11 2 AG 2 1, 15 2, 35 3 AT 2 2, 13 3, 3 4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 235 CC 4 1, 21 2, 31 3, 5 3, 7 6 CG 1 1, 5 7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17 8 GA 4 1, 3 1, 17 2, 15 2, 25 9 GC 0

10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1 11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19 12 TA 1 3, 25 13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11 14 TG 3 1, 13 2, 7 3, 9 15 TT

k-tuples f(t) F(t) -(t-1) Fs(t)

TG 1, 13 1, 13 0 1, 5 2, 7 2, 7 0 1, 13 3, 9 3, 9 0 2, -2

GC -1 CA 2, 3 2, 1 -2 2, 1 2, 9 2, 7 -2 2, 1 2, 21 2, 19 -2 2, 4 2, 27 2, 25 -2 2, 7 2, 33 2, 31 -2 2, 7 3, 21 3, 19 -2 2, 7 3, 23 3, 21 -2 2, 7

AA 2, 19 2, 16 -3 2, 16AC 1, 9 1, 5 -4 2, 16 2, 5 2, 1 -4 2, 19 2, 11 2, 7 -4 2, 21

CA 2, 3 2, -2 -5 2, 25 2, 9 2, 4 -5 2, 28 2, 21 2, 16 -5 2, 31 2, 27 2, 22 -5 3, -3 2, 33 2, 28 -5 3, 9 3, 21 3, 16 -5 3, 16 3, 23 3, 18 -5 3, 18

AT 2, 13 2, 7 -6 3, 19 3, 3 3, -3 -6 3, 21

Array of index and offset dataArray of index and offset dataSq = (TGCAACAT)

Query sequence:

In order to carry out search quickly and effectively, it would be helpful in the computer code to combine these two integer arrays into a single long integer array. We are targeting implementations on 64 bit machines. The long integer array can be expressed as

F (t) = {H (E(t),1), H (E(t),2),…, H (E(t),Nt)} with

H(E(t),i) = 232 H1(E(t),i) + H2’(E(t),i) i = 1,2,…, Nt

64 Bit Machines64 Bit Machines

It is seen from the above equation that the offset value takes the low bits while the index part takes high orders of bits in the long integer.

Index Offset

Power Law: CPU time v query lengthPower Law: CPU time v query length

y = 9E-05x1.3197

R2 = 0.9948

0

0.2

0.4

0.6

0.8

1

0 200 400 600 800 1000 1200

Number of k-tuples in query (n/k)

Nor

mal

ized

CPU

tim

e

Fig. 1 Normalized CPU time plotted against the number of k-tuples in query (k=12) using Quicksort.

SSAHA MemorySSAHA Memory

Memory for subjectMemory for subject: M: Mss = 4*N = 4*Nss/k+ 4*2/k+ 4*22k2k

Memory for queryMemory for query: M: Mqq = N = Nqq

House keepingHouse keeping: 10-20% total: 10-20% total

Total memoryTotal memory: M: Mss = 1.2*(M = 1.2*(Mss+M+Mqq) )

SSAHASSAHA22ClientClient

The SSAHA Trace Server The SSAHA Trace Server

It is aimed to provide a near real-time (under 10 seconds) search service for a clustered 1.0 TB database. The solution is extensible by plugging extra appliances.

The Seven Bridges of KonigsbergThe Seven Bridges of Konigsberg

.

.

.

.

a

c

bd

a

b d

cPregel River

During the 18th century, the city of Konigsberg (in East Prussia) was divided into four sections (a,b,c,d respectively) by the Pregel River. Seven bridges connected these regions.

Question: Is it possible to find a way to walk about the city as so to cross each bridge exactly once and then return to the starting point?

Vertex Degree, Euler Circuit and Euler PathVertex Degree, Euler Circuit and Euler Path

Vertex degreeVertex degree: For an : For an undirected graph undirected graph GG, the , the vertex degree is defined vertex degree is defined as the number of edges as the number of edges in the graph. in the graph.

Euler circuitEuler circuit: For an : For an undirected graph undirected graph GG, if there , if there is a circuit in is a circuit in GG that traverses that traverses every edge of the graph every edge of the graph exactly once, then exactly once, then GG is said is said to have an to have an Euler circuitEuler circuit. .

a

e

c d

b

f

Euler pathEuler path: If there is an open trail from : If there is an open trail from aa to to cc in in GG and this and this trails traverses each edge in trails traverses each edge in GG exactly once, the the trail is exactly once, the the trail is called an called an Euler trailEuler trail or or Euler pathEuler path. .

Sequence ReconstructionSequence Reconstruction- Hamiltonian path approach- Hamiltonian path approach

S=(ATGCAGGTCC)S=(ATGCAGGTCC)ATG ATG ->-> TGC TGC -> -> GCA GCA ->-> CAG CAG -> -> AGG AGG ->-> GGT GGT -> -> GTC GTC ->-> TCC TCC

ATG AGG TGC TCC GTC GGT GCA CAGATG AGG TGC TCC GTC GGT GCA CAG

VerticesVertices: k-tuples from the spectrum shown in red (8);: k-tuples from the spectrum shown in red (8);EdgesEdges: overlapping k-tuples (7);: overlapping k-tuples (7);PathPath: visiting all vertices corresponding to the : visiting all vertices corresponding to the sequence.sequence.

Sequence ReconstructionSequence Reconstruction- Euler path approach- Euler path approach

VerticesVertices: : correspond to (k-I)-tuples (7);correspond to (k-I)-tuples (7);EdgesEdges: : correspond to k-tuples from the spectrum (8);correspond to k-tuples from the spectrum (8);PathPath: : visiting all EDGES corresponding to the visiting all EDGES corresponding to the sequence.sequence.

ATAT

GTGT CGCG

CACA

GCGCTGTG

GGGG

ATGCGTGGCAATGCGTGGCA ATGGCGTGCAATGGCGTGCA

ATG ATG ->-> TGG TGG -> -> GGC GGC ->-> GCG GCG -> -> CGT CGT ->-> GTG GTG -> -> TGC TGC ->-> GCA GCA

E k-tuples Indices, Offsets and links to the next7 ATG 1,1,28 3,1,28 4,1,288 ATC 2,1,29 10 AGT 4,5,3811 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,3228 TGC 1,2,45 3,2,46 4,2,4529 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-138 GTT 4,6,2440 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,5146 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,1052 CAC 3,4,19

SSAHA Type Hash TableSSAHA Type Hash Table

S1=(ATGCAGGTCC) , S2=(ATCAGGTCC)S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)

Point to the Next - Hash Table LinksPoint to the Next - Hash Table Links

S1=(ATGCAGGTCC) , S2=(ATCAGGTCC)S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)

E k-tuples Indices, Offsets and links to the next7 ATG 1,1,28 3,1,28 4,1,288 ATC 2,1,29 10 AGT 4,5,3811 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,3228 TGC 1,2,45 3,2,46 4,2,4529 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-138 GTT 4,6,2440 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,5146 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,1052 CAC 3,4,19

ConsensusConsensus

AATG TG ->-> TTGC GC -> -> GGCA CA ->-> CCAG AG -> -> AAGG GG ->-> GGGT GT -> -> GGTC TC ->-> TTCCCC

CONS=(CONS=(ATGCAGGTCCATGCAGGTCC))

ATGC--AGGTCCATGC--AGGTCC

AT--C--AGGTCCAT--C--AGGTCC

ATGCTAGGTCCATGCTAGGTCC

ATGC--AGTTCCATGC--AGTTCC


eulerSNPeulerSNP

In the polymorphic datasets of shutgun reads, eulerSNP In the polymorphic datasets of shutgun reads, eulerSNP used combined Euler Path and hashing algorithm to used combined Euler Path and hashing algorithm to detect SNPs and replace them with the most commonly detect SNPs and replace them with the most commonly occurred base pair on the location.occurred base pair on the location.



ATATTTCCAGGTCCCCAGGTCC

ATATTTC--AGC--AGCCTCCTCC





ATATGGCTAGGTCCCTAGGTCC

ATATGGC--AGC--AGGGTCCTCC



Phusion Assembler PipelinePhusion Assembler Pipeline

Reads Group

Data Process

RPphrap - Contig

ShotgunReads

Read-pair Tracker

SupercontigFPC Mapping

RPjoin –Merge

PRono

Assembly

Gap-HashGap-Hash4x34x3

ATGGGCAGATGTATGGGCAGATGT

TGGCCAGTTGTTTGGCCAGTTGTT

GGCGAGTCGTTCGGCGAGTCGTTC

GCGTGTCCTTCGGCGTGTCCTTCG

ATGGATGGCGTCGTGCAGGCAGTCCTCCATGTATGTTCGTCGGATCGATCAA

ATGGCGTGCAGTATGGCGTGCAGT

TGGCGTGCAGTCTGGCGTGCAGTC

GGCGTGCAGTCCGGCGTGCAGTCC

GCGTGCAGTCCAGCGTGCAGTCCA

CGTGCAGTCCATCGTGCAGTCCAT

ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGTCCATGTTCGGATCA

ContiguousContiguous Base HashBase Hash

K = 12K = 12

Kmer Word HashingKmer Word Hashing

Zebrafish as a model organism

Danio rerio Fish length: 3 cm long

Estimated genome size: 1.55 Gb Easy to maintain

short generation time

can be kept at high densities Easy to manipulate

external fertilisation and development

transparent embryos

Sanger Institute WGS project started in spring 2001

- DNA source Tuebingen embryos;

- WGS read Insert sizes: 2 - 10 kb;

- BACends insert sizes: 165 – 175 kb;

- Polymorphism: ~ 1000 5 day old embryos;

- SNP density: One in every 200 bps;

- Indel density: One in every 1500 bps;

- Indel length: 2 – 30 bps.

Acknowledgements:

Jim Mullkin Yong Gu Adam Spargo Richard Durbin Kerstin Jekosch Sean Humphray Jane Rogers Sanger Systems Support Sanger Sequencing Facilities

hashing algorithm and its applications in bioinformatics by zemin ning informatics division the...

Documents