hashing algorithm and its applications in bioinformatics by zemin ning informatics division the...
TRANSCRIPT
Hashing Algorithm Hashing Algorithm and its Applications in and its Applications in
BioinformaticsBioinformaticsBy
Zemin NingZemin Ning
Informatics Division
The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute
Outline of the Talk: Research Background SSAHA – The Fastest Sequence Search Engine
- Hash table;
- Sequence search based on the hash table;
- Various applications. Euler Path – consensus generation
- Euler Path;
- Consensus generation;
- SNP calling. Phusion – the WGS assembler:
- Phusion pipeline;
- Reads grouping;
- Applications. Current Research
Powder Simulation
Hair Dynamics
Genetics and Human Hair Structure Genetics and Human Hair Structure
AFRICANAFRICAN CAUCASIANCAUCASIAN EAST ASIANEAST ASIAN
Sequence Search and Alignment Algorithms
- Dynamic programming; - Suffix tree; - Hash method; - …
Software tools - FASTA; - BLAST; - Cross_Match; - Blat; - …
CPU vs Memory
Objectives:
With SSAHA algorithm, we aim to achieve the following objectives:
(ii) To explore applications such as large scale sequence assembly and single nucleotide polymorphism (SNP) detection;
(i) To develop a sequence search engine to search genomic sequences with a fast speed and acceptable accuracy;
(iii)To provide possible tools for sequence analysis based on the search engine.
Automatic Sequencing
W hole genomeBAC/cosm id clone
f in a l con sen sus seq u en ce
Finishingq u a lity
b o th s ta n ds covera geg a p f illing
Partial Assem blyco n tigs
DNA sequencingra n d om clo n es
Clone libraryp U C 18
Sm all fragm ents1 .0 - 2 .0 kb
DNA fragm entationso n ic d is rup tion
n e bu liza tion
W hole genomeBAC/cosm id clone
ATGCAGGTCC …….ATGCAGGTCC …….
Sequence Representation
Sequence S: (s1s2, …, si, …, sm) i =1,2, …, m
K-tuple: (sisi+1...si+k-1)
Using two binary digits for each base, we may have the following representations:
“A” =00; “C” = 01; “G” = 10; “T” = 11
For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way
where i = 0 or 1, depending on the value of the sequence base and Emax is the maximum value of the possible E values.
k
i
ii EE
2
1
2kmax
1 1-2 with 2SSAHASSAHAIndex:Index:
E k-tuple Ni Indices and Offsets0 AA 1 2, 19 1 AC 3 1, 9 2, 5 2, 11 2 AG 2 1, 15 2, 35 3 AT 2 2, 13 3, 3 4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 235 CC 4 1, 21 2, 31 3, 5 3, 7 6 CG 1 1, 5 7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17 8 GA 4 1, 3 1, 17 2, 15 2, 25 9 GC 0
10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1 11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19 12 TA 1 3, 25 13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11 14 TG 3 1, 13 2, 7 3, 9 15 TT
S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT)S3=(GGATCCCCTGTCCTCTCTGTCACATA)
Hash Table: A 2-tuple hashing table of S1, S2 and S3
Query sequence: Sq = (TGCAACAT)
E k-tuple Ni Indices and Offsets0 AA 1 2, 19 1 AC 3 1, 9 2, 5 2, 11 2 AG 2 1, 15 2, 35 3 AT 2 2, 13 3, 3 4 CA 7 2, 3 2, 9 2, 21 2, 27 2, 33 3, 21 3, 235 CC 4 1, 21 2, 31 3, 5 3, 7 6 CG 1 1, 5 7 CT 6 1, 23 2, 39 2, 43 3, 13 3, 15 3, 17 8 GA 4 1, 3 1, 17 2, 15 2, 25 9 GC 0
10 GG 5 1, 25 1, 31 2, 17 2, 29 3, 1 11 GT 6 1, 1 1, 27 1, 29 2, 1 2, 37 3, 19 12 TA 1 3, 25 13 TC 6 1, 7 1, 11 1, 19 2, 23 2, 41 3, 11 14 TG 3 1, 13 2, 7 3, 9 15 TT
k-tuples f(t) F(t) -(t-1) Fs(t)
TG 1, 13 1, 13 0 1, 5 2, 7 2, 7 0 1, 13 3, 9 3, 9 0 2, -2
GC -1 CA 2, 3 2, 1 -2 2, 1 2, 9 2, 7 -2 2, 1 2, 21 2, 19 -2 2, 4 2, 27 2, 25 -2 2, 7 2, 33 2, 31 -2 2, 7 3, 21 3, 19 -2 2, 7 3, 23 3, 21 -2 2, 7
AA 2, 19 2, 16 -3 2, 16AC 1, 9 1, 5 -4 2, 16 2, 5 2, 1 -4 2, 19 2, 11 2, 7 -4 2, 21
CA 2, 3 2, -2 -5 2, 25 2, 9 2, 4 -5 2, 28 2, 21 2, 16 -5 2, 31 2, 27 2, 22 -5 3, -3 2, 33 2, 28 -5 3, 9 3, 21 3, 16 -5 3, 16 3, 23 3, 18 -5 3, 18
AT 2, 13 2, 7 -6 3, 19 3, 3 3, -3 -6 3, 21
Array of index and offset dataArray of index and offset dataSq = (TGCAACAT)
Query sequence:
In order to carry out search quickly and effectively, it would be helpful in the computer code to combine these two integer arrays into a single long integer array. We are targeting implementations on 64 bit machines. The long integer array can be expressed as
F (t) = {H (E(t),1), H (E(t),2),…, H (E(t),Nt)} with
H(E(t),i) = 232 H1(E(t),i) + H2’(E(t),i) i = 1,2,…, Nt
64 Bit Machines64 Bit Machines
It is seen from the above equation that the offset value takes the low bits while the index part takes high orders of bits in the long integer.
Index Offset
Power Law: CPU time v query lengthPower Law: CPU time v query length
y = 9E-05x1.3197
R2 = 0.9948
0
0.2
0.4
0.6
0.8
1
0 200 400 600 800 1000 1200
Number of k-tuples in query (n/k)
Nor
mal
ized
CPU
tim
e
Fig. 1 Normalized CPU time plotted against the number of k-tuples in query (k=12) using Quicksort.
SSAHA MemorySSAHA Memory
Memory for subjectMemory for subject: M: Mss = 4*N = 4*Nss/k+ 4*2/k+ 4*22k2k
Memory for queryMemory for query: M: Mqq = N = Nqq
House keepingHouse keeping: 10-20% total: 10-20% total
Total memoryTotal memory: M: Mss = 1.2*(M = 1.2*(Mss+M+Mqq) )
SSAHASSAHA22ClientClient
The SSAHA Trace Server The SSAHA Trace Server
It is aimed to provide a near real-time (under 10 seconds) search service for a clustered 1.0 TB database. The solution is extensible by plugging extra appliances.
The Seven Bridges of KonigsbergThe Seven Bridges of Konigsberg
.
.
.
.
a
c
bd
a
b d
cPregel River
During the 18th century, the city of Konigsberg (in East Prussia) was divided into four sections (a,b,c,d respectively) by the Pregel River. Seven bridges connected these regions.
Question: Is it possible to find a way to walk about the city as so to cross each bridge exactly once and then return to the starting point?
Vertex Degree, Euler Circuit and Euler PathVertex Degree, Euler Circuit and Euler Path
Vertex degreeVertex degree: For an : For an undirected graph undirected graph GG, the , the vertex degree is defined vertex degree is defined as the number of edges as the number of edges in the graph. in the graph.
Euler circuitEuler circuit: For an : For an undirected graph undirected graph GG, if there , if there is a circuit in is a circuit in GG that traverses that traverses every edge of the graph every edge of the graph exactly once, then exactly once, then GG is said is said to have an to have an Euler circuitEuler circuit. .
a
e
c d
b
f
Euler pathEuler path: If there is an open trail from : If there is an open trail from aa to to cc in in GG and this and this trails traverses each edge in trails traverses each edge in GG exactly once, the the trail is exactly once, the the trail is called an called an Euler trailEuler trail or or Euler pathEuler path. .
Sequence ReconstructionSequence Reconstruction- Hamiltonian path approach- Hamiltonian path approach
S=(ATGCAGGTCC)S=(ATGCAGGTCC)ATG ATG ->-> TGC TGC -> -> GCA GCA ->-> CAG CAG -> -> AGG AGG ->-> GGT GGT -> -> GTC GTC ->-> TCC TCC
ATG AGG TGC TCC GTC GGT GCA CAGATG AGG TGC TCC GTC GGT GCA CAG
VerticesVertices: k-tuples from the spectrum shown in red (8);: k-tuples from the spectrum shown in red (8);EdgesEdges: overlapping k-tuples (7);: overlapping k-tuples (7);PathPath: visiting all vertices corresponding to the : visiting all vertices corresponding to the sequence.sequence.
Sequence ReconstructionSequence Reconstruction- Euler path approach- Euler path approach
VerticesVertices: : correspond to (k-I)-tuples (7);correspond to (k-I)-tuples (7);EdgesEdges: : correspond to k-tuples from the spectrum (8);correspond to k-tuples from the spectrum (8);PathPath: : visiting all EDGES corresponding to the visiting all EDGES corresponding to the sequence.sequence.
ATAT
GTGT CGCG
CACA
GCGCTGTG
GGGG
ATGCGTGGCAATGCGTGGCA ATGGCGTGCAATGGCGTGCA
ATG ATG ->-> TGG TGG -> -> GGC GGC ->-> GCG GCG -> -> CGT CGT ->-> GTG GTG -> -> TGC TGC ->-> GCA GCA
E k-tuples Indices, Offsets and links to the next7 ATG 1,1,28 3,1,28 4,1,288 ATC 2,1,29 10 AGT 4,5,3811 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,3228 TGC 1,2,45 3,2,46 4,2,4529 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-138 GTT 4,6,2440 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,5146 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,1052 CAC 3,4,19
SSAHA Type Hash TableSSAHA Type Hash Table
S1=(ATGCAGGTCC) , S2=(ATCAGGTCC)S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)
Point to the Next - Hash Table LinksPoint to the Next - Hash Table Links
S1=(ATGCAGGTCC) , S2=(ATCAGGTCC)S3=(ATGCTAGGTCC), S4=(ATGCAGTTCC)
E k-tuples Indices, Offsets and links to the next7 ATG 1,1,28 3,1,28 4,1,288 ATC 2,1,29 10 AGT 4,5,3811 AGG 1,5,42 2,4,42 3,6,42 19 TAG 3,5,11 24 TTC 4,7,3228 TGC 1,2,45 3,2,46 4,2,4529 TCA 2,2,51 32 TCC 1,8,-1 2,7,-1 3,9,-1 4,8,-138 GTT 4,6,2440 GTC 1,7,32 2,6,32 3,8,32 42 GGT 1,6,40 2,5,40 3,7,40 45 GCA 1,3,51 4,3,5146 GCT 3,3,53 51 CAG 1,4,11 2,3,11 4,4,1052 CAC 3,4,19
ConsensusConsensus
AATG TG ->-> TTGC GC -> -> GGCA CA ->-> CCAG AG -> -> AAGG GG ->-> GGGT GT -> -> GGTC TC ->-> TTCCCC
CONS=(CONS=(ATGCAGGTCCATGCAGGTCC))
ATGC--AGGTCCATGC--AGGTCC
AT--C--AGGTCCAT--C--AGGTCC
ATGCTAGGTCCATGCTAGGTCC
ATGC--AGTTCCATGC--AGTTCC
ATGC--AGGTCCATGC--AGGTCC
eulerSNPeulerSNP
In the polymorphic datasets of shutgun reads, eulerSNP In the polymorphic datasets of shutgun reads, eulerSNP used combined Euler Path and hashing algorithm to used combined Euler Path and hashing algorithm to detect SNPs and replace them with the most commonly detect SNPs and replace them with the most commonly occurred base pair on the location.occurred base pair on the location.
ATGC--AGGTCCATGC--AGGTCC
ATGC--AGGTCCATGC--AGGTCC
ATATTTCCAGGTCCCCAGGTCC
ATATTTC--AGC--AGCCTCCTCC
ATGCTAGGTCCATGCTAGGTCC
ATGCTAGGTCCATGCTAGGTCC
ATGC--AGGTCCATGC--AGGTCC
ATGC--AGGTCCATGC--AGGTCC
ATATGGCTAGGTCCCTAGGTCC
ATATGGC--AGC--AGGGTCCTCC
ATGCTAGGTCCATGCTAGGTCC
ATGCTAGGTCCATGCTAGGTCC
Phusion Assembler PipelinePhusion Assembler Pipeline
Reads Group
Data Process
RPphrap - Contig
ShotgunReads
Read-pair Tracker
SupercontigFPC Mapping
RPjoin –Merge
PRono
Assembly
Gap-HashGap-Hash4x34x3
ATGGGCAGATGTATGGGCAGATGT
TGGCCAGTTGTTTGGCCAGTTGTT
GGCGAGTCGTTCGGCGAGTCGTTC
GCGTGTCCTTCGGCGTGTCCTTCG
ATGGATGGCGTCGTGCAGGCAGTCCTCCATGTATGTTCGTCGGATCGATCAA
ATGGCGTGCAGTATGGCGTGCAGT
TGGCGTGCAGTCTGGCGTGCAGTC
GGCGTGCAGTCCGGCGTGCAGTCC
GCGTGCAGTCCAGCGTGCAGTCCA
CGTGCAGTCCATCGTGCAGTCCAT
ATGGCGTGCAGTCCATGTTCGGATCAATGGCGTGCAGTCCATGTTCGGATCA
ContiguousContiguous Base HashBase Hash
K = 12K = 12
Kmer Word HashingKmer Word Hashing
Zebrafish as a model organism
Danio rerio Fish length: 3 cm long
Estimated genome size: 1.55 Gb Easy to maintain
short generation time
can be kept at high densities Easy to manipulate
external fertilisation and development
transparent embryos
Sanger Institute WGS project started in spring 2001
- DNA source Tuebingen embryos;
- WGS read Insert sizes: 2 - 10 kb;
- BACends insert sizes: 165 – 175 kb;
- Polymorphism: ~ 1000 5 day old embryos;
- SNP density: One in every 200 bps;
- Indel density: One in every 1500 bps;
- Indel length: 2 – 30 bps.
Acknowledgements:
Jim Mullkin Yong Gu Adam Spargo Richard Durbin Kerstin Jekosch Sean Humphray Jane Rogers Sanger Systems Support Sanger Sequencing Facilities