species identification through dna string analysis

Mark VorsterSupervisor: Prof Philip Machanick

Research Overview

GoalAid bioinformaticians in research by providing a tool which can identify similar DNA sequences in order to infer homogeneity, in a timely manner.Reason for problemsLarge data setsDays of processingNo existing specific tools

2

Bioinformatics String Matching DiscussionResearch Overview- - - - Questions

Bioinformatics

"Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualise such data.“

Biomedical Information Science and Technology Initiatives Definition Committee - Dr Huerta

"The branch of science concerned with information and information flow in biological systems, esp. the use of computational methods in genetics and genomics.“

Oxford English Dictionary

3

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

History of Bioinformatics and Genetics 1953 - Watson,

Crick , Wilkins and Franklin.

Discrete abstraction

Adenine – ThymineGuanine – Cytosine

44

One helical turn = 3.4 nm

http://www.accessexcellence.org/RC/VL/GG/images/structure.gif

Sugar-phosphate backbonebase

Hydrogen bonds


Sequence Analysis and Sequence Alignment

Sequence Alignment Global Alignment is expensive

Assumption: Sequences are already Globally Aligned

Alignment Differences TGAGCACCT Insertion TGACGCACCT Deletion TGA_CACCT Replacement TGATCACCT

Phylogenetic inference55


FASTA File Format

Leading ‘>’ Sequence Identifier Description or

comment A number of lines of

genetic code

Other Symbols6

>SequenceName description or commentCCGGAATACCTAGGACGCCTTCATCCCCCGCCGGTCTGTGATGTCCCAATGGACCGGA>NextSequence description of commentACGCCTGATTACCTGCTAGTCGGGATGATAACCAAGAATTTGTGTCTG


Approximate String Matching Algorithm Nesting loops inefficient Dynamic Programing

Take into account all previous information Improved to O(n2) | where n is number of bases in

shorter sequence

Goal: Find the closet match between two strings

Or the minimum number of differences

7


Approximate String Matching AlgorithmMinimum of:MatchCost = D[i-1][j-1], if pi = tj

ReviseCost = D[i-1][j-1]+1 , if pi ≠ tj

InsertCost = D[i-1][j]+1DeleteCost = D[i][j-1]+1

D[0][j] = 0 and D[i][0] = i8

D[i-1][j-1] D[i-1][j]

D[i][j-1] D[i][j]


Approximate String Matching Algorithm

9


H a v e a h s p p y d a yNULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

h 1

a 2

p 3

p 4

y 5


10



h 1 1 1 1 1 1 1

a 2 2 1 2 2 2 1

p 3 3 2 2 3 3

p 4 4 3 3 3 4

y 5 5 4 4 4 4

D[i-1][j-1]

MatchCost = D[i-1][j-1], if pi = tj

ReviseCost = D[i-1][j-1]+1 , if pi ≠ tj

InsertCost = D[i-1][j]+1DeleteCost = D[i][j-1]+1

D[i][j-1]i

j

D[i-1][j]

D[i-1][j-1]

tj

pi

MatchCost = N/AReviseCost = 3InsertCost = 2DeleteCost = 4-> Min = 2


11



h 1 1 1 1 1 1 1 1 0 1 1 1 1

a 2 2 1 2 2 2 1 2 1 1 2 2 2

p 3 3 2 2 3 3 2 2 2 2 1 2 3

p 4 4 3 3 3 4 3 3 3 3 2 1 2

y 5 5 4 4 4 4 4 4 4 4 3 2 1


12

Changes D[i][0] = i , if pi = t0

D[i][0] = i + 1 , if pi ≠ t0

D[0][j] = j , if p0 = tj

D[0][j] = j + 1 , if p0 ≠ tj

Additional stop case for mismatch



13

T A C G G A C G G T

T 0 2 3 4 5 6 7 8 9 9

A 2 0 1 2 3 4 5

C 3 1 0 1 2 3 4

G 4 2 1 0 1 2 3

A 5 3 2 1 1 1 2

A 6 4 3 2 2 1 2

G 7 5 4 3 2 2 2

G 8 6 5 4 3 3 3

G 9 7 6 5 4 4 4

A 10 8 7 6 5 4 5


Discussion

14


Grouping Algorithm Scale of the problem

400 – 800 bases per sequence Ten thousands of sequences

Assumptions: Sequences Globally Aligned Sequences Begin at the Same Place

Example Grouping

15

Seq[336] HK2QS7R01AXRJ6 Seq[218] Seq[38] Seq[235] Seq[89] …

Seq[382] HK2QS7R01BR4Q9 Seq[173]

Seq[180] HK2QS7R01ABFDP Seq[339] Seq[289] Seq[491] Seq[319] …

Seq[269] HK2QS7R01AZHD7 Seq[402] Seq[112] Seq[203] Seq[137] …

Seq[210] HK2QS7R01BMNQ4 Seq[364]

Seq[270] HK2QS7R01AZFOG Seq[388] Seq[441]

Seq[442] HK2QS7R01ADASO Seq[426] Seq[233] Seq[374] Seq[416] …

… …


Results

16

O(n2), where n is number of sequences.

~1600 comparisons per second.

10000 sequence ~8.6 hours.(from 10 days)

Comparisons for n sequence = (n-1)n/2


species identification through dna string analysis

Documents

pi tjinsertcost

number of bases

biomedical information

information flow

health data

use of biological

commenta number of lines

minimum number of differences