species identification through dna string analysis

Mark VorsterSupervisor: Prof Philip Machanick

Research Overview

GoalAid bioinformaticians in research by providing a tool which can identify similar DNA sequences in order to infer homogeneity, in a timely manner.Reason for problemsLarge data setsDays of processingNo existing specific tools

Bioinformatics String Matching DiscussionResearch Overview- - - - Questions

Bioinformatics

"Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualise such data.“

Biomedical Information Science and Technology Initiatives Definition Committee - Dr Huerta

"The branch of science concerned with information and information flow in biological systems, esp. the use of computational methods in genetics and genomics.“

Oxford English Dictionary

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

History of Bioinformatics and Genetics 1953 - Watson,

Crick , Wilkins and Franklin.

Discrete abstraction

Adenine – ThymineGuanine – Cytosine

One helical turn = 3.4 nm

http://www.accessexcellence.org/RC/VL/GG/images/structure.gif

Sugar-phosphate backbonebase

Hydrogen bonds

Sequence Analysis and Sequence Alignment

Sequence Alignment Global Alignment is expensive

Assumption: Sequences are already Globally Aligned

Alignment Differences TGAGCACCT Insertion TGACGCACCT Deletion TGA_CACCT Replacement TGATCACCT

Phylogenetic inference55

FASTA File Format

Leading ‘>’ Sequence Identifier Description or

comment A number of lines of

genetic code

Other Symbols6

>SequenceName description or commentCCGGAATACCTAGGACGCCTTCATCCCCCGCCGGTCTGTGATGTCCCAATGGACCGGA>NextSequence description of commentACGCCTGATTACCTGCTAGTCGGGATGATAACCAAGAATTTGTGTCTG

Approximate String Matching Algorithm Nesting loops inefficient Dynamic Programing

Take into account all previous information Improved to O(n2) | where n is number of bases in

shorter sequence

Goal: Find the closet match between two strings

Or the minimum number of differences

Approximate String Matching AlgorithmMinimum of:MatchCost = D[i-1][j-1], if pi = tj

ReviseCost = D[i-1][j-1]+1 , if pi ≠ tj

InsertCost = D[i-1][j]+1DeleteCost = D[i][j-1]+1

D[0][j] = 0 and D[i][0] = i8

D[i-1][j-1] D[i-1][j]

D[i][j-1] D[i][j]

Approximate String Matching Algorithm

H a v e a h s p p y d a yNULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

h 1 1 1 1 1 1 1

a 2 2 1 2 2 2 1

p 3 3 2 2 3 3

p 4 4 3 3 3 4

y 5 5 4 4 4 4

D[i-1][j-1]

MatchCost = D[i-1][j-1], if pi = tj

ReviseCost = D[i-1][j-1]+1 , if pi ≠ tj

InsertCost = D[i-1][j]+1DeleteCost = D[i][j-1]+1

D[i][j-1]i

D[i-1][j]

D[i-1][j-1]

MatchCost = N/AReviseCost = 3InsertCost = 2DeleteCost = 4-> Min = 2

h 1 1 1 1 1 1 1 1 0 1 1 1 1

a 2 2 1 2 2 2 1 2 1 1 2 2 2

p 3 3 2 2 3 3 2 2 2 2 1 2 3

p 4 4 3 3 3 4 3 3 3 3 2 1 2

y 5 5 4 4 4 4 4 4 4 4 3 2 1

Changes D[i][0] = i , if pi = t0

D[i][0] = i + 1 , if pi ≠ t0

D[0][j] = j , if p0 = tj

D[0][j] = j + 1 , if p0 ≠ tj

Additional stop case for mismatch

T A C G G A C G G T

T 0 2 3 4 5 6 7 8 9 9

A 2 0 1 2 3 4 5

C 3 1 0 1 2 3 4

G 4 2 1 0 1 2 3

A 5 3 2 1 1 1 2

A 6 4 3 2 2 1 2

G 7 5 4 3 2 2 2

G 8 6 5 4 3 3 3

G 9 7 6 5 4 4 4

A 10 8 7 6 5 4 5

Discussion

Grouping Algorithm Scale of the problem

400 – 800 bases per sequence Ten thousands of sequences

Assumptions: Sequences Globally Aligned Sequences Begin at the Same Place

Example Grouping

Seq[336] HK2QS7R01AXRJ6 Seq[218] Seq[38] Seq[235] Seq[89] …

Seq[382] HK2QS7R01BR4Q9 Seq[173]

Seq[180] HK2QS7R01ABFDP Seq[339] Seq[289] Seq[491] Seq[319] …

Seq[269] HK2QS7R01AZHD7 Seq[402] Seq[112] Seq[203] Seq[137] …

Seq[210] HK2QS7R01BMNQ4 Seq[364]

Seq[270] HK2QS7R01AZFOG Seq[388] Seq[441]

Seq[442] HK2QS7R01ADASO Seq[426] Seq[233] Seq[374] Seq[416] …

… …

Results

O(n2), where n is number of sequences.

~1600 comparisons per second.

10000 sequence ~8.6 hours.(from 10 days)

Comparisons for n sequence = (n-1)n/2

species identification through dna string analysis

pi tjinsertcost

number of bases

biomedical information

information flow

health data

use of biological

commenta number of lines

minimum number of differences

Documents

part two identification of bacteria by dna sequencing

dna sequence-based identification and molecular phylogeny

identification and genome-wide prediction of dna binding

dna based typing, identification and detection systems...

project title: rapid dna-based identification of brown

a statutory review of the dna identification act

identification and characterisation of microsatellite dna...

assessing dna barcodes for identification of pufferfish

model-based species identification using dna barcodes

identification & evaluation of dna ligase inhibitors

identification of the dna replication regulator mcm

dna fingerprinting for forensic identification

alternative methods for human identification: mitochondrial...

bio-identification of flake samples using dna barcoding

dna-based identification of spices: dna isolation, whole...

identification of dna autographa - proceedings of the...

novak - dna-based identification of raw and processed

dna sequence-based identification of fusarium:current...

characterization and t-dna insertion sites identification

research open access identification of rare dna sequence