species identification through dna string analysis
DESCRIPTION
Mark Vorster Supervisor: Prof Philip Machanick. Species Identification through DNA String Analysis. -. -. -. -. Research Overview. Bioinformatics. String Matching. Discussion. Questions. Research Overview. Goal - PowerPoint PPT PresentationTRANSCRIPT
Mark VorsterSupervisor: Prof Philip Machanick
Research Overview
GoalAid bioinformaticians in research by providing a tool which can identify similar DNA sequences in order to infer homogeneity, in a timely manner.Reason for problemsLarge data setsDays of processingNo existing specific tools
2
Bioinformatics String Matching DiscussionResearch Overview- - - - Questions
Bioinformatics
"Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualise such data.“
Biomedical Information Science and Technology Initiatives Definition Committee - Dr Huerta
"The branch of science concerned with information and information flow in biological systems, esp. the use of computational methods in genetics and genomics.“
Oxford English Dictionary
3
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
History of Bioinformatics and Genetics 1953 - Watson,
Crick , Wilkins and Franklin.
Discrete abstraction
Adenine – ThymineGuanine – Cytosine
44
One helical turn = 3.4 nm
http://www.accessexcellence.org/RC/VL/GG/images/structure.gif
Sugar-phosphate backbonebase
Hydrogen bonds
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
Sequence Analysis and Sequence Alignment
Sequence Alignment Global Alignment is expensive
Assumption: Sequences are already Globally Aligned
Alignment Differences TGAGCACCT Insertion TGACGCACCT Deletion TGA_CACCT Replacement TGATCACCT
Phylogenetic inference55
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
FASTA File Format
Leading ‘>’ Sequence Identifier Description or
comment A number of lines of
genetic code
Other Symbols6
>SequenceName description or commentCCGGAATACCTAGGACGCCTTCATCCCCCGCCGGTCTGTGATGTCCCAATGGACCGGA>NextSequence description of commentACGCCTGATTACCTGCTAGTCGGGATGATAACCAAGAATTTGTGTCTG
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
Approximate String Matching Algorithm Nesting loops inefficient Dynamic Programing
Take into account all previous information Improved to O(n2) | where n is number of bases in
shorter sequence
Goal: Find the closet match between two strings
Or the minimum number of differences
7
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
Approximate String Matching AlgorithmMinimum of:MatchCost = D[i-1][j-1], if pi = tj
ReviseCost = D[i-1][j-1]+1 , if pi ≠ tj
InsertCost = D[i-1][j]+1DeleteCost = D[i][j-1]+1
D[0][j] = 0 and D[i][0] = i8
D[i-1][j-1] D[i-1][j]
D[i][j-1] D[i][j]
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
Approximate String Matching Algorithm
9
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
H a v e a h s p p y d a yNULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 1
a 2
p 3
p 4
y 5
Approximate String Matching Algorithm
10
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
H a v e a h s p p y d a yNULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 1 1 1 1 1 1 1
a 2 2 1 2 2 2 1
p 3 3 2 2 3 3
p 4 4 3 3 3 4
y 5 5 4 4 4 4
D[i-1][j-1]
MatchCost = D[i-1][j-1], if pi = tj
ReviseCost = D[i-1][j-1]+1 , if pi ≠ tj
InsertCost = D[i-1][j]+1DeleteCost = D[i][j-1]+1
D[i][j-1]i
j
D[i-1][j]
D[i-1][j-1]
tj
pi
MatchCost = N/AReviseCost = 3InsertCost = 2DeleteCost = 4-> Min = 2
Approximate String Matching Algorithm
11
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
H a v e a h s p p y d a yNULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
h 1 1 1 1 1 1 1 1 0 1 1 1 1
a 2 2 1 2 2 2 1 2 1 1 2 2 2
p 3 3 2 2 3 3 2 2 2 2 1 2 3
p 4 4 3 3 3 4 3 3 3 3 2 1 2
y 5 5 4 4 4 4 4 4 4 4 3 2 1
Approximate String Matching Algorithm
12
Changes D[i][0] = i , if pi = t0
D[i][0] = i + 1 , if pi ≠ t0
D[0][j] = j , if p0 = tj
D[0][j] = j + 1 , if p0 ≠ tj
Additional stop case for mismatch
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
Approximate String Matching Algorithm
13
T A C G G A C G G T
T 0 2 3 4 5 6 7 8 9 9
A 2 0 1 2 3 4 5
C 3 1 0 1 2 3 4
G 4 2 1 0 1 2 3
A 5 3 2 1 1 1 2
A 6 4 3 2 2 1 2
G 7 5 4 3 2 2 2
G 8 6 5 4 3 3 3
G 9 7 6 5 4 4 4
A 10 8 7 6 5 4 5
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
Discussion
14
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
Grouping Algorithm Scale of the problem
400 – 800 bases per sequence Ten thousands of sequences
Assumptions: Sequences Globally Aligned Sequences Begin at the Same Place
Example Grouping
15
Seq[336] HK2QS7R01AXRJ6 Seq[218] Seq[38] Seq[235] Seq[89] …
Seq[382] HK2QS7R01BR4Q9 Seq[173]
Seq[180] HK2QS7R01ABFDP Seq[339] Seq[289] Seq[491] Seq[319] …
Seq[269] HK2QS7R01AZHD7 Seq[402] Seq[112] Seq[203] Seq[137] …
Seq[210] HK2QS7R01BMNQ4 Seq[364]
Seq[270] HK2QS7R01AZFOG Seq[388] Seq[441]
Seq[442] HK2QS7R01ADASO Seq[426] Seq[233] Seq[374] Seq[416] …
… …
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
Results
16
O(n2), where n is number of sequences.
~1600 comparisons per second.
10000 sequence ~8.6 hours.(from 10 days)
Comparisons for n sequence = (n-1)n/2
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions
Bioinformatics String Matching DiscussionResearch Overview - - - - Questions