species identification through dna string analysis

Post on 01-Jan-2016

35 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Mark Vorster Supervisor: Prof Philip Machanick. Species Identification through DNA String Analysis. -. -. -. -. Research Overview. Bioinformatics. String Matching. Discussion. Questions. Research Overview. Goal - PowerPoint PPT Presentation

TRANSCRIPT

Mark VorsterSupervisor: Prof Philip Machanick

Research Overview

GoalAid bioinformaticians in research by providing a tool which can identify similar DNA sequences in order to infer homogeneity, in a timely manner.Reason for problemsLarge data setsDays of processingNo existing specific tools

2

Bioinformatics String Matching DiscussionResearch Overview- - - - Questions

Bioinformatics

"Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualise such data.“

Biomedical Information Science and Technology Initiatives Definition Committee - Dr Huerta

"The branch of science concerned with information and information flow in biological systems, esp. the use of computational methods in genetics and genomics.“

Oxford English Dictionary

3

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

History of Bioinformatics and Genetics 1953 - Watson,

Crick , Wilkins and Franklin.

Discrete abstraction

Adenine – ThymineGuanine – Cytosine

44

One helical turn = 3.4 nm

http://www.accessexcellence.org/RC/VL/GG/images/structure.gif

Sugar-phosphate backbonebase

Hydrogen bonds

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Sequence Analysis and Sequence Alignment

Sequence Alignment Global Alignment is expensive

Assumption: Sequences are already Globally Aligned

Alignment Differences TGAGCACCT Insertion TGACGCACCT Deletion TGA_CACCT Replacement TGATCACCT

Phylogenetic inference55

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

FASTA File Format

Leading ‘>’ Sequence Identifier Description or

comment A number of lines of

genetic code

Other Symbols6

>SequenceName description or commentCCGGAATACCTAGGACGCCTTCATCCCCCGCCGGTCTGTGATGTCCCAATGGACCGGA>NextSequence description of commentACGCCTGATTACCTGCTAGTCGGGATGATAACCAAGAATTTGTGTCTG

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Approximate String Matching Algorithm Nesting loops inefficient Dynamic Programing

Take into account all previous information Improved to O(n2) | where n is number of bases in

shorter sequence

Goal: Find the closet match between two strings

Or the minimum number of differences

7

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Approximate String Matching AlgorithmMinimum of:MatchCost = D[i-1][j-1], if pi = tj

ReviseCost = D[i-1][j-1]+1 , if pi ≠ tj

InsertCost = D[i-1][j]+1DeleteCost = D[i][j-1]+1

D[0][j] = 0 and D[i][0] = i8

D[i-1][j-1] D[i-1][j]

D[i][j-1] D[i][j]

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Approximate String Matching Algorithm

9

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

H a v e a h s p p y d a yNULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

h 1

a 2

p 3

p 4

y 5

Approximate String Matching Algorithm

10

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

H a v e a h s p p y d a yNULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

h 1 1 1 1 1 1 1

a 2 2 1 2 2 2 1

p 3 3 2 2 3 3

p 4 4 3 3 3 4

y 5 5 4 4 4 4

D[i-1][j-1]

MatchCost = D[i-1][j-1], if pi = tj

ReviseCost = D[i-1][j-1]+1 , if pi ≠ tj

InsertCost = D[i-1][j]+1DeleteCost = D[i][j-1]+1

D[i][j-1]i

j

D[i-1][j]

D[i-1][j-1]

tj

pi

MatchCost = N/AReviseCost = 3InsertCost = 2DeleteCost = 4-> Min = 2

Approximate String Matching Algorithm

11

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

H a v e a h s p p y d a yNULL 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

h 1 1 1 1 1 1 1 1 0 1 1 1 1

a 2 2 1 2 2 2 1 2 1 1 2 2 2

p 3 3 2 2 3 3 2 2 2 2 1 2 3

p 4 4 3 3 3 4 3 3 3 3 2 1 2

y 5 5 4 4 4 4 4 4 4 4 3 2 1

Approximate String Matching Algorithm

12

Changes D[i][0] = i , if pi = t0

D[i][0] = i + 1 , if pi ≠ t0

D[0][j] = j , if p0 = tj

D[0][j] = j + 1 , if p0 ≠ tj

Additional stop case for mismatch

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Approximate String Matching Algorithm

13

T A C G G A C G G T

T 0 2 3 4 5 6 7 8 9 9

A 2 0 1 2 3 4 5

C 3 1 0 1 2 3 4

G 4 2 1 0 1 2 3

A 5 3 2 1 1 1 2

A 6 4 3 2 2 1 2

G 7 5 4 3 2 2 2

G 8 6 5 4 3 3 3

G 9 7 6 5 4 4 4

A 10 8 7 6 5 4 5

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Discussion

14

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Grouping Algorithm Scale of the problem

400 – 800 bases per sequence Ten thousands of sequences

Assumptions: Sequences Globally Aligned Sequences Begin at the Same Place

Example Grouping

15

Seq[336] HK2QS7R01AXRJ6 Seq[218] Seq[38] Seq[235] Seq[89] …

Seq[382] HK2QS7R01BR4Q9 Seq[173]

Seq[180] HK2QS7R01ABFDP Seq[339] Seq[289] Seq[491] Seq[319] …

Seq[269] HK2QS7R01AZHD7 Seq[402] Seq[112] Seq[203] Seq[137] …

Seq[210] HK2QS7R01BMNQ4 Seq[364]

Seq[270] HK2QS7R01AZFOG Seq[388] Seq[441]

Seq[442] HK2QS7R01ADASO Seq[426] Seq[233] Seq[374] Seq[416] …

… …

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Results

16

O(n2), where n is number of sequences.

~1600 comparisons per second.

10000 sequence ~8.6 hours.(from 10 days)

Comparisons for n sequence = (n-1)n/2

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

Bioinformatics String Matching DiscussionResearch Overview - - - - Questions

top related