sequence analysis by jyotika bhati. the design, construction and use of software tools to generate,...

73
SEQUENCE ANALYSIS By Jyotika Bhati

Upload: gerald-black

Post on 22-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

SEQUENCE ANALYSIS

By Jyotika Bhati

Page 2: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology

Bioinformatics

OR

Biologists doing “stuff” with computers?

Page 3: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

What is Sequence ? • A sequence is an ordered list of objects (or events).

• Biological sequence is a single, continuous molecule of nucleic acid or protein.

• Sequence analysis in bioinformatics is an automated, computer-based examination of characteristic fragments, e.g. of a DNA strand.

• The term "sequence analysis" in biology implies subjecting a DNA or peptide sequence to sequence alignment, sequence databases, repeated sequence searches, or other bioinformatics methods on a computer.

Page 4: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Nucleotide Sequence Databases

NCBI (National Center for Biotechnology Information)

EMBL (European Molecular Biology Laboratory)

DDBJ (DNA DataBank of Japan)

Page 5: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Protein Sequence Database

SWISS-PROT

TrEMBL

Page 6: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Sequence Alignment

• The identification of residue-residue correspondences

• The basic tool in bioinformatics

WHY Sequence Alignment ?

• For discovering functional, structural and evolutionary information in biological sequences

• Eases further tasks like:‾ Annotation of new sequences‾ Modeling of protein structures‾ Design and analysis of gene expression experiments

Page 7: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Basic Steps in Sequence Alignment

• Comparison of sequences to find similarity and dissimilarity in compared sequences

• Identification of gene-structures, reading frames, distributions of introns and exons and regulatory elements

• Finding and comparing point mutations to get the genetic marker

• Revealing the evolutionary and genetic diversity

• Function annotation of genes.

Page 8: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

The Concept

• An alignment is a mutual arrangement of two sequences

• Exhibits where two sequences are similar, and where they differ

• An ‘optimal’ alignment – most correspondences and the least differences

• Sequences that are similar probably have the same function

Sequence alignment involves the identification of the correct location of deletions and insertions that have occurred in either of the two lineages since the divergence from a common ancestor.

Page 9: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Terms of sequence comparison

Sequence identity

•Exactly same Nucleotide/AminoAcid in same position

Sequence similarity

•Substitutions with similar chemical properties

Sequence homology

•General term that indicates evolutionary relatedness among sequences

•Sequences are homologous if they are derived from a common ancestral sequence.

Page 10: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Homology

• Homology designates a qualitative relationship of common descent between entities

• Two genes are either homologs or not !‾ It doesn’t make sense to say “two genes are 43%

homologous”‾ It doesn’t make sense to say “John is 43% diabetic”

Two genes are orthologs if they originated from single ancestral gene in the most recent common ancestor of their respective genomes

Two genes are paralogs if they are related by duplication

Page 11: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Things to consider

• To find the best alignment one needs to examine all possible alignment

• To reflect the quality of the possible alignments one needs to score them

• There can be different alignments with the same highest score

• Variations in the scoring scheme may change the ranking of alignments

Page 12: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Manual alignmentManual alignment

• When there are few gaps and the two sequences are not too different from each other, a reasonable alignment can be obtained by visual inspection.

• Advantages: (1) use of a powerful and trainable tool (the brain, well… some brains).(2) ability to integrate additional data

Disadvantage : The method is subjective and unscalable.

Page 13: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Types of Alignment

- Pairwise Alignment

- Multiple Alignment

•Dot Matrix Method•Dynamic Programming•Word Method

•Dynamic Programming•Progressive Methods•Iterative Methods•Motif Finding

Page 14: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Pairwise Sequence Alignment

• One pair of elements at a time• Challenge – Find optimum alignment of 2

seqs with some degree of similarity• Optimality is based on SCORE• Score reflects the no. of paired characters

in the 2 seqs and the no. and length of gaps introduced to adjust the seqs so that max no. of characters are in alignment

Page 15: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

A pairwise alignment consists of a series of paired bases, one base from each sequence. There are three types of pairs:

(1) matches = the same nucleotide appears in both sequences. (2) mismatches = different nucleotides are found in the two sequences. (3) gaps = a base in one sequence and a null base in the other.

GCGGCCCATCAGGTACTTGGTG -G GCGT TCCATC - - CTGGTTGGTGTG

Match Gap Mismatch

Page 16: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Dot Matrix Method

• Established in 1970 by A.J. Gibbs and G.A.McIntyre• Method for comparing two nucleotide/aa sequences

• each sequence builds one axis of the grid• one puts a dot, at the intersection of same letters appearing in both sequences• scan the graph for a series of dots reveals similarity or a string of same characters• longer sequences can also be compared on a single page, by using smaller dots

Page 17: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Dot Matrix Method

• the dot matrix method reveals the presence of insertions or deletions

• comparing a single sequence to itself can reveal the presence of a repeat of a subsequence

• self comparison can reveal several features:

– similarity between chromosomes

– tandem genes

– repeated domains in a protein sequence

– regions of low sequence complexity (same

characters are often repeated)

Page 18: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Tools generating Dot Matrices

• Dotlet (Java based web-application)

http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

• Compare & dotplot programmes in GCG Wisconsin Package (Genetics Computer Group [commercial])

• GeneAssist package of ABI/Perkin Elmer

• DOTTER (available on dapsas, UNIX X-Windows)

• DNA Strider (Macintosh only)

Page 19: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Dot Matrix Methods

• When to use :

– unless the sequences are known to be very

much alike

• Demerits

– doesn’t readily resolve similarity that is

interrupted by insertion or deletions

– Difficult to find the best possible alignment

(optimal alignment)

– most computer programs don’t show an actual

alignment

Page 20: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Pairwise alignment: the problemThe number of possible pairwise alignments increases explosively with the length of the sequences:Two protein sequences of length 100 amino acids can be aligned in approximately 1060 different ways

Time needed to test all possibilities is same order of magnitude as the entire lifetime of the universe.

Page 21: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Global versus local alignments

Global alignment: align full length of both sequences. (The “Needleman-Wunsch” algorithm).

Local alignment: find best partial alignment of two sequences (the “Smith-Waterman” algorithm).

Global alignment

Seq 1

Seq 2

Local alignment

Page 22: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Global Sequence Alignment

• The Needleman–Wunsch algorithm performs a global alignment

• An example of dynamic programming• First application of dynamic programming to

biological sequence comparison• Suitable when the two sequences are of similar

length, with a significant degree of similarity throughout

• Aim: The best alignment over the entire length of two sequences

Page 23: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Steps in NW Algorithm

• Initialization• Scoring• Trace back (Alignment)

Consider the two DNA sequences to be globally aligned are:

ATCG (x=4, length of sequence 1)TCG (y=3, length of sequence 2)

Page 24: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

• The optimal alignment of two similar sequences is usually that which

• maximizes the number of matches and• minimizes the number of gaps.•There is a tradeoff between these two

- adding gaps reduces mismatches•Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non-homologous sequences.

• Penalizing gaps forces alignments to have relatively few gaps.

Why Gap Penalties?

Page 25: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Initialization Step

• Create a matrix with X +1 Rows and Y +1 Columns

• The 1st row and the 1st column of the score matrix are filled as multiple of gap penalty

T C G

0 -1 -2 -3

A -1

T -2

C -3

G -4

Page 26: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Scoring

• The score of any cell C(i, j) is the maximum of:

scorediag = C(i-1, j-1) + S(i, j)

scoreup = C(i-1, j) + g

scoreleft = C(i, j-1) + g

where S(i, j) is the substitution score for letters i and j, and g is the gap penalty

Page 27: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Scoring ….

• Example:

The calculation for the cell C(2, 2):

scorediag = C(i-1, j-1) + S(i, j) = 0 + -1 = -1

scoreup = C(i-1, j) + g = -1 + -1 = -2

scoreleft = C(i, j-1) + g = -1 + -1 = -2

T C G

0 -1 -2 -3

A -1 -1

T -2

C -3

G -4

Page 28: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Scoring ….

• Final Scoring Matrix

Note: Always the last cell has the maximum alignment score: 2

T C G

0 -1 -2 -3

A -1 -1 -2 -3

T -2 0 -1 -2

C -3 -1 1 0

G -4 -2 0 2

Page 29: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Trace back

• The trace back step determines the actual alignment(s) that result in the maximum score

• There are likely to be multiple maximal alignments

• Trace back starts from the last cell, i.e. position X, Y in the matrix

• Gives alignment in reverse order

Page 30: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Trace back ….

• There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left

• Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors

Page 31: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Trace back ….

• The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of

Seq 1: G

|

Seq 2: G

T C G

0 -1 -2 -3

A -1 -1 -2 -3

T -2 0 -1 -2

C -3 -1 1 0

G -4 -2 0 2

Page 32: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Trace back ….

• Final Trace back

Best Alignment:

A T C G

| | | |

_ T C G

T C G

0 -1 -2 -3

A -1 -1 -2 -3

T -2 0 -1 -2

C -3 -1 1 0

G -4 -2 0 2

Page 33: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Local Sequence Alignment

• The Smith-Waterman algorithm performs a local alignment on two sequences

• It is an example of dynamic programming

• Useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context

• Aim: The best alignment over the conserved domain of two sequences

Page 34: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Differences in Needleman-Wunsch and Smith-Waterman Algorithms

• In the initialization stage, the first row and first column are all filled in with 0s

• While filling the matrix, if a score becomes negative, put in 0 instead

• In the traceback, start with the cell that has the highest score and work back until a cell with a score of 0 is reached.

Page 35: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Three steps in Smith-Waterman Algorithm

• Initialization

• Scoring

• Trace back (Alignment)

Consider the two DNA sequences to be globally aligned are:

ATCG (x=4, length of sequence 1)

TCG (y=3, length of sequence 2)

Page 36: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Initialization Step

• Create a matrix with X +1 Rows and Y +1 Columns

• The 1st row and the 1st column of the score matrix are filled with 0s

T C G

0 0 0 0

A 0

T 0

C 0

G 0

Page 37: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Scoring

• The score of any cell C(i, j) is the maximum of:

scorediag = C(i-1, j-1) + S(I, j)

scoreup = C(i-1, j) + g

scoreleft = C(i, j-1) + g

And

0

(here S(i, j) is the substitution score for letters i and j, and g is the gap penalty)

Page 38: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Scoring ….

• Example:

The calculation for the cell C(2, 2):

scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1

scoreup = C(i-1, j) + g = 0 + -1 = -1

scoreleft = C(i, j-1) + g = 0 + -1 = -1

T C G

0 0 0 0

A 0 0

T 0

C 0

G 0

Page 39: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Scoring ….

• Final Scoring Matrix

Note: It is not mandatory that the last cell has the maximum alignment score!

T C G

0 0 0 0

A 0 0 0 0

T 0 1 0 0

C 0 0 2 1

G 0 0 1 3

Page 40: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Trace back

• The trace back step determines the actual alignment(s) that result in the maximum score

• There are likely to be multiple maximal alignments

• Trace back starts from the cell with maximum value in the matrix

• Gives alignment in reverse order

Page 41: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Trace back ….

• There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left

• Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible predecessors. This continues till cell with value 0 is reached.

Page 42: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Trace back ….

• The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of

Seq 1: G

|

Seq 2: G

T C G

0 0 0 0

A 0 0 0 0

T 0 1 0 0

C 0 0 2 1

G 0 0 1 3

Page 43: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Trace back ….

• Final Trace back

Best Alignment:

T C G

| | |

T C G

T C G

0 0 0 0

A 0 0 0 0

T 0 1 0 0

C 0 0 2 1

G 0 0 1 3

Page 44: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

• The true alignment between two sequences is the one that reflects accurately the evolutionary relationships between the sequences.

• Since the true alignment is unknown, in practice we look for the optimal alignment, which is the one in which the numbers of mismatches and gaps are minimized according to certain criteria.

• Unfortunately, reducing the number of mismatches results in an increase in the number of gaps, and vice versa.

Page 45: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

FASTA1) Derived from logic of the dot plot

– compute best diagonals from all frames of alignment2) Word method looks for exact matches between words in query and test

sequence– hash tables (fast computer technique)– DNA words are usually 6 bases– protein words are 1 or 2 amino acids– only searches for diagonals in region of word

matches = faster searchingFastA searches can be done on the WWW FastA server at EBI:

http://www2.ebi.ac.uk/fasta3/

Page 46: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

FASTA Algorithm

Page 47: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Makes Longest Diagonal

3) after all diagonals found, tries to join diagonals by adding gaps

4) computes alignments in regions of best diagonals

Page 48: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

FASTA Alignments

Page 49: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

FASTA Format

• simple format used by almost all programs• >header line with a [return] at end• Sequence (no specific requirements for line

length, characters, etc)

>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACAACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTTGCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACCCACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTGTGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCAGGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCATCTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGATGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATGGAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTCCATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATCCCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT

Page 50: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

BLAST Searches GenBank[BLAST= Basic Local Alignment Search Tool]

The NCBI BLAST web server lets you compare your query sequence to various sections of GenBank:

– nr = non-redundant (main sections)– month = new sequences from the past few weeks– ESTs– human, drososphila, yeast, or E.coli genomes– proteins (by automatic translation)

• This is a VERY fast and powerful computer.

Page 51: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

BLAST• Uses word matching like FASTA• Similarity matching of words (3 aa’s, 11 bases)

– does not require identical words.

• If no words are similar, then no alignment– won’t find matches for very short sequences

• Does not handle gaps well

Page 52: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

BLAST Algorithm

Page 53: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

BLAST Word Matching

MEAAVKEEISVEDEAVDKNI

MEA EAA AAV AVK VKE KEE EEI EIS ISV

...

Break query into words:

Break database sequences

into words:

Page 54: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Database Sequence Word Lists

RTT AAQSDG KSSSRW LLNQEL RWYVKI GKG

DKI NIS

LFC WDVAAV KVRPFR DEI… …

Compare Word Lists

Query Word List:

MEAEAAAAVAVKVKLKEEEEIEISISV

?

Compare word lists by Hashing

(allow near matches)

Page 55: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

ELEPRRPRYRVPDVLVADPPIARLSVSGRDENSVELTMEAT

TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY

IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH

MEAEAAAAVAVKKLVKEEEEIEISISV

Find locations of matching words in database sequences

Page 56: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Extend hits one base at a time

Page 57: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

HSPs are Aligned Regions

• The results of the word matching and attempts to extend the alignment are segments- called HSPs (High-scoring Segment Pairs)

• BLAST often produces several short HSPs rather than a single aligned region

Page 58: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Searching on the web: BLAST at NCBI

Very fast computer dedicated to running BLAST searches

Many databases that are always up to date

Nice simple web interface

But you still need knowledge about BLAST to use it properly

http://blast.ncbi.nlm.nih.gov/Blast.cgi

Page 59: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

BLAST Output: Alignments>gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756

Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%)

Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLLSbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335

Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDASbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395

Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSSSbjct: 396 FLQPLSKPLSS 406

low complexity sequence filtered

Page 60: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

DNA vs. Protein searches• DNA is composed of 4 characters: A,G,C,T It is

anticipated that on the average, at least 25% of the residues of any 2 unrelated aligned sequences, would be identical.

• Protein sequence is composed of 20 characters (aa). The sensitivity of the comparison is improved. It is accepted that convergence of Proteins is rare, meaning that high similarity between 2 proteins always means homology.

Page 61: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

DNA vs. Protein searches• What should we use to search for similarity, the

nucleotide or the protein sequences?• If we have a nucleotide sequence, should we search

the DNA databases only? Or should we translate it to protein and search protein databases? Note, that by translating into aa sequence, we’ll presumably lose information, since the genetic code is degenerate, meaning that two or more codons can be translated to the same amino acid.

Page 62: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

-GGAGCCATATTAGATAGA-

-GGAGCAATTTTTGATAGA-

Gly Ala Ile Leu asp Arg

Gly Ala Ile Phe asp Arg

DNA yields more phylogenetic information than proteins. The nucleotide sequences of a pair of homologous genes have a higher information content than the amino acid sequences of the corresponding proteins, because mutations that result in synonymous changes alter the DNA sequence but do not affect the amino acid sequence. (Amino-acid sequences are more efficiently aligned).

• 3 different DNA positions but only one different amino acid position:

2 of the nucleotide substitutions are therefore synonymous and one is non-synonymous.

Nucleotide, amino-acid sequences

Page 63: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

DNA vs. Protein searches

• What about very different DNA sequences that code for similar protein sequences? We certainly do not want to miss those.

• Conclusion: We should use proteins for database similarity searches when possible.

Page 64: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

DNA vs. Protein searches• The reasons for this conclusion are:– When comparing DNA sequences, we get significantly

more random matches than we get with proteins.– The DNA databases are much larger, and grow faster

than Protein databases. Bigger database means more random hits!

– For DNA we usually use identity matrices, for protein more sensitive matrices like PAM and BLOSUM, which allow for better search results.

– The conservation in evolution, protein are rarely mutated.

Page 65: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Input Query

DNA SequenceAmino Acid Sequence

Blastp tblastn blastn blastx tblastx

Compares Against Protein

SequenceDatabase

Compares Against

translatedNucleotide Sequence Database

Compares Against

NucleotideSequenceDatabase

Compares Against Protein

SequenceDatabase

Compares Against

translated nucleotideSequenceDatabase

An Overview of BLAST

Page 66: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Why use BLAST?

• To discover functional, structural and evolutionary similarities

• Because “similarity” may be an indicator of “homology” and thus provide some insight into function or gene identification.

• Applications include– identifying orthologs and paralogs– discovering new genes or proteins– exploring protein structure and function

Page 67: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

• Is the alignment correct ?• Can I make it better ?• Which programs are best ?• How do you know if its correct ?

Meaningfulness

Page 68: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

• What do mean by correct ?– Mathematically rigorous– Biologically meaningful– Operationally useful

Is the Alignment Correct ?

Page 69: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

• Only if you know what you doing !

• Define better ?

• What’s the goal ?

• What’s the biology ?

Can you make it better ?

Page 70: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

• No simple answer• Depends on the particular problem• Recent objective studies help answer this

problem• Some tools to help compare alignments

Which programs are best ?

Page 71: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

• Methods to evaluate the alignment

• Methods to evaluate the program/algorithm

• Structural information

• Biology

How do you know it is correct ?

Page 72: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

Next time we talk about MULTIPLE

ALIGNMENT

Page 73: SEQUENCE ANALYSIS By Jyotika Bhati. The design, construction and use of software tools to generate, store, annotate, access and analyse data and information

THANK YOU