8/27/07bcb 444/544 f07 isu dobbs #4 - sequence alignment1 bcb 444/544 finish: lecture 2- biological...

Post on 30-Dec-2015

234 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 1

BCB 444/544

Finish: Lecture 2- Biological Databases

Lecture 4

Sequence Alignment

#4_Aug27

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 2

Required Reading (before lecture)

Mon Aug 27 - for Lecture #4Pairwise Sequence Alignment • Chp 3 - pp 31-41 Xiong Textbook

Wed Aug 29 - for Lecture #5Dynamic Programming• Eddy: What is Dynamic Programming?

Thurs Aug 30 - Lab #2:Databases, ISU Resources,& Pairwise Sequence

Alignment

Fri Aug 31 - for Lecture #6Scoring Matrices and Alignment Statistics

• Chp 3 - pp 41-49

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 3

HW#2:

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 4

Back to: Chp 2- Biological Databases

• Xiong: Chp 2

Introduction to Biological Databases

• What is a Database?• Types of Databases• Biological Databases• Pitfalls of Biological Databases• Information Retrieval from Biological Databases

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 5

What is a Database?

Duh!!

OK: skip we'll skip that!

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 6

Types of Databases

3 Major types of electronic databases:

1. Flat files - simple text files

• no organization to facilitate retrieval

2. Relational - data organized as tables ("relations")

• shared features among tables allows rapid search

• Object-oriented - data organized as "objects"

• objects associated hierarchically

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 7

Biological Databases

Currently - all 3 types, but MANY flat files

What are goals of biological databases?

1.Information retrieval

2.Knowledge discovery

Important issue: Interconnectivity

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 8

Types of Biological Databases

1- Primary• "simple" archives of sequences, structures, images, etc.

• raw data, minimal annotations, not always well curated!

2- Secondary• enhanced with more complete annotation of sequences,

structures, images, etc.

• usually curated!

3- Specialized• focused on a particular research interest or organism

• usually - not always - highly curated

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 9

Examples of Biological Databases

1- Primary

• DNA sequences

• GenBank - US

• European Molecular Biology Lab - EMBL

• DNA Data Bank of Japan - DDBJ

• Structures (Protein, DNA, RNA)

• PDB - Protein Data Bank

• NDB - Nucleic Acid Data Bank

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 10

Examples of Biological Databases

2- Secondary

• Protein sequences

• Swiss-Prot, TreEMBL, PIR

• these recently combined into UniProt

3- Specialized

• Species-specific (or "taxonomic"

specific)

• Flybase, WormBase, AceDB, PlantDB

• Molecule-specific,disease-specific

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 11

• Errors!

&• Lack of documentation re: quality or reliability of data• Limited mechanisms for "data checking" or preventing

propagation of errors (esp. annotation errors!!)• Redundancy• Inconsistency• Incompatibility (format, terminology, data types, etc.)

Pitfalls of Biological Databases

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 12

Information Retrieval from Biological Databases

2 most popular retrieval systems:

• ENTREZ - NCBI

• will use a LOT - was introduced in Lab 1

• SRS - Sequence Retrieval Systems - EBI

• will use less, similar to ENTREZ

Both:

• Provide access to multiple databases

• Allow complex queries

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 13

Web Resources: Bioinformatics & Computational Biology

• NCBI - National Center for Biotechnology Information

• ISCB - International Society for Computational Biology• JCB - Jena Center for Bioinformatics• Pitt - OBRC Online Bioinformatics Resources Collection• UBC - Bioinformatics Links Directory• UWash - BioMolecules

• ISU - Bioinformatics Resources - Andrea Dinkelman• ISU - YABI = "Yet Another Bioinformatics Index"

(from BCB Lab at ISU)

• Wikipedia: Bioinformatics

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 14

ISU Resources & Experts

ISU Research Centers & Graduate Training Programs:

• LH Baker Center - Bioinformatics & Biological Statistics• BCB - Bioinformatics & Computational Biology• BCB Lab - (Student-Led Consulting & Resources)• CIAG - Center for Integrated Animal Genomics• CCILD - Computational Intelligence, Learning & Discovery• IGERT Training Grant - Computational Molecular Biology

ISU Facilities:

• Biotechnology - Instrumentation Facilities• PSI - Plant Sciences Institute • PSI Centers

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 15

BEWARE!

SUMMARY:#2- Biological Databases

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 16

Chp 3- Sequence Alignment

SECTION II SEQUENCE ALIGNMENT

Xiong: Chp 3

Pairwise Sequence Alignment

• Evolutionary Basis • Sequence Homology versus Sequence Similarity • Sequence Similarity versus Sequence Identity • Methods • Scoring Matrices• Statistical Significance of Sequence Alignment

Adapted from Brown and Caragea, 2007, with some slides from: Altman, Fernandez-Baca, Batzoglou, Craven, Hunter, Page.

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 17

Motivation for Sequence Alignment

"Sequence comparison lies at the heart of bioinformatics analysis." Jin Xiong

Sequence comparison is important for drawing functional & evolutionary inferences re: new genes/proteins

Pairwise sequence alignment is fundamental; it used to:• Search for common patterns of characters • Establish pair-wise correspondence between related

sequences

Pairwise sequence alignment is basis for:• Database searching (e.g., BLAST)• Multiple sequence alignment (MSA)

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 18

Why Align Sequences?

Databases contain many sequences with known functions & many sequences with unknown functions.

Genes (or proteins) with similar sequences may have similar structures and/or functions.

Sequence alignment can provide important clues to the function of a novel gene or protein

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 19

Examples of Bioinformatics Tasks that

Rely on Sequence Alignment

• Genomic sequencing (> 500 complete genomes sequenced!)

• Assembling multiple sequence reads into contigs, scaffolds

• Aligning sequences with chromosomes• Finding genes and regulatory regions• Identifying gene products• Identifying function of gene products• Studying the structural organization of genomes• Comparative genomics

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 20

Evolutionary Basis

• DNA, RNA and proteins are "molecular fossils" • they encode the history of millions of years of evolution

• During evolution, molecular sequences accumulate random changes (mutations/variants)

• some of which provide a selective advantage or disadvantage, and some of which are neutral

• Sequences that are structurally and/or functionally important tend to be conserved

• (e.g., chromosomal telomeric sequences; enzyme active sites)

• Significant sequence conservation allows inference of evolutionary relatedness

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 21

Homology

Homology has a very specific meaning in evolutionary & computational biology - & the term is often used incorrectly

For us:

Homology = similarity due to descent from a common evolutionary ancestor

But, HOMOLOGY ≠ SIMILARITY

When 2 sequences share a sufficiently high degree of sequence similarity (or identity), we may infer that they are homologous

We can infer homology from similarity (can't prove it!)

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 22

Orthologs vs Paralogs

2 types of homologous sequences:• Orthologs - "same genes" in different species;

result of common ancestry; corresponding proteins have "same" functions

(e.g., human -globin & mouse -globin) • Paralogs - "similar genes" within a species;

result of gene duplication events; corresponding proteins may (or may not) have similar functions

(e.g., human -globin & human -globin)

A is the parent geneSpeciation leads to B & CDuplication leads to C’

B and C are OrthologousC and C’ are Paralogous

Speciation

Duplication

B

A

C C'

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 23

Sequence Homology vs Similarity

• Homologous sequences - sequences that share a common evolutionary ancestry

• Similar sequences - sequences that have a high percentage of aligned residues with similar physicochemical properties

(e.g., size, hydrophobicity, charge)

IMPORTANT:• Sequence homology:• An inference about a common ancestral relationship, drawn

when two sequences share a high enough degree of sequence similarity

• Homology is qualitative

• Sequence similarity:• The direct result of observation from a sequence alignment • Similarity is quantitative; can be described using percentages

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 24

Sequence Similarity vs Identity

For nucleotide sequences (DNA & RNA), sequence similarity and identity have the "same" meaning: • Two DNA sequences can share a high degree of sequence identity

(or similarity) -- means the same thing• Drena's opinion: Always use "identity" when making

quantitative comparisons re: DNA or RNA sequences (to avoid confusion!)

For protein sequences, sequence similarity and identity have different meanings:• Identity = % of exact matches between two aligned sequences• Similarity = % of aligned residues that share similar

characteristics (e.g, physicochemical characteristics, structural propsensities, evolutionary profiles)

• Drena's opinion: Always use "identity" when making quantitative comparisons re: protein sequences (to avoid confusion!)

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 25

What is Sequence Alignment?

Given 2 sequences of letters, and a scoring scheme for evaluating matching letters, find an optimal pairing of letters in one sequence to letters of other sequence.

Align:

1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A SHORT SENTENCE.

1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A ######SHORT## SENTENCE##############.

OR

1: THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. 2: THIS IS A ##SHORT###SENT#EN###CE##############.

Is one of these alignments "optimal"? Which is better?

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 26

Goal of Sequence Alignment

Find the best pairing of 2 sequences, such that there is maximum correspondence between residues

• DNA 4 letter alphabet (+ gap)

TTGACACTTTACAC

• Proteins 20 letter alphabet (+ gap)

RKVA-GMA RKIAVAMA

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 27

Statement of Problem

Given: • 2 sequences• Scoring system for evaluating match (or

mismatch) of two characters • Penalty function for gaps in sequences

Find: Optimal pairing of sequences that• Retains the order of characters• Introduces gaps where needed• Maximizes total score

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 28

Types of Sequence Variation

• Sequences can diverge from a common ancestor through various types of mutations:

• Substitutions ACGA AGGA• Insertions ACGA ACCGA • Deletions ACGA AGA

• Insertions or deletions ("indels") result in gaps in alignments• Substitotions result in mismatches• No change? match

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 29

Gaps

Indels of various sizes can occur in one sequence relative to the other e.g., corresponding to a shortening of the polypeptide chain in a protein

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 30

Avoiding Random Alignments with a Scoring Function

• Introducing too many gaps generates nonsense alignments:

s--e-----qu---en--cesometimesquipsentice

• Need to distinguish between alignments that occur due to homology and those that occur by chance

• Define a scoring function that accounts for mismatches and gaps

Scoring Function (F): e.g. Match: + m +1 Mismatch: - s -1 Gap: - d -2

F = m(#matches) + s(#mismatches) + d(#gaps)

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 31

Not All Mismatches are the Same

• Some amino acids are more "exchangeable" than others; e.g., Ser and Thr are more similar than Trp and Ala

• A substitution matrix can be used to introduce "mismatch costs" for handling different types of substitutions

• Mismatch costs are not usually used in aligning DNA or RNA sequences, because no substitution is "better" than any other (in general)

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 32

Substitution Matrix

s(a,b) corresponds to score of aligning character a with character b

Match scores are often calculated based on frequency of mutations in very similar sequences

(more details later)

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 33

Methods

• Global and Local Alignment• Alignment Algorithms• Dot Matrix Method• Dynamic Programming Method

• Gap penalities• DP for Global Alignment• DP for Local Alignment

• Scoring Matrices• Amino acid scoring matrices• PAM• BLOSUM• Comparisons between PAM & BLOSUM

• Statistical Significance of Sequence Alignment

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 34

Global vs Local Alignment

Local alignment

• Finds local regions with highest similarity between 2 sequences

• Aligns these without regard for rest of sequence

• Sequences are not assumed to be similar over entire length

Global alignment

• Finds best possible alignment across entire length of 2 sequences

• Aligned sequences assumed to be generally similar over entire length

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 35

Global vs Local Alignment - example

S = CTGTCGCTGCACGT = TGCCGTG

CTGTCG-CTGCACG

-TGCCG--TG----

Global alignment

CTGTCG-CTGCACG

-TGC-CG-TG----CTGTCGCTGCACG---------TGC-CGTG

Local alignment

Which is better?

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 36

Global vs Local Alignment When use which?

Both are important but it is critical to use right method for a given task!

Global alignment:• Good for: aligning closely related sequences of approx. same

length• Not good for: divergent sequences or sequences with different

lengths

Local Alignment:• Good for: searching for conserved patterns (domains or motifs) in

DNA or protein sequences• Not good for: generating alignment of closely related sequences

Global and local alignments are fundamentally similar and differ only in optimization strategy used in aligning similar residues

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 37

Alignment Algorithms

3 major methods for alignment:

1. Dot matrix analysis 2. Dynamic Programming 3. Word or k-tuple methods (later, in Chp 4)

8/27/07BCB 444/544 F07 ISU Dobbs #4 - Sequence Alignment 38

Dot Matrix Method (Dot Plots)

• Place 1 sequence along top row of matrix

• Place 2nd sequence along left column of matrix

• Plot a dot each time there is a match between an element of row sequence and an element of column sequence• For proteins, usually use more

sophisticated scoring schemes than "identical match"

• Diagonal lines indicate areas of match • Reverse diagonals (perpendicular to

diagonal) indicate inversions

ACACG

A CC G G

Exploring Dot Plots

top related