sequence based analysis tutorial

38
Sequence Based Sequence Based Analysis Tutorial Analysis Tutorial March 26, 2004 March 26, 2004 NIH Proteomics Workshop NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Science Team Lead Protein Information Resource at Protein Information Resource at Georgetown University Medical Center Georgetown University Medical Center

Upload: fuller-mccormick

Post on 02-Jan-2016

44 views

Category:

Documents


2 download

DESCRIPTION

Sequence Based Analysis Tutorial. March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at Georgetown University Medical Center. Retrieval, Sequence Search & Classification Methods. Retrieve protein info by text / UID - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequence Based Analysis Tutorial

Sequence Based Analysis Sequence Based Analysis TutorialTutorial

March 26, 2004 March 26, 2004 NIH Proteomics Workshop NIH Proteomics Workshop

Lai-Su L. Yeh, Ph.D.Lai-Su L. Yeh, Ph.D.Protein Science Team LeadProtein Science Team LeadProtein Information Resource at Protein Information Resource at Georgetown University Medical CenterGeorgetown University Medical Center

Page 2: Sequence Based Analysis Tutorial

22

Retrieval, Sequence Search & Retrieval, Sequence Search & Classification MethodsClassification Methods

Retrieve protein info by text / UIDRetrieve protein info by text / UID Sequence Similarity SearchSequence Similarity Search

BLAST, FASTA, Dynamic ProgrammingBLAST, FASTA, Dynamic Programming Family Classification Family Classification

Patterns, Profiles, Hidden Markov Models, Patterns, Profiles, Hidden Markov Models, Sequence Alignments, Neural NetworksSequence Alignments, Neural Networks

Integrated Search and Classification Integrated Search and Classification SystemSystem

Page 3: Sequence Based Analysis Tutorial

33

Sequence Similarity SearchSequence Similarity Search

Based on Based on Pair-Wise ComparisonsPair-Wise Comparisons Dynamic Programming AlgorithmsDynamic Programming Algorithms

Global Similarity: Needleman-WunchGlobal Similarity: Needleman-Wunch Local Similarity: Smith-WatermanLocal Similarity: Smith-Waterman

Heuristic AlgorithmsHeuristic Algorithms FASTA: Based on K-Tuples (2-Amino Acid)FASTA: Based on K-Tuples (2-Amino Acid) BLAST: Triples of Conserved Amino AcidsBLAST: Triples of Conserved Amino Acids Gapped-BLAST: Allow Gaps in Segment PairsGapped-BLAST: Allow Gaps in Segment Pairs PHI-BLAST: Pattern-Hit Initiated SearchPHI-BLAST: Pattern-Hit Initiated Search PSI-BLAST: Position-Specific Iterated SearchPSI-BLAST: Position-Specific Iterated Search

Page 4: Sequence Based Analysis Tutorial

44

Sequence Similarity SearchSequence Similarity Search

Similarity Search ParametersSimilarity Search Parameters Scoring Matrices – Based on Conserved Amino Scoring Matrices – Based on Conserved Amino

Acid Substitution Acid Substitution • Dayhoff Mutation Matrix, e.g., PAM250 (~20% Dayhoff Mutation Matrix, e.g., PAM250 (~20%

Identity)Identity)• Henikoff Matrix from Ungapped Alignments, Henikoff Matrix from Ungapped Alignments,

e.g., BLOSUM 62 e.g., BLOSUM 62 Gap PenaltyGap Penalty

Search Time ComparisonsSearch Time Comparisons Smith-Waterman: 10 MinSmith-Waterman: 10 Min FASTA: 2 MinFASTA: 2 Min BLAST: 20 SecBLAST: 20 Sec

Page 5: Sequence Based Analysis Tutorial

55

Feature RepresentationFeature Representation

Features:Features: Residue Physicochemical Properties, Context Residue Physicochemical Properties, Context (Local & Global) Features, Evolutionary Features(Local & Global) Features, Evolutionary Features

Alternative Alphabets:Alternative Alphabets: Classification of Amino Acids To Classification of Amino Acids To Capture Different Features of Amino Acid ResiduesCapture Different Features of Amino Acid Residues

Page 6: Sequence Based Analysis Tutorial

66

Substitution MatrixSubstitution Matrix Likelihood of One Amino Acid Mutated into Another Over Evolutionary Likelihood of One Amino Acid Mutated into Another Over Evolutionary

TimeTime Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7)Negative Score: Unlikely to Happen (e.g., Gly/Trp, -7) Positive Score: Conservative Substitution (e.g., Lys/Arg, +3)Positive Score: Conservative Substitution (e.g., Lys/Arg, +3) High Score for Identical Matches: Rare Amino Acids (e.g., Trp, Cys)High Score for Identical Matches: Rare Amino Acids (e.g., Trp, Cys)

Page 7: Sequence Based Analysis Tutorial

77

BLASTBLAST

BLASTBLAST (Basic Local Alignment Search Tool) (Basic Local Alignment Search Tool) To search a sequence against the databaseTo search a sequence against the database Extremely fastExtremely fast Robust Robust Most widely usedMost widely usedIt finds very short segment pairs between the query It finds very short segment pairs between the query

and sequence in the databaseand sequence in the databaseThese segments are then extended in both directions These segments are then extended in both directions

until the maximum possible score of this particular until the maximum possible score of this particular segment is reached segment is reached

Page 8: Sequence Based Analysis Tutorial

88

BLAST SearchBLAST Search From BLAST Search InterfaceFrom BLAST Search Interface Table-Format Result with BLAST Output and SSEARCH Table-Format Result with BLAST Output and SSEARCH

(Smith-Waterman) Pair-Wise Alignment(Smith-Waterman) Pair-Wise Alignment

Page 9: Sequence Based Analysis Tutorial

99

BLAST/SSEARCH ResultsBLAST/SSEARCH Results

SSEARCH Alignment

BLAST Alignment

Page 10: Sequence Based Analysis Tutorial

1010

Family Classification MethodsFamily Classification Methods

Based on Based on Family InformationFamily Information ClustalW Multiple Sequence AlignmentClustalW Multiple Sequence Alignment ProSite Pattern SearchProSite Pattern Search Profile Search Profile Search Hidden Markov Models (HMMs)Hidden Markov Models (HMMs) Neural NetworksNeural Networks Integrated AnalysisIntegrated Analysis

Page 11: Sequence Based Analysis Tutorial

1111

Multiple Sequence AlignmentMultiple Sequence Alignment

ClustalWClustalW Progressive Pairwise ApproachProgressive Pairwise Approach

Base on Exhaustive Pairwise AlignmentsBase on Exhaustive Pairwise Alignments Neighbor JoiningNeighbor Joining

Joining Order Corresponding to a Tree Joining Order Corresponding to a Tree Alignment VariesAlignment Varies

Dependent on Joining OrderDependent on Joining Order

Page 12: Sequence Based Analysis Tutorial

1212

How do you build a tree?How do you build a tree?

Pick sequences to alignPick sequences to align Align themAlign them Verify the alignmentVerify the alignment Keep the parts that are aligned correctlyKeep the parts that are aligned correctly Build and evaluate a phylogenetic treeBuild and evaluate a phylogenetic tree

Page 13: Sequence Based Analysis Tutorial

1313

Multiple Alignment and TreeMultiple Alignment and Tree From Text/Sequence Search Result or ClustalW Alignment InterfaceFrom Text/Sequence Search Result or ClustalW Alignment Interface

Page 14: Sequence Based Analysis Tutorial

1414

Page 15: Sequence Based Analysis Tutorial

1515

Motif Patterns (Regular Expressions)Motif Patterns (Regular Expressions) Signature Patterns for Functional MotifsSignature Patterns for Functional Motifs

ProClass Motif Alignments

Page 16: Sequence Based Analysis Tutorial

1616

PIR Pattern SearchPIR Pattern Search From Text/Sequence Search Result or Pattern Search InterfaceFrom Text/Sequence Search Result or Pattern Search Interface One Query Sequence Against PROSITE Pattern DatabaseOne Query Sequence Against PROSITE Pattern Database One Query Pattern (PROSITE or User-Defined) Against Sequence DBOne Query Pattern (PROSITE or User-Defined) Against Sequence DB

Page 17: Sequence Based Analysis Tutorial

1717

Pattern Search Result (I)Pattern Search Result (I) One Query Sequence Against PROSITE Pattern DatabaseOne Query Sequence Against PROSITE Pattern Database

Page 18: Sequence Based Analysis Tutorial

1818

Pattern Search Result (II)Pattern Search Result (II) One Query Pattern Against Sequence DatabaseOne Query Pattern Against Sequence Database

Page 19: Sequence Based Analysis Tutorial

1919

Profile MethodProfile Method

Profile: A Table of Scores to Express Family Consensus Derived from Multiple Profile: A Table of Scores to Express Family Consensus Derived from Multiple Sequence AlignmentsSequence Alignments Num of Rows = Num of Aligned PositionsNum of Rows = Num of Aligned Positions Each row contains a score for the alignment with each possible residue.Each row contains a score for the alignment with each possible residue.

Profile SearchingProfile Searching Summation of Scores for Each Amino Acid Residue along Query SequenceSummation of Scores for Each Amino Acid Residue along Query Sequence Higher Match Values at Conserved PositionsHigher Match Values at Conserved Positions

Page 20: Sequence Based Analysis Tutorial

2020

PIR HMM Domain/Motif SearchPIR HMM Domain/Motif Search

From Text/Sequence From Text/Sequence Search Result or HMM Search Result or HMM Search InterfaceSearch Interface

HMMER Model Building HMMER Model Building & Sequence Search & Sequence Search

Search One Query Search One Query Protein Against All HMMs Protein Against All HMMs

Search One HMM Search One HMM Against Sequence DBAgainst Sequence DB

Page 21: Sequence Based Analysis Tutorial

2121

HMM Search Result (I)HMM Search Result (I) One Query Protein Against All Pfam HMMsOne Query Protein Against All Pfam HMMs

Page 22: Sequence Based Analysis Tutorial

2222

HMM Search Result (II)HMM Search Result (II) Search User-Built HMM Against Protein Sequence DBSearch User-Built HMM Against Protein Sequence DB Input Sequences (Optional Residue Ranges) -> Multiple Input Sequences (Optional Residue Ranges) -> Multiple

Sequence Alignment -> Model Building -> HMM SearchSequence Alignment -> Model Building -> HMM Search

Page 23: Sequence Based Analysis Tutorial

2323

Secondary Structure FeaturesSecondary Structure Features HelixHelix Patterns of Hydrophobic Residue Conservation Showing I, Patterns of Hydrophobic Residue Conservation Showing I,

I+3, I+4, I+7 Pattern Are Highly Indicative of an I+3, I+4, I+7 Pattern Are Highly Indicative of an Helix (Amphipathic)Helix (Amphipathic) StrandsStrands That Are Half Buried in the Protein Core Will Tend to Have That Are Half Buried in the Protein Core Will Tend to Have

Hydrophobic Residues at Positions I, I+2, I+4, I+6Hydrophobic Residues at Positions I, I+2, I+4, I+6

Page 24: Sequence Based Analysis Tutorial

2424

Integrated Bioinformatics System for Integrated Bioinformatics System for Function and Pathway DiscoveryFunction and Pathway Discovery

Data IntegrationData Integration Associative AnalysisAssociative Analysis

Sequence Analysis Pipeline

(Family Classification & Feature Identification)

Data Mining Tools

(Retrieval, Visualization, Analysis, Correlation)

Data Warehouse

(Gene, Protein, Family, Function, Structure, Pathway, Interaction)

Graphical User Interface

(Browsing, Querying, Navigation)

Input

(Gene/Protein Expression Data)

Output

(Analysis Results, Biological Interpretation)

Integrated Bioinformatics System

User

Input

(Local Data, Search Criteria, Report Format)

Sequence Analysis Pipeline

(Family Classification & Feature Identification)

Data Mining Tools

(Retrieval, Visualization, Analysis, Correlation)

Data Warehouse

(Gene, Protein, Family, Function, Structure, Pathway, Interaction)

Graphical User Interface

(Browsing, Querying, Navigation)

Input

(Gene/Protein Expression Data)

Output

(Analysis Results, Biological Interpretation)

Integrated Bioinformatics System

User

Input

(Local Data, Search Criteria, Report Format)

Page 25: Sequence Based Analysis Tutorial

2525

Analytical Analytical PipelinePipeline

Query SequencePIR-NREFiProClass

Top-Matched Superfamilies/Domains

BLAST Search HMM Domain Search

Predicated Superfamilies/Domains/Motifs/Sites/SignalPeptides/TMHs

SSEARCH CLUSTALW

Superfamily/Domain/Motif Alignments

Family Relationships & Functional Features

Family Classification & Functional Analysis

HMM Motif Search Pattern Search SignalP/TMHMM

Page 26: Sequence Based Analysis Tutorial

2626

Integrated Bioinformatics SystemIntegrated Bioinformatics System

Global Bioinformatics Global Bioinformatics Analysis of 1000’s of Analysis of 1000’s of Genes and ProteinsGenes and Proteins

Pathway Discovery, Pathway Discovery,

Target IdentificationTarget Identification

Gene Expression Data Proteomic Data

Clustering

Expression Pattern

Visualization & Statistical Analysis

Clustered Matrix Pathway Map Process HierarchyClustered Graph

Gene/Peptide-Protein Mapping

Pathway Discovery (Browsing, Sorting, Visualization & Statistical Analysis)

Functional Analysis (Sequence Analysis & Information Retrieval)

Integrated Protein Knowledge System

Comprehensive Protein

Information Matrix

Protein List

Gene Expression Data Proteomic Data

Clustering

Expression Pattern

Visualization & Statistical Analysis

Clustered Matrix Pathway Map Process HierarchyClustered GraphClustered Matrix Pathway Map Process HierarchyClustered Graph

Gene/Peptide-Protein Mapping

Pathway Discovery (Browsing, Sorting, Visualization & Statistical Analysis)

Functional Analysis (Sequence Analysis & Information Retrieval)

Integrated Protein Knowledge System

Comprehensive Protein

Information Matrix

Protein List

Gene/Peptide-Protein Mapping

Pathway Discovery (Browsing, Sorting, Visualization & Statistical Analysis)

Functional Analysis (Sequence Analysis & Information Retrieval)

Integrated Protein Knowledge System

Comprehensive Protein

Information Matrix

Protein List

Page 27: Sequence Based Analysis Tutorial

2727

Page 28: Sequence Based Analysis Tutorial

2828

Lab SectionLab Section

Page 29: Sequence Based Analysis Tutorial

2929

Peptide Search & ResultsPeptide Search & Results

Page 30: Sequence Based Analysis Tutorial

3030

Blast Similarity SearchBlast Similarity Search

Page 31: Sequence Based Analysis Tutorial

3131

Blast Search ResultsBlast Search Results

Page 32: Sequence Based Analysis Tutorial

3232

Pair-Wise AlignmentPair-Wise Alignment

Page 33: Sequence Based Analysis Tutorial

3333

Multiple Sequence AlignmentMultiple Sequence Alignment

Page 34: Sequence Based Analysis Tutorial

3434

Pattern Search Results Pattern Search Results

Page 35: Sequence Based Analysis Tutorial

3535

HMM Domain Search ResultHMM Domain Search Result

Page 36: Sequence Based Analysis Tutorial

3636

Building HMM ProfileBuilding HMM Profile

Page 37: Sequence Based Analysis Tutorial

3737

Using HMM Profile for Using HMM Profile for SearchingSearching

Page 38: Sequence Based Analysis Tutorial

3838

Rabbit Alpha Crystallin A Chain Rabbit Alpha Crystallin A Chain An An iiProClass View of the entryProClass View of the entry