teresa k.attwood school of biological sciences university of manchester, oxford road manchester m13...

38
Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK http://www.bioinf.man.ac.uk/dbbrowser/ Bioinformatics: gene-protein-structure- Bioinformatics: gene-protein-structure- function function

Upload: justin-mason

Post on 17-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

Teresa K.Attwood

School of Biological Sciences

University of Manchester, Oxford Road

Manchester M13 9PT, UK

http://www.bioinf.man.ac.uk/dbbrowser/

Bioinformatics: gene-protein-structure-functionBioinformatics: gene-protein-structure-function

Page 2: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

ForewordForeword

• Predicting genes in uncharacterised genomic DNA is one of the main problems facing sequence annotators. De novo prediction methods (searching for splice-site consensus motifs, biased codon usage, etc.) have been only partially successful, & investigators have found that the surest way of predicting a gene is by alignment with a homologous protein sequence.

Page 3: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

OverviewOverview

• In silico structure &function prediction– the Holy Grail– a reality check

• What methods are available– PROSITE, PRINTS, Pfam, etc.

• Why not just use PSI-BLAST?• Expert systems & other integrated approaches• Conclusions

Page 4: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

The Holy Grail of bioinformaticsThe Holy Grail of bioinformatics

...to be able to understand the words in a sequence sentence that form a particular protein structure

Page 5: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

The reality of sequence analysisThe reality of sequence analysis

...isn't so glamorous....but means we can recognise words that form characteristic patterns, even if we don't know the precise

syntax to build complete protein sentences

Page 6: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

Science fact & fictionScience fact & fiction

• The state of the art is pattern recognition• Sequence pattern recognition is easier to achieve &

more reliable than fold recognition– which is ~50% reliable even in expert hands

• Prediction is still not possible – & is unlikely to be so for decades to come (if ever)

• Structural genomics will yield representative structures for many (not all) proteins in future– structures of new sequences will be determined by

modelling– prediction will become an academic exercise

• But, to debunk a popular myth, knowing structure alone does not inherently tell us function

Page 7: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

In silicoIn silico function prediction function prediction…a reality check…a reality check

• What is the function of this structure?

• What is the function of this sequence?

• What is the function of this motif?– the fold provides a scaffold, which can be

decorated in different ways by different sequences to confer different functions - knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level

Page 8: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

What's in a sequence?What's in a sequence?

Page 9: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

Full domain alignment methods

Single motif methods

Multiple motif methods

Fuzzy regex (eMOTIF)

Exact regex (PROSITE)

Profiles (PROFILE LIBRARY)

HMMs (Pfam)

Identity matrices (PRINTS)

Weight matrices (BLOCKS)

Methods for family analysisMethods for family analysis

Page 10: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

The challenge of family analysisThe challenge of family analysis

• highly divergent family with single function?• superfamily with many diverse functional families?

– must distinguish if function analysis done in silico– a tough challenge!

Page 11: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

In the beginning was PROSITEIn the beginning was PROSITE

[GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]-X(2)-[LIVMFT]-[GSTANC]-LIVMFYWSTAC]-[DENH]-R

TM domain

Page 12: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

Diagnostic limitations of PROSITEDiagnostic limitations of PROSITE G_PROTEIN_RECEPTOR; PATTERN PS00237; G-protein coupled receptor signature [GSTALIVMYWC]-[GSTANCPDE]-{EDPKRH}-X(2)-[LIVMNQGA]- X(2)-[LIVMFT]-[GSTANC]-[LIVMFYWSTAC]-[DENH]-R /TOTAL=1121(1121); /POS=1057(1057); /FALSE_POS=64(64); /FALSE_NEG=112; /PARTIAL=48; UNKNOWN=0(0)

• This represents an apparent 20% error rate – the actual rate is probably higher

• Thus, a match to a pattern is not necessarily true – & a mis-match is not necessarily false!

• False-negatives are a fundamental limitation to this type of pattern matching– if you don't know what you're looking for, you'll never know

you missed it!

Page 13: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

TM domain TM domainloop region

Then came PRINTSThen came PRINTS

Page 14: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

loop regionTM domain TM domain

Hierarchical family analysisHierarchical family analysis

Page 15: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

What is PRINTS?What is PRINTS?(not the best thing since sliced bread, but....)(not the best thing since sliced bread, but....)

• A db of diagnostic fingerprints that characterise proteins– family analysis is hierarchical, allowing fine-grained diagnoses

• Fingerprints are groups of conserved motifs, used for iterative db searching– iteration refines the fingerprint– potency is gained from the mutual context of motif neighbours– results are biologically more meaningful than from single motifs– results are manually annotated prior to inclusion in the db

• PRINTS has many applications, e.g.:– basis of BLOCKS & eMOTIF– EditToTrEMBL - to annotate TrEMBL– provide annotation & hierarchical protein classification in InterPro

Page 16: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:
Page 17: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:
Page 18: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:
Page 19: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

N

C

Visualising fingerprintsVisualising fingerprints

Page 20: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

Diagnosing partial matchesDiagnosing partial matches

Page 21: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:
Page 22: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:
Page 23: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:
Page 24: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:
Page 25: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

-opioid receptor -opioid receptor-opioid receptor true

Page 26: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

Why bother with family dbs?Why bother with family dbs?

• Seq searches won't always allow outright diagnosis– BLAST & FASTA are not infallible & often can't assign

significant scores– outputs may be complicated by the multi-domain or modular

nature of the protein, compositionally biased regions, repeats & so on

– annotations of retrieved hits may be incorrect

• Pattern dbs contain potent descriptors– so, distant relationships missed by pairwise search tools

may be captured by one or more of the family or functional site distillations

Page 27: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

Overview of resourcesOverview of resources

• PROSITE (SIB) - 1144 entries– single motifs (regexs) - best with small highly conserved sites

• Profile library (ISREC) - ~300 entries– weight matrices - good with divergent domains & superfamilies

• PRINTS (Manchester) - 1750 entries– multiple motifs (fingerprints) - best for families and sub-families

• Pfam (Sanger Centre) - 3849 entries– HMMs - good with divergent domains & superfamilies

• Blocks (FHCRC) - ~2608 entries– multiple motifs (derived from InterPro & PRINTS)

• eMOTIF (Stanford)– permissive regexs (derived from PRINTS & BLOCKS)

Page 28: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

Designing a search protocolDesigning a search protocol

• Given a newly-determined sequence, want to know– what is my protein? – to what family does it belong?– what is its function? – how can we explain this in structural terms?

• Given the variety of dbs available, rather than rely on just one, it is important to devise a search protocol– search the sequence & family dbs – estimate significance - compare results & find a consensus

Page 29: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

This does not simply mean....This does not simply mean....

• BLAST + PROSITE (e.g., on the Web)– or

• FASTA + motifs/profiles (e.g., using GCG)

• But this is still what most people do– including so-called expert systems for genome analysis

Page 30: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

Expert systems for functional analysisExpert systems for functional analysis ...from genome data to biological knowledge...from genome data to biological knowledge

GeneQuiz - Automatic protein function annotation

MAGPIE- Automatic genome analysis

PEDANT - Automatic analysis of proteins

• How they describe themselves:GeneQuiz - Expert system for derivation of functional information

MAGPIE- Automated Genome Project Investigation Environment

PEDANT - Complete functional & structural characterisation of protein sequences

• What they do:GeneQuiz - BLAST/FASTA, PROSITE, BLOCKS

MAGPIE- BLAST/FASTA, PROSITE, BLOCKS

PEDANT - BLAST/FASTA, PROSITE, BLOCKS

Page 31: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

D10226 R13F63

UL78_HCMVAR05H51

450 320

320

Challenges Challenges for expert for expert systemssystems

Page 32: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

*

*

*

Page 33: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

What GeneQuiz said…What GeneQuiz said…a thrombin receptor?a thrombin receptor?

Page 34: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

What GeneQuiz said later…What GeneQuiz said later…

*

Page 35: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

Other integrated approachesOther integrated approachesThe European InterPro projectThe European InterPro project

• To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro

– current release has 5312 entries– a central annotation resource,

with pointers to its satellite dbs– initial partners were PRINTS,

PROSITE, profiles & Pfam– new partners include ProDom,

TIGRfam, SMART & hopefully others (e.g., BLOCKS, MetaFam)

– lags behind its sources– major role in fly & human genome

annotation

Page 36: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

InterPro – method comparisonInterPro – method comparison

Page 37: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

Ground rules for bioinformaticsGround rules for bioinformatics

• Don't always believe what programs tell you– they're often misleading & sometimes wrong!

• Don't always believe what databases tell you– they're often misleading & sometimes wrong!

• Don't always believe what lecturers tell you– they're sometimes misleading & often wrong!

• In short, don't be a naive user– when computers are applied to biology, it is vital to understand the

difference between mathematical & biological significance– computers don’t do biology– they do sums– quickly!

Page 38: Teresa K.Attwood School of Biological Sciences University of Manchester, Oxford Road Manchester M13 9PT, UK  Bioinformatics:

ConclusionsConclusions• Success of search protocols based only on BLAST & PROSITE is

likely to be limited– beware ‘expert’ systems

– understand the methods

• No db is best - use several– different methods provide different perspectives

– dbs aren’t complete & their contents don’t fully overlap

• The more dbs searched, harder to interpret results– hence s/w being designed to give "intelligent" consensus outputs

• The more computers are used to automate analysis, the greater the need for collaboration– between s/w developers, annotators & ‘wet’ experimentalists

• Long way from having reliable analytical tools – but with the right approach…