Download - Genome Annotation

Transcript
Page 1: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 1

Genome Annotation

Daniel LawsonVectorBase @ EBI

Page 2: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 3

Genome annotation - building a pipeline

Genome sequence

Map repeats

Genefinding

Protein-coding genes

Map ESTs Map Peptides

nc-RNAs

Functional annotation

Release

Page 3: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 4

Repeat features

Genomes contain repetitive sequences

Genome Size (Mb) % Repeat

Aedes aegypti 1,300 ~70

Anopheles gambiae 260 ~30

Culex pipiens 540 ~50

Page 4: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 5

Repeat features: Tandem repeats

Pattern of two or more nucleotides repeated where the repetitions are directly adjacent to each other

Polymorphic between individuals/populations

Example programs: Tandem, TRF

Page 5: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 6

Repeat features: Interspersed elements

Transposable elements (TEs) Transposons, Retrotransposons etc Entire research field in itself Example programs: Repeatscout, RECON

Page 6: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 7

Finding repeats as a preliminary to gene prediction

Repeat discovery Literature and public databanks Automated approaches (e.g. RepeatScout or RECON)

Generate a library of example repeat sequences (FASTA file with a defined header line format) Use RepeatMasker to search the genome and mask the sequence

Page 7: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 8

Masked sequence

Repeatmasked sequence is an artificial construction where those regions which are thought to be repetitive are marked with X’s

Widely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TE’s in the final annotation set

>my sequence

atgagcttcgatagcgatcagctagcgatcaggctactattggcttctctagactcgtctatctctattagctatcatctcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggctgatcttaggtcttctgatcttct

>my sequence (repeatmasked)

atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct

Page 8: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 9

Masked sequence - Hard or Soft?

Sometimes we want to mark up repetitive sequence but not to exclude it from downstream analyses. This is achieved using a format known as soft-masked

>my sequence

ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTGGCTTCTCTAGACTCGTCTATCTCTATTAGTATCATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT

>my sequence (softmasked)

ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTggcttctctagactcgtctatctctattagtatcATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT

Page 9: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 11

Genome annotation - building a pipeline

Genome sequence

Map Repeats

Genefinding

Protein-coding genes

Map ESTs Map Peptides

nc-RNAs

Functional annotation

Release

Page 10: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 13

Genome annotation - building a pipeline

Genome sequence

Map Repeats

Genefinding

Protein-coding genes

Map ESTs Map Peptides

nc-RNAs

Functional annotation

Release

Page 11: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 14

More terminology Gene prediction

Predicted exon structure for the primary transcript of a gene CDS

Coding sequence for a protein-coding gene prediction (not necessarily continuous in a genomic context)

ORFOpen reading frame, sequence devoid of stop codons

SimilaritySimilarity between sequences which does not necessarily infer any evolutionary linkage

ab initio predictionPrediction of gene structure from first principles using only the genome sequence

Hidden Markov Model (HMM)Statistical model (dynamic Baysian network) which can be used as a sensitive statistically robust search algorithm. Use of profile HMMs to search sequence data is widespread

Page 12: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 15

Eukaryote genome annotation

Genome

ATG STOPAAAn

A B

Transcription

Primary Transcript

Processed mRNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

m7G

Find locus

Find exons using transcripts

Find exons using peptides

Find function

Page 13: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 16

Prokaryote genome annotation

Genome

START STOP

A B

Transcription

Primary Transcript

Processed RNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

Find locus

Find CDS

Find function

START STOP

Page 14: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 17

Genefinding

ab initio similarity

Page 15: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 18

Genefinding resources

Transcript cDNA sequences EST sequences Other (MPSS, SAGE, ditags)

Peptide Non-redundant (nr) protein database Protein sequence data, Mass spectrometry data

Genome Other genomic sequence

Page 16: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 19

ab initio prediction

Genome

ATG STOPAAAn

A B

Transcription

Primary Transcript

Processed mRNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

m7G

Page 17: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 20

ab initio prediction

Genome

ATG STOPAAAn

A B

Transcription

Primary Transcript

Processed mRNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

m7G

Page 18: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 21

Genefinding - ab initio predictions

Use compositional features of the DNA sequence to define coding segments (essentially exons)

ORFs Coding bias Splice site consensus sequences Start and stop codons

Each feature is assigned a log likelihood score Use dynamic programming to find the highest scoring path Need to be trained using a known set of coding sequences Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh

Page 19: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 22

ab initio prediction

Genome

Coding potential

Coding potential

ATG & Stop codons

ATG & Stop codons

Splice sites

Page 20: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 23

ab initio prediction

Genome

Coding potential

Coding potential

ATG & Stop codons

ATG & Stop codons

Splice sites

Page 21: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 24

ab initio prediction

Find best prediction

Genome

Coding potential

Coding potential

ATG & Stop codons

ATG & Stop codons

Splice sites

Page 22: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 25

Similarity prediction

Genome

ATG STOPAAAn

A B

Transcription

Primary Transcript

Processed mRNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

m7G

Page 23: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 26

Similarity prediction

Genome

ATG STOPAAAn

A B

Transcription

Primary Transcript

Processed mRNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

m7G

Find exons using transcripts

Find exons using peptides

Page 24: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 27

Genefinding - similarity

Use known coding sequence to define coding regions EST sequences Peptide sequences

Needs to handle fuzzy alignment regions around splice sites Needs to attempt to find start and stop codons Examples: EST2Genome, exonerate, genewise

Page 25: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 28

Similarity-based prediction

Align

Create prediction

Genome

cDNA/peptide

Page 26: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 29

Genefinding - comparative

Use 2 or more genomic sequences to predict genes based on conservation of exon sequences Examples: Twinscan and SLAM

Page 27: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 30

Genefinding - manual

Manual annotation is time consuming Annotators use specialized utilities to view genomic regions

with tiers/columns of data from which they construct a gene prediction

Most decisions are subjective and tedious to document Avoids the systematic problems of ab initio predictors and

automated annotation pipeline

Page 28: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 31

Manual prediction

Coding potential

Coding potential

ATG & Stop codons

ATG & Stop codons

Splice sites

EST similarity

Page 29: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 32

Manual prediction

Coding potential

Coding potential

ATG & Stop codons

ATG & Stop codons

Splice sites

EST similarity

Page 30: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 33

Manual prediction

Predict structure

Coding potential

Coding potential

ATG & Stop codons

ATG & Stop codons

Splice sites

EST similarity

Page 31: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 34

Genefinding - non-coding RNA genes

Non-coding RNA genes can be predicted using knowledge of their structure or by similarity with known examples tRNAscan - uses an HMM and co-variance model for prediction of tRNA genes Rfam - a suite of HMM’s trained against a large number of different RNA genes

Page 32: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 36

Overview of current annotation system

Assembled genome

VectorBase gene predictions

Sequencing centre gene predictions

Merge into canonical set

Protein analysis

Display on genome browser

Release to GenBank/EMBL/DDBJ

Page 33: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 37

VectorBase gene prediction pipeline

Blessed predictionsCommunity submissionsManual annotations

Species-specific predictions Similarity predictions

Transcript based predictions Ab initio gene predictions

Canonical predictions

(Genewise) (Genewise)

(SNAP) (Exonerate)

(Apollo) (Genewise, Exonerate, Apollo)

Protein family HMMs(Genewise)

ncRNA predictions(Rfam)

Page 34: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 38

VectorBase curation database pipeline for manual/community annotation

Curationwarehouse db

Manual annotation (Harvard)

Apollo Apollo

Community annotation (Community representatives)

Chado-XMLChado-XMLChado

Ensembl

GFF3

Gene build db

Community annotation

Page 35: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 39

Genefinding - Review

Gene prediction relies heavily on similarity data EST/cDNA sequences are vital for genefinding

Training for ab initio approaches Similarity builds Validating predictions

Protein data is the predominant supporting evidence for prediction in most vector genomes Need to be wary of predicting from predictions

Genefinding is still something of a dark art Efforts to standardize and document supporting evidence

for prediction and modifications are ongoing

Page 36: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 40

Genefinding omissions

Alternative splice forms Currently there is no good method for predicting alternative isoforms Only created where supporting transcript evidence is present

Pseudogenes Each genome project has a fuzzy definition of pseudogenes Badly curated/described across the board

Promoters Rarely a priority for a genome project Some algorithms exist but usually not integrated into an annotation

set

Page 37: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 41

Functionalannotation

Page 38: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 42

Functional annotation Utilise known structure/function information to infer facts related to

the predicted protein sequence Provide users with results from a number of standard

algorithms/searches Provide users with cross-references (dbxrefs) to other resources Assign a simple one line description for each gene product

This will never be comprehensive This will always be somewhat general

Page 39: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 43

Genome annotation

Genome

ATG STOPAAAn

A B

Transcription

Primary Transcript

Processed mRNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

m7G

Find function

Page 40: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 44

Functional annotation - protein similarities

Predicted proteins are searched against the non-redundant protein database to look for similarities

Visually assess the top 5-10 hits to identify whether these have been assigned a function

It is important to check how the function of the top hits has been assigned in order not to transfer erroneous annotations

Page 41: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 45

Functional annotation - Protein domains

Protein domains have a number of definitions based on their size, folding and function/evolution.

Domains are a part of protein structure description Domains with a similar structure are likely to be related evolutionarily

and have a similar function We can use this to infer function (& structure) for an unknown protein

be comparison to known proteins The tool of choice here is a Hidden Markov Model (HMM)

Page 42: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 46

Protein Domain databases

InterPro UniProt - protein database Prosite - database of regular expressions Pfam - profile HMMs PRINTS - conserved protein signatures Prodom - collection of multiple sequence alignments SMART - HMMs TIGRfams - HMMs PIRSF Superfamily Gene3D Panther - HMMs

Page 43: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 47

Functional annotation - Other features

Other features which can be determined Signal peptides Transmembrane domains Low complexity regions Various binding sites, glycosylation sites etc.

See http://expasy.org/tools/ for a good list of possible prediction algorithms

Page 44: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 48

Signal peptides

Short peptide sequence found at the N-terminus of a pre-protein which mark the peptide for transport across one or more membranes

e.g. SignalP

Page 45: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 49

Transmembrane domains Simple hydrophobic regions which sit inside a membrane Transmembrane domains anchor proteins in a membrane and

can orient other domains in the protein correctly Examples: Receptors, transporters, ion channels

Identified based on the protein composition using a simple sliding window algorithm or an HMM

e.g. Tmpred, TMHMM

Page 46: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 50

Ontologies

Use of ontologies to annotate gene products Gene Ontology (GO)

Cellular component Molecular function Biological process

Sequence Ontology (SO)

Page 47: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 51

Other data to look at Enzyme classification (EC) numbers Phenotype information

Alleles Gene knockouts RNAi knockdowns

Expression data EST libraries (source of RNA material) Microarrays SAGE tags

Page 48: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 52

Functional assignment

The assignment of a function to a gene product can be made by a human curator by assessing all of the data (similarities, protein domains, signal peptide etc)

This is a labour intensive process and like gene prediction is subjective

There are automated approaches (based on family and domain databases such as Panther or InterPro) but these are under-developed

Large number of predictions from a genome project remain ‘hypothetical protein’ or ‘conserved hypothetical protein’.

Page 49: Genome Annotation

August 2008 Bioinformatics tools for Comparative Genomics of Vectors 53

Caveats to genome annotation Annotation accuracy is only as good as the available supporting data

at the time of annotation Gene predictions will change over time as new data becomes available

(ESTs, related genomes) Functional assignments will change over time as new data becomes

available (characterisation of hypothetical proteins)

Gene predictions are ‘best guess’ Functional annotations are not definitive and only a guide

If you want the annotation to improve you should get involved with whoever is (or has) sequenced your genome of interest.

For vectors you can mail [email protected] with suggestions and corrections.


Top Related