genome annotation

37
Genome Annotation Rosana O. Babu 1

Upload: adlai

Post on 11-Jan-2016

124 views

Category:

Documents


5 download

DESCRIPTION

Genome Annotation. Rosana O. Babu. Sequence to Annotation. Input1-Variant Annotation. Input2- Structural Annotation. Structural Annotation was conducted using AUGUSTUS (version 2.5.5), Magnaporthe_grisea as genome model - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Genome Annotation

Genome Annotation

Rosana O. Babu

1

Page 2: Genome Annotation

Sequence to Annotation

Page 3: Genome Annotation

Input1-Variant Annotation

Page 4: Genome Annotation

Input2- Structural Annotation

Structural Annotation was conducted using AUGUSTUS (version 2.5.5), Magnaporthe_grisea as genome model

However, we have to develop genome model for Oomycete to obtain accurate result

Page 5: Genome Annotation

Input3-Functional Annotation

Page 6: Genome Annotation

Genome Annotation

The process of identifying the locations of genes and the coding regions in a genome to determe what those genes do

Finding and attaching the structural elements and its related function to each genome locations

6

Page 7: Genome Annotation

Genome Annotation

7

gene structure prediction

Identifying elements (Introns/exons,CDS,stop,start) in the genome

gene function prediction

Attaching biological information to these elements- eg: for which protein exon will code for

Page 8: Genome Annotation

Eukaryote genome annotation

9

Genome

ATG STOP

AAAn

A B

Transcription

Primary Transcript

Processed mRNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

m7G

Find locus

Find exons using transcripts

Find exons using peptides

Find function

Page 9: Genome Annotation

Prokaryote genome annotation

10

Genome

START STOP

A B

Transcription

Primary Transcript

Processed RNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

Find locus

Find CDS

Find function

START STOP

Page 10: Genome Annotation

Genome annotation - workflow

11

Genome sequence

Repeats

Structural annotation-Gene finding

Protein-coding genesnc-RNAs, Introns

Functional annotation

Viewed & Released in Genome viewer

Masked or un-masked genome sequence

Page 11: Genome Annotation

Genome Repeats & features

12

Percentage of repetitive sequences in different organisms

Genome Genome Size (Mb)

% Repeat

Aedes aegypti 1,300 ~70

Anopheles gambiae 260 ~30

Culex pipiens 540 ~50

Microsatellite Minisatellite Tandem repeat Short tandem repeat SSR

Polymorphic between individuals/populations

Page 12: Genome Annotation

Finding repeats as a preliminary to gene prediction

13

Repeat discovery

Literature and public databanks

Homology based approaches

Automated approaches (e.g. RepeatScout or RECON)

Tandem repeats: Tandem, TRF

Use RepeatMasker to search the genome and mask the sequence

Page 13: Genome Annotation

Masked sequence

Repeatmasked sequence is an artificial construction where those regions which are thought to be repetitive are marked with X’s

Widely used to reduce the overhead of subsequent computational analyses and to reduce the impact of TE’s in the final annotation set

14

>my sequence

atgagcttcgatagcgatcagctagcgatcaggctactattggcttctctagactcgtctatctctattagctatcatctcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctactattggctgatcttaggtcttctgatcttct

>my sequence (repeatmasked)

atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct

Positions/locations are not affected by masking

Page 14: Genome Annotation

Types of Masking- Hard or Soft?

Sometimes we want to mark up repetitive sequence but not to exclude it from downstream analyses. This is achieved using a format known as soft-masked

15

>my sequence

ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTGGCTTCTCTAGACTCGTCTATCTCTATTAGTATCATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTTCGATAGCGATCAGCTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT

>my sequence (softmasked)

ATGAGCTTCGATAGCGCATCAGCTAGCGATCAGGCTACTATTggcttctctagactcgtctatctctattagtatcATCTCGATAGCGATCAGCTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTggcttcgatagcgatcagcTAGCGATCAGGCTACTATTGGCTGATCTTAGGTCTTCTGATCTTCT

>my sequence (hardmasked)

atgagcttcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxatctcgatagcgatcagctagcgatcaggctactattxxxxxxxxxxxxxxxxxxxtagcgatcaggctactattggcttcgatagcgatcagctagcgatcaggctxxxxxxxxxxxxxxxxxxxtcttctgatcttct

Page 15: Genome Annotation

Genome annotation - workflow

16

Genome sequence

Map repeats

Gene finding- structural annotation

Protein-coding genesnc-RNAs, Introns

Functional annotation

Viewed & Released in Genome viewer

Masked or un-masked

Page 16: Genome Annotation

Structural annotation

Identification of genomic elements

Open reading frame and their localization Coding regions Location of regulatory motifs Start/Stop Splice Sites Non coding Regions/RNA’s

17

Page 17: Genome Annotation

Methods

19

Similarity• Similarity between sequences which does not necessarily infer any

evolutionary linkage

Ab- initio prediction• Prediction of gene structure from first principles using only the

genome sequence

Page 18: Genome Annotation

Genefinding

20

ab initio similarity

Page 19: Genome Annotation

Gene_finding resources for Homology based methods

Transcript cDNA sequences EST sequences

Peptide Non-redundant (nr) protein database Protein sequence data, Mass spectrometry data

Genome Other genomic sequence

21

Page 20: Genome Annotation

ab initio prediction

22

Genome

Coding potential

Coding potential

ATG & Stop codons

ATG & Stop codons

Splice sites

Page 21: Genome Annotation

Genefinding - ab initio predictions

23

Use compositional features of the DNA sequence to define coding segments (essentially exons)

ORFs

Coding bias

Splice site consensus sequences

Start and stop codons

Methods

Training sets are required

Each feature is assigned a log likelihood score

Use dynamic programming to find the highest scoring path for accuracy

Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh

Page 22: Genome Annotation

Genefinding - similarity

24

Use known coding sequence to define coding regions

EST sequences

Peptide sequences

Problem to handle fuzzy alignment regions around splice sites

Examples: EST2Genome, exonerate, genewise

Gene-finding - comparative

Use two or more genomic sequences to predict genes based on conservation of exon sequences

Examples: Twinscan and SLAM

Page 23: Genome Annotation

Genefinding - non-coding RNA genes

25

Non-coding RNA genes can be predicted using knowledge of their structure or by similarity with known examples

tRNAscan - uses an HMM and co-variance model for prediction of tRNA genes

Rfam - a suite of HMM’s trained against a large number of different RNA genes

Page 24: Genome Annotation

Gene-finding omissions

26

Alternative isoformsCurrently there is no good method for predicting alternative isoformsOnly created where supporting transcript evidence is present

PseudogenesEach genome project has a fuzzy definition of pseudogenesBadly curated/described across the board

PromotersRarely a priority for a genome projectSome algorithms exist but usually not integrated into an annotation set

Page 25: Genome Annotation

Practical- structural annotation

27

Eukaryotes- AUGUSTUS (gene model)

~/Programs/augustus.2.5.5/bin/augustus --strand=both --genemodel=partial --singlestrand=true --alternatives-from-evidence=true --alternatives-from-sampling=true --progress=true --gff3=on --uniqueGeneId=true --species=magnaporthe_grisea our_genome.fasta >structural_annotation.gff

Prokaryotes – PRODIGAL (Codon Usage table)

~/Programs/prodigal.v2_60.linux -a protein_file.fa -g 11 –d nucleotide_exon_seq.fa -f gff -i contigs.fa -o genes_quality.txt -s genes_score.txt -t genome_training_file.txt

Page 26: Genome Annotation

Structural Annotation- Structural Annotation was conducted using AUGUSTUS (version

2.5.5), Magnaporthe_grisea as genome model

However, we have to develop genome model for obtaining accurate result

Page 27: Genome Annotation

Functionalannotation

29

Page 28: Genome Annotation

Functional annotation

30

Attaching biological information to genomic elements

Biochemical functionBiological functionInvolved regulation and interactionsExpression

• Utilise known structural information to predicted protein sequence

Page 29: Genome Annotation

Genome annotation - workflow

31

Genome sequence

Map repeats

Gene finding- structural annotation

Protein-coding genesnc-RNAs, Introns

Functional annotation

Viewed & Released in Genome viewer

Masked or un-masked

Page 30: Genome Annotation

Genome annotation

32

Genome

ATG STOP

AAAn

A B

Transcription

Primary Transcript

Processed mRNA

Polypeptide

Folded protein

Functional activity

Translation

Protein folding

Enzyme activity

RNA processing

m7G

Find function

Page 31: Genome Annotation

Functional annotation – Homology Based

Predicted Exons/CDS/ORF are searched against the non-redundant protein database (NCBI, SwissProt) to search for similarities

Visually assess the top 5-10 hits to identify whether these have been assigned a function

Functions are assigned

33

Page 32: Genome Annotation

Functional annotation - Other features

Other features which can be determined

Signal peptides Transmembrane domains Low complexity regions Various binding sites, glycosylation sites etc. Protein Domain

See http://expasy.org/tools/ for a good list of possible prediction algorithms

34

Page 33: Genome Annotation

Functional annotation - Other features (Ontologies)

Use of ontologies to annotate gene products Gene Ontology (GO)

Cellular component Molecular function Biological process

35

Page 34: Genome Annotation

Practical - FUNCTIONAL ANNOTATION

Homology Based Method

setup blast database for nucleotide/protein Blasting the genome.fasta for annotations (nucleotide/protein) sorting for blast minimum E-value (>=0.01) for nucleotide/protein Further filtering for best blast hit (5-15) and assigning functions Removing Positive strand blast hits Removing negative strand blast hits

36

Page 35: Genome Annotation

Functional annotation- output

August 2008 Bioinformatics tools for Comparative Genomics of Vectors

37

Page 36: Genome Annotation

Conclusion

Annotation accuracy is only as good as the available supporting data at the time of annotation- update information is necessary

Gene predictions will change over time as new data becomes available (ESTs, related genomes) that are much similar than previous ones

Functional assignments will change over time as new data becomes available (characterization of hypothetical proteins)

38

Page 37: Genome Annotation

Thank You

39