computational genome annotation

59
Computational Genome Annotation Chapter 3 Ying Xu

Upload: wilma

Post on 24-Feb-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Computational Genome Annotation. Chapter 3 Ying Xu. Introduction. DNA sequence of a genome encodes the entire functionality , Millions (microbes) to Billions (human), What information is encoded in a cont., A’s, C’s, G’s and T’s string? Where it is located ? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computational Genome Annotation

Computational Genome Annotation

Chapter 3Ying Xu

Page 2: Computational Genome Annotation

Introduction

• DNA sequence of a genome encodes the entire functionality,

• Millions (microbes) to Billions (human),

• What information is encoded in a cont., A’s, C’s, G’s and T’s string?

• Where it is located ?

• What information is identifiable directly ?

• How should the identified directly information be presented ?

• Two approaches,

1. Ab initio approach,

2. Comparative approach.

Page 3: Computational Genome Annotation

• Ab initio -> predicts functional elements by

statistical features and used to identify novel

functional elements,

• Comparative approach -> sequence

similarity to previously known one.

Page 4: Computational Genome Annotation

3.2 Prediction of Protein-Coding genes

• Single largest set of functional elements in a genome consists of

genes,

• 75-90% of microbial genome contains gene-coding regions,

• Sequence fragment between two stop codons of the same reading

frame is called an open reading frame (ORF),

Page 5: Computational Genome Annotation

3.2.1 Evaluation of coding potential

• Ab initio prediction - based on di-codons, or six-

mers,

• Eg., di-codon GACTGC, largely occur in noncoding

regions than in coding regions in Shewanella

oneidensis,

• 4,096 different di-codons in a genome ( 46 = 4,096),

Page 6: Computational Genome Annotation

For each di-codon X

• Total numbers of occurrences of X in coding and

noncoding regions.

• Relative frequency (RF)of X in coding regions =

number of occurrences of X / total number in

coding regions

• Est. RF of X in non-coding regions in a similar

fashion.

Page 7: Computational Genome Annotation

Preference model

• Log(FC(X)/FN(X)),

• FC(X) X’s relative frequency in a coding region

• FN(X) X’s relative frequency in a noncoding region,

• If X have the same RF - preference value is zero.

• Positive value - X has a higher RF in coding than in a non-

coding region;

• otherwise, it will be negative

Page 8: Computational Genome Annotation

• Overall preference value = sum of all preference

values of the di-codons.

• Positive preference value -> coding region

• Negative preference value -> noncoding region.

• GRAIL AND SORFIND,

• HIDDEN MARKOV MODELS,

Page 9: Computational Genome Annotation

Markov Chain Model

• Consecutive 6-mers or di-codons are

independent,

• Modeling dependence relationships among

consecutive di-codons,

Page 10: Computational Genome Annotation

Baysian formula

• P(S = s1, s2, . . . , sk|coding) and P(S = s1, s2, . . . , sk|

noncoding) probability of DNA segment S = s1, s2, . . . ,

sk.

• P(coding|S) = P(S|coding)/(P(S|coding) + P(S|

noncoding)P(noncoding)/P(coding))

FIFTH-ORDER MODEL

Page 11: Computational Genome Annotation

3.2.2 Identification of translation start

• Similar sequence patterns around the

ATG,

• Predict new translation starts based on

previously known,

• Weight matrix,

• Flanking DNA sequence

Page 12: Computational Genome Annotation

Weight matrix

Page 13: Computational Genome Annotation

3.2.3 Ab initio Gene Prediction through Information Fusion

• Identify all ORFs in six reading frames,

• Measure the coding potential,

• High translation-start score and the whole region has

high coding potential

• Strong coding potential on right and low coding

potential on left.

Page 14: Computational Genome Annotation

Gene Length Distribution

• Length distribution of all known genes is not uniform.

• Exponential distribution or a gamma distribution.

• Asymmetric and heavy tail on the right side.

Page 15: Computational Genome Annotation

G+C Composition

• Different G+C compositions have different di-

codon frequencies,

• One set of di-codon RF lead to incorrect

predictions.

• Different di-codon frequency tables .

• Normalization factor.

Page 16: Computational Genome Annotation

Regions of Repeats

• Not overlap with any genes,

• Reliable prediction software programs,

• These regions are masked out before

running a gene-finding program.

Page 17: Computational Genome Annotation

Neural Networking

• A non-gene is a region in an ORF that does

not overlap any coding regions

• set A contains only genes and set B

contains only non-genes,

• Examine the common features of sets A & B

Page 18: Computational Genome Annotation

• set A consists of a list of vectors (C1, C2, T, G,

L, 1) for each gene

• set B consists of a list of (C1, C2, T, G, L, 0)

for each nongene.

• 0 and 1 - one set consists of all genes and

the other set all nongenes.

Page 19: Computational Genome Annotation

Back-propagation

• One or two hidden layers should suffice.

• Nodes are connected with edges.

• Adjusting the edge weights.

• GRAIL - main prediction framework.

Page 20: Computational Genome Annotation

Input Nodes

Output node

Hidden layer

NEURAL NETWORKWEB SERVERS FOR GENOME

ANNOTATION

Page 21: Computational Genome Annotation

3.2.4 Gene Identification through comparative analysis

• High sequence similarity

• BLAST

• First Comparative approach to find a subset of genes

• Ab initio method to find the rest of the genes in the

genome.

• EST-based Gene Predictions

Page 22: Computational Genome Annotation

Identifying Conserved Regions across Multiple Genomes

• Conserved (long) regions across multiple genomes,

(a) megaBLAST (b) SENSEI (c) MUMmer

Very long sequence comparisons.

First find short (size of 8) ungapped sequence matches.

Sequences to be aligned are closely related.

Speed up computational time and reduce the memory requirement.

Extend them into longer gapped alignments .

Utilizing a suffix trees data structure.

Page 23: Computational Genome Annotation

PatternHunter• Non-contiguou sequence matches.

• Very less time and memory requirement, than BLAST.

• DIALIGN - predicts genes through genome-scale sequence

comparisonGenome A

Genome B

Genes

Page 24: Computational Genome Annotation

3.2.5 Interpretation of Gene Prediction

• GRAIL : marginal, intermediate, or strong descriptors,

• All predictions divide into bins based on the prediction

scores.

• Genes with scores between 0 and 0.1 are put into the first

bin,

• All genes with scores between 0.1 and 0.2 in the second bin,

etc.

Page 25: Computational Genome Annotation

Cont.,

• Different reliability thresholds applied for

different purposes.

• Gene validation, consider a high reliability

threshold,

• General screening - Low reliability

threshold.

Page 26: Computational Genome Annotation

Pseudogenes

• Frameshifts due to deletions/insertions,

• Hard for a regular gene prediction program.

• Specialized coding-region detection

program,

• Mycobacterium leprae has 1,100 predicted

pseudogenes

Page 27: Computational Genome Annotation

3.3 PREDICTION OF RNA-CODING GENES

• tRNA (transfer RNA), rRNA (ribosomal RNA), sRNA

(small RNA), srpRNA (signal recognition particle RNA),

etc.

• Catalyst and information storage molecules.

• tRNAs adapter molecules that decode the genetic code.

• rRNA catalyze the synthesis of proteins.

Page 28: Computational Genome Annotation

Cont.,

• (1) RNA signals are a combination of

sequence and structure motifs.

• for example, tRNA genes designed to

recognize particular types of RNA genes.

Page 29: Computational Genome Annotation

Cont., `• (2) Secondary structures in its folded tertiary

structure,

• Stems, provide signals for RNA gene recognition,

• tRNAscan-SE,

• Accuracy greater than 99%,

• False positive rate at one false prediction per 15

gigabases.

Page 30: Computational Genome Annotation

SECONDARY STRUCTURE

Loops

Stem

Page 31: Computational Genome Annotation

TERTIARY STRUCTURE

Page 32: Computational Genome Annotation

3.4 IDENTIFICATION OF PROMOTERS

• Coding regions and Regulatory regions,

• mRNA transcription,

• Transcription process is initiated by RNA

polymerase.

Page 33: Computational Genome Annotation

3.4.1 Promoter Prediction through Feature Recognition

• Hidden Markov model (HMM) - statistical tool,

• Promoter sequences have higher probabilities

than that of nonpromoter sequences.

• Conserved sequence fragments and their

spacing relationships.

Page 34: Computational Genome Annotation

Sequences recognized by omega-54 factor

Page 35: Computational Genome Annotation

CONSENSUS

• Conserved k-mers

• Determine if the current sequence

contains any k-mers that are similar to

any k-mers of the previous sequences

• Consensus matrix.

Page 36: Computational Genome Annotation

MEME

• Maximum likelihood of the conserved

k-mers - EM algorithm

• Signal Scan and NNPP

• Promoter-gene structure or the more general structure of promoter-gene-gene- . . . -gene

Page 37: Computational Genome Annotation

3.5 OPERON IDENTIFICATION

• A BASIC ORGANIZATIONAL UNIT of genes,

• TRANSCRIPTIONAL REGULATION.

• Genes in an operon are TANDEM and

controlled by a REGULATORY BINDING

MOTIFS

Page 38: Computational Genome Annotation

Computational identification of an operon

(1) Predicte promoter region and a terminator,

(2) Set of genes arranged in tandem on the same strand,

(3) Functional information of the genes involved.

• Identify transcriptional regulatory networks

Page 39: Computational Genome Annotation

Terminator Identification

• rho-dependent and rho-independent,

Three nucleic acid binding sites :

• A double-stranded DNA binding site,

• An RNA–DNA hybrid binding site,

• A single-stranded RNA binding site.

Page 40: Computational Genome Annotation

TransTerm

• Finds rho-independent transcription

terminators ( Bacterial genomes ).

• Catalyze successive reactions in metabolic

pathways,

• http://genomics4.bu.edu/operons/,

Page 41: Computational Genome Annotation

Cont.,

• lac operon.

• TRP OPERON biosynthesis of tryptophan

• MHP OPERON phenylpropionate catabolic pathway

• Using these known operons,

1) Intergenetic distance within an operon vs. between

operons,

(2) Distribution of the number of genes

Page 42: Computational Genome Annotation

3.6 FUNCTIONAL CATEGORIES OF GENES

• EC classes for enzymes,

• An ad hoc way,

• If “Metabolism” or “pathway”, of gene is

known, its functional category will be

labeled.

Page 43: Computational Genome Annotation

Gene group of Methanosarcina barkeri

Page 44: Computational Genome Annotation

Functional assignments of genes in the “cell motility” pathway

Page 45: Computational Genome Annotation

3.7 CHARACTERIZATION OF OTHER FEATURES IN A GENOME

• G + C Composition: Correlates with density

of genes,

• In a genome, higher G + C compositions

imply higher gene densities.

Page 46: Computational Genome Annotation

CpG Islands

• DNA with a higher frequency of CpG

dinucleotides.

• Transcriptional starts of genes.

• Commonly used threshold is 0.6.

• Human genome threshold is 0.8,

Page 47: Computational Genome Annotation

Genomic Repeats

• Prokaryotic and eukaryotic genomes.

• Transposons - mobile elements to move

around a genome.

• Genome annotation process.

• Gene density: Number of genes per fixed

length of genomic sequence.

Page 48: Computational Genome Annotation

Cont.,

• (a) Tandem Repeat Identification: Exact and

approximate string matching.

• (b) RepeatMasker: Matching all the repeat

sequences in its database against the DNA sequence.

• (c) RepeatFinder: Either exact or approximate

match, using a clustering technique.

Page 49: Computational Genome Annotation

3.8 GENOME-SCALE GENE MAPPING

• Genes Unique to a Genome: 20 to 30% of

genes in a genome are unique.

• Genome Rearrangement: One gene’s

location differ from their corresponding genes

• Quantitative studies of genome.

Page 50: Computational Genome Annotation

Cont.,

• Reversal Distance: Defined from (a, b) to

(b,a), where b1, b2, . . . , bn is a permutation

of a1, a2, . . . , an.

• Transposition Distance: Block of genes

from one position to another.

Page 51: Computational Genome Annotation

3.9 EXISTING GENOME ANNOTATION SYSTEMS

• Proteins, transfer RNAs (tRNAs), and phage

sequences

• Proteins are annotated in terms of

I. Physical attributes,

II. Molecular weight,

III. Membrane spanning regions,

IV. Structural domains, or three-dimensional structure.

Page 52: Computational Genome Annotation

Genome Channel

• Modeled Genes: FASTA including sequential positions, methods used for prediction, BLAST hits, etc.

• Functional Assignments of Genes:

I. EGG pathways,

II. Pfam families,

III. EC classes,

IV. COG groups.

Page 53: Computational Genome Annotation

Cont.,

• Modeled Genes,

• Functional Assignments,

• RNA genes,

• Repeats,

• General Sequence Features.

Page 54: Computational Genome Annotation

Genome Channel

Page 55: Computational Genome Annotation

Multipurpose Automated Genome Project Investigation Environment

• An environment for each annotation.

• MAGPIE - a set of tables containing

genomic features associated with a

particular region of the genome.

Page 56: Computational Genome Annotation

Unique features

Page 57: Computational Genome Annotation

GenDB• Is an open source annotation tool for microbial genomes,

Page 58: Computational Genome Annotation

3.10 SUMMARY

• Ab intio and computational approach,

• Models for prediction,

• Evaluation,

• Large-scale annotation efforts,

• RNA-coding genes and its prediction,

• Promoter – Structure and function of each gene

• Operon –Basic unit of genes,

• Genome-Scale gene mapping and pathway analysis

Page 59: Computational Genome Annotation

THANK YOU

By Prabhakaran