computational genome annotation

Computational Genome Annotation

Chapter 3Ying Xu

Introduction

• DNA sequence of a genome encodes the entire functionality,

• Millions (microbes) to Billions (human),

• What information is encoded in a cont., A’s, C’s, G’s and T’s string?

• Where it is located ?

• What information is identifiable directly ?

• How should the identified directly information be presented ?

• Two approaches,

1. Ab initio approach,

2. Comparative approach.

• Ab initio -> predicts functional elements by

statistical features and used to identify novel

functional elements,

• Comparative approach -> sequence

similarity to previously known one.

3.2 Prediction of Protein-Coding genes

• Single largest set of functional elements in a genome consists of

genes,

• 75-90% of microbial genome contains gene-coding regions,

• Sequence fragment between two stop codons of the same reading

frame is called an open reading frame (ORF),

3.2.1 Evaluation of coding potential

• Ab initio prediction - based on di-codons, or six-

mers,

• Eg., di-codon GACTGC, largely occur in noncoding

regions than in coding regions in Shewanella

oneidensis,

• 4,096 different di-codons in a genome ( 46 = 4,096),

For each di-codon X

• Total numbers of occurrences of X in coding and

noncoding regions.

• Relative frequency (RF)of X in coding regions =

number of occurrences of X / total number in

coding regions

• Est. RF of X in non-coding regions in a similar

fashion.

Preference model

• Log(FC(X)/FN(X)),

• FC(X) X’s relative frequency in a coding region

• FN(X) X’s relative frequency in a noncoding region,

• If X have the same RF - preference value is zero.

• Positive value - X has a higher RF in coding than in a non-

coding region;

• otherwise, it will be negative

• Overall preference value = sum of all preference

values of the di-codons.

• Positive preference value -> coding region

• Negative preference value -> noncoding region.

• GRAIL AND SORFIND,

• HIDDEN MARKOV MODELS,

Markov Chain Model

• Consecutive 6-mers or di-codons are

independent,

• Modeling dependence relationships among

consecutive di-codons,

3.2.2 Identification of translation start

• Similar sequence patterns around the

ATG,

• Predict new translation starts based on

previously known,

• Weight matrix,

• Flanking DNA sequence

Weight matrix

3.2.3 Ab initio Gene Prediction through Information Fusion

• Identify all ORFs in six reading frames,

• Measure the coding potential,

• High translation-start score and the whole region has

high coding potential

• Strong coding potential on right and low coding

potential on left.

Gene Length Distribution

• Length distribution of all known genes is not uniform.

• Exponential distribution or a gamma distribution.

• Asymmetric and heavy tail on the right side.

G+C Composition

• Different G+C compositions have different di-

codon frequencies,

• One set of di-codon RF lead to incorrect

predictions.

• Different di-codon frequency tables .

• Normalization factor.

Regions of Repeats

• Not overlap with any genes,

• Reliable prediction software programs,

• These regions are masked out before

running a gene-finding program.

Neural Networking

• A non-gene is a region in an ORF that does

not overlap any coding regions

• set A contains only genes and set B

contains only non-genes,

• Examine the common features of sets A & B

• set A consists of a list of vectors (C1, C2, T, G,

L, 1) for each gene

• set B consists of a list of (C1, C2, T, G, L, 0)

for each nongene.

• 0 and 1 - one set consists of all genes and

the other set all nongenes.

Back-propagation

• One or two hidden layers should suffice.

• Nodes are connected with edges.

• Adjusting the edge weights.

• GRAIL - main prediction framework.

Input Nodes

Output node

Hidden layer

NEURAL NETWORKWEB SERVERS FOR GENOME

ANNOTATION

3.2.4 Gene Identification through comparative analysis

• High sequence similarity

• BLAST

• First Comparative approach to find a subset of genes

• Ab initio method to find the rest of the genes in the

genome.

• EST-based Gene Predictions

Identifying Conserved Regions across Multiple Genomes

• Conserved (long) regions across multiple genomes,

(a) megaBLAST (b) SENSEI (c) MUMmer

Very long sequence comparisons.

First find short (size of 8) ungapped sequence matches.

Sequences to be aligned are closely related.

Speed up computational time and reduce the memory requirement.

Extend them into longer gapped alignments .

Utilizing a suffix trees data structure.

PatternHunter• Non-contiguou sequence matches.

• Very less time and memory requirement, than BLAST.

• DIALIGN - predicts genes through genome-scale sequence

comparisonGenome A

Genome B

Genes

3.2.5 Interpretation of Gene Prediction

• GRAIL : marginal, intermediate, or strong descriptors,

• All predictions divide into bins based on the prediction

scores.

• Genes with scores between 0 and 0.1 are put into the first

bin,

• All genes with scores between 0.1 and 0.2 in the second bin,

etc.

Cont.,

• Different reliability thresholds applied for

different purposes.

• Gene validation, consider a high reliability

threshold,

• General screening - Low reliability

threshold.

Pseudogenes

• Frameshifts due to deletions/insertions,

• Hard for a regular gene prediction program.

• Specialized coding-region detection

program,

• Mycobacterium leprae has 1,100 predicted

pseudogenes

3.3 PREDICTION OF RNA-CODING GENES

• tRNA (transfer RNA), rRNA (ribosomal RNA), sRNA

(small RNA), srpRNA (signal recognition particle RNA),

etc.

• Catalyst and information storage molecules.

• tRNAs adapter molecules that decode the genetic code.

• rRNA catalyze the synthesis of proteins.

Cont.,

• (1) RNA signals are a combination of

sequence and structure motifs.

• for example, tRNA genes designed to

recognize particular types of RNA genes.

Cont., `• (2) Secondary structures in its folded tertiary

structure,

• Stems, provide signals for RNA gene recognition,

• tRNAscan-SE,

• Accuracy greater than 99%,

• False positive rate at one false prediction per 15

gigabases.

SECONDARY STRUCTURE

Loops

Stem

TERTIARY STRUCTURE

3.4 IDENTIFICATION OF PROMOTERS

• Coding regions and Regulatory regions,

• mRNA transcription,

• Transcription process is initiated by RNA

polymerase.

3.4.1 Promoter Prediction through Feature Recognition

• Hidden Markov model (HMM) - statistical tool,

• Promoter sequences have higher probabilities

than that of nonpromoter sequences.

• Conserved sequence fragments and their

spacing relationships.

Sequences recognized by omega-54 factor

CONSENSUS

• Conserved k-mers

• Determine if the current sequence

contains any k-mers that are similar to

any k-mers of the previous sequences

• Consensus matrix.

MEME

• Maximum likelihood of the conserved

k-mers - EM algorithm

• Signal Scan and NNPP

• Promoter-gene structure or the more general structure of promoter-gene-gene- . . . -gene

3.5 OPERON IDENTIFICATION

• A BASIC ORGANIZATIONAL UNIT of genes,

• TRANSCRIPTIONAL REGULATION.

• Genes in an operon are TANDEM and

controlled by a REGULATORY BINDING

MOTIFS

Computational identification of an operon

(1) Predicte promoter region and a terminator,

(2) Set of genes arranged in tandem on the same strand,

(3) Functional information of the genes involved.

• Identify transcriptional regulatory networks

Terminator Identification

• rho-dependent and rho-independent,

Three nucleic acid binding sites :

• A double-stranded DNA binding site,

• An RNA–DNA hybrid binding site,

• A single-stranded RNA binding site.

TransTerm

• Finds rho-independent transcription

terminators ( Bacterial genomes ).

• Catalyze successive reactions in metabolic

pathways,

• http://genomics4.bu.edu/operons/,

http://genomics4.bu.edu/operons/

Cont.,

• lac operon.

• TRP OPERON biosynthesis of tryptophan

• MHP OPERON phenylpropionate catabolic pathway

• Using these known operons,

1) Intergenetic distance within an operon vs. between

operons,

(2) Distribution of the number of genes

3.6 FUNCTIONAL CATEGORIES OF GENES

• EC classes for enzymes,

• An ad hoc way,

• If “Metabolism” or “pathway”, of gene is

known, its functional category will be

labeled.

Gene group of Methanosarcina barkeri

Functional assignments of genes in the “cell motility” pathway

3.7 CHARACTERIZATION OF OTHER FEATURES IN A GENOME

• G + C Composition: Correlates with density

of genes,

• In a genome, higher G + C compositions

imply higher gene densities.

CpG Islands

• DNA with a higher frequency of CpG

dinucleotides.

• Transcriptional starts of genes.

• Commonly used threshold is 0.6.

• Human genome threshold is 0.8,

Genomic Repeats

• Prokaryotic and eukaryotic genomes.

• Transposons - mobile elements to move

around a genome.

• Genome annotation process.

• Gene density: Number of genes per fixed

length of genomic sequence.

Cont.,

• (a) Tandem Repeat Identification: Exact and

approximate string matching.

• (b) RepeatMasker: Matching all the repeat

sequences in its database against the DNA sequence.

• (c) RepeatFinder: Either exact or approximate

match, using a clustering technique.

3.8 GENOME-SCALE GENE MAPPING

• Genes Unique to a Genome: 20 to 30% of

genes in a genome are unique.

• Genome Rearrangement: One gene’s

location differ from their corresponding genes

• Quantitative studies of genome.

Cont.,

• Reversal Distance: Defined from (a, b) to

(b,a), where b1, b2, . . . , bn is a permutation

of a1, a2, . . . , an.

• Transposition Distance: Block of genes

from one position to another.

3.9 EXISTING GENOME ANNOTATION SYSTEMS

• Proteins, transfer RNAs (tRNAs), and phage

sequences

• Proteins are annotated in terms of

I. Physical attributes,

II. Molecular weight,

III. Membrane spanning regions,

IV. Structural domains, or three-dimensional structure.

Genome Channel

• Modeled Genes: FASTA including sequential positions, methods used for prediction, BLAST hits, etc.

• Functional Assignments of Genes:

I. EGG pathways,

II. Pfam families,

III. EC classes,

IV. COG groups.

Cont.,

• Modeled Genes,

• Functional Assignments,

• RNA genes,

• Repeats,

• General Sequence Features.

Genome Channel

Multipurpose Automated Genome Project Investigation Environment

• An environment for each annotation.

• MAGPIE - a set of tables containing

genomic features associated with a

particular region of the genome.

Unique features

GenDB• Is an open source annotation tool for microbial genomes,

3.10 SUMMARY

• Ab intio and computational approach,

• Models for prediction,

• Evaluation,

• Large-scale annotation efforts,

• RNA-coding genes and its prediction,

• Promoter – Structure and function of each gene

• Operon –Basic unit of genes,

• Genome-Scale gene mapping and pathway analysis

THANK YOU

By Prabhakaran

computational genome annotation

Documents

genecoding regions

noncoding regions

low coding potential

noncoding regions

rf of x

regions of repeats

rf preference value

positive value x