promoter prediction in e.coli using ann a.krishnamachari bioinformatics centre, jnu...

67
Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU [email protected]

Upload: stephen-brown

Post on 28-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Promoter Prediction in E.coli using ANN

A.Krishnamachari

Bioinformatics Centre, JNU

[email protected]

Page 2: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Definition of Bioinformatics

• Systematic development and application of Computing and Computational solution techniques to biological data to investigate biological process and make novel observations

Page 3: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Research in BiologyResearch in Biology

OrganismFunctionsCellChromosomeDNASequences

General approach

Bioinformatics era

Page 4: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

TF

TF -> Transcription Factor Sites

TSS

TSS->Transcription Start Sites

RBS

RBS -> Ribosome Binding sites

CDS

CDS - > Coding Sequence (or) Gene

intergenic

-10-35

Page 5: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in
Page 6: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in
Page 7: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Statement of the problem

• Given a set of known sequences pertaining to a specific biological feature , develop a computational method to search for new members or sequences

Page 8: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Computational Methods

• Pattern Recognition

• Pattern classification

• Optimisation Methods

Page 9: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Sequence Analysis AND

Prediction Methods• Consensus• Position Weight Matrix (or) Profiles• Machine Learning Methods

– Neural Networks– Markov Models– Support Vector Machines– Decision Tree– Optimization Methods

Statistical Learning

Page 10: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Consensus sequence TA TAA T

49%,54% and 58%

14 sites out of 291 sequences [Lisser and Margalitt]

Mismatches but which one?

Page 11: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in
Page 12: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in
Page 13: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in
Page 14: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Describing features using frequency matrices

Describing features using frequency matrices

• Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences

• Need to describe how often particular bases are found in particular positions in a sequence feature

Page 15: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Describing features using frequency matrices

Describing features using frequency matrices

• Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature

Page 16: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Frequency matrices (continued)Frequency matrices (continued)

• Three uses of frequency matrices– Describe a sequence feature– Calculate probability of occurrence of feature

in a random sequence– Calculate degree of match between a new

sequence and a feature

Page 17: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Frequency Matrices, PSSMs, and Profiles

Frequency Matrices, PSSMs, and Profiles

• A frequency matrix can be converted to a Position-Specific Scoring Matrix (PSSM) by converting frequencies to scores

• PSSMs also called Position Weight Matrixes (PWMs) or Profiles

Page 18: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Methods for converting frequency matrices to PSSMs

• Using log ratio of observed to expected

where m(j,i) is the frequency of character j observed at position i and f(j) is the overall frequency of character j (usually in some large set of sequences)

• Using amino acid substitution matrix (Dayhoff similarity matrix) [see later]

score(i) logm( j,i) / f ( j)

Page 19: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Finding occurrences of a sequence feature using a Profile• As with finding occurrences of a

consensus sequence, we consider all positions in the target sequence as candidate matches

• For each position, we calculate a score by “looking up” the value corresponding to the base at that position

Page 20: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in
Page 21: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Nucleotides

1 2 3 4 5

A x11 x21 x31 x41 x51

T x12 x22 x32 x42 x52

G x13 x23 x33 x43 x53

C x14 x24 x34 x44 x54

Positions (Columns in alignment)

TAGCT AGTGC x12 + x21 + x33 + x44 + x52

if is above a threshold it is a site

V1V1

Page 22: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Building a PSSMBuilding a PSSM

PSSM builder

Set of Aligned Sequence Features

Expected frequencies of each sequence element

PSSM

Page 23: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Searching for sequences related to a family with a PSSM

Searching for sequences related to a family with a PSSM

PSSM search

PSSM

Set of Sequences to search

Sequences that match above threshold

Threshold

Positions and scores of matches

PSSM builder

Set of Aligned Sequence Features

Expected frequencies of each sequence element

Page 24: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Consensus sequences vs.

frequency matrices

Consensus sequences vs.

frequency matrices• consensus sequence or a frequency

matrix which one to use?– If all allowed characters at a given position

are equally "good", use IUB codes to create consensus sequence

• Example: Restriction enzyme recognition sites

– If some allowed characters are "better" than others, use frequency matrix

• Example: Promoter sequences

Page 25: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Consensus sequences vs.

frequency matrices

Consensus sequences vs.

frequency matrices• Advantages of consensus sequences:

smaller description, quicker comparison

• Disadvantage: lose quantitative information on preferences at certain locations

Page 26: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Measure1

Measure2

Linear Classification Problems

Page 27: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Measure 1

Measure 2

Nonlinear Classification Problem

Page 28: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in
Page 29: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in
Page 30: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

(Artificial) Neural Network

Page 31: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

What Is A Neural Network ?

A computational construct based on biological neuron. A neural network can:

•Learn by adapting its synaptic weights to changes in the surrounding environments; •handle imprecise, fuzzy, noisy, and probabilistic information; and •generalize from known tasks or examples to unknown ones.

Artificial neural networks (ANNs) attempt to mimic some,or all of these characteristics.

Page 32: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Neural Network

• Characterised by:

- its pattern of connections between the neurons (Network Architecture)

- its method of determining the weights on the connections (training or learning algorithm)

Page 33: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Why Neural Network:Applications

-Little or no incomplete understanding of the problem to be solved (very little theory)

-Abundant data available

Page 34: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Neural Networks: Applications

• Pattern classification

• Speech synthesis and recognition

• Image compression

• Clustering

• Medical Diagnosis

• Manufacturing

Page 35: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Neural Network:Bioinformatics

• Binding sites prediction

• Protein Secondary Structure prediction

• Protein folds

• Micro array data clustering

• Gene prediction

Page 36: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Neural Networks

• Supervised Learning

• Unsupervised Learning

Page 37: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

inputs Output

3

1

2

W1,3

W2,3

Layer 1 (input) Layer 2 (output)

Direction of information flow

Page 38: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Summation Operation

xi * wij=x1*w1+x2*w2+x3w3….+xnWnj

Thresholding functions

Output = 0 if x*w < TOutput = 1 if x*w >T

Threshold=0

1Output

0

Output1 1

T0

Page 39: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Input Output

Logistic Transfer function

Output =

1

1 + e -

W(k+1) = w(k)+ µ[T9k) – w(k)x(k)]x(k) for 0 ≤ k≤ N-1

Weight updates

Page 40: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Learning Concepts

• Generally – the target output is 1 for +ve– The target output is 0 for –ve– But practically (0.9, 0.1) combination

• Stopping criterion

Based on certain epochs or cycles

Based on certain error estimates

Page 41: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

12

K-1

K

A T G C

Nucleic Acid

PositionIn a sequenceOf K nucleotides

Page 42: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

let the following binary values represent each baseA="0001C="0010G="0100T="1000

thenG = 4A or C = "0011 = 3A,G or T = "1101 = 13etc.

Page 43: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

0 0 10

0 1 0 0

0 1 0 0

1 0 0 0

1 A

2 G

K-1 G

K T

A T G C

Nucleic Acid

PositionIn a sequenceOf K nucleotides

Wi,j

Page 44: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

P=50

N=50

TP + TN =100

Note: Training and Test sequences are fixed length

Page 45: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

N=500

n=10

1 2 …………501 2 …………50

Page 46: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

TRAININGC G T A G C T A T A G T G G GT T T A A A C C C A A G A A TT A T G G A A T T T G G A A GT T T A G G A T A G C A C A GG A T A A G G C C T A G A T AT T T A T G C A T G A G A T G

TEST

C C T G A A C T G A G A T G A T A T A T A A G T G A A A T T C C G

PredictionMethod

Output

Input

Page 47: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

-1

Input Layer

Hidden layer

Output Layer

Page 48: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in
Page 49: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in
Page 50: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in
Page 51: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

weight

A B(local) (global)

Error function

Page 52: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

ExampleDisease A diagnosis:

x – gene expression data (vector of numbers)

f(x) – A positive / A negative (boolean 0/1)

<x,f(x)> - set of known values

NN

<x,f(x)>

Learning

NN

x

h(x)

Recognition

Page 53: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Evaluation Mechanism

• Sensitivity = TP

TP + FN

Specificity = TN

TN + FP

(TPxTN – FPxFN)

(TP+FP)x(FP+TN)x(TN+FN)x(FN+TP)

C =

Page 54: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Cross - Validation

• Benchmarking the Network Performance• Step:1

Divide the training set into “N” partitions• Step: 2

Train the “N-1” partitions and

Test the Left out • Step: 3

Evaluate the Performance

Page 55: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

1. Nucleic Acids Res. 1992 Aug 25;20(16):4331-8.

An assessment of neural network and statistical approaches for prediction of E. coli promoter sites.

Horton PB, Kanehisa M.

2. Nucleic Acids Res. 1994 Jun 11;22(11):2158-65.

Analysis of E.coli promoter structures using neural networks.

Mahadevan I, Ghosh I.

Page 56: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Training back-propagation neural networks to define and detect DNA-binding sites.Nucleic Acids Res. 1991 Jan 25;19(2):313-8.

Escherichia coli promoters: neural networks develop distinct descriptions in learning to search for promoters of different spacing classes.Nucleic Acids Res. 1992 Jul 11;20(13):3471-7.

A general procedure for locating and analyzing protein-binding sequence motifs in nucleic acids.Proc Natl Acad Sci U S A. 1998 Sep 1;95(18):10710-5.

Pedersen AG, Engelbrecht Investigations of Escherichia colipromoter sequences with artificial neural networks: new signals discovered upstream of the transcriptional startpoint.Proc Int Conf Intell Syst Mol Biol. 1995;3:292-9.

O’Neill MC

Page 57: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Basic Gene Grammars and DNA-ChartParser for language processing of Escherichia coli promoter DNA sequencesSiu-wai Leung, Chris Mellish and Dave Robertson

R Hershberg, G Bejerano, A Santos-Zavaleta and H Margalit

PromEC: An updated database of Escherichia coli mRNA promoters with experimentally identified transcriptionalstart sites.

Nucleic Acids Research 2001, 29: 277

E.Coli PROMOTER DATA

Page 58: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Earlier Studies

• Data set corresponding to 17 spacing class

• Very high threshold values

Page 59: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Network

Model

Performance pBR322(CW) pBR322(CCW)

17-1 34/35 -5, 339,477,1584,1657,1970,4130

125,805,1021,1226,4278

17-2 33/35 …. ….

17-3 34/35

17-4 32/35

Poll(4/4) -5, 1584, 1970,4130

125,807,1226,4278

Page 60: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

New New

New

New

Page 61: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

NEW

Page 62: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Disadvantages of MLP

• i) Slow convergence,

• ii) Training relies heavily on the choice of the number of hidden layers and

• iii) Mode of prediction is generally based on high threshold values.

Page 63: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Improvements in Prediction

• Pre processing of the data based on DNA structure

• Clean Model

Page 64: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

Structural Atlas of E.coli

: J Mol Biol. 2000 Jun 16;299(4):907-30.

                 A DNA structural atlas for Escherichia coli.

Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery DW.

Page 65: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in

1) The size of the positive data set may be increased by incorporating point mutations in non-sensitive positions

2) Negative data sets are generated by several ways i.e.

a) Shuffling or randomising the positive data set. This does not destroy the the correlations between the bases completely.

b) Using random sequences with a biased composition,

c) Extracting the sequences from gene coding segments.

3) For the learning phase, the number of positive and negative input vectors are not generally proportionate and there is no standard prescription

4) Convergence factor and predictive ability depend on the size and the number of input vectors .

Page 66: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in
Page 67: Promoter Prediction in E.coli using ANN A.Krishnamachari Bioinformatics Centre, JNU chari@mail.jnu.ac.in