parsing a bacterial genome mark craven department of biostatistics & medical informatics...

47
Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. [email protected] www.biostat.wisc.edu/~craven

Upload: carmel-blair

Post on 05-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Parsing A Bacterial Genome

Mark CravenDepartment of Biostatistics & Medical Informatics

University of Wisconsin

U.S.A.

[email protected]

www.biostat.wisc.edu/~craven

Page 2: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

The Task

Given: a bacterial genome

Do: use computational methods to predict a “parts list” of regulatory elements

Page 3: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Outline

1. background on bacterial gene regulation

2. background on probabilistic language models

3. predicting transcription units using probabilistic language models

4. augmenting training with “weakly” labeled examples

5. refining the structure of a stochastic context free grammar

Page 4: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

The Central Dogma of Molecular Biology

Page 5: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Transcription in Bacteria

Page 6: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Operons in Bacteria

• operon: sequence of one or more genes transcribed as a unit under some conditions

• promoter: “signal” in DNA indicating where to start transcription

• terminator: “signal” indicating where to stop transcription

promotergene

terminatorgene

gene

mRNA

Page 7: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

The Task Revisited

Given:– DNA sequence of E. coli genome– coordinates of known/predicted

genes– known instances of operons,

promoters, terminatorsDo:

– learn models from known instances

– predict complete catalog of operons, promoters, terminators for the genome

Page 8: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Our Approach: Probabilistic Language Models

1. write down a “grammar” for elements of interest (operons, promoters, terminators, etc.) and relations among them

2. learn probability parameters from known instances of these elements

3. predict new elements by “parsing” uncharacterized DNA sequence

Page 9: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Transformational Grammars• a transformational grammar characterizes a set of legal strings• the grammar consists of

– a set of abstract nonterminal symbols

– a set of terminal symbols (those that actually appear in strings)

– a set of productions

4321 , , , , CCCCS

21 t CC 32 CaC

42 g CC g3 C

a3 C

a4 C

tgca , , ,

Page 10: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

A Grammar for Stop Codons

• this grammar can generate the 3 stop codons: taa, tag, tga• with a grammar we can ask questions like

– what strings are derivable from the grammar?– can a particular string be derived from the grammar?

1CS 21 CtC 32 CaC

42 CgC g3 C

a3 C

a4 C

Page 11: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

The Parse Tree for tag

1CS 21 CtC 32 a CC

42 CgC g3 C

a3 C

a4 C

S

1C

2C

3C

t

a

g

Page 12: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

A Probabilistic Version of the Grammar

• each production has an associated probability• the probabilities for productions with the same left-hand side sum to 1• this grammar has a corresponding Markov chain model

1CS 21 CtC 32 CaC

42 CgC g3 C

a3 C

a4 C

1.0 1.0 0.7

0.3

1.0

0.2

0.8

Page 13: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

A Probabilistic Context Free Grammar for Terminators

START

PREFIX

STEM_BOT1

STEM_BOT2

STEM_MID

STEM_TOP2

STEM_TOP1

LOOP

LOOP_MID

SUFFIX

B

tl STEM_BOT2 tr

tl* STEM_MID tr

* | tl* STEM_TOP2 tr

*

tl* STEM_MID tr

* | tl* STEM_TOP2 tr

*

tl LOOP tr

B B LOOP_MID B B

tl* STEM_TOP1 tr

*

B LOOP_MID |

B B B B B B B B B

a | c | g | u

B B B B B B B B B

PREFIX STEM_BOT1 SUFFIX

t = {a,c,g,u}, t* = {a,c,g,u, }

cgaccgc

c-u-c-a-a-a-g-g- gcuggcg

ua

u c

c

-u-u-u-u-u-u-u-u

prefixstemloopsuffix

Page 14: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Inference with Probabilistic Grammars

• for a given string there may be many parses, but some are more probable than others

• we can do prediction by finding relatively high probability parses

• there are dynamic programming algorithms for finding the most probable parse efficiently

Page 15: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Learning with Probabilistic Grammars

• in this work, we write down the productions by hand, but learn the probability parameters

• to learn the probability parameters, we align sequences of a given classs (e.g. terminators) with the relevant part of the grammar

• when there is hidden state (i.e. the correct parse is not known), we use Expectation Maximization (EM) algorithms

Page 16: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Outline

1. background on bacterial gene regulation

2. background on probabilistic language models

3. predicting transcription units using probabilistic language models [Bockhorst et al., ISMB/Bioinformatics ‘03]

4. augmenting training with “weakly” labeled examples

5. refining the structure of a stochastic context free grammar

Page 17: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

untranscribed region

transcribed region

ORF

SCFG

position specific Markov model

semi-Markov model

-35 -10 TSS

ORF

lastORF

RITprefix

RDTprefix

stemloop

stemloop

start

RITsuffix

RDTsuffix

endendspacer

startspacer

spacerpromintern

postprom

intraORF

preterm

UTR

A Model for Transcription Units

Page 18: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

The Components of the Model

• stochastic context free grammars (SCFGs) represent variable-length sequences with long-range dependencies

• semi-Markov models represent variable-length sequences

),|Pr()|Pr(

)|Pr()|Pr(

011

0

:

l

j

illi

Lji

xxx

Lx

),|Pr()|Pr()|Pr( 111

1:

ill

j

illiji xxxx

• position-specific Markov models represent fixed-length sequence motifs

Page 19: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Gene Expression Data

• in addition to DNA sequence data, we also use expression data to make our parses

• microarrays enable the simultaneous measurement of the transcription levels of thousands of genes

genes/sequence positions

experimentalconditions

Page 20: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Incorporating Expression Data

ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCT

• our models parse two sequences simultaneously

– the DNA sequence of the genome

– a sequence of expression measurements associated with particular sequence positions

• the expression data is useful because it provides information about which subsequences look like they are transcribed together

Page 21: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Predictive Accuracy for Operons

0

20

40

60

80

100

sequence only expression only sequence+expression

sensitivity

specificity

precision

FNTP

TP

ysensitivit

FPTN

TN

yspecificit

FPTP

TP

precision

Page 22: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Predictive Accuracy for Promoters

50

60

70

80

90

100

sequence only expression only sequence+expression

sensitivity

specificity

precision

FNTP

TP

ysensitivit

FPTN

TN

yspecificit

FPTP

TP

precision

Page 23: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Predictive Accuracy for Terminators

50

60

70

80

90

100

sequence only expression only sequence+expression

sensitivity

specificity

precision

FNTP

TP

ysensitivit

FPTN

TN

yspecificit

FPTP

TP

precision

Page 24: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Accuracy of Promoter & Terminator Localization

Page 25: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Terminator Predictive Accuracy

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

True

Pos

itive

Rat

e

False Positive Rate

SCFGSCFG, no training

Complementarity MatrixInterpolated Markov Model

FNTP

TP

TNFP

FP

Page 26: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Outline

1. background on bacterial gene regulation

2. background on probabilistic language models

3. predicting transcription units using probabilistic language models

4. augmenting training data with “weakly” labeled examples [Bockhorst & Craven, ICML ’02]

5. refining the structure of a stochastic context free grammar

Page 27: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Key Idea: Weakly Labeled Examples

• regulatory elements are inter-related

– promoters precede operons

– terminators follow operons

– etc.

• relationships such as these can be exploited to augment training sets with “weakly labeled” examples

Page 28: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Inferring “Weakly” Labeled Examples

g1 g2 g3 g4

g5

ACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTAGCTGACAGCTAGATCGATAGCTCGATAGCACGTGTACGTAGATAGACAGAATGACAGATAGAGACAGTTCGCTTGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGATCGACTGTCGATCTAGCTATCGAGCTATCGTGCACATGCATCTATCTGTCTTACTGTCTATCTCTGTCAAGCGA

• if we know that an operon ends at g4, then there must be a terminator shortly downstream

• if we know that an operon begins at g2, then there must be a promoter shortly upstream

• we can exploit relations such as this to augment our training sets

Page 29: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Strongly vs. Weakly Labeled Terminator Examples

gtccgttccgccactattcactcatgaaaatgagttcagagagccgcaagatttttaattttgcggtttttttgtatttgaattccaccatttctctgttcaatg

gtccgttccgccactattcactcatgaaaatgagttcagagagccgcaagatttttaattttgcggtttttttgtatttgaattccaccatttctctgttcaatg

end of stem-loop

strongly labeled terminator:

weakly labeled terminator:

extent of terminator

sub-class: rho-independent

Page 30: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Training the Terminator Models:Strongly Labeled Examples

rho-dependent terminator model

negativemodel

rho-independent terminator model

negative examplesrho-independent

examplesrho-dependent

examples

Page 31: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Training the Terminator Models:Weakly Labeled Examples

rho-dependent terminator model

negativemodel

rho-independent terminator model

negative examples

weakly labeled examples

combined terminator model

Page 32: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Do Weakly Labeled Terminator Examples Help?

• task: classification of terminators (both sub-classes) in E. coli K-12

• train SCFG terminator model using:

– S strongly labeled examples and

– W weakly labeled examples

• evaluate using area under ROC curves

Page 33: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100 120 140

Are

a u

nd

er

RO

C c

urv

e

Number of strong positive examples

250 weak examples

25 weak examples

0 weak examples

Learning Curves using Weakly Labeled Terminators

Page 34: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Are Weakly Labeled Examples Better than Unlabeled Examples?

• train SCFG terminator model using:

– S strongly labeled examples and

– U unlabeled examples

• vary S and U to obtain learning curves

Page 35: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Training the Terminator Models:Unlabeled Examples

rho-dependent terminator model

negativemodel

rho-independent terminator model

unlabeled examples

combined model

Page 36: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

0 40 80 120

250 unlabeled examples25 unlabeled examples

0 unlabeled examples

Are

a u

nd

er

RO

C c

urv

e

Number of strong positive examples

0.6

0.8

1

0 40 80 120

250 weak examples25 weak examples

0 weak examples

Weakly Labeled Unlabeled

Learning Curves: Weak vs. Unlabeled

Page 37: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Are Weakly Labeled Terminators from Predicted Operons Useful?

• train operon model with S labeled operons

• predict operons

• generate W weakly labeled terminators from W most confident predictions

• vary S and W

Page 38: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

0.5

0.6

0.7

0.8

0.9

1

0 20 40 60 80 100 120 140 160

Are

a u

nd

er

RO

C c

urv

e

Number of strong positive examples

200 weak examples

100 weak examples

25 weak examples

0 weak examples

Learning Curves using Weakly Labeled Terminators

Page 39: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Outline

1. background on bacterial gene regulation

2. background on probabilistic language models

3. predicting transcription units using probabilistic language models

4. augmenting training with “weakly” labeled examples

5. refining the structure of a stochastic context free grammar [Bockhorst & Craven, IJCAI ’01]

Page 40: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Learning SCFGs• given the productions of a grammar, can learn the

probabilities using the Inside-Outside algorithm

• we have developed an algorithm that can add new nonterminals & productions to a grammar during learning

• basic idea:

– identify nonterminals that seem to be “overloaded”

– split these nonterminals into two; allow each to specialize

Page 41: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Refining the Grammar in a SCFG

1w

2wA U

1w

2wC G

• there are various “contexts” in which each grammar nonterminal may be used

• consider two contexts for the nonterminal 2w

2w2w

AU

| CG

|G C

| UA

3

3

3

32

w

w

w

ww 0.4

0.4

0.1

0.1

• if the probabilities for look very different, depending on its context, we add a new nonterminal and specialize

AU

| CG

|G C

| UA

3

3

3

32

w

w

w

ww 0.1

0.1

0.4

0.4

Page 42: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Refining the Grammar in a SCFG

• we can compare two probability distributions P and Q using Kullback-Leibler divergence

AU

| CG

|G C

| UA

3

3

3

32

w

w

w

ww 0.4

0.4

0.1

0.1

AU

| CG

|G C

| UA

3

3

3

32

w

w

w

ww 0.1

0.1

0.4

0.4

)(

)()()||(

i

i

ii xQ

xPxPQPH

P Q

Page 43: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Learning Terminator SCFGs• extracted grammar from the literature

(~ 120 productions)• data set consists of 142 known E. coli terminators, 125

sequences that do not contain terminators• learn parameters using Inside-Outside algorithm (an EM

algorithm)• consider adding nonterminals guided by three heuristics

– KL divergence– chi-squared– random

Page 44: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

SCFG Accuracy After Adding 25 New Nonterminals

20%

40%

60%

80%

100%

0% 20% 40% 60% 80% 100%

True

Pos

itive

Rat

e

False Positive Rate

Chi squareKL Divergence

RandomOriginal Grammar

Page 45: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

SCFG Accuracy vs. Nonterminals Added

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25

Are

a un

der

RO

C c

urve

Additional nonterminals

Chi squareKL Divergence

Random

Page 46: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Conclusions

• summary– we have developed an approach to predicting transcription units in bacterial

genomes– we have predicted a complete set of transcription units for the E. coli genome

• advantages of the probabilistic grammar approach– can readily incorporate background knowledge– can simultaneously get a coherent set of predictions for a set of related

elements– can be easily extended to incorporate other genomic elements

• current directions– expanding the vocabulary of elements modeled (genes, transcription factor

binding sites, etc.)– handling overlapping elements– making predictions for multiple related genomes

Page 47: Parsing A Bacterial Genome Mark Craven Department of Biostatistics & Medical Informatics University of Wisconsin U.S.A. craven@biostat.wisc.edu craven

Acknowledgements

• Craven Lab: Joe Bockhorst, Keith Noto

• David Page, Jude Shavlik

• Blattner Lab: Fred Blattner, Jeremy Glasner, Mingzhu Liu, Yu Qiu

• funding from National Science Foundation, National Institutes of Health