discovery of regulatory elements by a phylogenetic footprinting algorithm mathieu blanchette martin...

23
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science & Engineering University of Washington

Upload: stella-blair

Post on 01-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Discovery of Regulatory Elements by a Phylogenetic

Footprinting Algorithm

Mathieu BlanchetteMartin Tompa

Computer Science & EngineeringUniversity of Washington

2

Outline•How are genes regulated?

•What is phylogenetic footprinting?

•First solution

•Improvements and extensions

•Application to regulation of several important genes

3

Regulation of Genes

• What turns genes on and off?

• When is a gene turned on or off?

• Where (in which cells) is a gene turned on?

• How many copies of the gene product are produced?

4

Regulation of Genes

Coding regionRegulatory Element

RNA polymerase

Transcription Factor

DNA

5

RNA polymerase

Transcription Factor

DNA

Coding region

Regulation of Genes

Regulatory Element

6

GoalIdentify regulatory elements in DNA sequences. These are:

• Binding sites for proteins

• Short substrings (5-25 nucleotides)

• Up to 1000 nucleotides (or farther) from gene

• Inexactly repeating patterns (“motifs”)

7

Phylogenetic Footprinting(Tagle et al. 1988)

Functional sequences evolve slower than nonfunctional ones.

• Consider a set of orthologous sequences from different species

• Identify unusually well conserved regions

8

Substring Parsimony ProblemGiven:

• phylogenetic tree T,• set of orthologous sequences at leaves of T,• length k of motif• threshold d

Problem:

• Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d.

This problem is NP-hard.

9

Small Example

AGTCGTACGTGAC... (Human)

AGTAGACGTGCCG... (Chimp)

ACGTGAGATACGT... (Rabbit)

GAACGGAGTACGT... (Mouse)

TCGTGACGGTGAT... (Rat)

Size of motif sought: k = 4

10

Solution

Parsimony score: 1 mutation

AGTCGTACGTGAC...

AGTAGACGTGCCG...

ACGTGAGATACGT...

GAACGGAGTACGT...

TCGTGACGGTGAT...ACGGACGT

ACGT

ACGT

11

CLUSTALW multiple sequence alignment (rbcS gene)Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATTPea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACATobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACCIce-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACCTurnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGCWheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAADuckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAALarch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC

Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----APea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------ATobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGAIce-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAATurnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------AWheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC--------Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATTLarch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA

Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTAPea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTATobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATGIce-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGGTurnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATAWheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTGDuckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATCLarch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA

Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTACPea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAACTobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAAIce-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTACLarch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCATurnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAGWheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCCDuckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

12

An Exact Algorithm(generalizing Sankoff and Rousseau 1975)

Wu [s] = best parsimony score for subtree rooted at node u,

if u is labeled with string s.

AGTCGTACGTG

ACGGGACGTGC

ACGTGAGATAC

GAACGGAGTAC

TCGTGACGGTG

… ACGG: 2 ACGT: 1 ...

… ACGG: 0 ACGT: 2...

… ACGG: 1 ACGT: 1 ...

ACGG: + ACGT: 0

...

… ACGG: 1 ACGT: 0 ...

4k entries

… ACGG: 0 ACGT: + ...

… ACGG: ACGT :0 ...

… ACGG: ACGT :0 ...

… ACGG: ACGT :0 ...

13

Wu [s] = min ( Wv [t] + d(s, t) ) v: child t of u

Recurrence

14

O(k 42k ) time per node

Wu [s] = min ( Wv [t] + d(s, t) ) v: child t of u

Running Time

15

O(k 42k ) time per node

Number of species

Average sequence

length

Motif length

Total time O(n k (42k + l ))

Wu [s] = min ( Wv [t] + d(s, t) ) v: child t of u

Running Time

16

Improvements• Better algorithm reduces time from

O(n k (42k + l )) to O(n k (4k + l ))

• By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k)

• Amenable to many useful extensions (e.g., allow insertions and deletions)

17

Application to -actin Gene

Gilthead sea bream (678 bp)

Medaka fish (1016 bp)

Common carp (696 bp)

Grass carp (917 bp)

Chicken (871 bp)

Human (646 bp)

Rabbit (636 bp)

Rat (966 bp)

Mouse (684 bp)

Hamster (1107 bp)

18

Common carpACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAACATTGGCATGGCTTTTGTTATTTTTGGCGCTTGACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTG

AGGACTCAATGTTTTTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCA

GACATTTGGTGGGGCCAACCTGTACACTGACTAATTCAAATAAAAGTGCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC

ChickenACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGATTGGCATGGCTTTATTTG

TTTTTTCTTTTGGCGCTTGACTCAGGATTAAAAAACTGGAATGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGA

GCGAACGCCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAATAGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGA

TAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGAGGGAGGGGCTACCTGTACACTGACTTAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGTGATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCTGGGCTCAGTGGGACTGCAGCTGTGCT

HumanGCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAGATTGGCATGGCTTTATTTGTTT

TTTTTGTTTTGTTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAACGGTGAAGGTGACAGCAGTCGGTT

GGAGCGAGCATCCCCCAAAGTTCACAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCAAATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTTAATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTC

CCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTACCTGTACACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG

Parsimony score over 10 vertebrates: 0 1 2

19

Motifs Absent from Some Species

• Find motifs – with small parsimony score– that span a large part of the tree

• Example: in tree of 10 species spanning 760 Myrs, find all motifs with– score 0 spanning at least 250 Myrs– score 1 spanning at least 350 Myrs– score 2 spanning at least 450 Myrs– score 3 spanning at least 550 Myrs

20

Application to c-fos Gene

Asked for motifs of length 10, with 0 mutations over tree of

size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26

Puffer fish

Chicken

Pig

Mouse

Hamster

Human

10

2

7

2

2

21

0

1

1

Found: 0 mutations over tree of size 81 mutation over tree of size 163 mutations over tree of size 214 mutations over tree of size 28

21

Application to c-fos GeneMotif Score Conserved in Known?

CAGGTGCGAATGTTC 0 4 mammals

TTCCCGCCTCCCCTCCCC 0 4 mammals yes

GAGTTGGCTGcagcc 3 puffer + 4 mammals

GTTCCCGTCAATCcct 1 chicken + 4 mammals yes

CACAGGATGTcc 4 all 6 yes

AGGACATCTG 1 chicken + 4 mammals yes

GTCAGCAGGTTTCCACG 0 4 mammals yes

TACTCCAACCGC 0 4 mammals

22

Other Genes

Similar results for the following genes:•insulin•c-myc promoter and intron•growth hormone•interleukin-3•histone H1-globin•dihydrofolate reductase

•fibroin•myogenin•prolactin•thyroglobulin•γ-actin 3´ UTR•rbcS•rbcL

23

Conclusions• Guaranteed optimality for question posed

• Time linear in the number of species and the total sequence lengths, exponential in the parsimony score

• Practical on real biological data sets

• Discovered highly conserved regions, both known and not (yet) known

• Available at http://bio.cs.washington.edu/software.html