regulatory genomics and epigenomics in flyepigenomics in...
TRANSCRIPT
Regulatory genomics and epigenomics in fly and humanepigenomics in fly and human
Manolis Kellis, MIT
MIT Computer Science & Artificial Intelligence LaboratoryBroad Institute of MIT and Harvard
Three areas of computational genomics
1. Genome annotationEvolutionary signatures for each functionDiscover proteins, RNAs, microRNAsNew bio: read-thru, editing, miR*, miR-AS
2. Gene regulationDiscover regulatory motifs pre/post-transcr.Id tif t t i iIdentify gene targets using comp. genomicsEpigenomics in development and disease
3. Genome evolutionEvolution by whole-genome duplicationThe two forces of gene evolutionThe two forces of gene evolutionPhylogenomics and neofunctionalization
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAACTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAATCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGTCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCATGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAATAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCACCAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGCAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGAAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGCGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACACAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCCACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAACTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAATCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGGTCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTGCACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCATGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATATCCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTTGGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAATAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCACCAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG
Genes
Encodeproteins
Regulatory motifs
ControlCAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAGTTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAGGCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGAAATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGCGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
proteins gene expression
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGCTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTACGAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACAAAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGAAATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCATTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCATCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTCCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATTAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACACAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCCACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCT
Large-scale comparative genomics datasets32 mammals 17 fungi12 flies32 mammals 17 fungies
catio
nPo
st-d
uplic
9 Yeasts
Pre-
dup
loid
P
P
P
N
8 Candida
Dip
l
P
P
P
N
Hap
loid
P
P
N
Comparative genomics and evolutionary signatures
• Comparative genomics can reveal functional elementsF l d l d t hi k fi h– For example: exons are deeply conserved to mouse, chicken, fish
– Many other elements are also strongly conserved: exons / regulatory?
• Can we also pinpoint specific functions of each region? Yes!– Patterns of change distinguish different types of functional elements– Specific function Selective pressures Patterns of mutation/inse/del
• Develop evolutionary signatures characteristic of each function
Evolutionary signatures for diverse functionsProtein-coding genes- Codon Substitution Frequencies- Reading Frame Conservation
RNA structures- Compensatory changes- Silent G-U substitutions
microRNAs- Shape of conservation profile- Structural features: loops, pairs- Relationship with 3’UTR motifs- Relationship with 3 UTR motifs
Regulatory motifs- Mutations preserve consensus- Mutations preserve consensus- Increased Branch Length Score- Genome-wide conservation
Protein-coding genesProtein-coding genes
Mike Lin
Evolutionary signatures for protein-coding genes
Non-synonymous substitutions
Synonymous codon substitutions Frame-shifting gapsGaps are multiples of 3
• Same conservation levels, distinct patterns of divergence– Gaps are multiples of three (preserve amino acid translation)
M i l l 3 i di ( il d b i i )
Synonymous codon substitutions a e s t g gaps
– Mutations are largely 3-periodic (silent codon substitutions)– Specific triplets exchanged more frequently (conservative substs.)– Conservation boundaries are sharp (pinpoint individual splicing signals)
Protein-coding evolution vs nucleotide conservation
Protein-coding exonsHighly conserved non-coding elements
• Evolutionary signatures specific to each function– Distinguish protein-coding from non-coding conservation
G id (CSF l ) 81% 91% i i– Genome-wide run (CSF only): 81% sens., 91% precision– Incorporating additional signatures: RFC, single-species…
Additional support for novel human exons
• Length distribution matches known exons • Supported by high-quality cDNA evidence(no excess of very small lengths)
• Supported by independent curation efforts • Extraordinary comparative support in some cases
Many new genes confirmed by chromatin domains
Missedexon
Alt.splicedexon
• Several hundred new exons, many in clustersExample: MM14qC3
Mikk l t lMikkelsen et al
• Supported by chromatin signatures (Guttman et al)
Genome-wide curation / experimental follow-up
GG PI: Tim Hubbard, Sanger Center.HAVANA curators, experimental validation.
• Novel candidate genes and exons– Experimental cDNA sequencing and validationExperimental cDNA sequencing and validation– Curation of gene structures integrating evidence
• Revising existing annotationsg g– Identify dubious genes with non-protein-like evolution– Refine boundaries and exon sets of existing genes– Curation: evaluate evidence supporting that annotation
• Unusual gene structuresE l i id i b f i i l– Evolutionary evidence in absence of primary signals
– Reveal new and unusual biological mechanisms
Comparative evidence leads to genome reannotation
• 81% of 928 curated exons incorporated in FlyBase
• Surprise: 44% of intronic predictions independent of the surrounding gene
• Surprise: 42% of intergenic predictions extend existing
• Surprise: overlapping protein-coding exons on
genes
• Surprise: 414 rejectedt i di
protein coding exons on opposite strands
protein-coding genes(two are pre-miRNAs)
Lin et al, Genome Research 2007
Unusual protein-coding eventsUnusual protein-coding events
Mike Lin
When primary sequence signals are ignored
• Typical gene (MEF2A). Evolutionary signal stops at the stop codon.
• Unusual gene (GPX2). Protein-coding signal continues past the stop.• GPX2 is a known selenoprotein! Additional candidates found.
Translational read-through in neuronal proteinsNovel candidate: OPRL1 neurotransmitterNovel candidate: OPRL1 neurotransmitter
Protein-codingconservation
Continued protein-codingconservation
No moreconservation
Stop codonread through
2nd stopcodon
• New mechanism of post-transcriptional control.– Conserved in both mammals (~5 candidates) and flies (~150 candidates)– Strongly enriched for neurotransmitters and brain-expressed proteins– Read-through stop codon (&surrounding) shows increased conservation
• Many questions remain– Role of editing? Cryptic splice sites? RNA secondary structure?g yp p y
Lin et al, Genome Research 2007
Measuring excess constraint within protein-coding exons
Typical protein-coding exon (Numerous mutations, at each column)
Excess-conservation exon: conserved above and beyond the call of dutyLikely to have additional functions, overlapping selective pressures
Searching for excess-constraint coding sequence(1) Build a model for expected substitution counts( ) u d a ode o e pected subst tut o cou ts
Syn.subs. correlate w/ degeneracy & CpG Distribution for each ancestral codon
(2) Score windows for depletion in syn. subst.• Z-score: P(obs subst | expected for each codon)
(3) Top candidate exons with excess constraint
• Z-score: P(obs. subst | expected for each codon)
( ) p• PCPB2: derived from ancestral transposon• Hox B5 gene start: 52 AA before 1 syn.subst• C6orf111: predicted ORF on chr. 6• EIF4G2: overlaps spliced EvoFold prediction
Examples: Top candidate exons showing increased selection
• HoxB5: 52 amino acids before the first synonymous substitutionO l hi hl d RNA d t t• Overlaps highly conserved RNA secondary structure
• C6orf11: Predicted ORF, protein-coding, extremely conserved
• EIF4G2: Several consecutive exons, conserved RNA struct.
Evidence of post-transcriptional RNA regulation
• New non-coding RNAs (introns / intergenic)– Supported by independent expression in multiple tissues
• Roles in translational regulation (exons / 5’UTRs)Roles in translational regulation (exons / 5 UTRs)– Difficult to obtain experimentally: importance of evo. signal– Role in translation initiation: overlap ATG, ribosomal proteins
• Roles in A to I editing (exon & intron pairing)• Roles in A-to-I editing (exon & intron pairing)– Enriched in known ADAR targets, new editing candidates, cDNA support
• Roles in localization / targeting (3’UTRs)P i il di t d (75% & 80%)– Primarily on coding strand (75% & 80%)
– Enriched in post-transcriptional regulators: feedback, auto-/cross-regulation
Jakob Pedersen, EvoFold. Stark et al, Nature 2007
microRNA genesmicroRNA genes
Alex StarkPouya Kheradpour
Evolutionary signatures for microRNA genes
(1) Conservation profile
miRNAs show characteristic conservation properties
Distinguishing true miRNAs from random hairpinsEvolutionary features F t fEvolutionary features Feature performance
EnrichmentTotal
(1)
(2)
(3)
Structural features
(4)
Combination of features:(5)
Combination of features:> 4,500-fold enrichment(6)
Stark et al, Genome Research 2007
Novel miRNAs validated by sequencing reads
7al
GR
200
77.
Rub
y et
a
348 reads h (G
R) 2
007
Ruby, Bartel, Lai
348 reads16 reads
e R
esea
rch
y, ,• In fly genome: 101 hairpins above 0.95 cutoff
60 of 74 (81%) known Rfam miRNAs rediscovered+ 24 novel expression-validated by 454&Solexa (Bartel/Hannon)
al, G
enom
e
+ 17 additional candidates show diverse evidence of function• In mammals: combine experimental & evolutionary info
Rely on reads for discovery, use evolutionary signal to study function Star
k et
a
Surprise 1: microRNA & microRNA* function
Drosophila Hox
• Both hairpin arms of a microRNA can be functionalHi h b d t i d t t
Drosophila Hox
– High scores, abundant processing, conserved targets– Hox miRNAs miR-10 and miR-iab-4 as master Hox regulators
Stark et al, Genome Research 2007
Surprise 2: microRNA-anti-sense functionHighly conserved Hox targets
senseanti-
sense
ent 2
007
Dev
elop
me
al, G
enes
&D
• A single miRNA locus transcribed from both strands• The two transcripts show distinct expression domains (mutually exclusive)• Both processed to mature miRNAs: mir-iab-4, miR-iab-4AS (anti-sense) St
ark
et a
miR-iab-4AS leads to homeotic transformationswing
n
haltereSensory bristlesw/bristles
gnifi
catio
n
halterewing
wing
WT
sam
e m
age:
C,D
,E s
• Mis-expression of mir-iab-4S & AS:
sense Antisense Not
e
palteres wings homeotic transform.
• Stronger phenotype for AS miRNA• Sense/anti-sense pairs as generalSense/anti sense pairs as general
building blocks for miRNA regulation• 10 sense/anti-sense miRNAs in mouse
Stark et al, Genes&Development 2007
Surprise 2: MicroRNAs and developmental control
• Illustrates miR/miR* and miR/miR-AS cooperation
Measuring selectionMeasuring selection
Michele ClampManuel Garber
Xiaohui Xie
More mammals = more power
4 Species (branch length1 subs/site)
26 Species(branch length 4 subs/site)( g / )
6bp50bp 6bp
Comparative genomics of 29 eutherian mammals
(22 @ 2X coverage)
Detecting Purifying Selection (ω)
ωNeutral sequence Constrained sequenceω
Estimating intensity of constraint (ω):g y ( )• Probabilistic evolutionary model• Maximum Likelihood (ML) estimation of ω
- sitewise (evaluate every k-long window)- sitewise (evaluate every k-long window)- windows-based (increased power)
• Reports ω, and its log odds score (LODS).2• Theoretical p-value (LODS distributes χ2 with df = 1)
Manuel Garber, Michele Clamp, Xiaohui Xie
Detecting unusual mutational patterns (π)ω 0 0 0 8 0 5 0 6 3 2 0 0ω 0 0 0.8 0.5 0.6 3.2 0 0
• Repeated C G transversion
• Has happened at least 4 times.
• Very unlikely given neutral model.
π
• Goal: Identify sites with unlikely substitution pattern.• Approach: Probabilistic method to detect a
stationary distribution that is different from background.• Solution: Implement ML estimator (π) of this vector:
• Provides a Position Weight Matrix for any given k-mer in the genome.g y g g• Scores every base in the genome (LODS).
Manuel Garber, Michele Clamp, Xiaohui Xie
Using 29 mammals : ~7% of the b dgenome appears to be constrained
60% of selected bases can be
pinpointed to within5 species
pinpointed to within 12 bp at 95% confidence.
We can detect 127Mb (of the
estimated 210Mb)
20 species
Constraint score
How many of the selected bases have we not b f ?seen before?
29 mammals4 mammals (Old) 29 mammals127Mb
2,600,000 elements
4 mammals (Old)60Mb
500,000 elements
3Mb 69Mb (New)58Mb3Mb
77,000
69Mb (New)
1,970,000
58Mb
374,000
30% low alignment coverage.
Some are longer gelements with weak conservation
70% of exons 20% of exons
New elements primarily intergenic
Exons old newOld and new elements are non-coding only
Individual binding sites are revealed
5’
Chr 6 - HIST13A 120bp upstream
29mammal elements29mammal elements
29mammal constraint- 12mers
Alignment information content
Transfac matches
CAAT CAAT TATA
Over 20 histones show similar conservation patterns
Understanding genomes by their chromatin signatures
Jason Ernst
Combining evolutionary and chromatin signatures
• Evolutionary signatures & genome annotation• Evolutionary signatures & genome annotation– Protein-coding genes + unusual gene structures– miRNA gene discovery and characterizationmiRNA gene discovery and characterization– Functional anti-sense miRNAs and miRNA* arms– Motif discovery + motif functiony
• Chromatin signatures & genome regulation– De novo discovery of chromatin signatures
• New functional combinations of marks emerge• A new class of long non-coding RNA regulators
– Regulatory motif and target prediction– Regulatory motif and target prediction• Drosophila developmental enhancers• Human enhancer-specific motifs
• Dynamics across developmental time
Chromatin signatures for genome annotation
• Similarly to evolutionary signatures:
h ti i t d– chromatin signatures encode functional elements
• The difference– this information is dynamic
• The epigenetic code hypothesishypothesis– Distinct combinations of marks
encode distinct chromatin states
• Can we discover them de novo
Cartoon Illustration of the Model
EnhancerTranscription Start Site DNATranscribed Region
ObservedHistone Modifications
Most likely Hidden State 1 2 5 63 4 5 5 5 5 6 6
1: 4:
Even though modification was not observed can still infer correct state based on neighboring locations that this
Highly Likely Modifications in State
2
0.8
0 9
0.80.7.8
3:
5:
6:
state is likely of the same type as its neighboring states
2: 0.9
0.9 0.8
0.9
Core Promoter States
State Enriched Category Fold Corrected p-value
State 2 tRNA Metabolic 4 4 0 003State 2 tRNA Metabolic 4.4 0.003
State 3 Cell Cycle 2.7 2x10-7
State 4 Embryonic Development
2.8 9x10-23
State 5 Chromatin 2.2 2x10-7
State 6 Response to DNA Damage Stimulus
2.1 2x10-10
State 7 RNA Processing 2.6 9x10-24
State 8 T-cell Activation 4.7 3x10-7
Different promoter states show
Distance to TSS
Different promoter states show distinct functional enrichment
Comparing chromatin states across cell typesNHEKK562 HUVEC
K562
Proportion of genome
Pairwise state fold
enrichments
UVEC CTCF island state (State 9)
h hl bl llHU highly stable across cell types
NHEK
Comparing chromatin states across cell typesNHEKK562 HUVEC
GO C t P l
K562
GO Category P-valuebiopolymer metabolic process 1.60E‐120
cellular biopolymer metabolic process 6.60E‐120
cellular metabolic process 8.60E‐120
UVEC
cellular macromolecule metabolic process 5.30E‐119
macromolecule metabolic process 5.50E‐119
primary metabolic process 3.40E‐115
nucleic acid binding 1.10E‐105
HU
RNA processing 2.20E‐99nucleobase, nucleoside, nucleotide and nucleic acid metabolic process 5.10E‐93
Top GO Enrichment for TSS in Active promoter state (1) in
NHEK
Active promoter state (1) in NHEK and HUVEC
NHEK, HUVEC
46
Comparing chromatin states across cell typesNHEKK562 HUVEC
K562
GO Category P value
UVEC
GO Category P-value
olfactory receptor activity 3.70E‐201
sensory perception of smell 7.30E‐175
HU
sensory perception of chemical stimulus 1.60E‐170
Top GO Enrichment for TSS in unmodified state (7) in NHEK
NHEK
unmodified state (7) in NHEK and HUVEC
47NHEK, HUVEC
Comparing chromatin states across cell typesNHEKK562 HUVEC
K562
GO Category P-value
ectoderm development 2.90E‐09
UVEC
epidermis development 1.80E‐08
keratinocyte differentiation 3.00E‐06
tissue development 3.20E‐06
HU
cell adhesion 1.90E‐05
GO Enrichment for TSS in Active promoter state (1)
NHEK
Active promoter state (1) in NHEK and unmodified
state (7) in HUVEC
NHEK
48HUVEC
Comparing chromatin states across cell typesNHEKK562 HUVEC
K562
GO Category P-value
blood vessel development 2.60E‐05
UVEC
vasculature development 3.00E‐05
angiogenesis 3.50E‐05
blood vessel morphogenesis 1.20E‐04
HU
GO Enrichment for TSS in Active promoter state (1)
NHEK
Active promoter state (1) in HUVEC and unmodified
state (7) in NHEK
HUVEC
49NHEK
Striking example: ~2,000 Large intergenic non‐coding RNAs (lincRNAs)
Our experiments confirm:
H3K4me3 - K3K36me3
Our experiments confirm:• These regions produce
RNA molecules
• They have exon/intron structures
1 They are evolutionarily1. They are evolutionarily conserved
2. They show no coding t ti l ipotential, no evo. sign.
3. Their promoters and regulation are conserved
Mikkelsen et al. 2007
g
4. They play diverse roles in chromatin regulationGuttman et al. Nature, Feb 2009
Combine chromatin signatures and regulatory motifsNew developmental enhancers in human and fly
Zeitlinger et al, Genes & Development 2007Visel, Penacchio, Rubin, Ren, Nature 2008
• Chromatin signatures and evolutionary signature are predictive of enhancer elements
Heinzman, et al, Kellis, Ren, Nature 2008 Zeitlinger et al, Nature Genetics 2007
• Experimental techniques developed for inferring expression domains in human and fly
• Large-scale databases mapping every elements to its expression pattern emerge
• Ability to test new patterns and artificial elements in fly / mouse embryos
Regulatory motif discovery and functionRegulatory motif discovery and function
Pouya KheradpourAlex Stark
Evolutionary signatures for regulatory motifs
Known
5’-UTR 3’-UTR
enhancers promoters exons 3’-UTRsintronsD.mel CAGCT--AGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTCD.sim CAGCT--AGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTC
engrailedsite(footprint)
D.sim CAGCT AGCC AACTCTCTAATTAGCGACTAAGTC CAAGTCD.sec CAGCT--AGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTCD.yak CAGC--TAGCC-AACTCTCTAATTAGCGACTAAGTC-CAAGTCD.ere CAGCGGTCGCCAAACTCTCTAATTAGCGACCAAGTC-CAAGTCD.ana CACTAGTTCCTAGGCACTCTAATTAGCAAGTTAGTCTCTAGAG
D.mel
D. ere
** * * *********** * **** * **
D. ana
D. pse.
• Individual motif instances are preferentially conservedIndividual motif instances are preferentially conserved• Measure conservation across entire genome
– Over thousands of motif instances Increased discovery powerC l t id ti d id t i h– Couple to rapid enumeration and rapid string search
De novo discovery of regulatory motifs Kellis el al, Nature 2003Xie et al. Nature 2005
Stark et al, Nature 2007
Consensus MCS Matches to known Tissue specific target expression Promoters Enhancers1 CTAATTAAA 65.6 engrailed (en) 25.4 2
Power of evolutionary signatures for motif discovery
2 TTKCAATTAA 57.3 reversed-polarity (repo) 5.8 4.23 WATTRATTK 54.9 araucan (ara) 11.7 2.64 AAATTTATGCK 54.4 paired (prd) 4.5 16.55 GCAATAAA 51 ventral veins lacking (vvl) 13.2 0.36 DTAATTTRYNR 46.7 Ultrabithorax (Ubx) 16 3.37 TGATTAAT 45.7 apterous (ap) 7.1 1.77 TGATTAAT 45.7 apterous (ap) 7.1 1.78 YMATTAAAA 43.1 abdominal A (abd-A) 7 2.29 AAACNNGTT 41.2 20.1 4.3
10 RATTKAATT 40 3.9 0.711 GCACGTGT 39.5 fushi tarazu (ftz) 17.912 AACASCTG 38.8 broad-Z3 (br-Z3) 10.713 AATTRMATTA 38 2 19 5 1 213 AATTRMATTA 38.2 19.5 1.214 TATGCWAAT 37.8 5.8 215 TAATTATG 37.5 Antennapedia (Antp) 14.1 5.416 CATNAATCA 36.9 1.8 1.717 TTACATAA 36.9 5.418 RTAAATCAA 36.3 3.2 2.819 AATKNMATTT 36 3.6 020 ATGTCAAHT 35.6 2.4 4.621 ATAAAYAAA 35.5 57.2 -0.522 YYAATCAAA 33.9 5.3 0.623 WTTTTATG 33.8 Abdominal B (Abd-B) 6.3 624 TTTYMATTA 33 6 t d ti l ( d) 6 7 1 724 TTTYMATTA 33.6 extradenticle (exd) 6.7 1.725 TGTMAATA 33.2 8.9 1.626 TAAYGAG 33.1 4.7 2.727 AAAKTGA 32.9 7.6 0.328 AAANNAAA 32.9 449.7 0.829 RTAAWTTAT 32.9 gooseberry-neuro (gsb-n) 11 0.830 TTATTTAYR 32.9 Deformed (Dfd) 30.7
Ability to discover full dictionary of regulatory motifs de novoStark et al, Nature, 2007
Diverse lines of evidence to characterize novel motifs1. Clustering of motif occurrences upstream of candidate genes
3. Positional constraintsCore promoter element(initiator)
2. Tissue enrichment and avoidance
Downstream promoterMotif avoidance patternselement
Depletion Typical regulator(transcription factor)
Functional clusters emerge
Recognizing functional motifs within coding exons• Challenge: overlapping selective pressures
• Solution: frame specific conservation
g pp g p– Distinguish RNA-level motifs from protein-level motifs– The two have distinct evolutionary characteristics
• Solution: frame-specific conservation– Evaluate each reading-frame offset separately– Motifs due to di-codon biases: only one frame– Motifs due to RNA-level selection: all three frames
• Result: miRNA motifs in coding exonsT 20 tif 11 iRNA d ( 11 i 200 )– Top 20 motifs 11 miRNA seeds (vs 11 in 200+)
• Conclusion: iRNA t ti i di– miRNA targeting in coding exons
– Specific selection for RNA function• Similar to 3’UTR targeting
Stark et al, Nature, 2007
– Conservation profile of 7-mers– Coding & 3’UTR show corr.>0.9
Sequence determinants of TF binding• Hundreds of proteins bind overlapping regions
R l t tif l i l ifi it CTCF, check• Regulatory motif analysis reveals sequence specificity• Basis for understanding motif combinations & grammars
GAF check
CTCF, check
GAF, check
Su(Hw) check
Example: insulator proteins in Drosophila
Su(Hw), check
BEAF-32, variant
CP190, novel
Mod(mdg4), novelAlthough insulator-bound regions overlap, each motif is specific to exactly one protein
Reliable target identificationReliable target identification
Pouya KheradpourAlex Stark
Evolutionary signatures of individual motif instances
All f tif t• Allow for motif movements– Sequencing/alignment errors
Loss movement divergence– Loss, movement, divergence• Measure branch-length score
– Sum evidence along branchesSum evidence along branches– Close species little contribution
BLS: 25% Mef2:YTAWWWWTAR BLS: 83%
Motif confidence selects functional instancesTranscription factor motifs
Confidence ConfidenceConfidence
microRNA motifs
ConfidenceIncreasing BLS
Increasing confidenceConfidence selects functional regions
Confidence selects in vivo bound sites
High sensitivity
microRNA motifs
Confidence selects positive strand
Increasing BLS Increasing confidence
Confidence selects functional regions
Kheradpour et al, Genome Research 2007
ChIP vs. conservation: similar power / complement
2007
Zeitl
inge
r 2an
d 20
07, Z
man
n 20
06
Dat
a: S
andm
Amidst ChIP-bound regions: - Subset with conserved motifs: best
ChIP vs. conservation:- Similar functional enrichment
D
- Subset lacking cons. motifs: worseConservation selects relevant targets
- Even for motifs outside ChIPChIP-grade regulatory network
Kheradpour et al, Genome Research 2007
• ChIP-grade quality
Initial regulatory network for an animal genome• ChIP-grade quality
– Similar functional enrichment
– High sens High specHigh sens. High spec.
• Systems-level81% of Transc Factors– 81% of Transc. Factors
– 86% of microRNAs– 8k + 2k targets
46k ti– 46k connections
• Lessons learned– Pre- and post- are
correlated (hihi/lolo)– Regulators are heavily
t t d f db k ltargeted, feedback loop
Kheradpour et al, Genome Research, 2007Sushmita Roy
Network captures literature-supported connections
Kheradpour et al, Genome Research, 2007Sushmita Roy
Network captures co-expression supported edges
Red = co-expressed 46% of edgesGrey = not co-expressedNamed = literature-supportedBold = literature-supported
46% of edges are supported (P=10-3)
Kheradpour et al, Genome Research 2007Sushmita Roy
Motif role in chromatin dynamicsMotif role in chromatin dynamics
Pouya Kheradpour
Motif dynamics in Drosophila development“New” regions
12 developmental time points
Stage t ‐ 1
g
Stage t
Stage t + 1
“Old” regions
8 antibodies + gene expressionH3K4me1 Enhancers For each of the antibodies and time H3K4me3 Promoters/enhancers
H3K27ac Activation
H3K9ac Activation
H3K27 3 R i
points, we define three types of regions:
1.“Bound”: all the regions boundH3K27me3 Repression
H3K9me3 Heterochromatin
Pol 2 Transcription/promoters
CBP HAT – Enhancers
2.“New”: regions that were not bound in the previous time, but now are
3 “Old” i h b d hKevin White, Nicolas Nègre,
Parantu Shah, Carolyn Morrison
Total RNA Expression 3.“Old”: regions that are bound at the current time, but won’t be at the next
Examples of enrichment following expression
H3K27me3H3K27me3
• abd‐A motif is enriched in new H3K27me3 regions at L2– Coincides with a drop in the expression of abd‐A
Fold enrichment or over expression
Coincides with a drop in the expression of abd A– Model: sites gain H3K27me3 as abd‐A binding lost
• Additional intriguing stories found, to be explored
Motifs and chromatin dynamics in Human4 human cell types
NHEK
HUVECHUVEC Umbilical vein endothelial
NHEK Keratinocytes
GM12878 Lymphoblastoid
K562 Myelogenous leukemiaXX
??
GM12878
K562H3K4me1 Enhancers
H3K4me2 Promoters/enhancers
11 antibodies + gene expression
K562 Myelogenous leukemia
X?X
H3K4me2 Promoters/enhancers
H3K4me3 Poised/active promoters
H3K27ac Activation
H3K9ac Activation For each of the antibodies and cell types,
UniqueBound Missing
H3K27me3 Repression
H3K9me1 Activation
H4K20me1 Activation
H3K36me3 Transcription
ypwe define three types of regions:
1.“Bound”: Bound in that cell typePol 2 Transcription/promoters
CTCF Insulators
RNA Expression
2.“Unique”: Bound only in that cell type
3.“Missing”: Bound in all other cell typesBrad Bernstein, Tarjei Mikkelsen, Mitch Guttman, Charles Epstein, Noam Shoresh
Example: NF‐κB a likely regulator of GM12878
Active marksRepressive mark
• The NF‐κB motif is enriched in H3K4me2 regions found uniquely in GM12878 cellsI i lik i i h d i h i l b d i f
Repressive mark
• It is likewise enriched in the uniquely bound regions for other active marks
• Conversely, it is enriched in the uniquely unbound regions f th i k H3K27 3
NF‐κB motif
Fold enrichment or over expression
for the repressive mark H3K27me3• We find that NF‐κB is also over expressed in GM12878,
suggesting a causative explanation
Marks associated with activation
i i i
• By correlating the expression and enrichments of
Activator association
y g pactivating factors, we can rank each chromatin mark by its “Activator association”
• Correlation follows the expected trend– H3K4me2, H3K4me3, H3K9ac and H3K27ac associated with
i iactivation
– H3K27me3 anti‐correlated with activation
The grand challenge aheadAnnotations & images for all expression patterns
Binding sites of everydevelopmental regulator
Sequence motifs forevery regulator
tral CTCF, check
orsa
l-Ven
t
GAF, check
rD
Expression domain primitives reveal underlying logic Su(Hw), check
r-P
oste
rior
BEAF-32, variant
Ant
erio
rCP190, novel
Mod(mdg4), novel
Understand regulatory logic specifying development
Summary: Regulatory genomics of flies and men• Evolutionary signaturesEvolutionary signatures
–Systematic annotation of proteins, RNAs, miRNAs, motifs–Reveal unusual genes, RNA structures, stop read-through
• Regulatory motifs–Distinct motif sets in promoter vs. enhancer regions–Unique signatures for exonic motifs miRNA targetsUnique signatures for exonic motifs miRNA targets
• Epigenomics–Fly: Integration of AP, DV developmental processes–Human: Enhancer signatures, enhancer-specific motifs
• microRNAs–Functionality of miRNA* and anti-sense miRNAs–Implications for Hox cluster regulation miR10, miR-iab-4
• TargetsTargets–Global, reliable identification of TF and miRNA targets–Biochemically-active&selectively-neutral=non-functional?
Acknowledgements
AlexStark
MikeLin
JasonErnst
JuliaPouya JuliaZeitlinger
PouyaKheradpour
12-flies Andy Clark, Mike Eisen, Bill Gelbart, Doug SmithReadthru FlyBase, Bill Gelbart, Robert ReenanmiRNAs Julius Brennecke, Graham Ruby, Greg Hannon, David Barteliab-4AS Natascha Bushati, Julius Brenneke, Steve Cohen, Greg HannonTF binding Julia Zeitlinger, Robert Zinzen, Mike Levine, Rick YoungENCODE Kevin White, Bing Ren, Jim Posakony, Brad Bernstein