the segway annotation of encode data
TRANSCRIPT
The Segway annotation
of ENCODE data
Michael M. Hoffman
Department of Genome Sciences
University of Washington
Overview
1. ENCODE Project
2. Semi-automated genomic annotation
3. Chromatin
4. RNA-seq
Functional genomics
ENCODE Project Consortium 2011. PLoS Biol 9:e1001046.
Chromatin immunoprecipitation
Park PJ 2009. Nat Rev Genet 10:669.
(ChIP)
ChIP sequence
sequence signal: Wiggler
• Extends tags in strand
direction
• Extension length
determined by cross-
correlation peak
• Signal only in mappable
regions
• 1-bp resolution
Anshul Kundaje
http://align2rawsignal.googlecode.com/
Hoffman MM et al. 2013. Nucleic Acids Res 41:827.
Fine-scale data
300 bp
H3K4me2
H3K27me3
Pol2b
Egr-1
GABP
Pol2 (Myers)
Sin3Ak-20
TAF1
Histone modifications
Transcription factors s
ignal tr
acks
exte
nded r
ead
s p
er
base
Maher B 2012. Nature 489:46.
2685
data sets
Maher B 2012. Nature 489:46.
2685
data sets
Now what?
Overview
1. ENCODE Project
2. Semi-automated genomic annotation
3. Chromatin
4. RNA-seq
Semi-automated annotation
signal tracks
interpretation
visualization
annotation
pattern
discovery
Genomic segmentation
Nonoverlapping segments
Nonoverlapping segments
0
1
0
2
1
1
Finite number of labels
0
1
0
1
1
Maximize similarity in labels
2
Bayesian network for ChIP-seq
Xt
observed random variable
signal at position t
continuous
Bayesian network for ChIP-seq
Qt
Xt
hidden random variable
observed random variable
transcription factor present at position t?
0: transcription factor is not present
1: transcription factor is present
signal at position t
discrete continuous
Bayesian network for ChIP-seq
Qt
Xt
µ0 σ0
µ1 σ1
emission probability parameter
hidden random variable
conditional relationship
observed random variable
TF present at position t?
signal at position t
discrete continuous
P(Xt | Qt = 0) ~ N(µ0, σ0)
P(Xt | Qt = 1) ~ N(µ1, σ1)
Bayesian network: 2 positions
Qt
Xt
µ0 σ0
µ1 σ1
emission probability parameter
hidden random variable
conditional relationship
observed random variable
discrete continuous
Qt+1
Xt+1
µ0 σ0
µ1 σ1
Bayesian network: 2 positions
Qt
Xt
µ0 σ0
µ1 σ1
emission probability parameter
hidden random variable
conditional relationship
observed random variable
discrete continuous
Qt+1
Xt+1
µ0 σ0
µ1 σ1
00 01
10 11
transition probability parameter
P(Qt+1 = 0 | Qt = 0) = 0.99
P(Qt+1 = 1 | Qt = 0) = 0.01
P(Qt+1 = 0 | Qt = 1) = 0.01
P(Qt+1 = 1 | Qt = 1) = 0.99
Dynamic Bayesian network (DBN)
Qt
Xt
µ0 σ0
µ1 σ1
emission probability parameter
hidden random variable
conditional relationship
observed random variable
discrete continuous
Qt+1
Xt+1
µ0 σ0
µ1 σ1
00 01
10 11
transition probability parameter
Qt+2
Xt+2
µ0 σ0
µ1 σ1
00 01
10 11 Q
X
µ0
µ1
00 01
10 11
Dynamic BN for segmentation segment
label
CTCF
H3K36me3
DNaseI
transition probability parameter
emission probability parameter
hidden random variable
conditional relationship
observed random variable
discrete continuous
Heterogeneous missing data
Hoffman MM et al. 2012. Nat Methods 9:473.
Handling missing data
µ0 σ0
µ1 σ1
00 01
10 11
µ0 σ0
µ1 σ1
transition probability parameter
emission probability parameter
hidden random variable
conditional
observed random variable
segment
DNaseI
discrete continuous switching
1 0
segment
label
CTCF
H3K36me3
DNaseI
transition probability parameter
emission probability parameter
hidden random variable
conditional
observed random variable
discrete continuous switching
present(CTCF)
present(H3K36me3)
present(DNaseI)
Handling missing data
segment
label
CTCF
H3K36me3
DNaseI
present(CTCF)
present(H3K36me3)
present(DNaseI)
Length
distribution
segment
label
CTCF
H3K36me3
DNaseI
present(CTCF)
present(H3K36me3)
present(DNaseI)
segment
countdown
segment
transition
ruler frame index Length
distribution
• Minimum segment length
• Maximum segment length
• Trained geometric length distribution
• Dirichlet prior on segment length
• Weight of prior versus observed data
Segway
A way to segment the genome
http://noble.gs.washington.edu/proj/segway/
Hoffman MM et al. 2012. Nat Methods 9:473.
Overview
1. ENCODE Project
2. Semi-automated genomic annotation
3. Chromatin
4. RNA-seq
embryoblast
endoderm mesoderm
intermediate
mesoderm
lateral
mesoderm
hemangioblast
blood vessel
endothelium
hemocytoblast
mesendoderm H1 hESC embryonic
stem cell
myeloid
progenitor
lymphoid
progenitor
HeLa-S3 cervical
carcinoma cell
HepG2 hepatocelluar
carcinoma cell
HUVEC umbilical vein
endothelial cell
K562 chronic myeloid
leukemia cell
GM12878 lymphoblastoid
cell
liver
cervix
lymphoblast
49 49 tracks
• ENCODE K562
ChIP-seq
DNase-seq
FAIRE-seq
• 8 different labs
Input tracks
25 labels
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Picking the number of labels
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Emission parameters Each cell represents
a Gaussian.
Means are row-
normalized so the
highest mean value
for a track is red and
the lowest mean
value is blue.
Standard deviation is
proportional to the
length of the black
bar.
TSS transcription start site
GS gene start
GM gene middle
GE gene end
E enhancer
I insulator
R repression
D dead
Transcription start site (TSS)
Hoffman MM et al. 2013. Nucleic Acids Res 41:827.
Rediscovering genes
Zooming out 10×
TSS segments
occur near 5’
ends of genes
TSS/G*
segments
missing in
gene deserts
R*/D*
segments
occur more in
gene deserts
3' gene ends
Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Jason Ernst
Lots of genes but
very few TSS/GS
segments. Why?
Because these genes
are not expressed in
K562. A p
uzzlin
g r
egio
n
Experimental validation
http://switchgeargenomics.com/products/promoter-reporter-collection/
Testing <1000bp sequences for promoter activity
• predicted + in K562
• predicted – in K562
predicted + in GM12878
predicted – in GM12878
Lu
cife
rase
assa
y r
esu
lts
Hoffman MM et al. 2012. Nat Methods 9:473.
Comparison with GWAS catalog
Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Bob Harris, Ross Hardison
Summary of results
Semi-automated genomic annotation
begins with pattern discovery from multiple
functional genomics data sets and enables:
• A simple annotation with a single label for
each part of the genome.
• Visualization reducing multivariate data to
a comprehensible representation.
• Interpretation of the context and potential
regulatory impact of variants.
Software availability
• Segway data tracks segmentation
Hoffman MM et al. 2012. Nat Methods 9:473.
http://noble.gs.washington.edu/proj/segway/
• Segtools segmentation plots and summary statistics
Buske OJ et al. 2011. BMC Bioinformatics 12:415
http://noble.gs.washington.edu/proj/segtools/
• Genomedata efficient access to numeric data anchored to genome
Hoffman MM et al. 2010. Bioinformatics 26:1458. http://noble.gs.washington.edu/proj/genomedata/
Acknowledgments
University of Washington: Harshad
Petwe, Meg Olson, Sheila Reynolds,
Noble Research Group. University
of Massachusetts Medical School:
Zhiping Weng. SwitchGear
Genomics: Patrick Collins. Stanford
University: Anshul Kundaje.
Pennsylvania State University:
Ross Hardison, Bob Harris.
European Bioinformatics Institute:
Ewan Birney, Ian Dunham.
University of California, Santa
Cruz: Kate Rosenbloom, Brian
Raney. Cold Spring Harbor
Laboratory: Tom Gingeras, Carrie
Davis. CRG: Sarah Djebali. RIKEN:
Timo Lassmann.
ENCODE Project Consortium.
NIH/NHGRI:
K99HG006259, U54HG004695.
Bill Noble Jeff Bilmes Orion Buske Paul Ellenbogen