the segway annotation of encode data

The Segway annotation

of ENCODE data

Michael M. Hoffman

Department of Genome Sciences

University of Washington

Overview

1. ENCODE Project

2. Semi-automated genomic annotation

3. Chromatin

4. RNA-seq

Functional genomics

ENCODE Project Consortium 2011. PLoS Biol 9:e1001046.

Chromatin immunoprecipitation

Park PJ 2009. Nat Rev Genet 10:669.

(ChIP)

ChIP sequence

sequence signal: Wiggler

• Extends tags in strand

direction

• Extension length

determined by cross-

correlation peak

• Signal only in mappable

regions

• 1-bp resolution

Anshul Kundaje

http://align2rawsignal.googlecode.com/

Hoffman MM et al. 2013. Nucleic Acids Res 41:827.

Fine-scale data

300 bp

H3K4me2

H3K27me3

Pol2b

Egr-1

GABP

Pol2 (Myers)

Sin3Ak-20

TAF1

Histone modifications

Transcription factors s

ignal tr

acks

exte

nded r

ead

s p

er

base

Maher B 2012. Nature 489:46.

2685

data sets

Maher B 2012. Nature 489:46.

2685

data sets

Now what?

Overview

1. ENCODE Project


3. Chromatin

4. RNA-seq

Semi-automated annotation

signal tracks

interpretation

visualization

annotation

pattern

discovery

Genomic segmentation

Nonoverlapping segments

0

1

0

2

1

1

Finite number of labels

0

1

0

1

1

Maximize similarity in labels

2

Bayesian network for ChIP-seq

Xt

observed random variable

signal at position t

continuous


Qt

Xt

hidden random variable


transcription factor present at position t?

0: transcription factor is not present

1: transcription factor is present


discrete continuous


Qt

Xt

µ0 σ0

µ1 σ1

emission probability parameter


conditional relationship


TF present at position t?


discrete continuous

P(Xt | Qt = 0) ~ N(µ0, σ0)

P(Xt | Qt = 1) ~ N(µ1, σ1)

Bayesian network: 2 positions

Qt

Xt

µ0 σ0

µ1 σ1





discrete continuous

Qt+1

Xt+1

µ0 σ0

µ1 σ1

Bayesian network: 2 positions

Qt

Xt

µ0 σ0

µ1 σ1





discrete continuous

Qt+1

Xt+1

µ0 σ0

µ1 σ1

00 01

10 11

transition probability parameter

P(Qt+1 = 0 | Qt = 0) = 0.99

P(Qt+1 = 1 | Qt = 0) = 0.01

P(Qt+1 = 0 | Qt = 1) = 0.01

P(Qt+1 = 1 | Qt = 1) = 0.99

Dynamic Bayesian network (DBN)

Qt

Xt

µ0 σ0

µ1 σ1





discrete continuous

Qt+1

Xt+1

µ0 σ0

µ1 σ1

00 01

10 11


Qt+2

Xt+2

µ0 σ0

µ1 σ1

00 01

10 11 Q

X

µ0

µ1

00 01

10 11

Dynamic BN for segmentation segment

label

CTCF

H3K36me3

DNaseI






discrete continuous

Heterogeneous missing data

Hoffman MM et al. 2012. Nat Methods 9:473.

Handling missing data

µ0 σ0

µ1 σ1

00 01

10 11

µ0 σ0

µ1 σ1




conditional


segment

DNaseI

discrete continuous switching

1 0

segment

label

CTCF

H3K36me3

DNaseI




conditional


discrete continuous switching

present(CTCF)

present(H3K36me3)

present(DNaseI)

Handling missing data

segment

label

CTCF

H3K36me3

DNaseI

present(CTCF)

present(H3K36me3)

present(DNaseI)

Length

distribution

segment

label

CTCF

H3K36me3

DNaseI

present(CTCF)

present(H3K36me3)

present(DNaseI)

segment

countdown

segment

transition

ruler frame index Length

distribution

• Minimum segment length

• Maximum segment length

• Trained geometric length distribution

• Dirichlet prior on segment length

• Weight of prior versus observed data

Segway

A way to segment the genome

http://noble.gs.washington.edu/proj/segway/


Overview

1. ENCODE Project


3. Chromatin

4. RNA-seq

embryoblast

endoderm mesoderm

intermediate

mesoderm

lateral

mesoderm

hemangioblast

blood vessel

endothelium

hemocytoblast

mesendoderm H1 hESC embryonic

stem cell

myeloid

progenitor

lymphoid

progenitor

HeLa-S3 cervical

carcinoma cell

HepG2 hepatocelluar

carcinoma cell

HUVEC umbilical vein

endothelial cell

K562 chronic myeloid

leukemia cell

GM12878 lymphoblastoid

cell

liver

cervix

lymphoblast

49 49 tracks

• ENCODE K562

ChIP-seq

DNase-seq

FAIRE-seq

• 8 different labs

Input tracks

25 labels

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Picking the number of labels

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Emission parameters Each cell represents

a Gaussian.

Means are row-

normalized so the

highest mean value

for a track is red and

the lowest mean

value is blue.

Standard deviation is

proportional to the

length of the black

bar.

TSS transcription start site

GS gene start

GM gene middle

GE gene end

E enhancer

I insulator

R repression

D dead

Transcription start site (TSS)

Hoffman MM et al. 2013. Nucleic Acids Res 41:827.

Rediscovering genes

Zooming out 10×

TSS segments

occur near 5’

ends of genes

TSS/G*

segments

missing in

gene deserts

R*/D*

segments

occur more in

gene deserts

3' gene ends

Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Jason Ernst

Lots of genes but

very few TSS/GS

segments. Why?

Because these genes

are not expressed in

K562. A p

uzzlin

g r

egio

n

Experimental validation

http://switchgeargenomics.com/products/promoter-reporter-collection/

Testing <1000bp sequences for promoter activity

• predicted + in K562

• predicted – in K562

predicted + in GM12878

predicted – in GM12878

Lu

cife

rase

assa

y r

esu

lts


Comparison with GWAS catalog

Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Bob Harris, Ross Hardison

Summary of results

Semi-automated genomic annotation

begins with pattern discovery from multiple

functional genomics data sets and enables:

• A simple annotation with a single label for

each part of the genome.

• Visualization reducing multivariate data to

a comprehensible representation.

• Interpretation of the context and potential

regulatory impact of variants.

Software availability

• Segway data tracks segmentation



• Segtools segmentation plots and summary statistics

Buske OJ et al. 2011. BMC Bioinformatics 12:415

http://noble.gs.washington.edu/proj/segtools/

• Genomedata efficient access to numeric data anchored to genome

Hoffman MM et al. 2010. Bioinformatics 26:1458. http://noble.gs.washington.edu/proj/genomedata/


http://noble.gs.washington.edu/proj/segtools/

http://noble.gs.washington.edu/proj/genomedata/

http://noble.gs.washington.edu/proj/genomedata/

Acknowledgments

University of Washington: Harshad

Petwe, Meg Olson, Sheila Reynolds,

Noble Research Group. University

of Massachusetts Medical School:

Zhiping Weng. SwitchGear

Genomics: Patrick Collins. Stanford

University: Anshul Kundaje.

Pennsylvania State University:

Ross Hardison, Bob Harris.

European Bioinformatics Institute:

Ewan Birney, Ian Dunham.

University of California, Santa

Cruz: Kate Rosenbloom, Brian

Raney. Cold Spring Harbor

Laboratory: Tom Gingeras, Carrie

Davis. CRG: Sarah Djebali. RIKEN:

Timo Lassmann.

ENCODE Project Consortium.

NIH/NHGRI:

K99HG006259, U54HG004695.

Bill Noble Jeff Bilmes Orion Buske Paul Ellenbogen

the segway annotation of encode data

Documents