the segway annotation of encode data

46
The Segway annotation of ENCODE data Michael M. Hoffman Department of Genome Sciences University of Washington

Upload: others

Post on 16-Oct-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Segway annotation of ENCODE data

The Segway annotation

of ENCODE data

Michael M. Hoffman

Department of Genome Sciences

University of Washington

Page 2: The Segway annotation of ENCODE data

Overview

1. ENCODE Project

2. Semi-automated genomic annotation

3. Chromatin

4. RNA-seq

Page 3: The Segway annotation of ENCODE data

Functional genomics

ENCODE Project Consortium 2011. PLoS Biol 9:e1001046.

Page 4: The Segway annotation of ENCODE data

Chromatin immunoprecipitation

Park PJ 2009. Nat Rev Genet 10:669.

(ChIP)

Page 5: The Segway annotation of ENCODE data

ChIP sequence

Page 6: The Segway annotation of ENCODE data

sequence signal: Wiggler

• Extends tags in strand

direction

• Extension length

determined by cross-

correlation peak

• Signal only in mappable

regions

• 1-bp resolution

Anshul Kundaje

http://align2rawsignal.googlecode.com/

Hoffman MM et al. 2013. Nucleic Acids Res 41:827.

Page 7: The Segway annotation of ENCODE data

Fine-scale data

300 bp

H3K4me2

H3K27me3

Pol2b

Egr-1

GABP

Pol2 (Myers)

Sin3Ak-20

TAF1

Histone modifications

Transcription factors s

ignal tr

acks

exte

nded r

ead

s p

er

base

Page 8: The Segway annotation of ENCODE data

Maher B 2012. Nature 489:46.

2685

data sets

Page 9: The Segway annotation of ENCODE data

Maher B 2012. Nature 489:46.

2685

data sets

Now what?

Page 10: The Segway annotation of ENCODE data

Overview

1. ENCODE Project

2. Semi-automated genomic annotation

3. Chromatin

4. RNA-seq

Page 11: The Segway annotation of ENCODE data

Semi-automated annotation

signal tracks

interpretation

visualization

annotation

pattern

discovery

Page 12: The Segway annotation of ENCODE data

Genomic segmentation

Page 13: The Segway annotation of ENCODE data

Nonoverlapping segments

Page 14: The Segway annotation of ENCODE data

Nonoverlapping segments

Page 15: The Segway annotation of ENCODE data

0

1

0

2

1

1

Finite number of labels

Page 16: The Segway annotation of ENCODE data

0

1

0

1

1

Maximize similarity in labels

2

Page 17: The Segway annotation of ENCODE data

Bayesian network for ChIP-seq

Xt

observed random variable

signal at position t

continuous

Page 18: The Segway annotation of ENCODE data

Bayesian network for ChIP-seq

Qt

Xt

hidden random variable

observed random variable

transcription factor present at position t?

0: transcription factor is not present

1: transcription factor is present

signal at position t

discrete continuous

Page 19: The Segway annotation of ENCODE data

Bayesian network for ChIP-seq

Qt

Xt

µ0 σ0

µ1 σ1

emission probability parameter

hidden random variable

conditional relationship

observed random variable

TF present at position t?

signal at position t

discrete continuous

P(Xt | Qt = 0) ~ N(µ0, σ0)

P(Xt | Qt = 1) ~ N(µ1, σ1)

Page 20: The Segway annotation of ENCODE data

Bayesian network: 2 positions

Qt

Xt

µ0 σ0

µ1 σ1

emission probability parameter

hidden random variable

conditional relationship

observed random variable

discrete continuous

Qt+1

Xt+1

µ0 σ0

µ1 σ1

Page 21: The Segway annotation of ENCODE data

Bayesian network: 2 positions

Qt

Xt

µ0 σ0

µ1 σ1

emission probability parameter

hidden random variable

conditional relationship

observed random variable

discrete continuous

Qt+1

Xt+1

µ0 σ0

µ1 σ1

00 01

10 11

transition probability parameter

P(Qt+1 = 0 | Qt = 0) = 0.99

P(Qt+1 = 1 | Qt = 0) = 0.01

P(Qt+1 = 0 | Qt = 1) = 0.01

P(Qt+1 = 1 | Qt = 1) = 0.99

Page 22: The Segway annotation of ENCODE data

Dynamic Bayesian network (DBN)

Qt

Xt

µ0 σ0

µ1 σ1

emission probability parameter

hidden random variable

conditional relationship

observed random variable

discrete continuous

Qt+1

Xt+1

µ0 σ0

µ1 σ1

00 01

10 11

transition probability parameter

Qt+2

Xt+2

µ0 σ0

µ1 σ1

00 01

10 11 Q

X

µ0

µ1

00 01

10 11

Page 23: The Segway annotation of ENCODE data

Dynamic BN for segmentation segment

label

CTCF

H3K36me3

DNaseI

transition probability parameter

emission probability parameter

hidden random variable

conditional relationship

observed random variable

discrete continuous

Page 24: The Segway annotation of ENCODE data

Heterogeneous missing data

Hoffman MM et al. 2012. Nat Methods 9:473.

Page 25: The Segway annotation of ENCODE data

Handling missing data

µ0 σ0

µ1 σ1

00 01

10 11

µ0 σ0

µ1 σ1

transition probability parameter

emission probability parameter

hidden random variable

conditional

observed random variable

segment

DNaseI

discrete continuous switching

1 0

Page 26: The Segway annotation of ENCODE data

segment

label

CTCF

H3K36me3

DNaseI

transition probability parameter

emission probability parameter

hidden random variable

conditional

observed random variable

discrete continuous switching

present(CTCF)

present(H3K36me3)

present(DNaseI)

Handling missing data

Page 27: The Segway annotation of ENCODE data

segment

label

CTCF

H3K36me3

DNaseI

present(CTCF)

present(H3K36me3)

present(DNaseI)

Length

distribution

Page 28: The Segway annotation of ENCODE data

segment

label

CTCF

H3K36me3

DNaseI

present(CTCF)

present(H3K36me3)

present(DNaseI)

segment

countdown

segment

transition

ruler frame index Length

distribution

• Minimum segment length

• Maximum segment length

• Trained geometric length distribution

• Dirichlet prior on segment length

• Weight of prior versus observed data

Page 29: The Segway annotation of ENCODE data

Segway

A way to segment the genome

http://noble.gs.washington.edu/proj/segway/

Hoffman MM et al. 2012. Nat Methods 9:473.

Page 30: The Segway annotation of ENCODE data

Overview

1. ENCODE Project

2. Semi-automated genomic annotation

3. Chromatin

4. RNA-seq

Page 31: The Segway annotation of ENCODE data

embryoblast

endoderm mesoderm

intermediate

mesoderm

lateral

mesoderm

hemangioblast

blood vessel

endothelium

hemocytoblast

mesendoderm H1 hESC embryonic

stem cell

myeloid

progenitor

lymphoid

progenitor

HeLa-S3 cervical

carcinoma cell

HepG2 hepatocelluar

carcinoma cell

HUVEC umbilical vein

endothelial cell

K562 chronic myeloid

leukemia cell

GM12878 lymphoblastoid

cell

liver

cervix

lymphoblast

Page 32: The Segway annotation of ENCODE data

49 49 tracks

• ENCODE K562

ChIP-seq

DNase-seq

FAIRE-seq

• 8 different labs

Input tracks

Page 33: The Segway annotation of ENCODE data

25 labels

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Picking the number of labels

Page 34: The Segway annotation of ENCODE data

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Emission parameters Each cell represents

a Gaussian.

Means are row-

normalized so the

highest mean value

for a track is red and

the lowest mean

value is blue.

Standard deviation is

proportional to the

length of the black

bar.

Page 35: The Segway annotation of ENCODE data

TSS transcription start site

GS gene start

GM gene middle

GE gene end

E enhancer

I insulator

R repression

D dead

Page 36: The Segway annotation of ENCODE data

Transcription start site (TSS)

Hoffman MM et al. 2013. Nucleic Acids Res 41:827.

Page 37: The Segway annotation of ENCODE data

Rediscovering genes

Page 38: The Segway annotation of ENCODE data

Zooming out 10×

TSS segments

occur near 5’

ends of genes

TSS/G*

segments

missing in

gene deserts

R*/D*

segments

occur more in

gene deserts

Page 39: The Segway annotation of ENCODE data

3' gene ends

Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Jason Ernst

Page 40: The Segway annotation of ENCODE data

Lots of genes but

very few TSS/GS

segments. Why?

Because these genes

are not expressed in

K562. A p

uzzlin

g r

egio

n

Page 41: The Segway annotation of ENCODE data

Experimental validation

http://switchgeargenomics.com/products/promoter-reporter-collection/

Testing <1000bp sequences for promoter activity

• predicted + in K562

• predicted – in K562

predicted + in GM12878

predicted – in GM12878

Page 42: The Segway annotation of ENCODE data

Lu

cife

rase

assa

y r

esu

lts

Hoffman MM et al. 2012. Nat Methods 9:473.

Page 43: The Segway annotation of ENCODE data

Comparison with GWAS catalog

Hoffman MM et al. 2013. Nucleic Acids Res 41:827. Bob Harris, Ross Hardison

Page 44: The Segway annotation of ENCODE data

Summary of results

Semi-automated genomic annotation

begins with pattern discovery from multiple

functional genomics data sets and enables:

• A simple annotation with a single label for

each part of the genome.

• Visualization reducing multivariate data to

a comprehensible representation.

• Interpretation of the context and potential

regulatory impact of variants.

Page 45: The Segway annotation of ENCODE data

Software availability

• Segway data tracks segmentation

Hoffman MM et al. 2012. Nat Methods 9:473.

http://noble.gs.washington.edu/proj/segway/

• Segtools segmentation plots and summary statistics

Buske OJ et al. 2011. BMC Bioinformatics 12:415

http://noble.gs.washington.edu/proj/segtools/

• Genomedata efficient access to numeric data anchored to genome

Hoffman MM et al. 2010. Bioinformatics 26:1458. http://noble.gs.washington.edu/proj/genomedata/

Page 46: The Segway annotation of ENCODE data

Acknowledgments

University of Washington: Harshad

Petwe, Meg Olson, Sheila Reynolds,

Noble Research Group. University

of Massachusetts Medical School:

Zhiping Weng. SwitchGear

Genomics: Patrick Collins. Stanford

University: Anshul Kundaje.

Pennsylvania State University:

Ross Hardison, Bob Harris.

European Bioinformatics Institute:

Ewan Birney, Ian Dunham.

University of California, Santa

Cruz: Kate Rosenbloom, Brian

Raney. Cold Spring Harbor

Laboratory: Tom Gingeras, Carrie

Davis. CRG: Sarah Djebali. RIKEN:

Timo Lassmann.

ENCODE Project Consortium.

NIH/NHGRI:

K99HG006259, U54HG004695.

Bill Noble Jeff Bilmes Orion Buske Paul Ellenbogen