peakseg: constrained optimal segmentation and supervised

PeakSeg: constrained optimal Segmentation andsupervised penalty learning for Peak detection in

count data

Toby Dylan [email protected]

joint work with Guillem Rigaill and Guillaume Bourque

7 July 2015

ChIP-seq data and previous work on peak detection

New PeakSeg model: constrained optimal segmentation

Train and test error results, conclusions

Problem: many false positives in unsupervised peakdetectors

Scalechr11:

10 kb hg19118,095,000 118,100,000 118,105,000 118,110,000 118,115,000 118,120,000 118,125,000

UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)

McGill0002.MS000201: monocyte, H3K4me3, signal

McGill0004.MS000401: CD4-positive helper T cell, H3K4me3, signal

McGill0091.MS009101: B cell, H3K4me3, signal

McGill0103.MS010302: B cell, H3K4me3, signal

AMICA1AK289390

MPZL3MPZL2

000201mono.k4me336.366 _

0.1254 _

000401htc.k4me35.1414 _

0.1254 _

009101bCell.k4me313.0345 _

0.0995 _

010302bCell.k4me37.8597 _

0.1107 _

I Grey signal is noisy count data (≈protein binding).

I Binary classification for every sample and position:negative class = background noise,positive class = “peaks.”

I Black bands show peak predictions of “Model-based analysisof ChIP-Seq” (MACS), Zhang et al, 2008 (defaultparameters).

Peak detector accuracy can be quantified using manuallyannotated region labels

Supervised learning framework (arXiv:1409.6209).

I Good peaks have 0 incorrect regions.

I Bad peaks have 7 incorrect regions.

I Goal: minimize number of incorrect test labels.

Maximum likelihood Poisson segmentation models

I Previous work: unconstrained maximum likelihood meanfor s segments (s − 1 changes).

I This paper: constraint enforces up, down, up, down(and not up, up, down).

I Odd-numbered segments are background noise,even-numbered segments are peaks.

PeakSeg: constrained maximum likelihood segmentationFor each number of segments s ∈ {1, . . . , smax},the PeakSeg model for the mean vector m is defined as:

ms(y) = arg minm∈Rd

d∑j=1

mj − yj logmj (PoissonLoss)

such that Segments(m) = s,

∀j ∈ {1, . . . , d}, Pj(m) ∈ {0, 1}.up, down, up, down constraint.

where the peak indicator P1(m) = 0 and for j > 1,

Pj(m) =

j∑k=2

sign(mk −mk−1).

We propose cDPA = a constrained dynamic programmingalgorithm, which computes smax models in O(smaxd

2) time.

Train error on H3K4me3 data

Train error on H3K36me3 data

Supervised learning of a penalty for choosing aprofile-specific number of segments

Supervised learning method of Hocking, Rigaill, et al (ICML 2013).

We have labeled profiles y1, . . . , yn andfeatures x1, . . . , xn (number of data points, mean, quantiles, etc).

Predicted number of segments for each profile i :

si = arg mins

PoissonLoss [ms(yi ), yi ] +

penalty︷︸︸︷h(p, di )︸︷︷︸

given

λi︸︷︷︸learned

,

Main idea: learn f (xi ) = log λi with minimal error on train set.

AIC/BIC and oracle model complexity criteria


si = arg mins



given


,

name model complexity h(s, di )

AIC/BIC.* s

oracle.* s(

1 + 4√

1.1 + log(di/s))2

Oracle model complexity of Cleynen and Lebarbier (2014).

Penalty function parameterizations


si = arg mins



given


,

name learned λi parameters learning algorithm*.0 AIC=2, BIC=log di none unsupervised*.1 β β ∈ R+ grid search*.3 eβdw1

i (max yi )w2 β,w1,w2 ∈ R interval regression*.41 exp(β + wᵀxi ) β ∈ R,w ∈ R40 regularized int. reg.

Unsupervised constrained optimization algorithm works forboth H3K36me3 and H3K4me3 data types

...except in the H3K4me3 XJ immune data set.

H3K36me3AM

immune

H3K36me3TDH

immune

H3K36me3TDHother

H3K4me3PGP

immune

H3K4me3TDH

immune

H3K4me3TDHother

H3K4me3XJ

immune

●●●●●●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●● ●●●

●●● ● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

● ●●●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●● ●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

AIC/BIC.0

oracle.0

macs.0

hmcan.broad.0

PeakS

eg(cD

PA)

baselines

0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60percent incorrect peak region labels (test error)al

gorit

hm .

para

met

ers

lear

ned

learningalgorithm

● unsupervised

Six train/test splits (open circles) and mean (shaded circle).

Training 1 parameter with grid search reduces test error

...except for macs, good defaults for 3/4 H3K4me3 data sets.

H3K36me3AM

immune

H3K36me3TDH

immune

H3K36me3TDHother

H3K4me3PGP

immune

H3K4me3TDH

immune

H3K4me3TDHother

H3K4me3XJ

immune

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●● ● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

AIC/BIC.0

AIC/BIC.1

oracle.0

oracle.1

macs.0

macs.1

hmcan.broad.0

hmcan.broad.1

PeakS

eg(cD

PA)

baselines

0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60percent incorrect peak region labels (test error)

algo

rithm

. pa

ram

eter

s le

arne

d

learningalgorithm

● ●gridsearch unsupervised


Training several parameters with interval regression furtherreduces test error

...except when there are few train data (H3K36me3 TDH).

H3K36me3AM

immune

H3K36me3TDH

immune

H3K36me3TDHother

H3K4me3PGP

immune

H3K4me3TDH

immune

H3K4me3TDHother

H3K4me3XJ

immune

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●● ● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

AIC/BIC.0AIC/BIC.1AIC/BIC.3

AIC/BIC.41oracle.0oracle.1oracle.3

oracle.41

macs.0macs.1

hmcan.broad.0hmcan.broad.1

PeakS

eg (cDPA

)baselines

0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60percent incorrect peak region labels (test error)

mod

el .

para

met

ers

lear

ned

learningalgorithm

● ● ●intervalregression

gridsearch unsupervised


Conclusions and future workPeakSeg: Peak detection via constrained optimal Segmentation.

I New segmentation model with up, down, up, down constraint.

I First supervised peak detection algorithm.

I State-of-the-art peak detection for both H3K4me3 andH3K36me3 profiles.

I Oracle model complexity more accurate than AIC/BIC.

Future work:

I Constrained version of Pruned Dynamic Programming (RigaillarXiv:1004.0887) to compute in O(d log d) time.

I Efficient algorithm which provably computes PeakSeg model?

I Theoretically optimal features for the penalty learningproblem?

I Feature learning based on profile count data.

I Overlapping peaks at the same positions across samples(arXiv:1506.01286).

Thanks for your attention!

Write me at [email protected] to collaborate!

Source code for slides, figures, paper online!https://github.com/tdhock/PeakSeg-paper

Supplementary slides appear after this one.

https://github.com/tdhock/PeakSeg-paper

PeakSeg accuracy can be quantified using labels

I 1, 3, 5, 7 segments = 0, 1, 2, 3 peaks (2p + 1 = s).

I Models with s ∈ {1, 7} segments have 1 incorrect region.

I Models with s ∈ {3, 5} segments are perfect.

I Goal for i ∈ {1, . . . , n} profiles:predict profile-specific segments si with minimum errors.

Chromatin immunoprecipitation sequencing (ChIP-seq)

Analysis of DNA-protein interactions.

Source: “ChIP-sequencing,” Wikipedia.

Previous work in computer vision: look and add labels to...

Photos Cell images Copy number profiles

Labels: names phenotypes alterations

CVPR 2013 CellProfiler SegAnnDB246 papers 873 citations H, et. al. 2014.

Sources: http://en.wikipedia.org/wiki/Face_detection

Jones et al PNAS 2009. Scoring diverse cellular morphologies inimage-based screens with iterative feedback and machine learning.

http://en.wikipedia.org/wiki/Face_detection

Computation of optimal loss Ls,t for s = 1 segments up todata point t

L1,t = c(0,t]︸︷︷︸optimal loss of 1st segment (0, t]

Computation of optimal loss Ls,t for s = 2 segments up todata point t < d

L2,t = mint′<t

L1,t′︸︷︷︸optimal loss in 1 segment up to t′

+ c(t′,t]︸︷︷︸optimal loss of 2nd segment (t′, t]

Computation of optimal loss Ls,t for s = 2 segments up tolast data point t = d

L2,t = mint′<t

L1,t′︸︷︷︸optimal loss in 1 segment up to t′

+ c(t′,t]︸︷︷︸optimal loss of 2nd segment (t′, t]

Dynamic programming is faster than grid search for s > 2segments

Computation time in number of data points d :

segments s grid search dynamic programming

1 O(d) O(d)2 O(d2) O(d2)3 O(d3) O(d2)4 O(d4) O(d2)...

......

For example d = 5735 data points to segment.d2 = 32890225d3 = 188625440375...

Computation of optimal loss Ls,t for s = 3 segments up todata point t

L3,t = mint′<t

L2,t′︸︷︷︸optimal loss in 2 segments up to t′

+ c(t′,t]︸︷︷︸optimal loss of 3rd segment (t′, t]

Step 1: compute annotation error functions

I Inputs: for i ∈ {1, . . . , n} samples, genomic profiles yi ,annotated regions Ri .

0 peaks · · · pmax peaks

segmentations m0(yi ) · · · mpmax(yi )annotation error ei (0) · · · ei (pmax)

I R package https://github.com/tdhock/PeakError/

computes the annotation error ei : {0, . . . , pmax} → Z+.

I TD Hocking et al. Visual annotations and a supervisedlearning approach for evaluating and calibrating ChIP-seqpeak detectors (arXiv:1409.6209).

https://github.com/tdhock/PeakError/

Step 2: compute model selection functions

For each sample/chromosome i ∈ {1, . . . , n}, for λ ∈ R+,

I The optimal number of peaks function is

p∗i (λ) = arg minp∈{1,...,pmax}

αpi + λp,

where αpi is the Poisson loss of the model with p peaks.

I The penalized annotation error function is

Ei (λ) = ei [p∗i (λ)] ,

where ei (p) is the number of incorrect annotations for themodel with p peaks.

Peaks p∗i and error Ei are non-convex, piecewise constantfunctions that can be computed exactly.

Step 3: learn a penalty function via interval regression

I Compute the target interval (Li , Li ).

I log λi ∈ (Li , Li )⇒ optimal peak detection.

I Compute simple features xi ∈ Rm, e.g. chromosome size, readcounts, signal scale log max yi .

I Learn an optimal affine f (xi ) = β + wᵀxi = log λi .

I Equivalent to learning a penalty λi = exp f (xi ):

p∗i [exp f (xi )] = arg minp

αpi + p exp f (xi )

= arg minp

αpi + p(max yi )

weβ.

I Convex optimization problem, global optimum, variableselection (G Rigaill, TD Hocking, et al. ICML 2013).

Summary of supervised PeakSeg algorithm

I Fix the maximum number of peaks pmax = 10, 000.I For each sample/chromosome i ∈ {1, . . . , n},

I Unsupervised PeakSeg: compute constrained maximumlikelihood segmentations m0(yi ), . . . , mpmax (yi ).

I Step 1: use annotated region labels to compute the annotationerror ei (0), . . . , ei (pmax).

I Step 2: compute peaks p∗i (λ), error Ei (λ), and target interval(Li , Li ).

I Step 3: learn a penalty λi = exp f (xi ) using features xi suchas log max(yi ).

I Given an unlabeled chromosome (x, y), we predictmp∗[exp f (x)](y).

Benchmark: 7 annotated region data sets

http://cbio.ensmp.fr/~thocking/chip-seq-chunk-db/

I 4 annotators (AM, TDH, PGP, XJ).

I 8 cell types.

I 37 annotated H3K4me3 profiles (sharp peaks).

I 29 annotated H3K36me3 profiles (broadly enriched domains).

I 12,826 annotated regions in total.

I 2752 separate segmentation problems.

Used the cDPA on the annotated data.

I cDPA computed models with 0, . . . , 9 peaks(for 99.5% of problems).

I For the biggest problem, cDPA took 3 hours.(d = 88, 509 data points, 3.5 megabases)

I macs takes about 90 minutes for one whole genome.


Maximum likelihood segmentations

For a coverage profile y ∈ Zd+, find the mean vector ms(y) ∈ Rd

with maximum Poisson likelihood, given s segments (s − 1change-points).

Computed via Segmentor3IsBack R package (Cleynen et al. 2014)

Previous work: maximum likelihood segmentation

I Let y =[y1 · · · yd

]∈ Zd

+ be the aligned read counts forone sample and one genomic region.

I Fix smax = 19, the maximum number of segments.

I For each number of segments s ∈ {1, . . . , smax}, we want:

ms(y) = arg minm∈Rd

d∑j=1

mj − yj logmj (Poisson loss)

such that Segments(m) = s.

I Pruned Dynamic Programming (Rigaill arXiv:1004.0887)returns m1(y), . . . , msmax(y) in O(smaxd log d) time.

Maximum likelihood segmentations

Model with s = 5 segments changes up, up, down, down.How to define peaks? Introduce a threshold parameter?

Constrained maximum likelihood segmentations

Model with s = 5 segments changes up, down, up, down.Peaks are even-numbered segments.

Two annotators provide consistent labels, but differentprecision

I TDH peakStart/peakEnd more precise than AM peaks.

I AM noPeaks more precise than TDH no label.

Comparison on annotated McGill benchmark data set

Hocking et al, 2014, arXiv:1409.6209.

I Manually annotate regions with or without peaks.http://cbio.ensmp.fr/~thocking/chip-seq-chunk-db/

I Tune 1 parameter that affects the number of peaks.

I Choose the parameter that minimizes the annotation error.

Results:

I MACS best for H3K4me3 (sharp peak pattern),

I HMCan.broad best for H3K36me3 (broad peak pattern).

I Consistent across 4 annotators (PhD students, postdocs).

I 10–20% test error rates.


Can we do better than unsupervised peak detectors? Yes!

We propose PeakSeg, a new model with efficient algorithms forsupervised peak detection.

I Input: several ChIP-seq profiles, manually annotated regions.I New methods for peak detection:

I Constrained optimal segmentation.I Efficient supervised learning using manually annotated regions.

I Output: predicted peaks for each profile.

State-of-the-art peak detection accuracy (on both sharp H3K4me3and broad H3K36me3 profiles).

Existing peak detection algorithms

I Model-based analysis of ChIP-Seq (MACS), Zhang et al, 2008.

I SICER, Zang et al, 2009.

I HOMER findPeaks, Heinz et al, 2010.

I RSEG, Song and Smith, 2011.

I Histone modifications in cancer (HMCan), Ashoor et al, 2013.

I ... dozens of others.

Two big questions: how to choose the best...

I ...algorithm?

I ...parameters?

How to choose model parameters?

19 parameters for Model-based analysis of ChIP-Seq (MACS), Zhang et al, 2008.

[-g GSIZE]

[-s TSIZE] [--bw BW] [-m MFOLD MFOLD] [--fix-bimodal]

[--nomodel] [--extsize EXTSIZE | --shiftsize SHIFTSIZE]

[-q QVALUE | -p PVALUE | -F FOLDENRICHMENT] [--to-large]

[--down-sample] [--seed SEED] [--nolambda]

[--slocal SMALLLOCAL] [--llocal LARGELOCAL]

[--shift-control] [--half-ext] [--broad]

[--broad-cutoff BROADCUTOFF] [--call-summits]

10 parameters for Histone modifications in cancer (HMCan), Ashoor et al, 2013.

minLength 145

medLength 150

maxLength 155

smallBinLength 50

largeBinLength 100000

pvalueThreshold 0.01

mergeDistance 200

iterationThreshold 5

finalThreshold 0

maxIter 20

peakseg: constrained optimal segmentation and supervised

Documents