multiscale dna partitioning: statistical evidence for segments

8
Multiscale DNA partitioning: statistical evidence for segments Andreas Futschik 1,* , Thomas Hotz 2, * , Axel Munk 3,4 , Hannes Sieling 3 1 Department of Applied Statistics, JKU University Linz, A-4040 Linz, Austria. 2 Institute of Mathematics, Technische Universit¨ at Ilmenau, D-98693 Ilmenau, Germany. 3 Institute for Mathematical Stochastics and Felix Bernstein Institute for Mathematical Statistics in Biosciences, Georgia Augusta University of Goettingen, D-37077 G¨ ottingen, Germany. 4 Max Planck Institute for Biophysical Chemistry, D-37077 Goettingen, Germany. ABSTRACT Motivation: DNA segmentation, i.e. the partitioning of DNA in compositionally homogeneous segments, is a basic task in bioinformatics. Different algorithms have been proposed for various partitioning criteria such as GC content, local ancestry in population genetics, or copy number variation. A critical component of any such method is the choice of an appropriate number of segments. Some methods use model selection criteria, and do not provide a suitable error control. Other methods that are based on simulating a statistic under a null model provide suitable error control only if the correct null model is chosen. Results: Here, we focus on partitioning with respect to GC content and propose a new approach that provides statistical error control: as in statistical hypothesis testing, it guarantees with a user speci- fied probability 1 α that the number of identified segments does not exceed the number of actually present segments. The method is based on a statistical multiscale criterion, rendering this as segmen- tation method which searches segments of any length (on all scales), simultaneously. It is also very accurate in localizing segments: under bench-mark scenarios, our approach leads to a segmentation that is more accurate than the approaches discussed in the comparative review of Elhaik et al. (2010a). In our real data examples, we find segments that often correspond well to features taken from standard UCSC genome annotation tracks. Availability: Our method is implemented in function smuceR of the R-package stepR, available from http://www.stochastik.math.uni- goettingen.de/smuce. Contact: [email protected], [email protected], [email protected], [email protected] 1 INTRODUCTION It has been observed a long time ago (Sueoka, 1962) that DNA sequ- ences are not composed homogeneously and that bases fluctuate in their frequency. These inhomogeneities often have an evolutionary or a functional interpretation, and can be relevant for the subsequent analysis of sequence data. Since it correlates with many features of * to whom correspondence should be addressed interest, the GC content, i.e. the relative frequency of the bases G and C, is one of the most commonly studied sequence properties. Large scale regions, typically 300kb (Bernardi, 2001), of appro- ximately homogeneous GC content have been called isochores. In view of the somewhat vague notion of “approximate homogeneity”, and conceptual criticism in studies such as Cohen et al. (2005), or Elhaik et al. (2010a), there is less interest in isochores nowadays. There is no doubt about variation in GC content along genomes however, and search is done instead for domains of any length exhibiting distinct local GC content; see for instance Elhaik et al. (2010b). Several factors can influence the GC content of a region. At larger scales, it correlates with the density of genes, with gene rich regions typically exhibiting an elevated GC content compared to regions of low gene density. At smaller scales, there is fluctua- tion in GC content for instance due to repetitive elements and GpC islands. The GC content is also known to vary between exons and introns. Indeed, especially for long introns, their lower GC content seems to play a role in splice site recognition (Amit et al., 2012). There is also a correlation between GC content and the local recom- bination rate (Fullerton et al., 2001). For a further discussion of features correlated to the GC content, see Freudenberg et al. (2009). In gene expression studies, regions of homogeneous GC content are of interest as the GC content of a region affects the number of reads mapped to this region. For DNA and RNA-seq experi- ments with the Illumina Genome Analyzer platform, this has been for instance investigated in Benjamini et al. (2012) and Risso et al. (2011). Segmentation algorithms aim to partition a given DNA sequence into stretches that are homogeneous in their composition, but differ from neighboring segments. The classical approach of using moving windows is very simple and available for instance as an option with the UCSC and Ensembl genome browsers. It has some disadvanta- ges, however. For instance, the choice of the window size is difficult as it defines implicitly a fixed scale at which segments primarily will be detected. Further, the involved smoothing blurs abrupt changes. Without additional statistical criteria, the method also does not tell us whether differences between neighboring windows are statisti- cally significant. Therefore several more sophisticated approaches have been proposed. These methods include hidden Markov Models (Churchill (1989, 1992)) and walking Markov Models (Fickett et al. 1 © The Author (2014). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Associate Editor: Dr. John Hancock Bioinformatics Advance Access published April 21, 2014 at Universidade do Porto on April 22, 2014 http://bioinformatics.oxfordjournals.org/ Downloaded from

Upload: h

Post on 23-Dec-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Multiscale DNA partitioning: statistical evidence for segments

Multiscale DNA partitioning: statistical evidence forsegmentsAndreas Futschik 1,∗, Thomas Hotz 2,∗, Axel Munk 3,4, Hannes Sieling 3

1Department of Applied Statistics, JKU University Linz, A-4040 Linz, Austria.2Institute of Mathematics, Technische Universitat Ilmenau, D-98693 Ilmenau, Germany.3Institute for Mathematical Stochastics and Felix Bernstein Institute for Mathematical Statistics inBiosciences, Georgia Augusta University of Goettingen, D-37077 Gottingen, Germany.4Max Planck Institute for Biophysical Chemistry, D-37077 Goettingen, Germany.

ABSTRACTMotivation:

DNA segmentation, i.e. the partitioning of DNA in compositionallyhomogeneous segments, is a basic task in bioinformatics. Differentalgorithms have been proposed for various partitioning criteria suchas GC content, local ancestry in population genetics, or copy numbervariation. A critical component of any such method is the choice of anappropriate number of segments. Some methods use model selectioncriteria, and do not provide a suitable error control. Other methodsthat are based on simulating a statistic under a null model providesuitable error control only if the correct null model is chosen.Results: Here, we focus on partitioning with respect to GC contentand propose a new approach that provides statistical error control:as in statistical hypothesis testing, it guarantees with a user speci-fied probability 1 − α that the number of identified segments doesnot exceed the number of actually present segments. The method isbased on a statistical multiscale criterion, rendering this as segmen-tation method which searches segments of any length (on all scales),simultaneously. It is also very accurate in localizing segments: underbench-mark scenarios, our approach leads to a segmentation thatis more accurate than the approaches discussed in the comparativereview of Elhaik et al. (2010a). In our real data examples, we findsegments that often correspond well to features taken from standardUCSC genome annotation tracks.Availability: Our method is implemented in function smuceR ofthe R-package stepR, available from http://www.stochastik.math.uni-goettingen.de/smuce.Contact: [email protected], [email protected],[email protected], [email protected]

1 INTRODUCTIONIt has been observed a long time ago (Sueoka, 1962) that DNA sequ-ences are not composed homogeneously and that bases fluctuate intheir frequency. These inhomogeneities often have an evolutionaryor a functional interpretation, and can be relevant for the subsequentanalysis of sequence data. Since it correlates with many features of

∗to whom correspondence should be addressed

interest, the GC content, i.e. the relative frequency of the bases Gand C, is one of the most commonly studied sequence properties.

Large scale regions,typically 300kb (Bernardi, 2001),of appro-ximately homogeneous GC content have been called isochores. Inview of the somewhat vague notion of “approximate homogeneity”,and conceptual criticism in studies such as Cohenet al. (2005), orElhaik et al. (2010a), there is less interest in isochores nowadays.There is no doubt about variation in GC content along genomeshowever, and search is done instead for domains of any lengthexhibiting distinct local GC content; see for instance Elhaiket al.(2010b). Several factors can influence the GC content of a region.At larger scales, it correlates with the density of genes, with generich regions typically exhibiting an elevated GC content comparedto regions of low gene density. At smaller scales, there is fluctua-tion in GC content for instance due to repetitive elements and GpCislands. The GC content is also known to vary between exons andintrons. Indeed, especially for long introns, their lower GC contentseems to play a role in splice site recognition (Amitet al., 2012).There is also a correlation between GC content and the local recom-bination rate (Fullertonet al., 2001). For a further discussion offeatures correlated to the GC content, see Freudenberget al. (2009).

In gene expression studies, regions of homogeneous GC contentare of interest as the GC content of a region affects the numberof reads mapped to this region. For DNA and RNA-seq experi-ments with the Illumina Genome Analyzer platform, this has beenfor instance investigated in Benjaminiet al. (2012) and Rissoet al.(2011).

Segmentation algorithms aim to partition a given DNA sequenceinto stretches that are homogeneous in their composition, but differfrom neighboring segments. The classical approach of using movingwindows is very simple and available for instance as an option withthe UCSC and Ensembl genome browsers. It has some disadvanta-ges, however. For instance, the choice of the window size is difficultas it defines implicitly a fixed scale at which segments primarily willbe detected. Further, the involved smoothing blurs abrupt changes.Without additional statistical criteria, the method also does not tellus whether differences between neighboring windows are statisti-cally significant. Therefore several more sophisticated approacheshave been proposed. These methods include hidden Markov Models(Churchill (1989, 1992)) and walking Markov Models (Fickettet al.

1© The Author (2014). Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]

Associate Editor: Dr. John Hancock

Bioinformatics Advance Access published April 21, 2014 at U

niversidade do Porto on April 22, 2014

http://bioinformatics.oxfordjournals.org/

Dow

nloaded from

Page 2: Multiscale DNA partitioning: statistical evidence for segments

(1992)). There are also change point methods available, see forinstance Braunet al. (2000). A Bayesian approach that relies onthe Gibbs sampler has been proposed by Keith (2006). An olderapproach based on information criteria can be found in Oliveret al.(1999). Furthermore, recently developed methods based on infor-mation criteria have been shown to perform particularly well, seeElhaik et al. (2010a), Elhaiket al. (2010b). A review of segmenta-tion methods can be found in Braunet al. (1998), and for a morerecent comparative evaluation of the more popular approaches seeElhaiket al. (2010a).

In this paper we focus on binary segmentation, where the fourletter alphabet of a DNA sequence is converted into a two lettercode. For GC content, we set the response to be “1” for G or C ata position, and zero for A/T; we useYi to denote the response atpositioni, and summarize the responses for a sequence of lengthn

byY = (Y1, Y2, . . . , Yn). (1)

We model the responsesYi to be independent and BernoulliBin(1, πi) distributed, and also assume that there is a partition0 = τ0 < τ1 < · · · < τK = n into K segments on whichthe πi are piecewise constant, i.e.πi = pj for i ∈ Ij . Here,Ij := (τj−1, τj ] denotes thej’th segment with response probabi-lity pj for 1 ≤ j ≤ K while K is unknown and can be arbitrarilylarge.

A segmentation algorithm provides estimatesK for the numberof segments, for the internal segment boundaries,

0 = τ0 < τ1 < τ2 < · · · < τK−1 < τK = n, (2)

and for the response probabilities,pj , on the estimated segmentsIj := (τj−1, τj ].

In the following we will identify a segmentation with(p, I),wherep = (p1, . . . , pK) andI = (I1, . . . , IK).

Our proposed algorithm provides a parsimonious estimateK forK: K will not exceed the actual number of segmentsK, except fora small user specified error probabilityα; as a default value, we sug-gestα = 5%, the error probability also chosen in our simulationsand data analyses. Relaxing this significance level to a larger value,α = 20% say, will typically lead to more identified segments, but atthe cost of statistical accuracy.

2 METHODOur proposed multiscale segmentation provides estimates for thenumber of segments and their boundaries at the same time. Weemploy a certain multiscale statistic which will ensure that theestimator fits the data well on all segments simultaneously, i.e.the number of segments is not underestimated with high probabi-lity. This estimator is based on Fricket al. (2013) who proposeda generalstatistical multiscale change-point estimator (SMUCE)for exponential family models. Exponential families include manyclasses of well known distributions, such as the Gaussian (normal)class, the Poisson class or the Bernoulli class, which is of particularinterest for this paper.

In the Gaussian setting, a related estimator has also been sug-gested in Davieset al. (2012). Forerunners of SMUCE are basedon a penalized likelihood with a penality that depends on the num-ber of jumps, see e.g. Yao (1988); Braunet al. (2000); Winkler

et al. (2002); Boysenet al. (2009) and the introduction in Fricket al. (2013) for a brief survey. The underlying multiscale stati-stic is based on the work of Duembgenet al. (2001), see alsoDuembgenet al. (2008) and Walther (2010). For a general descri-ption of the approach underlying the present work, its statisticalinterpretation, statistical optimality, and theoretical results we againrefer to Fricket al. (2013). To ease understanding, in the follow-ing we elaborate in greater details on the case of binary Bernoulliobservations with success probabilities given as piecewise constantsegments, as this model underlies the methodology for the segmen-tation problem at hand. In contrast to other approaches such asHidden Markov Models, we require neither explicit nor implicitdistributional assumptions on the segments and their lengths.

Let ℓ(Yi, πi) denote the likelihood ofYi under the parameterπi,i.e. ℓ(Yi, πi) = πi if Yi = 1 and ℓ(Yi, πi) = 1 − πi else. Wethen define for a fixed interval(i, j] with 0 ≤ i < j ≤ n the locallikelihood ratio statistic

T(i,j](p0) = log

(

maxp0∈[0,1]

j∏

l=i+1

ℓ(Yl, p0)

ℓ(Yl, p0)

)

. (3)

Roughly, this statistic indicates how well the data on the sub-interval (i, j] are described by the constant response probabilityp0 ∈ [0, 1] of some segment under consideration as opposed tochoosing somep0 ∈ [0, 1] freely for that sub-interval.

As a goodness-of-fit measure for the segmentation(p, I) weconsider the scale-calibrated multiresolution statistic

Tn(p, I) = max1≤l≤K

max(i,j]⊆Il

(

2T(i,j](pl)−

2 log

(

e · n

j − i

)

)

.

(4)(Here “e” denotes Euler’s number.) This statistic may be interpre-ted as follows: for all segmentsIl in I the response probability isassumed to be constant, and for every interval(i, j] within such asegment,Tn(p, I) measures whether the data are well described bythe constant response probabilitypl on that interval. It thus checksthe quality of fit on all scales simultaneously, hence the name. Notethat the log-penalty term depends on the lengthj − i of the inte-rval that is currently checked for deviations from the model. Indeed,it takes the number of disjoint intervals of the considered lengthinto account, thereby adjusting for multiple testing. For our finalestimate we determine

1. the minimal number of segmentsK, such that there existsa segmentation(p, I) with K segments satisfying the multi-resolution constraintTn(p, I) ≤ q for some pre-determinedsignificance thresholdq, and

2. the segmentation(p, I) with maximal likelihood among allsegmentations havingK segments and satisfying the multire-solution constraint.

To be more precise, letCK denote the set of segmentations withK

segments. Then, our estimate in the second step is

(p, I) = argmax(p,I)∈C

K,Tn(p,I)≤q

ℓ(Y ; p, I), (5)

whereargmax denotes a position(p, I) at which the maximum isobtained,and ℓ(Y ; p, I) denotes the likelihood of all data, if the

2

at Universidade do Porto on A

pril 22, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Page 3: Multiscale DNA partitioning: statistical evidence for segments

segmentation(p, I) with K segments were true, i.e.

ℓ(Y ; p, I) =∏

1≤l≤K

i∈Il

ℓ(Yi, pl). (6)

Following Fricket al. (2013), the general class of such estimators inexponential families has been denoted as SMUCE (Statistical MUl-tiscale Change point Estimator). We adopt this terminology and willdenote the estimator in (5) for the Bernoulli and binomial case asB-SMUCE.

The threshold parameterq determines the parsimony of theestimator, the largerq the fewer segments will be included into B-SMUCE. Hence, the choice ofq is crucial. A statistically attractivefeature of B-SMUCE is thatq can be chosen as the(1−α)-quantileof the distribution ofTn(p, I) under the hypothesis that(p, I) isthe true model. In Fricket al. (2013) it has been proven that thischoice ensures that the number of segments is notoverestimatedwith probability at least1− α.

To be a little more precise, in Fricket al. (2013) Theorem 2.1 wasshown under some mild technical assumptions that for any(p, I) theasymptotic distribution of the multiresolution statisticTn(p, I) canbe bounded by the asymptotic distribution of the statistic for a signalwith only one segment. Moreover, the latter distribution convergesto the limit distribution of (4) for the case of i.i.d. zero-mean Gaus-sian observations. Therefore, we may (and we do in the following)simply use Monte Carlo simulations with i.i.d. zero-mean Gaus-sian data to determine bounds on the quantiles of the distributionof Tn(p, I). In fact, in simulations we found the approximate quan-tiles so obtained to be rather conservative (i.e. the preassigned errorprobabilityα was not exceeded) even for small sample sizes. Thisadds support to the the basic inequality

P (K > K) ≤ P (Tn(p, I) > q) ≤ α (7)

in Section 1.2. of (Frick et al. 2013) which renders SMUCE to bea method which statistically controls the error tooverestimatethenumber of segments in the binary case, i.e. it provides the statisticalvalidity of B-SMUCE in the above sense (7). The other way around,Theorem 2.2 in Fricket al. (2013) provides an exponential deviationbound for the error tounderestimatethe true number of segmentswhich explicitly depends on the smallest segment length and signalstrength. Under prior information on these quantities these two ine-qualities together even allow to give a guarantee for the probabilityP (K = K) of specifying the number of segmentscorrectly. More-over, in the Gaussian case, it can be shown that SMUCE attainsoptimal detection rates (and even constants) over a large range ofsegment lengths, see Fricket al. (2013), Theorems 2.6, 2.7. Oursimulations suggest a similar performance in the binary/binomialcase, although we do not have a rigorous proof of this statement.

2.1 Details of the AlgorithmWe follow Frick et al. (2013) and use a pruned version of dynamicprogramming to compute our estimated segmentationI and levelsp.This is possible since our multiresolution statisticTn considers onlysubintervals of the candidate segments of constant response probabi-lity. For related ideas, where dynamic programming has been usedfor other segmentation estimators, see e.g. Friedrichet al. (2008),Boysenet al. (2009), Davieset al. (2012) and in particular for pru-ning Killick et al. (2012). To describe the algorithm for computing

B-SMUCE, we need the following notation, identifying a segmenta-tion with (p, I) again: for an interval(i, j] we define the local costsof a response probabilityp0 ∈ [0, 1] as

c(i,j](p0) =

{

−∏j

l=i+1 ℓ(Yl, p0) if T(i,j](p0) ≤ q,

∞ otherwise.(8)

Let c∗(i,j] = minp0∈[0,1] c(i,j](p0) denote the minimal costs on(i, j] for a constant response probability under the multiresolutionconstraint, whilep(i,j] = argminp0∈[0,1] c(i,j](p0) denotes thecorresponding optimal estimate.

Let us for the moment only consider the observationsY1, . . . , Yi

for fixed 1 ≤ i ≤ n, and denote byc∗i,K the optimal overall costson (0, i] usingK segments, i.e.

c∗i,K = argmax

(p,I)∈CK,i,Ti(p,I)≤q

1≤l≤K

j∈Il

ℓ(Yj , pl), (9)

cf. (5), whereCK,i denotes the set of segmentations of(0, i]with K segments; if no segmentation(p, I) ∈ CK,i fulfills themultiresolution constraintTi(p, I) ≤ q, we letc∗i,K =∞.

The algorithm for B-SMUCE is then based on the observation thatfor K > 0

c∗i,K = min

1≤l≤i

(

c∗l,K−1 + c

∗(l,i]

)

. (10)

In dynamic programming this is called the Bellman equation; it isthe main ingredient for an efficient implementation, see line 11 inAlgorithm 1.

Algorithm 1 Dynamic Programming Algorithm for B-SMUCE

1: K ← 0, I0 ← ∅, p0 ← ∅, i← 12: while i ≤ n do3: if K = 0 then4: l∗ ← 05: c∗i,0 ← c∗(0,i]6: else7: for l = i− 1, . . . , 1 do8: if c(l,i] =∞ then9: goto 14

10: else11: cli ← c∗

l,K−1+ c∗(l,i]

12: end if13: end for14: l∗ ← argminl≤j<i c

ji

15: c∗i,K← cl

i

16: end if17: if c∗

i,K=∞ then

18: K ← K + 119: goto 320: end if21: Ii ←

(

Il∗ , (l∗, i])

22: pi ←(

pl∗ , p(l∗,i])

23: end while24: return K, In, pn

Note that we introduced two rules that permit for early stoppingof loops: if c∗(l,i] = ∞, i.e. if the hypothesis of constancy on(l, i]

3

at Universidade do Porto on A

pril 22, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Page 4: Multiscale DNA partitioning: statistical evidence for segments

is rejected, then consequently this also happens on any larger inte-rval, i.e. for any smallerl; this justifies lines 8–9. Similarly, ifKsegments are insufficient to fulfill the multiresolution constraint on(0, i] then a fortiori so for any largeri, whence lines 17–20. Tothe best of our knowledge, these shortcuts that are possible dueto the specific structure of the multiscale constraint, have not beenused so far. Additional improvements were used in our implementa-tion; these, however, are rather technical and thus omitted from thepresent paper.

3 RESULTSWe evaluated our segmentation approach both on simulated dataand on data taken from the human genome and the long knownλ-phage. In our simulations, we used the benchmark scenarios pro-posed in Elhaiket al. (2010a).Since an extensive comparison ofpopular DNA segmentation algorithms under these benchmark sce-narios is already available in Elhaiket al. (2010a), we provide acomparison of our approach with the method that performed bestin Elhaik et al. (2010a), namely the one based on the Jensen-Shannon divergence. This recursive approach (calledDJS) splitsone of the current segments in each step. This is done by addinga new break point such that the improvement in Jensen-Shannondivergence is maximized. The algorithm stops when the improve-ment does not reach a treshhold value obtained via simulations.Here, we used the Matlab implementationDjsegmentation.m of thealgorithm which is publicly available as part of ISOPLOTTER 2.4(http://code.google.com/p/isoplotter/). There5.8 × 10−5 is takenas threshold, a value obtained from simulating long (1Mb) homo-geneous sequences. While this value seems to work well for theconsidered benchmark scenarios and might also be useful to preventfalse positives when searching for long homogeneous sequences, itmight be less suitable for balancing false positives and false nega-tives under other scenarios. Therefore a modified version (calledISOPLOTTER) ofDJS has been proposed briefely after (Elhaiket al., 2010b) that uses critical values dependent both on the segmentlength and the standard deviation of GC content. We thereforealso report on the performance of ISOPLOTTER 2.4 (again underthe default parameter settings), and provide detailed results in thesupplement.

To facilitate comparison and to accelerate the computations forlonger sequences, we binned the data and applied our algorithm tothe resulting binomial frequencies. We choose the bin size equal to32 which is the default value with theDJS and IsoPlotter software,and has also been used in Elhaiket al. (2010a).While binning thedata clearly improves the speed, it should be noted that it is notessential for the algorithm to work.

The first scenario considered there involves sequences consistingof fixed-size homogeneous domains. Although not realistic in pra-ctice, this setup has been proposed by Elhaiket al. (2010a) as aminimum standard: a criterion that does not perform well on suchdata cannot be expected to perform well under more complex set-tings. The second scenario consists of sequences with domainsof random length generated according to a power-law distribution.These sequences are reported to mimick mammalian genomes well,see Clayet al. (2001).

3.1 Performance measuresWe measured performance both by a qualitative criterion proposedby Elhaik et al. (2010a), and by a new quantitative criterion. Forthe qualitative criterion, we classify an identified segment as truepositive, if both segment boundaries are identified correctly withinan error margin of 5000 bases, or 5% of the segment length, whi-chever is smaller. Thus an identified segment is considered to be afalse positive, unless both detected boundaries were within 5000bases (or 5%) from the boundaries of a true segment. Similarly,actual segments were taken as false negatives, if they were not dete-cted correctly within the permitted tolerance. Let nowtp, fp, andfn denote the number of true positives, false positives and falsenegatives respectively. Following Elhaiket al. (2010a), we define asensitivity rate as

rs :=tp

tp+ fn, (11)

and a precision rate as

rp :=tp

tp+ fp. (12)

We investigate the performance of our proposed approach based onthese criteria.

Furthermore, we defined two quantitative criteria that betterreflect the accuracy of detecting segment boundaries. We denotethem by false negative sensitive localization error (FNSLE) andfalse positive sensitive localization error (FPSLE). To introducethe FNSLE, consider a true segmentIj := (τj−1, τj ], and letmj =

τj−1+τj2

denote the midpoint of the segment. We define thebest matching estimated segment as the segmentIl = (τl−1, τl]with mj ∈ Il. The FNSLE for segmentIj is then defined as themean distance

e(FNSL)j =

1

2

(

|τj−1 − τl−1|+ |τj − τl|)

(13)

between true and estimated boundaries.The overall FNSLE is then defined as the mean FNSLE taken over

all true segments

e(FNSL) :=

1

K

K∑

j=1

e(FNSL)j . (14)

Analogously, the FPS localization error can be defined by mea-suring how closely an estimated segment matches to one of the truesegments. By starting with an estimated segmentIl and its mid-point ml we look for the true segment satisfyingml ∈ Ij . Withanalogously defined errorse(FPSL)

l , we call

e(FPSL) :=

1

K

K∑

l=1

e(FPSL)l . (15)

the overall FPSLE.These measures for the error may be interpreted as follows:

assume the estimated segmentation agrees with the true segmenta-tion in the number of segments, and that all boundaries have beendetermined with an error smaller than half the length of each nei-ghboring segment. Then FPSLE and FNSLE agree: they essentiallygive the average distance between true and estimated boundaries.

4

at Universidade do Porto on A

pril 22, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Page 5: Multiscale DNA partitioning: statistical evidence for segments

These error measures behave differently however, if the numbersof true and detected segments do not coincide: assume that theestimated segmentation is the true segmentation except that it hasincorrectly split one true segment into two estimated segments. Thenthe FNSLE increases by the length of that true segment minus thelength of the longer estimated segment divided by2K, i.e. a spu-rious split is treated like an error in localizing that boundary. TheFPSLE, however, will get rather large, as the length of that truesegment divided by2K gets added. Similarly, if a true boundary isnot detected, the FNSLE will be larger.

3.2 Simulations

3.3 Segments of Equal LengthWe first implemented scenario I of Elhaiket al. (2010a). Thus wesimulated sequences that consist of 10 segments of equal length.We considered the following eight different segment lengths:10kb,50kb, 100kb, 200kb, 300kb, 500kb, 1Mb, 5Mb. Thus the longestsequences had total length50Mb. For each sequence, we selected aglobal probabilityp(t) for the response “1” at a position accordingto a uniform distribution on[0.1, 0.9]. Then we randomly modifiedthis probability for each sequence segmentj by taking

pj = p(t) + σZj . (16)

HereZj denotes a standard normal random number, andσ was cho-sen from{0, 0.025, 0.05, 0.075, 0.1}. Thepj were conditioned tolie within [0, 1], i.e. if pj did not turn out to be a proper probability,a new random number was generated. The individual observati-onsYi within a given segmentIj were then chosen as independentBernoulli random variables with expected valuepj .

We simulated100 sequences for each combination of segmentlength and heterogeneityσ of the segment specific response proba-bilities.

In Elhaik et al. (2010a), a detection threshold was introduced,and neighboring true segments for which the value ofpj differed byless than this threshold have been merged and considered as a singlesegment in the subsequent performance analysis. We did not usesuch a threshold, however, since we did not want to penalize highsensitivity. Indeed, a correct detection of two neighboring segmentswith unequal but too similar levelspj would be counted as an error,if such a detection threshold was used.

The average sensitivity and precision rates of B-SMUCE and themethod based on the Jensen-Shannon entropy (DJS) are displayedin Figures 1 and 2 respectively. Especially for shorter segments,B-SMUCE performs better thanDJS , with higher sensitivity andprecision. IsoPlotter performed worse thanDJS under the conside-red scenarios and gave up to 40 segments on average for the verylong sequences.B-SMUCE is able to detect also short segmentswhile controlling the number of spuriously detected segments.Notice however that in the case of a homogeneous sequence with-out partitioning into segments (σ = 0), DJS always obtained thecorrect answer whereas B-SMUCE sometimes introduced spurioussegment boundaries. This is to be expected, as the error of intro-ducing spurious boundaries has been set toα = 5% under such amodel. Furthermore, under all scenarios, too many boundaries wereestimated by B-SMUCE in less than 5% of the simulations, as pre-dicted. We consider the ability to control this error to be a particularstrength of our approach.

segment length (logarithmic scale)

sens

itivi

ty r

ate

0.2

0.4

0.6

0.8

104 105 106 5 × 106

B−SMUCEDJS

= σ 0.025

B−SMUCEDJS

= σ 0.05

B−SMUCEDJS

= σ 0.075

104 105 106 5 × 106

0.2

0.4

0.6

0.8

B−SMUCEDJS

= σ 0.1

Fig. 1. Average sensitivity rate as defined in (11) forDJS (◦) and B-SMUCE (×). Results are based on simulations under scenario I (segments ofequal length) for several values ofσ. The sensitivity obtained with the mul-tiresolution criterion tends to be higher. At the simulated segment lengths,95% confidence intervals are given as error bars.

segment length (logarithmic scale)

prec

isio

n ra

te

0.2

0.4

0.6

0.8

104 105 106 5 × 106

B−SMUCEDJS

= σ 0.025

B−SMUCEDJS

= σ 0.05

B−SMUCEDJS

= σ 0.075

104 105 106 5 × 106

0.2

0.4

0.6

0.8

B−SMUCEDJS

= σ 0.1

Fig. 2. Average precision rate, as defined in (12) forDJS (◦) and B-SMUCE (×). Results are based on simulations under scenario I (segmentsof equal length) for several values ofσ. The precision rate obtained withthe multiresolution criterion tends to be higher. At the simulated segmentlengths, 95% confidence intervals are given as error bars.

Figures 3 and 4 show the average FNSLE (14) and FPS localiza-tion errors (15) that measure the accuracy of the segment detection.For these quantitative criteria, we only consider sequences that arenon-homogeneous (σ > 0). Again, B-SMUCE shows better per-formance, leading to smaller errors on average. With increasingheterogeneityσ there will be typically larger differences betweenneighboring segments, and thus smaller errors. The graphs alsoseem to indicate that both FPSLE and FNSLE tend to get larger with

5

at Universidade do Porto on A

pril 22, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Page 6: Multiscale DNA partitioning: statistical evidence for segments

Table 1. Sensitivity rate and precision rate under scena-rio 2 for the Jensen-Shannon divergence method (DJS)and B-SMUCE. Long and short segments were generatedfrom a power law distribution and the total sequence lengthwasn = 10

6. We provide averages (and in parenthesesstandard errors) over 100 simulation runs.

DJS B-SMUCE

avg. true number 27.51 (1.90) 27.51 (1.90)avg. number of true pos. 23.26 (1.69) 24.68 (1.80)avg. number of false pos. 2.51 (0.20) 1.60 (0.14)avg. number of false neg. 4.25 (0.37) 2.83 (0.24)avg. sensitivity 0.83 (0.016) 0.87 (0.015)avg. precision 0.88 (0.013) 0.92 (0.013)

increasing sequence lengths. A closer inspection reveals that this ismostly caused by outliers, as the median accuracy of detection staysnearly the same for all segment lengths. These outliers occur whena segment is either missed or detected incorrectly, and such eventslead to larger errors when the segments are longer.

3.4 Segments According to Power LawIn our second simulation setup, we generated 100 sequences consi-sting of segments with different lengths. As in Elhaiket al. (2010a),we chose the segment lengths according to a power law distribution:

p(x) = Cx−a1[x > x0], (17)

with a = 1.55 andx0 = 10000. The parametera was chosen tomimick segment lengths for mammalian, in particular human, DNAsequences; see Elhaiket al. (2010a), Clayet al. (2001), Cohenet al.(2005). Notice that the minimal segment lengthx0 = 10000 wasintroduced to avoid very short segments that are difficult to detect.The total sequence length was taken to ben = 106. For even num-bered segments, we selected the response probabilitypj accordingto a uniform distribution on[0.6, 1], whereas for odd segments wetookpj uniformly frompj ∈ [0, 0.4].

Qualitatively, it turns out that B-SMUCE performs better than theJensen-Shannon entropy criterionDJS both in terms of sensitivityand precision rate, see Table 1. B-SMUCE detected 90% of all truesegments within the desired margin of error. Furthermore 94% ofall detected segments were correct, again, within the desired levelof accuracy. Since we used B-SMUCE with a type I error probabi-lity of 5% for including too many segments, this implies that almostall of the detected jumps were detected within the required level ofaccuracy. Furthermore when incorrect, our method usually detectednot more than one spurious segment boundary. We also tried Iso-Plotter on the simulated sequences and got 85.23 detected segmentson average. Given an average number of 27.51 true segments (seeTable 1), more than three times the true number of segments hasbeen detected on average.More detailed results on ISOPLOTTERcan be found in the supplementary material.

B-SMUCE also leads to good results in terms of the FPSLE andFNSLE, see Table 2. Thus on average, detected segments and truesegments match more closely with B-SMUCE than with theDJS

criterion.

Table 2. False negative sensitive localizationerrors (FNSLE) and false positive sensitivelocalization errors (FPSLE) under scenario 2for the Jensen-Shannon divergence method(DJS) and B-SMUCE. Long and shortsegments were generated under a power lawdistribution and the total sequence length wasn = ×10

6. The results are averages (andstandard errors) over 100 simulation runs.The error rates were standardized accordingto the average segment length.

FNSLE FPSLEDJS 0.27 (0.436) 0.06 (0.103)B-SMUCE 0.11 (0.164) 0.01 (0.014)

segment length (log10)

FN

S lo

calis

atio

n er

ror

(log1

0)

4.0

4.5

5.0

5.5

104 105 106 5 × 106

B−SMUCEDJS

= σ 0.025

B−SMUCEDJS

= σ 0.05

B−SMUCEDJS

= σ 0.075

104 105 106 5 × 106

4.0

4.5

5.0

5.5B−SMUCEDJS

= σ 0.1

Fig. 3. Logarithm (base 10) of average FNSLE (false negative sensitivelocalization error), as defined in (14) forDJS (◦) and B-SMUCE (×).Results are based on simulations under scenario I (segments of equal length)for several values ofσ. The errors encountered with the multiresolution cri-terion tend to be smaller. At the simulated segment lengths, 95% confidenceintervals are given as error bars.

3.5 Real DataWe applied our segmentation algorithm to three data sets. The firsttwo examples,λ phage and human MHC complex, have previ-ously been studied in the context of segmentation algorithms. Asa further example, a 10Mb sequence chunk (hg19, chr1:50000002-60000000) has been arbitrarily chosen from human chromosome 1.

The genome of bacteriophageλ consists of 48502 bases and wasone of the first completely sequenced genomes. Our segmentationled to 6 segments with boundaries1, 22501, 27829, 33186, 39172,46367, 48502. Notice that the same number of segments althoughwith a bit different boundaries has been reported as the outcomeof a segmentation using Hidden Markov Models in Chapter 4 ofCristianini and Hahn (2007).

6

at Universidade do Porto on A

pril 22, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Page 7: Multiscale DNA partitioning: statistical evidence for segments

segment length (log10)

FP

S lo

calis

atio

n er

ror

(log1

0)

3.5

4.0

4.5

5.0

5.5

104 105 106 5 × 106

B−SMUCEDJS

= σ 0.025

B−SMUCEDJS

= σ 0.05

B−SMUCEDJS

= σ 0.075

104 105 106 5 × 106

3.5

4.0

4.5

5.0

5.5

B−SMUCEDJS

= σ 0.1

Fig. 4. Logarithm (base 10) of average FPSLE, as defined in (15) forDJS

(◦) and B-SMUCE (×). Results are based on simulations under scenarioI (segments of equal length) for several values ofσ. The errors encounte-red with the multiresolution criterion tend to be smaller. At the simulatedsegment lengths, 95% confidence intervals are given as error bars.

We next investigate human genome data from chromosome6p21.3 and 6p22.1 (hg19, chr6:29,677,952-33,289,874). Thissegment harbors the much studied human MHC complex. We founda number of segments even larger than that in Elhaiket al. (2010b),contradicting once again the concept of homogeneous isochores.from the UCSC browser for this example.) We recoded G,C as“1” and A,T as zero, and applied B-SMUCE, and bothDJS andIsoPlotter to this data. WithDJS , we found 182 segments. With B-SMUCE and a type I error probability ofα = 0.05, we identified640 segments. (Withα = 0.01, 528 segments were obtained, andchoosingα = 0.1 led to 716 segments.)

A natural question is whether the number of 640 or 182 segmentsis more appropriate. To address this issue, notice that the statisticalerror control associated with the multiresolution criterion suggeststhat there are (except for a small error probability) at least 640segments. To check whether this finding is also compatible withthe DJS segmentation, we simulated the segmentation with 640segments obtained with B-SMUCE as our null model. We simulated100 data sets from this null model, and for 80% of all data setsDJS

led to a segmentation with the number of segments at most 182.With the number of segments taken as test statistic, this amounts toan estimated p-value of 0.80. Thus the segmentation based on theJensen-Shannon (DJS) criterion does not contradict the assumptionof 640 segments, whereas the hypothesis of 182 segments is reje-cted by the multi-scale criterion underlying B-SMUCE as not beingcompatible with the data.

We also applied IsoPlotter 2.4 to the data. With its adaptive dete-ction threshold, 227 segments were identified. The homogeneitytest (one-sided F-test) provided with the IsoPlotter software confir-med for 180 of these 227 segments that they are significantly morehomogeneous than the entire considered DNA sequence. Althoughthis observation does not give us the number of segments actually

Table 3. Run times (in seconds) of the B-SMUCE algorithm when appliedto sequences of different length taken from the human chromosome 1. Thecomputations were been carried out on a single cluster core with 2.6 Ghz and8 GB RAM. The results are reported forα = 0.05 and for bins of lengths32 and 10. Forα = 0.01, very similar run times have been obtained.

sequence length 105 2·105 5 · 10

510

65 · 10

610

7

run time bin size 32 0.37 1.0 3.3 9.4 51.9 102.9run time bin size 10 1.7 4.0 12.3 30.7 169.7 384.0

present, it seems interesting that the number of sufficiently homo-geneous segments found by IsoPlottor is almost the same as thenumber of segments identified with theDJS criterion.

Finally we considered the region between 50Mb and 60Mb on thehuman chromosome 1. Here, we tried bins both of size 10 and 32. Itturned out that with the finer partition slightly more segments weredetected than with the larger bins of size 32, although the difference(1096 versus 1041) was not very large. It seems plausible that finescale variation can be detected more easily with shorter bins.

To illustrate the run time behavior of the B-SMUCE algorithmwith our default significance thresholdα = 0.05, we consideredseveral shorter sequences taken from the above mentioned 10MbDNA sequence. Table 3 gives the run times of our algorithm (inseconds) in dependence of the sequence length.

To give an idea about the run-times ofDJS and IsoPlotter, weapplied them on the same sequence pieces. With the standard opti-ons (bin size: 32, shortest detectable domain size: 3008) the runtimes for the longest sequence (107 bases) were 6.2 sec. (DJS) and9.6 sec. (IsoPlotter). Notice however that B-SMUCE is designed todetect segments of any length, and the shortest segment detectedby B-SMUCE in the context of our run time analysis was 80 baseslong. We therefore also recorded the run times forDJS and IsoPlot-ter with the minimum segment length changed from 3008 to 80. Fora bin width of 32 and a sequence length of107, the run time forDJS

remained unchanged, whereas the run time for IsoPlotter increasedto 329.1 sec.

For the human genome data considered here, a cross checkwith the genome annotation revealed that several segments have aninterpretation in terms of genes/exons, repetitive elements or CpGislands. As the GC content may depend on several functional andevolutionary factors, we do not expect simple explanations for manyof the identified segments. Nevertheless, we explore the overlap ofthe identified segments with available annotation in the supplement.

4 CONCLUSIONWe introduced a new method (B-SMUCE) for the segmentation ofbiological sequences. The segmentation is with respect to a binaryresponse; here, we have considered GC content but it might be inte-resting to apply the method to other applications involving binaryresponses (such as ancestral/derived state of alleles in populationgenetic applications). Our approach provides precise statistical errorcontrol, and will produce a parsimonious segmentation that doesnot contain more segments than there actually are with a user spe-cified preassigned probability of1 − α. A comparison under thebenchmark scenarios taken from Elhaiket al. (2010a) suggests that

7

at Universidade do Porto on A

pril 22, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from

Page 8: Multiscale DNA partitioning: statistical evidence for segments

the proposed method B-SMUCE is more accurate than previouslyproposed approaches.

Interestingly, the difference to the popular Jensen-Shannon cri-terion in terms of the number of detected segments has beenparticularly large for the human data.

ACKNOWLEDGEMENTWe are grateful to Dr. Marlies Dolezal for her helpful comments.The research of A.M and H.S. is supported by DFG FOR 916, SFB803 and the Volkswagen Foundation. The work by A.F. is supportetby the FWF (Vienna Graduate School of Population Genetics).

REFERENCESAmit, M., Donyo, M., Hollander, D., Goren, A., Kim, E., Gel-

fman, S., Lev-Maor, G., Burstein, D., Schwartz, S., Postolsky,B., Pupko, T., Ast, G. (2012), Differential GC content betw-een exons and introns establishes distinct strategies of splice-siterecognition,Cell Rep., 1(5):, 543-56.

Benjamini, Y., Speed, T.P. (2012), Summarizing and correcting theGC content bias in high-throughput sequencing,Nucleic AcidsRes., 40, e72.

Bernardi, G. (1989) The isochore organization of the humangenome,Annu Rev Genet, 23,637–61.

Bernardi, G. (1995) The human genome: organization and evolutio-nary history,Annu Rev Genet, 29, 445–76.

Bernardi, G. (2001) Misunderstandings about isochores. Part I.Gene, 276, 3-13.

Boysen, L., Kempe, A., Liebscher, V., Munk, A., Wittich, O. (2009)Consistencies and rates of convergence of jump-penalized leastsquares estimatorsAnn. Statist., 37, 157–183.

Braun, J.V., Muller, H.G., (1998), Statistical methods for DNAsegmentation.Statistical Science, 13, 142-162.

Braun, J.V., Braun, R.K., Muller, H.G. (2000) Multiple change-point fitting via quasi-likelihood, with application to DNA sequ-ence segmentation.Biometrika 87, 301-314.

Cristianini, N. and Hahn, M.W. (2007) Computational Genomics.Cambridge University Press, Cambridge, UK.

Churchill, G.A. (1989) Stochastic models for heterogeneous DNAsequences.Bulletin of Mathematical Biology, 51, 79-94.

Churchill, G.A. (1992) Hidden Markov chains and the analysis ofgenome structure.Computer and Chemistry 16, 107-115.

Clay, O., Carels, N., Douady, C., Macaya, G., Bernardi,G. (2001)Compositional heterogeneity within and among isochores inmammalian genomes. I. CsCl and sequence analyses.Gene 276,1524.

Cohen, N., Dagan, T., Stone, L., Graur, D. (2005) GC compositionof the human genome: in search for isochores.Molecular Biologyand Evolution 22, 1260-1272.

Cuny, G., Soriano, P., Macaya, G., Bernardi, G. (1981) Themajor components of the mouse and human genomes. 1. prepara-tion, basic properties and compositional heterogeneity,EuropeanJournal of Biochemistry / FEBS, 115, 227–33.

Davies, L., Hohenrieder, C. and Kramer, W. (2012) Recursivecomputation of piecewise constant volatilitiesComputationalStatistics & Data Analysis, 11, 3623 - 3631.

Dumbgen, L., Spokoiny, V.G. (2001) Multiscale testing of qualita-tive hypotheses,Ann. Stat. 29, 124-152.

Dumbgen, L., Walther, G. (2008) Multiscale inference about adensity,Ann. Stat. 36, 1758-1785.

Elhaik, E., Graur, D., Josic, K. (2010a) Comparative testingof DNA segmentation algorithms using benchmark simulations,Molecular Biology and Evolution 27, 1015-24.

Elhaik, E., Graur, D., Josic, K. (2010b) Identifying compositionallyhomogeneous and nonhomogeneous domains within the humangenome using a novel segmentation algorithm.Nucl. Acids Res.38: e158

Fickett, J.W., Torney, D.C., Wolf, D.R. (1992) Base compositionalstructure of genomes.Genomics, 13, 1056-1064.

Frick, K., Munk, A., Sieling, H. (2013) Multiscale Change-Point InferenceArXiv e-prints, 1301.7212, discussion paper withrejoinder by the authors.Journal of the Royal Statistical SocietySer. B, to appear.

Friedrich, F., Kempe, A., Liebscher, V., Winkler, G. (2008) Com-plexity Penalized M- Estimation : Fast Computation.Journal ofComputational and Graphical Statistics, 17, 201–24.

Freudenberg, J., Wang, M., Yang, Y., Li, W. (2009) Partial correla-tion analysis indicates causal relationships between GC-content,exon density and recombination rate in the human genome. BMCBioinformatics10 (Suppl 1):S66.

Fullerton, S.M., Carvalho, A.B., and Clark, A.G. (2001) Local Ratesof Recombination Are Positively Correlated with GC Content inthe Human GenomeMol. Biol. Evol., 18, 1139-1142.

Galtier, N., Piganeau, G., Mouchiroud, D., and Duret, L. (2001)GC-content evolution in mammalian genomes: the biased geneconversion hypothesis.Genetics, 159, 907-911.

Keith, J.M. (2006) Segmenting Eukaryotic Genomes with the Gene-ralized Gibbs Sampler.Journal of Computational Biology, 13.1369-1383.

Killick, R., Fearnhead, P., Eckley, I.A. (2012) Optimal detectionof changepoints with a linear computational cost.Journal of theAmerican Statistical Association 107. 1590-1598.

Oliver, J.L., Roman-Roldan, R., Perez, J., Bernaola-Galvan, P.(1999) SEGMENT: identifying compositional domains in DNAsequences.Bioinformatics 15: 974-979.

Risso, D., Schwartz, K., Sherlock, G., Dudoit, S. (2011) GC-Content Normalization for RNA-Seq Data,BMC Bioinformatics12:480.

Sueoka, N. (1962) On the genetic basis of variation and heteroge-neity of DNA base composition,PNAS, 48, 582–92.

Walther, G. (2010) Optimal and fast detection of spatial clusterswith scan statistics,Ann. Statist., 38, 1010-1033.

Winkler, G., Liebscher, V. (2002) Smoothers for discontinuoussignalsJ. Nonparametr. Stat., 14, 203–222.Math. Nachr., 281, 582–595.

Yao, Yi-Ching (1988) Estimating the number of change-points viaSchwarz’ criterion,Statist. Probab. Lett., 6, 181–189.

8

at Universidade do Porto on A

pril 22, 2014http://bioinform

atics.oxfordjournals.org/D

ownloaded from