[ieee 2012 ieee international symposium on information theory - isit - cambridge, ma, usa...

5
Bounds on estimated Markov orders of individual sequences Luciana Vitale Instituto de Computaci´ on Universidad de la Rep´ ublica Montevideo, Uruguay Email: lvitale@fing.edu.uy ´ Alvaro Mart´ ın Instituto de Computaci´ on Universidad de la Rep´ ublica Montevideo, Uruguay Email: almartin@fing.edu.uy Gadiel Seroussi HP Laboratories, Palo Alto, CA, USA, and U. de la Rep´ ublica, Montevideo, Uruguay Email: [email protected] Abstract—We study the maximal values estimated by com- monly used Markov model order estimators on individual se- quences. We start with penalized maximum likelihood (PML) estimators with cost functions of the form log ˆ P k (x n )+f (n)α k , where ˆ P k (x n ) is the ML probability of the input sequence x n under a Markov model of order k, α is the size of the input alphabet, and f (n) is an increasing (penalization) function of n (the popular BIC estimator corresponds to f (n)= α1 2 log n). Comparison with a memoryless model yields a known upper bound k(n) on the maximum order that x n can estimate. We show that, under mild conditions on f that are satisfied by commonly used penalization functions, this simple bound is not far from tight, in the following sense: for sufficiently large n, and any k< k(n), there are sequences x n that estimate order k; moreover, for all but a vanishing fraction of the values of n such that k = k(n), there are sequences x n that estimate order k. We also study KT-based MDL Markov order estimators, and show that in this case, there are sequences x n that estimate order n 1/2 , which is much larger than the maximum log n/ log α(1 + o(1)) attainable by BIC, or the order o(log n) required for consistency of the KT estimator. In fact, for these sequences, limiting the allowed estimated order might incur in a significant asymptotic penalty in description length. All the results are constructive, and in each case we exhibit explicit sequences that attain the claimed estimated orders. I. I NTRODUCTION Initially, we consider penalized maximum likelihood (PML) Markov model order estimators, where, given a sequence x n over a finite alphabet A, of size α = |A|, and a candidate Markov order k, we define a cost 1 C k (x n )= log ˆ P k (x n )+ f (n)α k . (1) Here, ˆ P k (x n ) is the maximum likelihood (ML) probability of x n under a kth order Markov model (with appropriate conventions on the initial states), and f (n) is a positive penalization function satisfying some mild conditions to be detailed later. The order estimated for x n is ˆ k(x n ) = arg min k0 C k (x n ) . (2) Different variants of PML estimators have been extensively studied (see, e.g., [1] and citations therein). When f (n)= 1 2 (α 1) log n, we obtain the popular BIC estimator, which Work supported in part by grant I+D CSIC-UdelaR. 1 All logarithms are taken to base 2 unless specified otherwise. is usually regarded as an asymptotic approximation of a Minimum Description Length (MDL) estimator of the Markov order. We will also be interested in the latter type of order estimators, and, in particular, in the variant based on the Krichevskii-Trofimov (KT) probability assignment [2]. The cost function in this case does not include an explicit penal- ization term; instead, the contribution of the model size to the cost is amortized across actual occurrences of model states in the sequence under evaluation. The range of values of the estimate ˆ k has played an important role in the theoretical analysis of the above men- tioned estimators. The first consistency results for the BIC estimator [3], for example, assumed a known bound on the Markov order. This assumption was removed in [1], where it is also shown that, if no bound is assumed, pure MDL Markov order estimators, be it in the KT or the normalized maximum likelihood (NML) versions, are not consistent. The consistency of the latter two was shown in [4], when the range of the Markov order k for the minimization in the estimation is bounded by o(log n) and c log n, respectively, with c< 1/ log α. Similarly, in the case of the estimation of context trees [5], the consistency of BIC and (KT-based) MDL estimators was proved in [6], under the assumption of an upper bound of o(log n) on the depth of the candidate context trees considered for the minimization. Imposing a-priori bounds on the estimated order may be useful in some cases to guarantee consistency, but might not be desirable in other applications. For example, in universal lossless data compression, we are interested in choosing the estimated order that yields the shortest description length for the given input sequence, regardless of where the input originated from. Similarly, in the universal simulation results of [7], given a sequence x n , a Markov order ˆ k is estimated using a PML estimator, and a “simulated” sequence y n is obtained by drawing uniformly from the set of sequences in the ˆ kth order Markov type class of x n that also estimate order ˆ k. No assumptions are made on the range of ˆ k, and, in the individual sequence setting, the choice of penalization function f (n) governs a trade-off between the statistical similarity of y n to x n , and the “richness” (entropy) of the space from which y n is drawn. On the other hand, aside from the theoretical interest, obtaining inherent bounds on the possible outcome of the 2012 IEEE International Symposium on Information Theory Proceedings 978-1-4673-2579-0/12/$31.00 ©2012 IEEE 1102

Upload: gadiel

Post on 17-Mar-2017

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2012 IEEE International Symposium on Information Theory - ISIT - Cambridge, MA, USA (2012.07.1-2012.07.6)] 2012 IEEE International Symposium on Information Theory Proceedings

Bounds on estimated Markov orders of individualsequences∗

Luciana VitaleInstituto de Computacion

Universidad de la Republica

Montevideo, Uruguay

Email: [email protected]

Alvaro MartınInstituto de Computacion

Universidad de la Republica

Montevideo, Uruguay

Email: [email protected]

Gadiel SeroussiHP Laboratories, Palo Alto, CA, USA, and

U. de la Republica, Montevideo, Uruguay

Email: [email protected]

Abstract—We study the maximal values estimated by com-monly used Markov model order estimators on individual se-quences. We start with penalized maximum likelihood (PML)estimators with cost functions of the form − log Pk(x

n)+f(n)αk,where Pk(x

n) is the ML probability of the input sequence xn

under a Markov model of order k, α is the size of the inputalphabet, and f(n) is an increasing (penalization) function of n(the popular BIC estimator corresponds to f(n) = α−1

2log n).

Comparison with a memoryless model yields a known upperbound k(n) on the maximum order that xn can estimate.We show that, under mild conditions on f that are satisfiedby commonly used penalization functions, this simple boundis not far from tight, in the following sense: for sufficientlylarge n, and any k<k(n), there are sequences xn that estimateorder k; moreover, for all but a vanishing fraction of thevalues of n such that k = k(n), there are sequences xn thatestimate order k. We also study KT-based MDL Markov orderestimators, and show that in this case, there are sequences xn thatestimate order n1/2−ε, which is much larger than the maximumlog n/ logα(1 + o(1)) attainable by BIC, or the order o(log n)required for consistency of the KT estimator. In fact, for thesesequences, limiting the allowed estimated order might incur ina significant asymptotic penalty in description length. All theresults are constructive, and in each case we exhibit explicitsequences that attain the claimed estimated orders.

I. INTRODUCTION

Initially, we consider penalized maximum likelihood (PML)Markov model order estimators, where, given a sequence xn

over a finite alphabet A, of size α = |A|, and a candidate

Markov order k, we define a cost1

Ck(xn) = − log Pk(x

n) + f(n)αk . (1)

Here, Pk(xn) is the maximum likelihood (ML) probability

of xn under a kth order Markov model (with appropriate

conventions on the initial states), and f(n) is a positive

penalization function satisfying some mild conditions to be

detailed later. The order estimated for xn is

k(xn) = argmink≥0

Ck(xn) . (2)

Different variants of PML estimators have been extensively

studied (see, e.g., [1] and citations therein). When f(n) =12 (α − 1) log n, we obtain the popular BIC estimator, which

∗Work supported in part by grant I+D CSIC-UdelaR.1All logarithms are taken to base 2 unless specified otherwise.

is usually regarded as an asymptotic approximation of a

Minimum Description Length (MDL) estimator of the Markov

order. We will also be interested in the latter type of order

estimators, and, in particular, in the variant based on the

Krichevskii-Trofimov (KT) probability assignment [2]. The

cost function in this case does not include an explicit penal-

ization term; instead, the contribution of the model size to the

cost is amortized across actual occurrences of model states in

the sequence under evaluation.

The range of values of the estimate k has played an

important role in the theoretical analysis of the above men-

tioned estimators. The first consistency results for the BIC

estimator [3], for example, assumed a known bound on the

Markov order. This assumption was removed in [1], where

it is also shown that, if no bound is assumed, pure MDL

Markov order estimators, be it in the KT or the normalizedmaximum likelihood (NML) versions, are not consistent. The

consistency of the latter two was shown in [4], when the

range of the Markov order k for the minimization in the

estimation is bounded by o(log n) and c log n, respectively,

with c < 1/ logα. Similarly, in the case of the estimation of

context trees [5], the consistency of BIC and (KT-based) MDL

estimators was proved in [6], under the assumption of an upper

bound of o(log n) on the depth of the candidate context trees

considered for the minimization.

Imposing a-priori bounds on the estimated order may be

useful in some cases to guarantee consistency, but might not

be desirable in other applications. For example, in universal

lossless data compression, we are interested in choosing the

estimated order that yields the shortest description length

for the given input sequence, regardless of where the input

originated from. Similarly, in the universal simulation results

of [7], given a sequence xn, a Markov order k is estimated

using a PML estimator, and a “simulated” sequence yn is

obtained by drawing uniformly from the set of sequences in

the kth order Markov type class of xn that also estimate order

k. No assumptions are made on the range of k, and, in the

individual sequence setting, the choice of penalization function

f(n) governs a trade-off between the statistical similarity of yn

to xn, and the “richness” (entropy) of the space from which yn

is drawn. On the other hand, aside from the theoretical interest,

obtaining inherent bounds on the possible outcome of the

2012 IEEE International Symposium on Information Theory Proceedings

978-1-4673-2579-0/12/$31.00 ©2012 IEEE 1102

Page 2: [IEEE 2012 IEEE International Symposium on Information Theory - ISIT - Cambridge, MA, USA (2012.07.1-2012.07.6)] 2012 IEEE International Symposium on Information Theory Proceedings

estimation procedure has practical computational implications,

as a bound on the estimated order translates to a bound on the

memory requirements of an algorithmic implementation of the

estimator. Thus, in this paper, we study the maximum possible

value of k(xn), for any sequence xn, when no a-priori bounds

are imposed on the candidate orders.

For k = k(x), writing C0(xn) ≥ Ck(x

n), trivially bounding

the ML probability, and rearranging terms, yields

n logα

f(n)≥ αk − 1, (3)

from which a uniform upper bound (log n− log f(n))/ logα+O(1) on k(xn), which we denote by k(n), was obtained in [7].

We will show that, under mild conditions on f(n), which are

satisfied by commonly used penalization functions, the bound

k(n) is not far from tight, in the following sense: for any

sufficiently large n, and any k < k(n), there are sequences of

length n that estimate order k; moreover, for all sufficiently

large k and all but a vanishing fraction of the values of n such

that k = k(n), there are sequences of length n that estimate

order k. After some preliminaries in Section II, these results

are presented in Section III, by showing explicit constructions

of sequences that attain the claimed estimated orders. The

constructions rely on properties of de Bruijn sequences [8].

We initially present results for arbitrary values of α, and then

show that these results can be tightened in the case α = 2by exploiting properties of a special kind of binary de Bruijn

sequence, the so-called Ford sequence [9]. In Section IV, we

extend our study to the MDL estimator based on the KT

probability assignment. We show that in this case, there exist

sequences xn that estimate order n1/2−ε for any ε ∈ (0, 12 ).

This order is much larger than the maximum possible order,

k(n) = logn/ logα+ o(log n), attainable by a BIC estimator,

and also of the order o(log n) required for consistency of the

KT-based estimator [4]. In fact, we show that, in a universal

lossless compression setting, for the constructed sequences,

imposing an artificial upper-bound on the allowed estimated

order could incur a significant asymptotic penalty in overall

description length. Similar results (with the same individual

sequences) are obtained when a context tree [5], [6], rather

than a plain Markov order, is estimated, using either the KT

or the (tree) BIC estimator.

II. PRELIMINARIES AND PROPERTIES OF THE UPPER

BOUND

We denote by uji the string uiui+1 . . . uj over A, with uj

i =λ, the empty string, when i > j. We omit the subscript when

i = 1. We let |u| denote the length of a string u, and uvthe concatenation of strings u and v. The terms string and

sequence are used interchangeably.

We model a sequence xn as the realization of a generic

kth order Markov process, where k is unknown. We regard a

string s ∈ Ak as a state of the Markov process and we say

that a sequence y selects state s whenever s is a suffix of y.

When k is not clear from the context, we explicitly refer to

s as a k-state. For the purpose of selecting states, we assume

that xn is preceded by an arbitrary fixed semi-infinite string

x0−∞. This convention uniquely determines a state selected by

xi, for each i, 0 ≤ i ≤ n, and for any order k. If xi selects

state s, 0 ≤ i < n, we say that xi+1 is emitted in state s and

that s occurs (in position i) in xn. We denote by ns(xn) the

number of occurrences of s in xn, and, for a ∈ A, we denote

by n(a)s (xn) the number of times a symbol xi = a is emitted

in state s. We omit the dependence on xn of n(a)s , ns, and

other notations, when clear from the context.

The kth order ML probability of a sequence xn is deter-

mined by the fixed initial state and the empirical probabilities

Pk(·|s) conditioned on k-states s,

Pk(a|s) = n(a)s

ns, s ∈ Ak, a ∈ A , (4)

so that

− log Pk(xn) = −

∑s∈Ak,a∈A

n(a)s log

n(a)s

ns. (5)

The class of PML estimators of interest is defined by (1)–(2),

where we assume that f(n) is positive and nondecreasing, with

f(n)n−→∞ and

f(n)n

n−→ 0.2 We refer to the first and second

terms on the right-hand side of (1), respectively, as the MLterm (specified in (5)) and the penalty term of order k.

The upper bound k(n) on k is defined as the largest value

of k satisfying (3) for a given n. Reciprocally, given k, the

smallest integer n satisfying (3), denoted n(k), is a lower

bound on the length of sequences that can estimate order k.

In particular, from the definition of n(k), we have

αk − 1

logαf(n(k)) ≤ n(k) <

αk − 1

logαf(n(k)− 1) + 1 . (6)

The following lemma follows readily from the foregoing

definitions, and from (6).

Lemma 1. Given a value of n, the inequality (3) holds forall k, 0 ≤ k ≤ k(n). We have k(n)

n−→∞, n(k) k−→∞, n(k) isnondecreasing, and, moreover, for sufficiently large k,

αk

logαf(n(k+1))−1 < n(k+1)−n(k) <

αk+1

logαf(n(k+1)) .

(7)

III. SEQUENCES THAT MAXIMIZE PML-ESTIMATED ORDER

In this section we exhibit sequences of length n that get

very close to, or even precisely attain, the bound k(n) of the

previous section. The constructions will be based on de Bruijn

sequences, whose properties we review next.

A kth order de Bruijn sequence [8] is a sequence

bαk

, of length αk, k≥ 0, such that the sliding window

bi+1bi+2 . . . bi+k, with indices taken modulo αk, exhausts all

distinct k-tuples over A. De Bruijn sequences exist for every

order k, and every cyclic permutation of a de Bruijn sequence

is itself a de Bruijn sequence of the same order. We denote

by Bk the (nonempty) set of de Bruijn sequences of order kthat have x0

−k+1 as a suffix (i.e., they match, cyclically, the

2In the case of penalization functions such as f(n) = c logn or f(n) =c log logn, we assume n is large enough so that f(n) is positive.

1103

Page 3: [IEEE 2012 IEEE International Symposium on Information Theory - ISIT - Cambridge, MA, USA (2012.07.1-2012.07.6)] 2012 IEEE International Symposium on Information Theory Proceedings

assumed fixed initial condition). For a sequence u, we denote

by (u)∗ the concatenation of an infinite number of copies of

u, and, when |u| ≥ n, we denote by [u ]n the truncation of uto length n. Let Bn

k denote the set of sequences

Bnk =

{ [(bα

k

)∗]n

∣∣ bαk ∈ Bk

}. (8)

The following lemma follows immediately from the defini-

tion (8) and the properties of de Bruijn sequences.

Lemma 2. Let xn ∈ Bnk . If k′ ≥ k, then a k′-state, when

it occurs in xn, always emits the same symbol. In particular,when n = mαk for some integer m ≥ 0, then

(i) each k-state s occurs m times in xn, and we have m =ns = n

(a)s for some a∈A (which depends on s);

(ii) if j < k, each possible j-state s occurs mαk−j timesin xn and each symbol of A is emitted mαk−j−1 timesin s, i.e., n(a)

s = mαk−j−1 and ns = mαk−j for alla ∈ A and all s ∈ Aj .

Theorem 1. For sufficiently large n, if k < k(n) and xn ∈Bn

k , then k(xn) = k.

Proof: By Lemma 2, if k′ ≥ k, then the ML term of order

k′ of xn is zero. Thus, since the penalty term grows with the

order, we must have k(xn) ≤ k. Let m = � nαk �. If j < k,

then by Lemma 2 (ii), we have, for all a ∈ A and all s ∈ Aj ,

n(a)s (xn) ≤ (m + 1)αk−j−1 and ns(x

n) ≥ mαk−j , where at

least one inequality is strict, which implies thatn(a)s

ns< m+1

mα .

Therefore, by (5), we have

− log Pj(xn) > −

∑s∈Aj ,a∈A

n(a)s log

m+ 1

mα= n log

m+ 1.

(9)

Using (9) and recalling that − log Pk(xn) = 0, we obtain, for

0 ≤ j < k,

Cj(xn)− Ck(x

n) > n logmα

m+ 1+ f(n)αj − f(n)αk

> n logmα

m+ 1+

f(n)(1− αk+1)

α

≥ n logmα

m+ 1− n logα

α

= n log

(α1− 1

αm

m+ 1

)≥ n log

(√2

m

m+ 1

), (10)

where the third inequality follows from the first claim of

Lemma 1 and the fact that k+1 ≤ k(n), and the last inequality

holds since α ≥ 2. It follows from (10) that Cj(xn) > Ck(x

n)when m ≥ 3. The latter condition, in turn, holds for all

sufficiently large n, since, by (3), and with k > 0, we have

n/αk ≥ (1 − α−k)f(n)/ logα ≥ f(n)/(2 logα), which is

unbounded by our assumptions on f .

Theorem 1 shows that for sufficiently large n, we can

construct sequences that estimate any order k up to k(n)−1.

We next show that, with additional mild assumptions on f(n),for most values of n we can construct sequences that estimate

precisely order k(n). We say that the function f is nice if it is

defined over the positive reals, f is concave and differentiable

over (z,∞) for some z ∈ R, and

zf ′(z) < f(z)− α

2for all z ∈ (z,∞) . (11)

It is readily verified that commonly used penalization functions

are nice. In particular, this includes functions of the form

f(n) = c log n, f(n) = c log logn, and f(n) = cnβ for

positive constants c and β < 1. The following lemma is an

immediate consequence of (11).

Lemma 3. If f is nice, then n/f(n) is strictly increasing withn in (z,∞).

In the sequel, for a real number z and a positive integer N ,

we write z�N as shorthand for Nz/N�, i.e., the smallest

multiple of N that is not smaller than z.

Theorem 2. Assume f is nice. Then, for sufficiently large k,if n > n(k)�αk and xn ∈ Bn

k , then k(xn) = k.

Remark: To interpret Theorem 2, we observe that for a given

value of k, by Lemma 3, the set of integers n such that k(n) =k is given by the range n(k) ≤ n < n(k + 1). The fraction

of values of n in this range for which the theorem does notprovide a sequence of length n that estimates order k(n) is

upper-bounded by

αk

n(k + 1)− n(k)<

logα

f(n(k + 1))− logααk

k→∞−−−−→ 0 ,

where the inequality follows from the leftmost inequality

in (7), and the limit follows from the unboundedness of n(k)and f(n). Thus, Theorem 2 guarantees that for all but a

vanishing fraction of values of n such that k(n) = k, there

are sequences of length n that estimate order k.To prove Theorem 2, we rely on a series of lemmas.

Lemma 4 below follows immediately from [4, Lemma 4].

For a probability vector P = (p1, . . . , pα), we denote by

H(P ) = −∑αi=1 pi log pi the entropy of P .

Lemma 4. If P = (p1, . . . , pα) is a probability vectorsatisfying 1

2α ≤ pi ≤ 2α for all i, 1 ≤ i ≤ α, then

H(P ) ≥ logα− α

α∑i=1

(pi − 1

α

)2

.

Lemma 5. Let xn∈Bnk , where n and k satisfy (3), let m =⌊

nαk

⌋, r = n− αkm, and assume m≥1. Then,

− log Pj(xn) ≥ n logα− rα

4m, 0 ≤ j < k. (12)

Proof: Since − log Pj(xn) is non-increasing with j, it

suffices to consider j = k−1. For a (k−1)-state s, let Rs

be the set of symbols of A that are emitted in state s in

xnmαk+1, the truncated (possibly empty) copy of a sequence

from Bk at the end of xn. Let rs = |Rs|, and define

T = { s ∈ Ak−1 | rs > 0 }. Clearly,∑

s rs = r, and |T | ≤ r.

By Lemma 2 (ii), we have n(a)s = m + 1 if a ∈ Rs and

n(a)s = m otherwise, so that ns = mα+rs. Thus, with m ≥ 1,

we have, for all a ∈ A,

1

2α≤ m

mα+ rs≤ n

(a)s

ns≤ m+ 1

mα+ rs≤ 2

α,

1104

Page 4: [IEEE 2012 IEEE International Symposium on Information Theory - ISIT - Cambridge, MA, USA (2012.07.1-2012.07.6)] 2012 IEEE International Symposium on Information Theory Proceedings

and Lemma 4 applied to Ps = Pk−1(·|s) yields, together with

some algebraic manipulations,

H(Ps) ≥ logα−α( ∑

a∈Rs

(m+1

mα+rs− 1

α

)2

+∑

a∈A\Rs

(m

mα+rs− 1

α

)2)

= logα− (α− rs)rs(mα+ rs)2

. (13)

Now, writing the ML term of order k−1 in terms of state-

conditioned empirical entropies, and applying (13), we obtain

− log Pk−1(xn) =

∑s∈Ak−1

nsH(Ps)

≥∑

s∈Ak−1

ns logα−∑s∈T

ns(α− rs)rs(mα+ rs)2

= n logα−∑s∈T

(α− rs)rs(mα+ rs)

, (14)

where we recall that rs = 0 for s �∈T , and, for the last equality,

that ns = mα+ rs. We claim that g(rs)Δ= (α−rs)rs

mα+rsis upper-

bounded by α4m for all s, which, by (14), would suffice to

prove (12). Indeed, elementary analysis of the function g(ρ)for ρ ≥ 0 reveals that it has a global maximum at ρ∗ =α(

√m(m+ 1)−m), with

g(ρ∗) = α(√m+ 1−√m)2 ≤ α

4m,

where the inequality is readily verified for m ≥ 1.Proof of Theorem 2: Let k be large enough so that

n(k) ≥ z. By Lemma 3, n and k satisfy (3) for all n ≥ n(k).Now, for xn ∈ Bn

k , with n > n(k)�αk , Lemma 5 and (1)

yield

Cj(xn) ≥ n logα− rα

4m+ f(n) , 0 ≤ j < k ,

with m and r as defined in the lemma. Thus, for 0 ≤ j < k,

recalling that − log Pk(xn) = 0, we have,

Cj(xn)− Ck(x

n) ≥ n logα− rα

4m− f(n)(αk − 1). (15)

Write μ = mαk = n − r. Since μ ≥ n(k)�αk , μ and ksatisfy (3) with μ in the role of n, i.e., we have αk − 1 ≤μ

f(μ) logα. Thus, from (15), we have, for 0 ≤ j < k,

Cj(xn)−Ck(x

n) ≥ (μ+r) logα− rα

4m− f(n)

f(μ)μ logα

= −f(n)− f(μ)

f(μ)μ logα − rα

4m+ r logα

≥ −rf ′(μ)f(μ)

μ logα − rα

4m+ r logα , (16)

where the last inequality follows from the fact that f is

concave in (z,∞) and μ ≥ n(k)�αk ≥ z. Now, since nand k satisfy (3), for k > 0 we have m ≥ f(n)/(2 logα), so,

recalling also the monotonicity of f , it follows from (16) that

Cj(xn)−Ck(x

n) ≥ −rf ′(μ)f(μ)

μ logα − rα logα

2f(μ)+ r logα

=

(−μf ′(μ) + α/2

f(μ)+ 1

)r logα . (17)

Since μ≥z, by (11), the right-hand side of (17) is positive

if r > 0. If r = 0, since n > n(k)�αk ≥ n(k), by (3)

and Lemma 3, the right-hand side of (15) is positive. Hence,

Cj(xn)>Ck(x

n) for j<k, and, thus, k(xn)≥k. Furthermore,

since − log Pk(xn)=0 and the penalty term increases with k,

we must have, in fact, k(xn) = k.Let n(k) denote the least integer n in the interval n(k) ≤

n < n(k+1) such that for all n ≥ n(k) in that interval, there

are sequences of length n that estimate order k. By Theorem 2,

we have n(k)−n(k) ≤ αk. We next show that, in the special

case α = 2, we can exploit known properties of special binary

de Bruijn sequences to reduce this gap to n(k)−n(k) = o(2k).Specifically, the Ford sequence of order k ≥ 0, which will

be denoted Fk = a2k

1 , is constructed as follows: start with

ak−11 = 0k−1, and extend the sequence using the least-first

greedy algorithm [9], where, for k < i ≤ 2k, given s =ai−1i−k+1, we set ai = 0 if ns = 0 and ai = 1 otherwise

(i.e., of the sibling k-tuples s0 and s1 we always choose s0first). It is readily verified that a de Bruijn sequence of order

k is indeed constructed this way, and that the sequence is

lexicographically first among all binary de Bruijn sequence of

order k. We denote by Fnk the sequence [(Fk)

∗]n.The following lemma is an immediate consequence of [10,

Theorem 1] (n0 and n1 are interpreted as special cases of ns).

Lemma 6. Let xn = Fnk . We have n0(x

j) − n1(xj) =

O(

2k log kk

)for all j, 1 ≤ j ≤ n.

Theorem 3. Assume f is nice and xn = Fnk . For sufficiently

large k and a well-characterized function g(k)=O(1), if

n ≥ n(k) + g(k)2k log k

k, (18)

then k(xn) = k.

Remark: It will turn out in the proof of Theorem 3 (given in

the full paper) that for some penalization functions of interest

we have, in fact, g(k) = o(1). In particular, if f(z) = c log zwith c > 0, the second term on the right-hand side of (18) is

O(

2k log kk2

), whereas for f(z) = czβ , with 0 < β < 1, it is

of the form O(

2(1−β)k log kk

).

IV. SEQUENCES WITH LARGE MDL-ESTIMATED ORDER

In this section, we consider the MDL Markov order esti-

mator based on the KT probability assignment. We construct

sequences xn that estimate orders that are much larger than

those attainable by a BIC estimator, or than the bound on the

order required for consistency of the KT estimator. For nota-

tional simplicity, we focus on the case α = 2 (A = {0, 1}).The KT probability [2] of order 0 of a binary sequence xn

is defined as KT0(λ) = 1 (for n = 0), and

KT0(xn) =

Γ(n0(x

n)+ 12

)Γ(n1(x

n)+ 12

)Γ(n+ 1)

, n > 0, (19)

where Γ is the Gamma function. The KT probability of order

k ≥ 0, in turn, is defined as

KTk(xn) =

∏s∈Ak

KT0 (xn[s]) , (20)

1105

Page 5: [IEEE 2012 IEEE International Symposium on Information Theory - ISIT - Cambridge, MA, USA (2012.07.1-2012.07.6)] 2012 IEEE International Symposium on Information Theory Proceedings

where xn[s] denotes the subsequence of symbols from xn that

occur in state s. Using this distribution, one can construct a

lossless description of xn of length

CKT,k(xn) = − logKTk(x

n) + c(k) , (21)

where c(k)=O(log k) is the (non-decreasing) length of an

efficient encoding of k, and we ignore integer constraints on

code lengths. The estimated Markov order for a sequence xn

is the value of k that yields the shortest description, namely,

kKT(xn) = argmin

k≥0CKT,k(x

n) . (22)

We will make use of the following relation, which follows

from Stirling’s approximation (see, e.g., [11]). For every

sequence xn, we have∣∣− logKT0(xn) + log P0(x

n)− 12 log n− ν

∣∣ ≤ γn−1 , (23)

where ν = log√2π, and γ is a positive constant.

Define the sequence Unk = [(10k)∗]n, where 10k is a string

consisting of a 1 followed by k 0’s. For simplicity, we assume

that the sequence x0−∞ used to determine initial states is all

0’s, and that n is a multiple of k + 1. These constraints can

be easily removed.

Theorem 4. Let n = (k+1)m for some m ≥ 1, and xn = Unk .

If k and n satisfy4n

(k+1)2≥ 1

2log

n

k+1+ ν +

(2k+1)(k+1)γ

n+ c(k) , (24)

then kKT(xn) = k.

Remark. It is readily verified that, for sufficiently large n, the

condition (24) is satisfied by values of k as large as k = n1/2−ε

for any ε∈(0, 12 ). Notice that, since the KT cost penalizes only

states that do occur in xn, the estimated order for the sequence

xn=Unk would be the same if the model under estimation was

a context tree [5]. In fact, it will be shown in the full paper

that a similar result (with the same sequence xn) holds also

for context trees under the BIC estimator.Proof of Theorem 4 (outline): For 0 ≤ j ≤ k, xn

contains occurrences of exactly j + 1 states, namely 0j and

0�10j−1−�, 0 ≤ < j. When j = k, each such state

occurs exactly m times, always followed by the same symbol.

Thus, the conditional distribution for each occurring k-state is

deterministic, and, from (20) and (23), we obtain

− logKTk(xn) ≤ k+1

2logm+ (k+1)

(ν +

γ

m

). (25)

When j < k, states of the form 0�10j−1−� occur m times

each, always followed by a 0. The state 0j occurs (k−j+1)mtimes, m of them followed by a 1, and the rest followed by a

0. From (20) and (23), writing δ = k−j+1, and denoting the

binary entropy function by h(·), we obtain

− logKTj(xn) ≥ δmh(δ−1) +

j

2logm+

1

2log(δm)

+ (j+1)ν−j γ

m− γ

δm

≥ 4m(δ−1)δ−1 +j+1

2logm+ (j + 1)ν − k

γ

m, (26)

where the second inequality follows by applying the uniform

bound h(p)≥4p(1 − p), recalling that 0≤j<k, and dropping

some nonnegative terms. Now, by (21), (25), and (26), recall-

ing that m = n/(k + 1) and c(j) > 0, we obtain

CKT,j(xn)− CKT,k(x

n) >4n(δ − 1)

(k + 1)2− δ−1

2log

n

k + 1

−(δ−1)ν − (2k+1)(k+1)γ

n− c(k).

Recalling that 1<δ≤ k+1, and factoring out δ−1, we verify

that, for 0 ≤ j < k, we have CKT,j(xn) > CKT,k(x

n)whenever (24) holds. It is readily shown in the full paper that

CKT,j(xn) > CKT,k(x

n) also when j > k. Therefore, under the

conditions of the theorem, we have kKT(xn) = k.

Remark. Consider the case where an upper bound K(n) =o(n1/2−ε) is imposed on the allowed estimated order of the

KT-based MDL estimator (or the tree BIC estimator). Choose

k = (1 + ξ)K(n) for some ξ > 0, and xn = Unk . By (25),

recalling that m = n/(k + 1), we obtain CKT,k(xn) =

O(K(n) log n) for this choice of k. On the other hand, if

j ≤ K(n), then, by (26), we have CKT,j(xn) = Ω(n/K(n)),

and, hence CKT,k(xn)/CKT,j(x

n)n−→ 0. We conclude that

limiting the allowed estimated order as assumed incurs a

significant asymptotic penalty in the description length of

the individual sequence xn. The gap is more pronounced

the smaller the upper bound K(n) is. In particular, with

K(n) = O(log n) (as required for consistency of the MDL

or tree estimator [4]), we obtain a code length Ω(n/ log n)with the restricted estimator, as compared to O(log2 n) with

an unrestricted one.

REFERENCES

[1] I. Csiszar and P. C. Shields, “The consistency of the BIC Markov orderestimator.” Annals of Statistics, vol. 28, no. 6, pp. 1601–1619, 2000.

[2] R. E. Krichevskii and V. K. Trofimov, “The performance of universalencoding,” IEEE Trans. Inform. Theory, vol. 27, pp. 199–207, Mar 1981.

[3] L. Finesso, “Estimation of the order of a finite markov chain,” in RecentAdvances in Mathematical Theory of Systems, Control, Networks andSignal Processing, H. Kimura and S. Kodama, Eds. Mita Press, 1992,pp. 643–645.

[4] I. Csiszar, “Large-scale typicality of Markov sample paths and con-sistency of MDL order estimators.” IEEE Transactions on InformationTheory, vol. 48, no. 6, 2002.

[5] J. Rissanen, “A universal data compression system,” IEEE Trans. Inform.Theory, vol. 29, pp. 656–664, Sep. 1983.

[6] I. Csiszar and Z. Talata, “Context tree estimation for not necessarilyfinite memory processes, via BIC and MDL,” IEEE Trans. Inform.Theory, vol. 52, no. 3, pp. 1007–1016, March 2006.

[7] A. Martın, N. Merhav, G. Seroussi, and M. J. Weinberger, “Twice-universal simulation of Markov sources and individual sequences,” IEEETrans. Inform. Theory, vol. 56, no. 9, pp. 4245–4255, Sep. 2010.

[8] N. G. de Bruijn, “A combinatorial problem,” Koninklijke NederlandsAkademie van Wetenschappen, Proceedings, vol. 49 Part 2, pp. 758–764, 1946.

[9] H. Fredricksen, “A survey of full length nonlinear shift register cyclealgorithms,” SIAM Review, vol. 24, no. 2, pp. pp. 195–221.

[10] “The discrepancy of the lex-least de Bruijn sequence,” Discrete Mathe-matics, vol. 310, no. 6-7, pp. 1152 – 1159, 2010.

[11] N. Cesa-Bianchi and G. Lugosi, Prediction, learning, and games.Cambridge: Cambridge Univ. Press, 2006.

1106