[ieee 2012 ieee international symposium on information theory - isit - cambridge, ma, usa...
TRANSCRIPT
Bounds on estimated Markov orders of individualsequences∗
Luciana VitaleInstituto de Computacion
Universidad de la Republica
Montevideo, Uruguay
Email: [email protected]
Alvaro MartınInstituto de Computacion
Universidad de la Republica
Montevideo, Uruguay
Email: [email protected]
Gadiel SeroussiHP Laboratories, Palo Alto, CA, USA, and
U. de la Republica, Montevideo, Uruguay
Email: [email protected]
Abstract—We study the maximal values estimated by com-monly used Markov model order estimators on individual se-quences. We start with penalized maximum likelihood (PML)estimators with cost functions of the form − log Pk(x
n)+f(n)αk,where Pk(x
n) is the ML probability of the input sequence xn
under a Markov model of order k, α is the size of the inputalphabet, and f(n) is an increasing (penalization) function of n(the popular BIC estimator corresponds to f(n) = α−1
2log n).
Comparison with a memoryless model yields a known upperbound k(n) on the maximum order that xn can estimate.We show that, under mild conditions on f that are satisfiedby commonly used penalization functions, this simple boundis not far from tight, in the following sense: for sufficientlylarge n, and any k<k(n), there are sequences xn that estimateorder k; moreover, for all but a vanishing fraction of thevalues of n such that k = k(n), there are sequences xn thatestimate order k. We also study KT-based MDL Markov orderestimators, and show that in this case, there are sequences xn thatestimate order n1/2−ε, which is much larger than the maximumlog n/ logα(1 + o(1)) attainable by BIC, or the order o(log n)required for consistency of the KT estimator. In fact, for thesesequences, limiting the allowed estimated order might incur ina significant asymptotic penalty in description length. All theresults are constructive, and in each case we exhibit explicitsequences that attain the claimed estimated orders.
I. INTRODUCTION
Initially, we consider penalized maximum likelihood (PML)Markov model order estimators, where, given a sequence xn
over a finite alphabet A, of size α = |A|, and a candidate
Markov order k, we define a cost1
Ck(xn) = − log Pk(x
n) + f(n)αk . (1)
Here, Pk(xn) is the maximum likelihood (ML) probability
of xn under a kth order Markov model (with appropriate
conventions on the initial states), and f(n) is a positive
penalization function satisfying some mild conditions to be
detailed later. The order estimated for xn is
k(xn) = argmink≥0
Ck(xn) . (2)
Different variants of PML estimators have been extensively
studied (see, e.g., [1] and citations therein). When f(n) =12 (α − 1) log n, we obtain the popular BIC estimator, which
∗Work supported in part by grant I+D CSIC-UdelaR.1All logarithms are taken to base 2 unless specified otherwise.
is usually regarded as an asymptotic approximation of a
Minimum Description Length (MDL) estimator of the Markov
order. We will also be interested in the latter type of order
estimators, and, in particular, in the variant based on the
Krichevskii-Trofimov (KT) probability assignment [2]. The
cost function in this case does not include an explicit penal-
ization term; instead, the contribution of the model size to the
cost is amortized across actual occurrences of model states in
the sequence under evaluation.
The range of values of the estimate k has played an
important role in the theoretical analysis of the above men-
tioned estimators. The first consistency results for the BIC
estimator [3], for example, assumed a known bound on the
Markov order. This assumption was removed in [1], where
it is also shown that, if no bound is assumed, pure MDL
Markov order estimators, be it in the KT or the normalizedmaximum likelihood (NML) versions, are not consistent. The
consistency of the latter two was shown in [4], when the
range of the Markov order k for the minimization in the
estimation is bounded by o(log n) and c log n, respectively,
with c < 1/ logα. Similarly, in the case of the estimation of
context trees [5], the consistency of BIC and (KT-based) MDL
estimators was proved in [6], under the assumption of an upper
bound of o(log n) on the depth of the candidate context trees
considered for the minimization.
Imposing a-priori bounds on the estimated order may be
useful in some cases to guarantee consistency, but might not
be desirable in other applications. For example, in universal
lossless data compression, we are interested in choosing the
estimated order that yields the shortest description length
for the given input sequence, regardless of where the input
originated from. Similarly, in the universal simulation results
of [7], given a sequence xn, a Markov order k is estimated
using a PML estimator, and a “simulated” sequence yn is
obtained by drawing uniformly from the set of sequences in
the kth order Markov type class of xn that also estimate order
k. No assumptions are made on the range of k, and, in the
individual sequence setting, the choice of penalization function
f(n) governs a trade-off between the statistical similarity of yn
to xn, and the “richness” (entropy) of the space from which yn
is drawn. On the other hand, aside from the theoretical interest,
obtaining inherent bounds on the possible outcome of the
2012 IEEE International Symposium on Information Theory Proceedings
978-1-4673-2579-0/12/$31.00 ©2012 IEEE 1102
estimation procedure has practical computational implications,
as a bound on the estimated order translates to a bound on the
memory requirements of an algorithmic implementation of the
estimator. Thus, in this paper, we study the maximum possible
value of k(xn), for any sequence xn, when no a-priori bounds
are imposed on the candidate orders.
For k = k(x), writing C0(xn) ≥ Ck(x
n), trivially bounding
the ML probability, and rearranging terms, yields
n logα
f(n)≥ αk − 1, (3)
from which a uniform upper bound (log n− log f(n))/ logα+O(1) on k(xn), which we denote by k(n), was obtained in [7].
We will show that, under mild conditions on f(n), which are
satisfied by commonly used penalization functions, the bound
k(n) is not far from tight, in the following sense: for any
sufficiently large n, and any k < k(n), there are sequences of
length n that estimate order k; moreover, for all sufficiently
large k and all but a vanishing fraction of the values of n such
that k = k(n), there are sequences of length n that estimate
order k. After some preliminaries in Section II, these results
are presented in Section III, by showing explicit constructions
of sequences that attain the claimed estimated orders. The
constructions rely on properties of de Bruijn sequences [8].
We initially present results for arbitrary values of α, and then
show that these results can be tightened in the case α = 2by exploiting properties of a special kind of binary de Bruijn
sequence, the so-called Ford sequence [9]. In Section IV, we
extend our study to the MDL estimator based on the KT
probability assignment. We show that in this case, there exist
sequences xn that estimate order n1/2−ε for any ε ∈ (0, 12 ).
This order is much larger than the maximum possible order,
k(n) = logn/ logα+ o(log n), attainable by a BIC estimator,
and also of the order o(log n) required for consistency of the
KT-based estimator [4]. In fact, we show that, in a universal
lossless compression setting, for the constructed sequences,
imposing an artificial upper-bound on the allowed estimated
order could incur a significant asymptotic penalty in overall
description length. Similar results (with the same individual
sequences) are obtained when a context tree [5], [6], rather
than a plain Markov order, is estimated, using either the KT
or the (tree) BIC estimator.
II. PRELIMINARIES AND PROPERTIES OF THE UPPER
BOUND
We denote by uji the string uiui+1 . . . uj over A, with uj
i =λ, the empty string, when i > j. We omit the subscript when
i = 1. We let |u| denote the length of a string u, and uvthe concatenation of strings u and v. The terms string and
sequence are used interchangeably.
We model a sequence xn as the realization of a generic
kth order Markov process, where k is unknown. We regard a
string s ∈ Ak as a state of the Markov process and we say
that a sequence y selects state s whenever s is a suffix of y.
When k is not clear from the context, we explicitly refer to
s as a k-state. For the purpose of selecting states, we assume
that xn is preceded by an arbitrary fixed semi-infinite string
x0−∞. This convention uniquely determines a state selected by
xi, for each i, 0 ≤ i ≤ n, and for any order k. If xi selects
state s, 0 ≤ i < n, we say that xi+1 is emitted in state s and
that s occurs (in position i) in xn. We denote by ns(xn) the
number of occurrences of s in xn, and, for a ∈ A, we denote
by n(a)s (xn) the number of times a symbol xi = a is emitted
in state s. We omit the dependence on xn of n(a)s , ns, and
other notations, when clear from the context.
The kth order ML probability of a sequence xn is deter-
mined by the fixed initial state and the empirical probabilities
Pk(·|s) conditioned on k-states s,
Pk(a|s) = n(a)s
ns, s ∈ Ak, a ∈ A , (4)
so that
− log Pk(xn) = −
∑s∈Ak,a∈A
n(a)s log
n(a)s
ns. (5)
The class of PML estimators of interest is defined by (1)–(2),
where we assume that f(n) is positive and nondecreasing, with
f(n)n−→∞ and
f(n)n
n−→ 0.2 We refer to the first and second
terms on the right-hand side of (1), respectively, as the MLterm (specified in (5)) and the penalty term of order k.
The upper bound k(n) on k is defined as the largest value
of k satisfying (3) for a given n. Reciprocally, given k, the
smallest integer n satisfying (3), denoted n(k), is a lower
bound on the length of sequences that can estimate order k.
In particular, from the definition of n(k), we have
αk − 1
logαf(n(k)) ≤ n(k) <
αk − 1
logαf(n(k)− 1) + 1 . (6)
The following lemma follows readily from the foregoing
definitions, and from (6).
Lemma 1. Given a value of n, the inequality (3) holds forall k, 0 ≤ k ≤ k(n). We have k(n)
n−→∞, n(k) k−→∞, n(k) isnondecreasing, and, moreover, for sufficiently large k,
αk
logαf(n(k+1))−1 < n(k+1)−n(k) <
αk+1
logαf(n(k+1)) .
(7)
III. SEQUENCES THAT MAXIMIZE PML-ESTIMATED ORDER
In this section we exhibit sequences of length n that get
very close to, or even precisely attain, the bound k(n) of the
previous section. The constructions will be based on de Bruijn
sequences, whose properties we review next.
A kth order de Bruijn sequence [8] is a sequence
bαk
, of length αk, k≥ 0, such that the sliding window
bi+1bi+2 . . . bi+k, with indices taken modulo αk, exhausts all
distinct k-tuples over A. De Bruijn sequences exist for every
order k, and every cyclic permutation of a de Bruijn sequence
is itself a de Bruijn sequence of the same order. We denote
by Bk the (nonempty) set of de Bruijn sequences of order kthat have x0
−k+1 as a suffix (i.e., they match, cyclically, the
2In the case of penalization functions such as f(n) = c logn or f(n) =c log logn, we assume n is large enough so that f(n) is positive.
1103
assumed fixed initial condition). For a sequence u, we denote
by (u)∗ the concatenation of an infinite number of copies of
u, and, when |u| ≥ n, we denote by [u ]n the truncation of uto length n. Let Bn
k denote the set of sequences
Bnk =
{ [(bα
k
)∗]n
∣∣ bαk ∈ Bk
}. (8)
The following lemma follows immediately from the defini-
tion (8) and the properties of de Bruijn sequences.
Lemma 2. Let xn ∈ Bnk . If k′ ≥ k, then a k′-state, when
it occurs in xn, always emits the same symbol. In particular,when n = mαk for some integer m ≥ 0, then
(i) each k-state s occurs m times in xn, and we have m =ns = n
(a)s for some a∈A (which depends on s);
(ii) if j < k, each possible j-state s occurs mαk−j timesin xn and each symbol of A is emitted mαk−j−1 timesin s, i.e., n(a)
s = mαk−j−1 and ns = mαk−j for alla ∈ A and all s ∈ Aj .
Theorem 1. For sufficiently large n, if k < k(n) and xn ∈Bn
k , then k(xn) = k.
Proof: By Lemma 2, if k′ ≥ k, then the ML term of order
k′ of xn is zero. Thus, since the penalty term grows with the
order, we must have k(xn) ≤ k. Let m = � nαk �. If j < k,
then by Lemma 2 (ii), we have, for all a ∈ A and all s ∈ Aj ,
n(a)s (xn) ≤ (m + 1)αk−j−1 and ns(x
n) ≥ mαk−j , where at
least one inequality is strict, which implies thatn(a)s
ns< m+1
mα .
Therefore, by (5), we have
− log Pj(xn) > −
∑s∈Aj ,a∈A
n(a)s log
m+ 1
mα= n log
mα
m+ 1.
(9)
Using (9) and recalling that − log Pk(xn) = 0, we obtain, for
0 ≤ j < k,
Cj(xn)− Ck(x
n) > n logmα
m+ 1+ f(n)αj − f(n)αk
> n logmα
m+ 1+
f(n)(1− αk+1)
α
≥ n logmα
m+ 1− n logα
α
= n log
(α1− 1
αm
m+ 1
)≥ n log
(√2
m
m+ 1
), (10)
where the third inequality follows from the first claim of
Lemma 1 and the fact that k+1 ≤ k(n), and the last inequality
holds since α ≥ 2. It follows from (10) that Cj(xn) > Ck(x
n)when m ≥ 3. The latter condition, in turn, holds for all
sufficiently large n, since, by (3), and with k > 0, we have
n/αk ≥ (1 − α−k)f(n)/ logα ≥ f(n)/(2 logα), which is
unbounded by our assumptions on f .
Theorem 1 shows that for sufficiently large n, we can
construct sequences that estimate any order k up to k(n)−1.
We next show that, with additional mild assumptions on f(n),for most values of n we can construct sequences that estimate
precisely order k(n). We say that the function f is nice if it is
defined over the positive reals, f is concave and differentiable
over (z,∞) for some z ∈ R, and
zf ′(z) < f(z)− α
2for all z ∈ (z,∞) . (11)
It is readily verified that commonly used penalization functions
are nice. In particular, this includes functions of the form
f(n) = c log n, f(n) = c log logn, and f(n) = cnβ for
positive constants c and β < 1. The following lemma is an
immediate consequence of (11).
Lemma 3. If f is nice, then n/f(n) is strictly increasing withn in (z,∞).
In the sequel, for a real number z and a positive integer N ,
we write z�N as shorthand for Nz/N�, i.e., the smallest
multiple of N that is not smaller than z.
Theorem 2. Assume f is nice. Then, for sufficiently large k,if n > n(k)�αk and xn ∈ Bn
k , then k(xn) = k.
Remark: To interpret Theorem 2, we observe that for a given
value of k, by Lemma 3, the set of integers n such that k(n) =k is given by the range n(k) ≤ n < n(k + 1). The fraction
of values of n in this range for which the theorem does notprovide a sequence of length n that estimates order k(n) is
upper-bounded by
αk
n(k + 1)− n(k)<
logα
f(n(k + 1))− logααk
k→∞−−−−→ 0 ,
where the inequality follows from the leftmost inequality
in (7), and the limit follows from the unboundedness of n(k)and f(n). Thus, Theorem 2 guarantees that for all but a
vanishing fraction of values of n such that k(n) = k, there
are sequences of length n that estimate order k.To prove Theorem 2, we rely on a series of lemmas.
Lemma 4 below follows immediately from [4, Lemma 4].
For a probability vector P = (p1, . . . , pα), we denote by
H(P ) = −∑αi=1 pi log pi the entropy of P .
Lemma 4. If P = (p1, . . . , pα) is a probability vectorsatisfying 1
2α ≤ pi ≤ 2α for all i, 1 ≤ i ≤ α, then
H(P ) ≥ logα− α
α∑i=1
(pi − 1
α
)2
.
Lemma 5. Let xn∈Bnk , where n and k satisfy (3), let m =⌊
nαk
⌋, r = n− αkm, and assume m≥1. Then,
− log Pj(xn) ≥ n logα− rα
4m, 0 ≤ j < k. (12)
Proof: Since − log Pj(xn) is non-increasing with j, it
suffices to consider j = k−1. For a (k−1)-state s, let Rs
be the set of symbols of A that are emitted in state s in
xnmαk+1, the truncated (possibly empty) copy of a sequence
from Bk at the end of xn. Let rs = |Rs|, and define
T = { s ∈ Ak−1 | rs > 0 }. Clearly,∑
s rs = r, and |T | ≤ r.
By Lemma 2 (ii), we have n(a)s = m + 1 if a ∈ Rs and
n(a)s = m otherwise, so that ns = mα+rs. Thus, with m ≥ 1,
we have, for all a ∈ A,
1
2α≤ m
mα+ rs≤ n
(a)s
ns≤ m+ 1
mα+ rs≤ 2
α,
1104
and Lemma 4 applied to Ps = Pk−1(·|s) yields, together with
some algebraic manipulations,
H(Ps) ≥ logα−α( ∑
a∈Rs
(m+1
mα+rs− 1
α
)2
+∑
a∈A\Rs
(m
mα+rs− 1
α
)2)
= logα− (α− rs)rs(mα+ rs)2
. (13)
Now, writing the ML term of order k−1 in terms of state-
conditioned empirical entropies, and applying (13), we obtain
− log Pk−1(xn) =
∑s∈Ak−1
nsH(Ps)
≥∑
s∈Ak−1
ns logα−∑s∈T
ns(α− rs)rs(mα+ rs)2
= n logα−∑s∈T
(α− rs)rs(mα+ rs)
, (14)
where we recall that rs = 0 for s �∈T , and, for the last equality,
that ns = mα+ rs. We claim that g(rs)Δ= (α−rs)rs
mα+rsis upper-
bounded by α4m for all s, which, by (14), would suffice to
prove (12). Indeed, elementary analysis of the function g(ρ)for ρ ≥ 0 reveals that it has a global maximum at ρ∗ =α(
√m(m+ 1)−m), with
g(ρ∗) = α(√m+ 1−√m)2 ≤ α
4m,
where the inequality is readily verified for m ≥ 1.Proof of Theorem 2: Let k be large enough so that
n(k) ≥ z. By Lemma 3, n and k satisfy (3) for all n ≥ n(k).Now, for xn ∈ Bn
k , with n > n(k)�αk , Lemma 5 and (1)
yield
Cj(xn) ≥ n logα− rα
4m+ f(n) , 0 ≤ j < k ,
with m and r as defined in the lemma. Thus, for 0 ≤ j < k,
recalling that − log Pk(xn) = 0, we have,
Cj(xn)− Ck(x
n) ≥ n logα− rα
4m− f(n)(αk − 1). (15)
Write μ = mαk = n − r. Since μ ≥ n(k)�αk , μ and ksatisfy (3) with μ in the role of n, i.e., we have αk − 1 ≤μ
f(μ) logα. Thus, from (15), we have, for 0 ≤ j < k,
Cj(xn)−Ck(x
n) ≥ (μ+r) logα− rα
4m− f(n)
f(μ)μ logα
= −f(n)− f(μ)
f(μ)μ logα − rα
4m+ r logα
≥ −rf ′(μ)f(μ)
μ logα − rα
4m+ r logα , (16)
where the last inequality follows from the fact that f is
concave in (z,∞) and μ ≥ n(k)�αk ≥ z. Now, since nand k satisfy (3), for k > 0 we have m ≥ f(n)/(2 logα), so,
recalling also the monotonicity of f , it follows from (16) that
Cj(xn)−Ck(x
n) ≥ −rf ′(μ)f(μ)
μ logα − rα logα
2f(μ)+ r logα
=
(−μf ′(μ) + α/2
f(μ)+ 1
)r logα . (17)
Since μ≥z, by (11), the right-hand side of (17) is positive
if r > 0. If r = 0, since n > n(k)�αk ≥ n(k), by (3)
and Lemma 3, the right-hand side of (15) is positive. Hence,
Cj(xn)>Ck(x
n) for j<k, and, thus, k(xn)≥k. Furthermore,
since − log Pk(xn)=0 and the penalty term increases with k,
we must have, in fact, k(xn) = k.Let n(k) denote the least integer n in the interval n(k) ≤
n < n(k+1) such that for all n ≥ n(k) in that interval, there
are sequences of length n that estimate order k. By Theorem 2,
we have n(k)−n(k) ≤ αk. We next show that, in the special
case α = 2, we can exploit known properties of special binary
de Bruijn sequences to reduce this gap to n(k)−n(k) = o(2k).Specifically, the Ford sequence of order k ≥ 0, which will
be denoted Fk = a2k
1 , is constructed as follows: start with
ak−11 = 0k−1, and extend the sequence using the least-first
greedy algorithm [9], where, for k < i ≤ 2k, given s =ai−1i−k+1, we set ai = 0 if ns = 0 and ai = 1 otherwise
(i.e., of the sibling k-tuples s0 and s1 we always choose s0first). It is readily verified that a de Bruijn sequence of order
k is indeed constructed this way, and that the sequence is
lexicographically first among all binary de Bruijn sequence of
order k. We denote by Fnk the sequence [(Fk)
∗]n.The following lemma is an immediate consequence of [10,
Theorem 1] (n0 and n1 are interpreted as special cases of ns).
Lemma 6. Let xn = Fnk . We have n0(x
j) − n1(xj) =
O(
2k log kk
)for all j, 1 ≤ j ≤ n.
Theorem 3. Assume f is nice and xn = Fnk . For sufficiently
large k and a well-characterized function g(k)=O(1), if
n ≥ n(k) + g(k)2k log k
k, (18)
then k(xn) = k.
Remark: It will turn out in the proof of Theorem 3 (given in
the full paper) that for some penalization functions of interest
we have, in fact, g(k) = o(1). In particular, if f(z) = c log zwith c > 0, the second term on the right-hand side of (18) is
O(
2k log kk2
), whereas for f(z) = czβ , with 0 < β < 1, it is
of the form O(
2(1−β)k log kk
).
IV. SEQUENCES WITH LARGE MDL-ESTIMATED ORDER
In this section, we consider the MDL Markov order esti-
mator based on the KT probability assignment. We construct
sequences xn that estimate orders that are much larger than
those attainable by a BIC estimator, or than the bound on the
order required for consistency of the KT estimator. For nota-
tional simplicity, we focus on the case α = 2 (A = {0, 1}).The KT probability [2] of order 0 of a binary sequence xn
is defined as KT0(λ) = 1 (for n = 0), and
KT0(xn) =
Γ(n0(x
n)+ 12
)Γ(n1(x
n)+ 12
)Γ(n+ 1)
, n > 0, (19)
where Γ is the Gamma function. The KT probability of order
k ≥ 0, in turn, is defined as
KTk(xn) =
∏s∈Ak
KT0 (xn[s]) , (20)
1105
where xn[s] denotes the subsequence of symbols from xn that
occur in state s. Using this distribution, one can construct a
lossless description of xn of length
CKT,k(xn) = − logKTk(x
n) + c(k) , (21)
where c(k)=O(log k) is the (non-decreasing) length of an
efficient encoding of k, and we ignore integer constraints on
code lengths. The estimated Markov order for a sequence xn
is the value of k that yields the shortest description, namely,
kKT(xn) = argmin
k≥0CKT,k(x
n) . (22)
We will make use of the following relation, which follows
from Stirling’s approximation (see, e.g., [11]). For every
sequence xn, we have∣∣− logKT0(xn) + log P0(x
n)− 12 log n− ν
∣∣ ≤ γn−1 , (23)
where ν = log√2π, and γ is a positive constant.
Define the sequence Unk = [(10k)∗]n, where 10k is a string
consisting of a 1 followed by k 0’s. For simplicity, we assume
that the sequence x0−∞ used to determine initial states is all
0’s, and that n is a multiple of k + 1. These constraints can
be easily removed.
Theorem 4. Let n = (k+1)m for some m ≥ 1, and xn = Unk .
If k and n satisfy4n
(k+1)2≥ 1
2log
n
k+1+ ν +
(2k+1)(k+1)γ
n+ c(k) , (24)
then kKT(xn) = k.
Remark. It is readily verified that, for sufficiently large n, the
condition (24) is satisfied by values of k as large as k = n1/2−ε
for any ε∈(0, 12 ). Notice that, since the KT cost penalizes only
states that do occur in xn, the estimated order for the sequence
xn=Unk would be the same if the model under estimation was
a context tree [5]. In fact, it will be shown in the full paper
that a similar result (with the same sequence xn) holds also
for context trees under the BIC estimator.Proof of Theorem 4 (outline): For 0 ≤ j ≤ k, xn
contains occurrences of exactly j + 1 states, namely 0j and
0�10j−1−�, 0 ≤ < j. When j = k, each such state
occurs exactly m times, always followed by the same symbol.
Thus, the conditional distribution for each occurring k-state is
deterministic, and, from (20) and (23), we obtain
− logKTk(xn) ≤ k+1
2logm+ (k+1)
(ν +
γ
m
). (25)
When j < k, states of the form 0�10j−1−� occur m times
each, always followed by a 0. The state 0j occurs (k−j+1)mtimes, m of them followed by a 1, and the rest followed by a
0. From (20) and (23), writing δ = k−j+1, and denoting the
binary entropy function by h(·), we obtain
− logKTj(xn) ≥ δmh(δ−1) +
j
2logm+
1
2log(δm)
+ (j+1)ν−j γ
m− γ
δm
≥ 4m(δ−1)δ−1 +j+1
2logm+ (j + 1)ν − k
γ
m, (26)
where the second inequality follows by applying the uniform
bound h(p)≥4p(1 − p), recalling that 0≤j<k, and dropping
some nonnegative terms. Now, by (21), (25), and (26), recall-
ing that m = n/(k + 1) and c(j) > 0, we obtain
CKT,j(xn)− CKT,k(x
n) >4n(δ − 1)
(k + 1)2− δ−1
2log
n
k + 1
−(δ−1)ν − (2k+1)(k+1)γ
n− c(k).
Recalling that 1<δ≤ k+1, and factoring out δ−1, we verify
that, for 0 ≤ j < k, we have CKT,j(xn) > CKT,k(x
n)whenever (24) holds. It is readily shown in the full paper that
CKT,j(xn) > CKT,k(x
n) also when j > k. Therefore, under the
conditions of the theorem, we have kKT(xn) = k.
Remark. Consider the case where an upper bound K(n) =o(n1/2−ε) is imposed on the allowed estimated order of the
KT-based MDL estimator (or the tree BIC estimator). Choose
k = (1 + ξ)K(n) for some ξ > 0, and xn = Unk . By (25),
recalling that m = n/(k + 1), we obtain CKT,k(xn) =
O(K(n) log n) for this choice of k. On the other hand, if
j ≤ K(n), then, by (26), we have CKT,j(xn) = Ω(n/K(n)),
and, hence CKT,k(xn)/CKT,j(x
n)n−→ 0. We conclude that
limiting the allowed estimated order as assumed incurs a
significant asymptotic penalty in the description length of
the individual sequence xn. The gap is more pronounced
the smaller the upper bound K(n) is. In particular, with
K(n) = O(log n) (as required for consistency of the MDL
or tree estimator [4]), we obtain a code length Ω(n/ log n)with the restricted estimator, as compared to O(log2 n) with
an unrestricted one.
REFERENCES
[1] I. Csiszar and P. C. Shields, “The consistency of the BIC Markov orderestimator.” Annals of Statistics, vol. 28, no. 6, pp. 1601–1619, 2000.
[2] R. E. Krichevskii and V. K. Trofimov, “The performance of universalencoding,” IEEE Trans. Inform. Theory, vol. 27, pp. 199–207, Mar 1981.
[3] L. Finesso, “Estimation of the order of a finite markov chain,” in RecentAdvances in Mathematical Theory of Systems, Control, Networks andSignal Processing, H. Kimura and S. Kodama, Eds. Mita Press, 1992,pp. 643–645.
[4] I. Csiszar, “Large-scale typicality of Markov sample paths and con-sistency of MDL order estimators.” IEEE Transactions on InformationTheory, vol. 48, no. 6, 2002.
[5] J. Rissanen, “A universal data compression system,” IEEE Trans. Inform.Theory, vol. 29, pp. 656–664, Sep. 1983.
[6] I. Csiszar and Z. Talata, “Context tree estimation for not necessarilyfinite memory processes, via BIC and MDL,” IEEE Trans. Inform.Theory, vol. 52, no. 3, pp. 1007–1016, March 2006.
[7] A. Martın, N. Merhav, G. Seroussi, and M. J. Weinberger, “Twice-universal simulation of Markov sources and individual sequences,” IEEETrans. Inform. Theory, vol. 56, no. 9, pp. 4245–4255, Sep. 2010.
[8] N. G. de Bruijn, “A combinatorial problem,” Koninklijke NederlandsAkademie van Wetenschappen, Proceedings, vol. 49 Part 2, pp. 758–764, 1946.
[9] H. Fredricksen, “A survey of full length nonlinear shift register cyclealgorithms,” SIAM Review, vol. 24, no. 2, pp. pp. 195–221.
[10] “The discrepancy of the lex-least de Bruijn sequence,” Discrete Mathe-matics, vol. 310, no. 6-7, pp. 1152 – 1159, 2010.
[11] N. Cesa-Bianchi and G. Lugosi, Prediction, learning, and games.Cambridge: Cambridge Univ. Press, 2006.
1106