maximum likelihood estimation and em fixed …math.sfsu.edu/serkan/theses/daniellemkethesis.pdf ·...

34
MAXIMUM LIKELIHOOD ESTIMATION AND EM FIXED POINT IDEALS FOR BINARY TENSORS Daniel Lemke Version of May 27, 2016

Upload: hoangduong

Post on 24-Mar-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

MAXIMUM LIKELIHOOD ESTIMATION AND EMFIXED POINT IDEALS FOR BINARY TENSORS

Daniel Lemke

Version of May 27, 2016

Contents

1 Introduction 41.1 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background Math 72.1 Nonnegative Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Tensors of Bounded Nonnegative Rank . . . . . . . . . . . . . . . . . . . . . 82.3 Maximum Likelihood Estimation: A Closer Look . . . . . . . . . . . . . . . 102.4 EM Algorithm for Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.5 Ideals, Varieties, and Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 13

3 EM Fixed Point Ideal 153.1 Extension of EM Algorithm to Tensors . . . . . . . . . . . . . . . . . . . . . 153.2 EM Fixed Point Ideal for Tensors . . . . . . . . . . . . . . . . . . . . . . . . 16

4 MLE Using Boundary Strata 224.1 Boundary Stratification of Binary Tensors . . . . . . . . . . . . . . . . . . . 224.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Implementation 285.1 Cellular Decomposition, Primality, and Primary Decomposition . . . . . . . 285.2 EM, MLE, and Boundary Strata Experiments . . . . . . . . . . . . . . . . . 30

6 Conclusion 31

Bibliography 32

2

I would like to express my deepest gratitude to Serkan Hosten and Kaie Kubjas for providingthe ideas and framework found in this thesis. Thanks go to Nathanael Aff for programmingguidance and to Michelle Lemke Riggs and Matthias Beck for their special awareness ofEnglish grammar.

3

1 Introduction

The term Algebraic Statistics first appeared in the literature as the title of a 2001 bookby Giovanni Pistone, Eva Riccomagno, and Henry Wynn [PRW]. Beginning with an in-troduction to Grobner bases, it presents the application of polynomial algebra to statistics,discrete probability, and experimental design. In 2005 Lior Pachter and Bernd Sturmfelspublished a single-volume collection of works titled Algebraic Statistics for ComputationalBiology [PS]. It was written by an array of professionals and graduate students from thefields of algebra and computational biology. The book provides a thorough treatment of thebasic principles of algebraic statistics and their relationship to computational biology, andpresents an emerging dictionary between algebraic geometry and statistics.

Our research is a continuation of the work of Sturmfels et al. The story picks up with awell-known problem in statistics called Maximum Likelihood Estimation.

1.1 Maximum Likelihood Estimation

The likelihood of a set of data is the probability of observing that particular set of data,given some statistical model, which is just a family of probability distributions. The valuesof the parameters that maximize the sample likelihood function are known as the MaximumLikelihood Estimates or MLEs. MLEs have been studied since the dawn of the 20th centuryand were made popular by the statistician and biologist Sir Ronald Fisher [Wik]. Considerthe following example from [PS] §1.1. Suppose we generate a DNA sequence by rolling threetetrahedral dice, each labelled A, C, G, and T, for nucleobases adenine, cytosine, guanine,and thymine. Two of the dice are unfair, one is fair, and suppose they have the associatedprobabilities of Table 1.1.

Table 1.1

A C G Tfirst die 0.15 0.33 0.36 0.16second die 0.27 0.24 0.23 0.26third die 0.25 0.25 0.25 0.25

A

C

GT

C

A

We generate the DNA sequence

CTCACGTGATGAGAGCATTCTCAGACCGTGACGCGTGTAGCAGCGGCTC

4

by selecting the first die with probability θ1, the second with probability θ2, and the thirdwith probability 1− θ1− θ2. We would like to determine the parameters θ1 and θ2 that wereused to select the dice. This amounts to a problem in optimization. Let pA, pC , pG, andpT denote the probabilities that will generate any of the four letters. The statistical modelderived from Table 1.1 is written algebraically as follows:

pA = −0.10 · θ1 + 0.02 · θ2 + 0.25,pC = 0.08 · θ1 − 0.01 · θ2 + 0.25,pG = 0.11 · θ1 − 0.02 · θ2 + 0.25,pT = −0.09 · θ1 + 0.01 · θ2 + 0.25.

We emphasize that these are polynomials in the unknowns θ1 and θ2. In statistical ter-minology these unknowns are called the model parameters. Each of the 49 characters wasgenerated independently, so the likelihood of observing the above DNA sequence is the prod-uct of the probabilities of observing the individual letters:

L(θ1, θ2) = pC(θ1, θ2) · pT (θ1, θ2) · pC(θ1, θ2) · pA(θ1, θ2) · · · pC(θ1, θ2)

= p10A (θ1, θ2) · p15C (θ1, θ2) · p15G (θ1, θ2) · p10T (θ1, θ2).

In the maximum likelihood framework we estimate the unknown parameters that were usedwith those values in the parameter space which make the likelihood of observing the data aslarge as possible. The parameter space over which we maximize L(θ1, θ2) is the triangle

Θ = {(θ1, θ2) ∈ R2 : θ1 > 0 and θ2 > 0 and θ1 + θ2 ≤ 1}.

It is simpler and equivalent to maximize the log of the likelihood function, denoted `(θ):

`(θ) = log(L(θ1, θ2))

= 10 log(pA(θ1, θ2)) + 14 log(pC(θ1, θ2))

+ 15 log(pG(θ1, θ2)) + 10 log(pT (θ1, θ2)),

and we can obtain the solution to this optimization problem using techniques from Calculus.Optimization yields the maximum likelihood estimate

(θ1, θ2) = (0.5191263945, 0.2172513326).

One of the drawbacks of MLEs, in terms of popular use, is that it is in general a nonconvexoptimization problem requiring solutions to complicated nonlinear systems of equations. Itis common in practice to circumvent these issues by using the hill-climbing ExpectationMaximization (EM) algorithm, one of the main topics of this thesis. However, any algorithmof this type is doomed to imperfection. It will inevitably run into the problem of beingtrapped in local maxima and will have no way of providing a certificate for having found theglobal optimum, which may or may not exist [KRS+, pg. 2].

1.2 Results

We analyze the behavior of the EM algorithm in the case where the model M, the spaceover which we are optimizing, consists of 2 × 2 × 2 data arrays of nonnegative rank 2 (cf.§2.1). M is a nonconvex, compact, semi-algebraic subset of a 7-dimensional tetrahedron.

5

Figure 1.1: Representative picture of the 7-dimensional model M as a 3-dimensional nonconvex,nonlinear, compact subset of the 3-dimensional tetrahedron.

Since maximum likelihood estimation is an optimization problem, in order to locate theglobal optimum one restricts the objective function to the the interior and to each boundary,finds the maximum on each of these strata, and picks the best value among them. Allman,Hosten, Rhodes, and Zwiernik [AHRZ] give exact formulas for the maxima on each boundarystratification of M. [AHRZ] follows [ARSZ] in which M is realized as those probabilitydistributions satisfying a special set of polynomial equalities and inequalities. We analyzethe [AHRZ] formulas by determining how often they produce MLEs withinM. We determinethe strata of M that the EM algorithm is most attracted to, find the frequency with whichthe EM algorithm locates the global optimum, and count the number of times the EMalgorithm must be run in order to find the MLE. We also compare the computation times ofrunning the algorithm against using [AHRZ], and produce a picture of the behavior of thealgorithm on a 3-dimensional slice of the 7-dimensional model M.

We also analyze an algebraic approach to the EM algorithm and MLE. The EM fixedpoints are all the points that the EM algorithm can potentially converge to. These pointsrepresent the entire collection of maxima in the relative interior ofM, as well as maximizerson the boundaries of M, and can be realized as the vanishing set of a collection of polyno-mials. We find these polynomials, following in the footsteps of [KRS+], and describe the setof all EM fixed points to maximum likelihood problems of two separate classes of 2× 2× 2data arrays. In total we discuss and compare three approaches to the maximum likelihoodproblem on M; one is algorithmic, one is formulaic, and one is algebraic.

In Chapter 2 we cover the background math necessary to understand Chapters 3 and 4.These concepts include MLE, nonnegative rank, tensors of bounded nonnegative rank, andthe EM algorithm for matrices. We also discuss ideals, varieties, and primary decomposition.In Chapter 3 we describe the EM fixed point ideal for binary tensors of nonnegative rankless than or equal to 2 and 3, we describe cellular decomposition, which was used to producethese ideals, and we provide tables completely characterizing these ideals. In Chapter 4 weprovide results on MLE using the boundary strata given in [ARSZ] and [AHRZ].

6

2 Background Math

2.1 Nonnegative Rank

The nonnegative rank of a nonnegative matrix A ∈ Rm×n, denoted rank+(A), is the smallestr ∈ Z≥0 such that A = B · C for nonnegative B ∈ Rm×r and nonnegative C ∈ Rr×n.Equivalently, it is the smallest r such that A can be written as the sum of r nonnegativerank 1 matrices,

A =r∑i=1

xiyi for xi ∈ Rm×1≥0 , yi ∈ R1×n

≥0 .

Rank is always less than or equal to nonnegative rank. The smallest case for which rankand nonnegative rank disagree is for m = n = 4. [CR] provides the standard example. It isshown that the matrix

1 1 0 01 0 1 00 1 0 10 0 1 1

has rank+ = 4, but by observing linear dependence, or that

1 1 0 01 0 1 00 1 0 10 0 1 1

=

1 1 01 0 10 1 00 0 1

·1 0 0 −1

0 1 0 10 0 1 1

,

we see that this matrix is rank 3 in the usual sense. Stephen Vavasis shows that nonnegativematrix factorization is NP-hard in [Vav].

7

2.2 Tensors of Bounded Nonnegative Rank

A real nonnegative tensor is a multidimensional array in Rd1×d2×···×dn≥0 . A vector is a 1-

dimensional tensor, a matrix is a 2-dimensional tensor, and a 3-or-higher dimensional tensoris just a “tensor”.

Figure 2.1: A 3× 3× 3- and 2× 2× 2× 2-tensor. Image sources: [Kar] & [Wal].

The cells of a 3 × 3 × 3 Rubik’s Cube represent a 3 × 3 × 3-tensor and a labelling of thevertices of a 4-dimensional cube represent a 2× 2× 2× 2-tensor.

Example 2.1. Let a = (a1, a2), b = (b1, b2), c = (c1, c2) ∈ R2≥0, then a⊗b⊗c is a nonnegative

rank 1, 2× 2× 2-tensor and can be written in slices as(a1b1c1 a1b1c2 a2b1c1 a2b1c2a1b2c1 a1b2c2 a2b2c1 a2b2c2

).

This is just one view of the tensor (front-to-back), but that it is rank 1 in the usual sense.Indeed, each slice is a linear combination of the other, independent of the viewpoint.

a1b1c1 a1b1c2

a1b2c1 a1b2c2

a2b1c1 a2b1c2

a2b2c1 a2b2c2

Figure 2.2: This 2× 2× 2-tensor can be written in different slices as viewed from the top-down,bottom-up, left-right, and right-left.

A tensor P of format d1 × d2 × · · · × dn has nonnegative rank at most r, if r is the smallestnatural number such that P can be written as the sum of r nonnegative rank 1 tensors. Thus

8

we can build tensors of arbitrary nonnegative rank by adding nonnegative rank 1 tensors. Arank+ r tensor of this form can be written

P = a11 ⊗ a12 ⊗ · · · ⊗ a1n + a21 ⊗ a22 ⊗ · · · ⊗ a2n + · · ·+ ar1 ⊗ ar2 ⊗ · · · ⊗ arn (2.1)

with aij ∈ R≥0.

Example 2.2. Let P = pijk be a real 2 × 2 × 2-tensor. Then P is nonnegative rank 2 ifthere exists nonnegative 2× 2-matrices

A =

[a11 a12a21 a22

], B =

[b11 b12b21 b22

], C =

[c11 c12c21 c22

]such that

pijk = a1ib1jc1k + a2ib2jc2k.

a1i

b1j

c1k

a2i

b2j

c2k

= +

Figure 2.3: Rank+ 2 tensor decomposition, adapted from [KB], depicting a general rank+ 2tensor being constructed by adding rank+ 1 tensors, which are themselves built from the rows of thenonnegative matrices A, B, and C.

It is shown in [Lan, §5.5] that the set of real tensors P = [pi1i2···in ] of format d1×d2×· · ·×dnof nonnegative rank ≤ 2 is a closed semialgebraic subset of dimension

2(d1 + d2 + · · ·+ dn)− 2(n− 1).

Throughout, we informally refer to the set of tensors of some dimensions and rank as a spaceof tensors of some dimensions and rank.

Definition 2.1 ([ARSZ]). Suppose P is of the form (2.1) with n ≥ 3, di ≥ 2, and r = 2.Pick any subset A of [n] = {1, 2, ..., n} with 1 ≤ |A| ≤ n − 1 and write the tensor P as anordinary matrix with

∏i∈A di rows and

∏j 6∈A dj columns. The flattening rank of P is the

maximal rank of any of these matrices.

Definition 2.2 ([ARSZ]). Fix a tuple π = (π1, π2, ..., πn) where πi is a permutation of{1, ..., di}. Then P is π-supermodular if

pi1i2···in · pj1j2···jn ≤ pk1k2···kn · pl1l2···ln (2.2)

whenever {ir, jr} = {kr, lr} and πr(kr) ≤ πr(lr) holds for r = 1, 2, ..., n. A tensor P is calledsupermodular if it is π-supermodular for some π.

9

Theorem 2.1 ([ARSZ]). A nonnegative tensor P has nonnegative rank at most 2 if andonly if P is supermodular and has flattening rank at most 2.

Example 2.3. Let P = [pijk`] be a real 2× 2× 2× 2-tensor. P has flattening-rank at most2 for any solutions to the systems of equations defined by the 3-minors of the matrices

p1111 p1112 p1121 p1122p1211 p1212 p1221 p1222p2111 p2112 p2121 p2122p2211 p2212 p2221 p2222

,p1111 p1112 p1211 p1212p1121 p1122 p1221 p1222p2111 p2112 p2211 p2212p2121 p2122 p2221 p2222

,p1111 p1121 p1211 p1221p1112 p1122 p1212 p1222p2111 p2121 p2211 p2221p2112 p2122 p2212 p2222

(2.3)

obtained by setting n = 4 and A = {1, 2}, A = {1, 3}, and A = {1, 4}, respectively, inDefinition 2.1. Since A and Ac yield transpose matrices, and since A = {1} results in a2× 8-matrix, there are no other 3-minors to consider.

Example 2.4 ([ARSZ]). Let P = [pijk] be a real 2× 2× 2-tensor. As in Example 2.2,

pijk = a1ib1jc1k + a2ib2jc2k.

In this case there are no flattening rank conditions since for each flattening there are no3-minors. For π = (id, id, id), the binomial inequalities for supermodularity are

p111p222 ≥ p112p221 p111p222 ≥ p121p212 p111p222 ≥ p211p122p112p222 ≥ p122p212 p121p222 ≥ p122p221 p211p222 ≥ p212p221p111p122 ≥ p112p121 p111p212 ≥ p112p211 p111p221 ≥ p121p211.

(2.4)

Nonnegative 2×2×2-tensors P that satisfy these nine inequalities lie in the setMid,id,id =M(12),(12),(12). By label swapping 1↔ 2, we obtain three other setsMid,id,(12),Mid,(12),id =

M(12),id,(12), and M

(12),id,id =Mid,(12),(12). Thus, by definition, the semialgebraic set of all

supermodular 2× 2× 2-tensors is the union

M =Mid,id,id ∪Mid,id,(12) ∪Mid,(12),id ∪M(12),id,id. (2.5)

Theorem 2.1 states that P ∈ R2×2×2 has nonnegative rank ≤ 2 if and only if P lies in M.

2.3 Maximum Likelihood Estimation: A Closer Look

When dealing with statistical models involving discrete data we may identify the samplespace with the set of the first m positive integers,

[m] := {1, 2, ...,m}.

A probability distribution on the set [m] is a point in the probability simplex

∆m−1 :=

{(p1, ..., pm) ∈ Rm :

m∑i=1

pi = 1 and pj ≥ 0 for all j

}.

10

The algebraic statistical model is a natural generalization of the ordinary statistical model.It comes as the image of a polynomial map

f : Rd → Rm, θ = (θ1, θ2, ..., θd) 7→ (f1(θ), f2(θ), ..., fm(θ)). (2.6)

Each fi is a polynomial in R[θ1, ..., θd] and θ1, ..., θd are the model parameters. Furthermore,(θ1, θ2, ..., θd) is a point in Θ, a non-empty open subset of Rd called the parameter space ofthe model f . We assume that Θ satisfies

fi(θ) > 0 for all i ∈ [m] and θ ∈ Θ.

Since the data is discrete, it can be given in the form of a sequence of observations

i1, i2, ..., iN (2.7)

where each ij is an element from the sample space [m]. The integer N is the sample size.This data can be summarized in the data vector u = (u1, u2, ..., um) where uk is the numberof indices j ∈ [N ] such that ij = k. Hence u ∈ Nm, where N = {0, 1, 2, ...}, and u1 +u2 + · · ·+ um = N . The empirical distribution corresponding to the data (2.7) is the scaledvector (1/N)u which is a point in the probability simplex ∆. We consider the model fto be a “good fit” for the data u if there exists a parameter vector θ ∈ Θ such that theprobability distribution f(θ) is close, in a statistically meaningful way, to the empiricaldistribution (1/N)u. Were we to draw N times at random from the set [m] with respect tothe probability distribution f(θ), then the probability of observing the sequence (2.7) givesthe likelihood function

L(θ) = fi1(θ)fi2(θ) · · · fiN (θ) = f1(θ)u1f2(θ)

u2 · · · fm(θ)um . (2.8)

Since u represents the observed data it is thus fixed, and L depends only on θ; therefore, L isa function from Θ to R>0. It is equivalent but simpler to deal with the log of the likelihoodfunction, `(θ). The problem of maximum likelihood estimation is to maximize `(θ) whereθ ranges over the the parameter space Θ. Put plainly, we aim to solve the optimizationproblem:

maximize `(θ) subject to θ ∈ Θ. (2.9)

A solution to (2.9) is called a maximum likelihood estimate of θ with respect to the modelf and the data u, and is denoted θ. For many statistical models, a maximum likelihoodestimate may not exist, and if it does, there could be more than one global maximum;actually, there can be infinitely many of them [PS]. Also, it may be difficult to find any oneof these global maxima. This is where the Expectation Maximization (EM) algorithm entersthe picture. It is a numerical method for finding solutions to (2.9), but it also gives insight,like shading paper over a leaf, into the topology of the model M. For a detailed treatmentof maximum likelihood estimation in the context of computational biology, see [PS] §1.1,1.3, and 3.3, from which the above exposition is derived. Let’s consider maximum likelihoodestimation in a less general setting.

The rth mixture model M of two discrete random variables X and Y expresses theconditional statement X ⊥⊥ Y | Z, where Z is a hidden variable with r states1. Now,

1Imagine having data on hair length and height. The hidden variable is gender and has r states, dependingon how one chooses to classify gender.

11

assuming X and Y have m and n states respectively, their joint distribution is written as anm× n-matrix of nonnegative rank ≤ r whose entries sum to 1. Let the nonnegative matrix

U =

u11 · · · u1n...

. . ....

um1 · · · umn

be a collection of independent and identically distributed samples from a joint distribution.Here, uij is the number of observations in the sample with X = i and Y = j. The samplesize is u++ =

∑i,j uij. The EM algorithm attempts to maximize the log-likelihood function

(2.12) of the model M. It approximates the data matrix U with a product of nonnegativematrices A and B where A ∈ Rm×r

≥0 and B ∈ Rr×n≥0 . As mentioned in the introduction, this

is a nonconvex optimization problem, and any algorithm that attempts to solve it will runinto a host of problems, of which the following dichotomy is most fundamental: either theMLE P lies in the relative interior of the model M, or it lies in the boundary ∂M of themodel. If P lies in ∂M, then it is generally not a critical point for the likelihood function inthe space of rank r matrices. It is shown in [KRS+] that for 8 × 8-matrices of nonnegativerank ≤ 5, 96% of data matrices have MLEs lying in the boundary ∂M.

Let ∆mn−1 denote the probability simplex of nonnegative m× n-matrices P = [pij]. Themodel M is the subset of ∆mn−1 consisting of all matrices of the form

P = A · Λ ·B, (2.10)

where A is a nonnegative m × r-matrix whose columns sum to 1, Λ is a nonnegative r × rdiagonal matrix whose entries sum to 1, and B is a nonnegative r × n-matrix whose rowssum to 1. The kth column of A represents the conditional probability distribution of Xgiven Z = k; the kth row of B represents the conditional probability distribution of Y givenZ = k, and the diagonal of Λ is the probability distribution of Z. The parameter space inwhich (A,Λ, B) lies is the convex polytope

Θ = (∆m−1)r ×∆r−1 × (∆n−1)

r.

The model M is the image of the trilinear map

Θ→ ∆mn−1, (A,Λ, B) 7→ P.

We aim to learn the model parameters (A,Λ, B) by maximizing the likelihood function(u++

u

) m∏i=1

n∏j=1

puijij (2.11)

or equivalently, by maximizing the log-likelihood function

`U =m∑i=1

n∑j=1

uij · log

(r∑

k=1

aikλkbkj

)(2.12)

over M.

12

2.4 EM Algorithm for Matrices

The EM algorithm for m×n-matrices is an iterative method for finding local maxima of thelikelihood function (2.12). Algorithm 1 presents the version in [PS], §1.3.

Algorithm 1 Function EM(U, r)

Select random a1, a2, ..., ar ∈ ∆m−1, random λ ∈ ∆r−1, and random b1, b2, ..., br ∈ ∆n−1.Run the following steps until the entries of the m× n-matrix P converge.E-Step: Estimate the m× r × n-table that represents this expected hidden data:

Set vikj :=aikλkbkj∑rl=1 ailλlblj

uij for i = 1, ...,m, k = 1, ..., r, and j = 1, ..., n.

M-Step: Maximize the likelihood function of the model for the hidden data:Set λk :=

∑mi=1

∑ni=1 vikj/u++ for k = 1, ..., r.

Set aik :=∑n

j=1 vikj/u++ for k = 1, ..., r, i = 1, ...,m.Set bkj :=

∑ni=1 vikj/u++ for k = 1, ..., r, j = 1, ..., n.

Update the estimate of the joint distribution for our mixture model:Set pij :=

∑rk=1 aikλkbkj for i = 1, ...,m, j = 1, ..., n.

Return P .

The alternating sequence estimation steps and maximization steps (E- and M-steps)defines trajectories in the parameter polytope Θ. The log-likelihood function (2.12) is non-decreasing along each trajectory (cf. [PS], Theorem 1.15). The value can remain unchangedonly at a fixed point of the EM algorithm.

Definition 2.3. An EM fixed point for a given table U is any point (A,Λ, B) in the polytopeΘ = (∆m−1)

r ×∆r−1× (∆n−1)r to which the EM alogorithm can converge if it is applied to

(U, r).

Lemma 2.2 ([KRS+]). The following are equivalent for a point (A,Λ, B) in the parameterpolytope Θ:

1. The point (A,Λ, B) is an EM fixed point

2. If we start EM with (A,Λ, B) instead of a random point, then EM converges to (A,Λ, B).

3. The point (A,Λ, B) remains fixed after one E-step and one M-step.

Every global maximum P of `U is among the EM fixed points. [KRS+] identify the polyno-mials whose roots represent all fixed points for the 4 × 4-matrix case. Since a point is EMfixed if and only if it stays fixed after an E-step and an M-step, we can write rational functionequations for the EM fixed points in Θ. We examine this process in depth in Chapter 3.

2.5 Ideals, Varieties, and Algorithms

Let R = K[x1, ..., xn] be the ring of polynomials in n variables with coefficients in a subfieldK of the real numbers R, usually the rational numbers K = Q.

13

Definition 2.4. A subset I ⊆ R is an ideal in R if I is a subgroup of R under addition, andfor every f ∈ I and every g ∈ R we have fg ∈ I. Equivalently, an ideal I is closed undertaking linear combinations with coefficients in the ring R.

Definition 2.5. Let K be a field and let f1, ..., fs be polynomials in K[x1, ..., xn]. Then weset

V (f1, ..., fs) = {(a1, ..., an) ∈ Kn : fi(a1, ..., an) = 0 for all 1 ≤ i ≤ s}.We call V (f1, ..., fs) the variety defined by f1, ..., fs.

Let T = f1, ..., fs. The ideal generated by T , denoted 〈T 〉, is the smallest ideal in Rcontaining T . We use V (T ) in place of V (〈T 〉). In computational algebra, we often replaceT by a Grobner basis of 〈T 〉. This allows us to test ideal membership and to determinegeometric properties of the variety V (T ) [CLO].

Definition 2.6. A subset X ⊆ Cn is a variety if X = V (T ) for some T ⊆ R.

A variety X ⊆ Cn is irreducible if we cannot write X = X1 ∪X2, where X1, X2 ( X arestrictly smaller varieties. An ideal I ⊆ R is prime if fg ∈ I implies f ∈ I or g ∈ I.

Proposition 2.3. The variety X is irreducible if and only if I(X) is prime.

An ideal is radical if it is an intersection of prime ideals.

Proposition 2.4. Every variety X can be written uniquely as X = X1∪X2∪· · ·∪Xm, whereX1, X2, ..., Xm are irreducible and none of these m components contain any other. Moreover,

I(X) = I(X1) ∩ I(X2) ∩ · · · ∩ I(Xm)

is the unique decomposition of radical ideal I(X) as an intersection of prime ideals.

A minimal prime of an ideal I is a prime ideal J such that V (J) is an irreduciblecomponent of V (I).

Definition 2.7. An ideal I in K[x1, ..., xn] is primary if fg ∈ I implies either f ∈ I andgm ∈ I for some m > 0.

Lemma 2.5. If an ideal I is primary, then√I is prime, and it is the smallest prime ideal

containing I.

All ideals I in R can be written as intersections of primary ideals, that is, a decomposition

I = Q1 ∩Q2 ∩ · · · ∩Qs

where each Qi is primary. The radical P =√Q is a prime ideal and Q is called P -primary.

Primary ideals are more general than prime ideals, but they still define irreducible varieties,and geometrically primary ideals contain the same information as do their prime counter-parts.

Definition 2.8. Let I ⊆ K[x1, ..., xn] be an ideal, and f ∈ K[x1, ..., xn]. Then the saturationof I with respect to f is the ideal

(I : f∞) = 〈g ∈ K[x1, ..., xn] : gfm ∈ I for some m > 0 〉.

Saturating an Ideal I by a polynomial f geometrically means that we obtain a new idealJ = (I : f∞) whose variety V (J) contains all components of V (I) except for the ones onwhich f vanishes. For more on these concepts see [CLO], from whence this section is derived.

14

3 EM Fixed Point Ideal

3.1 Extension of EM Algorithm to Tensors

Maximum likelihood estimation and the EM algorithm for matrices extend naturally to datagiven in the form of a tensor, which is just a table of dimension higher than 2. Here werestate the MLE problem and the EM algorithm for 2 × 2 × 2-tensors of nonnegative rank≤ 2 and describe the ideal of EM fixed points.

We begin by updating the paramater polytope Θ to (∆ × ∆ × ∆)2 × ∆. A point in Θis of the form (A,B,C,Λ) where A ∈ R2×2

≥0 , B ∈ R2×2≥0 , and C ∈ R2×2

≥0 are nonnegative and

row stochastic, and Λ ∈ R2×2≥0 is a nonnegative diagonal 2× 2-matrix. The model M is the

image of the quadrilinear map

Θ→ ∆7, (A,B,C,Λ) 7→ P. (3.1)

We update the function `U to reflect the tensor U . Now we seek to maximize(u+++

u

2∏i,j,k=1

puijkijk

where uijk is the data and the unknowns P = [pijk] form a nonnegative 2 × 2 × 2-tensor ofnonnegative rank 2 with p+++ = 1. Since we do not allow pijk = 0, P is a strict subset of theprobability simplex ∆7. Again, this is equivalent to maximizing the log-likelihood function

`U =∑i,j,k

uijk · log(pijk) =∑i,j,k

uijk · log

(r∑`=1

λ`a`ib`jc`k

). (3.2)

In Algorithm 2 we update the EM Algorithm for matrices to reflect the new format of thedata.

15

Algorithm 2 Function EM(U, r). i, j, k = {1, 2}, r = 2, U = [uijk] ∈ R2×2×2

Select random nonnegative stochastic matrices A, B, C in R2×2+ and a diagonal 2× 2 matrix

Λ. Define two nonnegative rank 1, 2× 2× 2-tensors [λ1a1ib1jc1k] and [λ2a2ib2jc2k].Run the following steps until the entries of the 2× 2× 2-tensor P converge.E-Step: Estimate the 2× 2× 2× 2-table that represents this expected hidden data:

Set v`ijk :=λ`a`ib`jc`k∑2

s=1 λsasibsjcsk· uijk for i, j, k, ` = {1, 2}

M-Step: Maximize the likelihood function of the model for the hidden data:Set λ` :=

∑2i,j,k=1 v

`ijk/u+++ for ` = 1, 2

Set a`i :=∑2

j,k=1 v`ijk/(u+++λ`) for `, i = 1, 2

Set b`j :=∑2

i,k=1 v`ijk/(u+++λ`) for `, j = 1, 2

Set c`k :=∑2

i,j=1 v`ijk/(u+++λ`) for `, k = 1, 2

Update the estimate of the joint distribution for our mixture model:Set pijk :=

∑2`=1 λ`a`ib`jc`k for i, j, k = 1, 2

Return P .

3.2 EM Fixed Point Ideal for Tensors

As in Section 2.4, if we could compute all EM fixed points, then this would reveal the globalmaximizer of `U . Since a point is EM fixed if and only if it stays fixed after an E-step andan M-step, we can write rational function equations for the EM fixed points in Θ:

λ` =1

u+++

2∑i,j,k=1

λ`a`ib`jc`k∑2s=1 λsasibsjcsk

· uijk for ` = 1, 2,

a`i =1

u+++λ`

2∑j,k=1

λ`a`ib`jc`k∑2s=1 λsasibsjcsk

· uijk for i, ` = 1, 2,

b`j =1

u+++λ`

2∑i,k=1

λ`a`ib`jc`k∑2s=1 λsasibsjcsk

· uijk for j, ` = 1, 2,

c`k =1

u+++λ`

2∑i,j=1

λ`a`ib`jc`k∑2s=1 λsasibsjcsk

· uijk for k, ` = 1, 2.

Our goal is to understand the solutions to these equations for a fixed tensor U . We seek tofind the variety they define in the polytope Θ and the image of that variety in M. In theEM algorithm we usually start with a`i, b`j, c`k, λ` that are strictly positive. The a`i, b`j, c`kmay become zero in the limit, but the parameters λk always remain positive when the uijkare positive since the rows of A, B, C sum to 1. This justifies that we cancel out the factorsλk in our equations. After this, the first equation is implied by the other three. Therefore,

16

the set of all EM fixed points is a variety, and it is characterized by

a`i =1

u+++

2∑j,k=1

a`ib`jc`k∑2s=1 λsasibsjcsk

· uijk for i, ` = 1, 2,

b`j =1

u+++

2∑i,k=1

a`ib`jc`k∑2s=1 λsasibsjcsk

· uijk for j, ` = 1, 2,

c`k =1

u+++

2∑i,j=1

a`ib`jc`k∑2s=1 λsasibsjcsk

· uijk for k, ` = 1, 2.

These equations can be simplified further, for example,

a`i =1

u+++

2∑j,k=1

a`ib`jc`k∑2s=1 λsasibsjcsk

· uijk =⇒

a`iu+++ =2∑

j,k=1

a`ib`jc`k∑2s=1 λsasibsjcsk

· uijk =⇒

a`iu+++

2∑j,k=1

b`jc`k =2∑

j,k=1

a`ib`jc`k∑2s=1 λsasibsjcsk

· uijk =⇒

a`i

(2∑

j,k=1

(u+++ −

uijk∑2s=1 λsasibsjcsk

)· b`jc`k

)= 0 =⇒

a`i

(2∑

j,k=1

(u+++ −

uijkpijk

)· b`jc`k

)= 0, for `, i = 1, 2.

Note that∑2

j,k=1 b`jc`k = 1 and the last line, in part, follows from the identity

pijk =2∑`=1

λ`a`ib`jc`k.

We can do this simplification symmetrically for b`j and c`k yielding

a`i

(2∑

j,k=1

(u+++ −

uijkpijk

)· b`jc`k

)= 0 for i, ` = 1, 2,

b`j

(2∑

i,k=1

(u+++ −

uijkpijk

)· a`ic`k

)= 0 for j, ` = 1, 2,

c`k

(2∑

i,j=1

(u+++ −

uijkpijk

)· a`ib`j

)= 0 for k, ` = 1, 2.

Therefore, the set of EM fixed points is a variety characterized by the above equations. Wecan simplify further, denoting by R the tensor with entries

rijk = u+++ −uijkpijk

17

and the fixed point equations become

a`i

(2∑

j,k=1

rijkb`jc`k

)= 0 for all `, i = 1, 2,

b`j

(2∑

i,k=1

rijka`ic`k

)= 0 for all `, j = 1, 2,

c`k

(2∑

i,j=1

rijka`ib`j

)= 0 for all `, k = 1, 2.

This derivation yields the following theorem.

Theorem 3.1. The variety of EM fixed points for 2 × 2 × 2-tensors of rank+ ≤ 2 in thepolytope Θ is defined by the equations

a`i

(2∑

j,k=1

rijkb`jc`k

)= 0 for all `, i = 1, 2,

b`j

(2∑

i,k=1

rijka`ic`k

)= 0 for all `, j = 1, 2,

c`k

(2∑

i,j=1

rijka`ib`j

)= 0 for all `, k = 1, 2,

where [rijk] =[u+++ − uijk

pijk

].

The variety defined in Theorem 3.1 is reducible.

Definition 3.1. Let F be the ideal of EM fixed points, as in Theorem 3.1. A minimal primeof F is called relevant if it contains none of the 8 polynomials pijk =

∑2`=1 a`ib`jc`k.

Theorem 3.2. The ideal F of EM fixed points for 2 × 2 × 2-tensors of nonnegative rank≤ 2 has precisely 52 minimal primes consisting of 9 orbital classes. Moreover, the ideal isradical, hence, it equals the intersection of its minimal primes.

Proof. While the ideal F is not a binomial ideal, we follow [KRS+] in using an approachbased on the primary decomposition of binomial ideals given in [ES, §6]. Let F be the EMfixed point ideal⟨

a`i

(2∑

j,k=1

rijkb`jc`k

), b`j

(2∑

i,k=1

rijka`ic`k

), c`k

(2∑

i,j=1

rijka`ib`j

): i, j, k, l = 1, 2

⟩.

Any prime ideal containing F contains either a`i or∑2

j,k=1 rijkb`jc`k for `, i ∈ {1, 2}, and

either b`j or∑2

i,k=1 rijka`ic`k for `, j ∈ {1, 2}, and either c`k or∑2

i,j=1 rijka`ib`j for `, k ∈

18

{1, 2}. We categorize all primes containing F according to the set S of unknowns a`i, b`j,and c`k. There are 212 subsets and the symmetry group acts on this power set by permutingthe rows of A, B, and C simultaneously, the columns of A, B, and C separately, andthe matrices A, B, and C themselves. We pick one representative S from each orbit thatis relevant, that is, we exclude those orbits for which pijk =

∑2`=1 a`ib`jc`k = 0. These

are exactly the orbits containing an element pijk lying in the ideal 〈S〉. For each relevantrepresentative S, we compute the cellular component FS = ((F + 〈S〉) : (

∏Sc)∞), where

Sc = {a11, ...a22, b11, ..., b22, c11, ...c22}\S. Next we minimalize our cellular decomposition byremoving all representatives S such that FT ⊂ FS for some representative T in another orbit.This leads to a list of 6 orbits comprising 11 ideals. Up to symmetry, each prime is uniquelydetermined by its attributes in Table 3.1. These are its set S, its degree and codimension,the ranks rA = rank(A), rB = rank(B), and rC = rank(C) at a generic point, the numberof ideals in the orbit of S, and the number of elements in the primary decomposition. Ineach case, primality of the ideal was verified using either the Macaulay2 isPrime functionor the linear elimination sequence in [GSS, Proposition 23(b)], which we discuss in detail in§5.1.

Table 3.1Minimal primes of EM fixed point ideal F for 2× 2× 2-tensors of rank+ 2.

Class S |S| a’s b’s c’s deg codim rA rB rC |orbit| #prime’s

{∅} 0 0 0 0 60 7 1 1 1 1 5

0 0 0 0 48 7 2 2 1

0 0 0 0 1 8 2 2 2

{a11} 1 1 0 0 31 8 2 2 1 6 3

1 1 0 0 5 8 2 2 2

{a11, b11} 2 1 1 0 11 8 2 2 2 12 1

{a11, a12} 2 2 0 0 23 6 1 2 2 6 1

{a11, a22} 2 2 0 0 25 8 2 2 2 3 1

{a11, b11, c11} 3 1 1 1 23 8 2 2 2 8 1

In Table 3.1, while the ideals given by {∅} determine the fixed points in the interior ofM, the ideals given by {a11}, {a11, b11}, {a11, a22}, and {a11, b11, c11} determine the fixedpoints on the non-interior boundary strata of M, as seen in [AHRZ]. That is, the map(3.1) sends any elements in the parameter polytope Θ = (A,B,C,Λ) that vanish under thedefining equations of these ideals to a boundary strata of M. The ideal given by {a11, a12}is degenerate because it yields a probability distribution outside of M. If a11 = a12 = 0

19

then A is not a stochastic matrix since the first row of A does not sum to 1. While #primescorresponding to the {∅} is 5 we only show the attributes of three. The two minimal primesnot appearing in the list can be obtained as group actions of the ones in the list. We referto the minimal primes appearing in the list as representatives of orbital classes. The totalnumber of minimal primes can be read off the table by summing the product of the columns|orbit| and #primes. The |orbit| is the number of parameters in the orbit of the elementfrom column 1. For example, the orbit of {a11, a22} is {b11, b22}, {c11, c22}, so |orbit| = 3.

The most concise ideal is the minimal prime defined by setting a11 = 0 and a22 = 0,corresponding to one of two 5-dimensional substrata of M. It has defining equations

〈a1,1, a2,2,b1,1r2,1,2 + b1,2r2,2,2, b1,1r2,1,1 + b1,2r2,2,1, b2,1r1,1,2 + b2,2r1,2,2, b2,1r1,1,1 + b2,2r1,2,1,

c1,1r2,1,1 + c1,2r2,1,2, c1,1r2,2,1 + c1,2r2,2,2, c2,1r1,2,1 + c2,2r1,2,2, c2,1r1,1,1 + c2,2r1,1,2,

r1,1,2r1,2,1 − r1,1,1r1,2,2, r2,1,2r2,2,1 − r2,1,1r2,2,2〉.

(3.3)

Recall thatrijk = u+++ −

uijkpijk

= u+++ −uijk∑2

`=1 λ`a`ib`jc`k, (3.4)

thus the set of EM fixed points corresponding to a data tensor U = [uijk] defined by thisideal are obtained by substituting (3.4), clearing denominators, and saturating. In this case,the tensor R consists of two rank-1 slices. We also see that rA, rB, and rC are 2 since thedeterminants of A, B, and C do not appear in the decomposition.

We extend the computations of Theorem 3.2 to the case of 2×2×2-tensors of nonnegativerank≤ 3. The boundary stratification of the space of 2×2×2-tensors of nonnegative rank≤ 3is not known, but the parameters that yield its stratification reside within the decompositiongiven in Table 3.2. We update the parameter polytope Θ = (A,B,C,Λ) with

A =

a11 a12a21 a22a31 a32

B =

b11 b12b21 b22b31 b32

C =

c11 c12c21 c22c31 c32

Λ =

λ1 λ2λ3

where A,B,C are stochastic matrices, and

∑3`=1 λ` = 1 with λi ≥ 0. We extend the EM

algorithm in the natural way with

pijk =3∑`=1

λ`a`ib`jc`k.

Theorem 3.3. The ideal F of EM fixed points for 2 × 2 × 2-tensors of nonnegative rank≤ 3 has precisely 277 minimal primes consisting of 41 orbital classes. Up to symmetry, eachprime is uniquely determined by its attributes in Table 3.2. Moreover, the ideal is not radical.The ideal corresponding to {∅} contains embedded components.

20

Table 3.2Minimal primes of EM fixed point ideal F for 2× 2× 2-tensors of rank+ 3.

Set S |S| a’s b’s c’s deg codim rA rB rC |o| #p’s{∅} 0 0 0 0 1 8 2 2 2 1 9

0 0 0 0 27 10 1 2 20 0 0 0 5 12 1 1 10 0 0 0 162 9 2 2 10 0 0 0 27 10 1 1 20 0 0 0 35 10 1 2 20 0 0 0 38 10 2 2 2

{a11} 1 1 0 0 105 10 2 1 2 6 61 1 0 0 1 9 2 2 21 1 0 0 10 10 2 2 21 1 0 0 38 11 2 2 2

{a11, b11} 2 1 1 0 39 10 2 2 2 12 22 1 1 0 1 10 2 2 2

{a11, b11, c11} 3 1 1 1 60 11 2 2 2 8 53 1 1 1 1 11 2 2 23 1 1 1 39 11 2 2 2

{a11, a12} 2 2 0 0 60 9 1 2 2 3 52 2 0 0 1 10 2 2 22 2 0 0 48 9 2 2 22 2 0 0 48 9 1 2 2

{a11, a21} 2 2 0 0 54 11 2 1 2 6 32 2 0 0 5 9 2 2 2

{a11, a22} 2 2 0 0 68 11 2 2 1 3 52 2 0 0 1 10 2 2 22 2 0 0 10 11 2 2 22 2 0 0 68 11 2 1 2

{a11, a12, a21} 3 3 0 0 31 10 2 2 2 3 33 3 0 0 5 10 2 2 2

{a11, a12, a21, a22} 4 4 0 0 23 8 1 2 2 3 1{a11, a12, b21} 3 2 1 0 31 10 1 2 2 12 3

3 2 1 0 5 10 2 2 23 2 1 0 31 10 2 2 2

{a11, a12, b21, c21} 4 2 1 1 11 10 2 2 2 12 1{a11, a12, b21, b22} 4 2 2 0 23 8 2 2 2 6 1{a11, a21, b11, b21} 4 2 2 0 11 10 2 2 2 12 1{a11, a21, b11, b21, c11, c21} 6 2 2 2 23 11 2 2 2 8 1{a11, a21, b11, b22, c11, c22} 6 2 2 2 20 12 2 2 2 6 1{a11, a22, b11, b22} 4 2 2 0 8 11 2 2 2 3 1{a11, a22, b11, b22, c11, c22} 6 2 2 2 23 12 2 2 2 1 1{a11, a12, a21, b21} 4 3 1 0 11 10 2 2 2 12 1{a11, a12, a21, b21, c21} 5 3 1 1 23 10 2 2 2 12 1

21

4 MLE Using Boundary Strata

4.1 Boundary Stratification of Binary Tensors

Following [ARSZ], Allman, Hosten, Rhodes, and Zwiernik completely characterize the bound-ary stratification of binary tensors of nonnegative rank two. For 2× 2× 2-tensors of nonneg-ative rank ≤ 2 they give specific formulas for the ML estimate on each strata. For example,in the 2× 2× 2 case there are 15 ridges of dimension 5. One of these stratum is obtained asthe images of those parameters (A,B,C,Λ) where a11 = 0 and a22 = 0. The resulting tensorpijk is of the form

λ1

(0 0 a12b11c11 a12b11c120 0 a12b12c11 a12b12c12

)+ λ1

(a22b21c21 a22b21c22 0 0a22b22c21 a22b22c22 0 0

).

[AHRZ] give the following ML formula corresponding to this 5-dimensional ridge completelyin terms of the data U = [uijk],

pijk =uij+ · ui+kui++ · u+++

i, j, k = 1, 2.

For all of the strata there is a formula in terms of the data U . If, among all the formulas forall of the stratifications, this P produces the maximum output of the log-likelihood function`U , and if P is supermodular, then this is the MLE for U , and the MLE for this data lieson a 5-dimensional ridge of the model. Table 4.1 completely characterizes the boundarystratification of M.

For the 2× 2× 2 case, we show that the [AHRZ] formula computations for the MLE arefaster than the EM algorithm. We use the results in [ARSZ] and [AHRZ] to perform thisexperiment, along with several others, giving insight into the EM algorithm for 2 × 2 × 2-tensors, maximum likelihood estimation, and the model M.

22

Table 4.1Boundary stratification of 2× 2× 2-tensors of nonnegative rank ≤ 2

# of Strata Dimension Zeros of (A,B,C,Λ)

1 7 (Interior) {∅}

6 6 a11 = 0

12 5 a1i = 0 and b1j = 0

3 5 a11 = a22 = 0

8 4 a1i = 0 and b1j = 0 and c1k = 0

1 3 λi = 0 (rank+ 1 tensors)

4.2 Experiments

In the following experiments we implement all EM and MLE computations using Julia, ahigh-level dynamic programming language for technical computing. Julia is relatively new.It was developed by researchers at MIT and first appeared in 2012. All graphical modelingis done in R. We present the following experiments as a series of questions and answers.

Experiment 4.1. In which boundary stratification of the model M is the MLE most likelyto occur for a random nonnegative data tensor?We uniformly and randomly generate 10, 000 nonnegative 2× 2× 2-tensors. For each tensorwe use the [AHRZ] formulas to count the number of times the MLE lands on one of the fivestratum of the 7-dimensional space of 2× 2× 2-tensors of rank+ ≤ 2. There are 31 formulasto check and for each formula we must verify that pijk is supermodular. Checking super-modularity requires verification of between 9 and 36 inequalities. In total, the computationsfor this experiment required 0.4 seconds. Table 4.2 shows the percentage distribution amongthe strata.

Table 4.2Experiment 4.1: Stratification attraction of the model M.

7-dim 6-dim 5a-dim 5b-dim 4-dim 3-dim

5.35 36.2 30.16 20.77 7.52 0.0

In R, we produce a 3-dimensional picture modelling this behavior on the 7-dimensionalmodel M. We generate Figures 4.1 and 4.2 by running the EM algorithm on 200, 000

23

randomly generated tensors [uijk] from the Jukes-Cantor slice given by[u111 u112u121 u122

]=

[x yz w

]and

[u211 u212u221 u222

]=

[w zy x

].

A normalized [uijk] in the Jukes-Cantor slice is a point in the the 3-dimensional simplex.99.9% of the MLEs for these tensors occur evenly distributed among the interior and thethree substrata of the 5b stratum of M, labelled 5b1, 5b2, and 5b3. To each of these stratawe associate a color and to each of the points in the 3-d simplex we assign the color of theassociated boundary stratum of the MLE at that point. This yields a partitioning of the3-dimensional simplex. The 5b1, 5b2, and 5b3 subsets form the same shape, rotated by 60◦.Observe the linear boundary between the 5bi subsets and the polynomial boundary of theinterior.

24

Figure 4.1: The simplex is partitioned based on the MLE for the tensor in the Jukes-Cantor slice.The MLE of the pink points is in the interior of the model. The MLE of the light and dark turqoisepoints lies on the 5b1 stratum. The second row depicts the 5b1 subset from two different angles. Theblank space between cells is where the shapes interlock. The second picture in the first row showsthe interior and the 5b1 subsets interlocked.

25

Figure 4.2: The figure on the left shows the 5b1, 5b2, and 5b3 subsets locked together. The figureon the right shows all 4 subsets: the interior, 5b1, 5b2, and 5b3. The orange is the interior of themodel.

Experiment 4.2. How much does the EM algorithm vary for one data tensor U = [uijk]?We uniformly and randomly generate 100 nonnegative data tensors U and for each U we runthe EM algorithm from 1, 000 different starting parameters. Recall that starting parametersare elements in the polytope Θ = (A,B,C,Λ). We determine the frequency with which theEM algorithm spreads, or finds MLEs, across different strata.

Table 4.3Experiments 4.2

Spreads across: 1 stratum 2 strata 3 strata 4 strata 5 strata 6 strata

% of time: 23 32 34 11 0 0

Table 4.3 says that given 1,000 different starting parameters the EM algorithm will find alocal maxima on one particular strata 23% of the time. 32% of the time the algorithm willspread across 2 strata, and so on. This table does not address the density of these spreads.In order to address this, for each run of 1, 000 we count the number of times EM lands inone strata more than 90% of the time. We found that this will occur in 80% of the samples.Informally, this means that the EM algorithm is reliable in the sense that it tends to beattracted to one stratum most of the time.

Experiment 4.3. How often does the EM algorithm produce the actual MLE given onestarting parameter?We uniformly and randomly generate 10, 000 nonnegative data tensors U and run the EMalgorithm on each U from one starting parameter with a maximum of 10, 000 steps. Wecompute the actual MLE using the [AHRZ] formulas and compare. The EM algorithmconverges in 80% of the samples and produces the actual MLE 76% of the time.

26

Experiment 4.4. How many times must the EM algorithm be run to find the actual MLE?We input 1, 000 uniformly and randomly generated nonnegative tensors U and compute theMLE for each one using the [AHRZ] formulas. We then run the EM algorithm on each Uuntil it returns the MLE. We tally the number of different starting parameters required tohit the MLE. The EM algorithm finds the MLE given 1 starting parameter 75.5% of thetime. It requires less than 10 different starting parameters to find the MLE 96.5% of thetime.

Experiment 4.5. Is computing MLEs using the formulas faster than using EM Algorithm?This is a valid question because computing MLEs with the formulas is not trivial. There are31 formulas to check. Each formula has a canonical representative, as in a11 = 0, a22 = 0.To obtain the ML estimates on the parameters in the orbit of the canonical representive(b11 = b22 = 0 and c11 = c22 = 0); we permute the data tensor and perform the computationas if it was the canonical representative, and then permute the ML estimate in reverse. Foreach formula, supermodularity must be verified. We generate 1,000 nonnegative, random,and uniform tensors U . For each U we compute the MLE using the [AHRZ] formulas andthe EM algorithm with a maximum 10,000 steps to ensure convergence 80% of the time. Ifthe EM algorithm does not find the MLE given one set of starting parameters, we discardthe trial. It must be noted that the EM algorithm for tensors that we are implementing ismost likely not the optimally coded EM algorithm, thus these speeds are only estimates asto the efficiency of the algorithm. We compute the mean, median, maximum, and minimumrun times for each method. Table 4.4 shows that in these trials, the EM algorithm was neverfaster than the [AHRZ] formulas. In fact, the slowest formula time of 0.000250751 secondsbeats the fastest EM time of 0.000315767 seconds.

Table 4.4Experiment 4.5

Formula EM Algorithm

Mean: 7.768014000000002× 10−5 0.016230879089999994

Median: 7.43555×10−5 0.007662911

Max: 0.000250751 0.095157257

Min: 3.6039×10−5 0.000315767

27

5 Implementation

Here we give an overview of the computational methods used throughout this project. Thecomputational body of work can be split into two categories: the EM fixed point idealdecompositions, and the EM algorithm along with the [AHRZ] formulas. The former consistsof cellular decomposition, determining the primality of the cellular components, and thedecomposing of the non-primary components. All of this is implemented in Macaulay2. Thelatter consists of the EM algorithm itself, the coding of the [AHRZ] formulas, and all of thesupport functions required for gathering data from these objects. Initially, we attemptedto implement this using R, but Julia proved to be greater than 10-times faster. All of thecoding for this section is done in Julia, besides the modeling of Figures 4.1 and 4.2.

5.1 Cellular Decomposition, Primality, and Primary Decompo-sition

Macaulay2 is our primary tool for computing cellular decompositions, determining primality,and decomposing ideals into primary components. The most important theorem for deter-mining primality is [GSS, Proposition 23]. It is stated therein without proof; for a conciseproof, see [LS, Pg. 3].

Lemma 5.1. [GSS, Proposition 23] Let J ⊂ R[x1, ..., xn] be an ideal containing a polynomialf = gx1 + h with g, h not involving x1 and g a non-zero divisor modulo J . Let J1 =J ∩ R[x2, ..., xn] be the elimination ideal. Then J is prime if and only if J1 is prime.

Algorithm 3 Pseudocode implementation of Proposition 5.1.

Input an ideal I ⊂ R[x1, ..., xn].Create LIST : a list of all variables in the ring.Compute K: a list of generators of a Grobner basis of I.for i in length(LIST )

Set f = LIST [i]for j in length(K)

Set g = d(K[j])/ df # Note that the variables appear linearly.if (I : g) == I (implying g is a nonzero divisor)then I = eliminate(LIST [i], I)

Return I.

28

Following elimination, Algorithm 3 yields a simpler ideal, as measured by degree and codi-mension. After multiple eliminations, as in all of our cases, verification of primality usingthe Macaulay2 isPrime command takes only seconds. It is wise to update and maintain asequence of strings representing the elimination sequence for fast verification. Consider theideal (3.3) obtained by setting a11 = a22 = 0. In the syntax of Macaulay2, our algorithmwill output the sequence of strings in Figure 5.1.

Figure 5.1: Elimination sequence for verifying primality of the EM fixed point ideal of 2× 2× 2-tensors of rank+ ≤ 2 corresponding to a11 = a22 = 0.

K = first entries gens gb I;

g = diff(a_(1,1), K#1); I : ideal(g) == I

I = eliminate(a_(1,1), I);

K = first entries gens gb I;

g = diff(a_(2,2), K#0); I : ideal(g) == I

I = eliminate(a_(2,2), I);

K = first entries gens gb I;

g = diff(b_(1,1), K#2); I : ideal(g) == I

I = eliminate(b_(1,1), I);

K = first entries gens gb I;

g = diff(b_(2,1), K#5); I : ideal(g) == I

I = eliminate(b_(2,1), I);

isPrime(I)

The degree of 3.3 is reduced from 25 to 9, the codimension drops from 8 to 4, andprimality is verified by isPrime in less than 1 second.

Our most important method for finding minimal primes is found in the discussion fol-lowing Proposition 23 in [GSS]. This method is based on splitting the ideal I that we wishto decompose into two parts. Given an ideal I, if there is an element f of its Grobner basisthat factors f = f1f2, then

√I =

√〈I, f1〉 ∩

√〈I, f2〉 : f∞1 . (5.1)

In our case, the ideals are radical, so we drop the radical signs in 5.1. We keep a list ofideals whose intersection is the same as I. For each ideal we keep a list of the elements wehave inverted by so far (for example, f1 in 〈I, f2〉 : f∞1 ) and saturate at each step with theseelements. Eventually, the ideals either split into one or two prime parts, which we verify as inProposition 5.1, or the splits result in ideals that are decomposable in under 5 minutes usingMacaulay2’s built-in functionality. This method worked invariably for our decompositions.

Algorithm 4 presents an overview of our method for computing cellular decompositionsof the EM fixed point ideal for 2 × 2 × 2-tensors of nonnegative rank ≤ 2. The codeassociated with this algorithm comprises the main body of our work with the EM fixedpoint ideals. For its implementation for 4 × 4-matrices, from which our code follows, seehttps://math.berkeley.edu/~bernd/EM/findPrimeIdeals44.m2.

29

Algorithm 4 Pseudocode for computing cellular decomposition of 2 × 2 × 2-tensors ofnonnegative rank ≤ 2.

Create zero sets, a list of all possible sets of indices corresponding to zeros in (A,B,C).Remove all sets in zero sets that render pijk = 0, for some i, j, k = 1, 2. For example,{{(1, 1)}, {(2, 1)}, {}} forces p111 = 0.Remove all sets in zero sets that are extra representatives in an orbit of the symmetry group.#For example, permuting the rows of A, B, C does not change the tensor.Create poly list :{

a`i,2∑

j,k=1

rijkb`jc`k, b`j,2∑

i,k=1

rijka`ic`k, c`k,2∑

i,j=1

rijka`ib`j | i, j, k, ` = 1, 2

}.

Use zero sets and poly list to create cellular ideals : a list of cellular ideals as in Theorem3.2.Remove ideals from cellular ideals that are not minimal.Determine primality of ideals as in 5.1.Compute minimal primes or primary decomposition as in the discussion appearing after[GSS, Proposition 23].

5.2 EM, MLE, and Boundary Strata Experiments

All EM, MLE, and boundary strata experiments were done in Julia. The main body of workconsists of three main parts:

1. Technical functions, i.e. random tensor generation, the log-likelihood function, etc.

2. EM algorithm related functions, including the algorithm itself and the functions neededto collect data from it.

3. The AHRZ formulas and the machinery required to compute them by way of tensorpermutations. These account for roughly 500 lines of code.

The code is flexible, and all experiments are accomplished via small alterations of the mainbody of work. Experiments 4.1 - 4.5 are divided between six files. Experiment 4.1 is dividedbetween two files to account for the modeling.

30

6 Conclusion

The fundamental objective of our research was to obtain Tables 3.1 and 3.2 which completelycharacterize the EM fixed point ideal for 2 × 2 × 2-tensors of nonnegative rank ≤ 2 andnonnegative rank ≤ 3. The complete stratification of the modelM for rank+ 2 is captured inTable 3.1, and we conjecture that the complete characterization of the boundary for rank+ 3is contained in Table 3.2. Chapter 3 is devoted to extending the EM algorithm from matricesto tensors; we define the EM fixed point ideal for 2× 2× 2-tensors of nonnegative rank ≤ 2and present our main theorems 3.1 and 3.2, as well as their associated tables. Chapter4 moves to a focus on the boundary stratification of the model M of 2 × 2 × 2-tensorsof nonnegative rank ≤ 2, and we examine MLEs via the MLE formulas given by Allman,Hosten, Rhodes, and Zwiernik. This chapter shows the complete boundary stratificiation ofM and includes various experimental results that yield insight into the EM algorithm forM. It also gives insight into the geometry of M showing 3-dimensional scatterplots thatdepict the behavior of maximum likelihood estimation for our 7-dimensional model. Chapter5 contains an overview of the computational methods used in this project with a focus onverifying primality and computing primary decomposition. We provide a hyperlink wherethe reader will find all code done in Macaulay2, Julia, and R.

The next step is to explore the MLEs via the EM fixed point ideal given in Theorem 3.1.That is, given generic data, we would like to find the solutions to the system of equationsgiven by the defining equations of the EM fixed point ideal for 2×2×2-tensors of nonnegativerank ≤ 2. If this can be done, then we will have effectively solved the MLE problem in thiscase in three different ways.

It is worth noting that our methods hit the computational wall for the 2×2×3 rank+ ≤ 2and the 2 × 2 × 2 × 2 rank+ ≤ 2 case. In both instances, for certain sets, it becomescomputationally infeasible to compute saturation ideals during the cellular decompositionstep. Thus, in these cases, one must explore different approaches. However, even at thistime, it may yet be possible to formulate and prove conjectures related to these EM fixedpoint ideals. As an example, the EM fixed point ideal for binary tensors of rank+ 2 mayalways contain a primary ideal corresponding to the set a11 = a22 = 0. We have found thatthis is true in the 2× 2× 2 and 2× 2× 2× 2 cases. While we are unable to arrive at the fullcellular decomposition of the EM fixed point ideal for 2× 2× 2× 2-tensors of rank+ ≤ 2, weare able to determine the cellular ideal corresponding to a11 = a22 = 0. We have determinedthat it is in fact prime and has forty minimal generating polynomials. We show five of theforty generators of this ideal in (6.1). To see how this is similar to the 2 × 2 × 2 case, see(3.3).

31

〈a11, a22,b11r2122r2211 − b11r2121r2212 − b11r2112r2221 − 2b12r2212r2221 + b11r2111r2222 + 2b12r2211r2222,

c11r2122r2211 − c11r2121r2212 + c11r2112r2221 + 2c12r2122r2221 − c11r2111r2222 − 2c12r2121r2222,

d11r2122r2211 + d11r2121r2212 − d11r2112r2221 + 2d12r2122r2212 − d11r2111r2222 − 2d12r2112r2222,

· · ·〉(6.1)

We would like to show that an ideal such as this will exist in general for binary tensors ofrank+ 2.

We would also like for this research to lead to a proposition mirroring Proposition 5.1in [KRS+] which desribes the number of minimal primes in the EM fixed point ideal corre-sponding to the empty set (cf. Tables 3.1, 3.2) for general d1×d2×d3-tensors of nonnegativerank ≤ r.

32

Bibliography

[AHRZ] E. S. Allman, S. Hosten, J. A. Rhodes, and P. Zwiernik. Boundary Stratificationof Binary Tensors of Nonnegative Rank Two. (Manuscript), 2016.

[ARSZ] Elizabeth S Allman, John A Rhodes, Bernd Sturmfels, and Piotr Zwiernik. Tensorsof nonnegative rank two. Linear Algebra and its Applications, 2013.

[CLO] D. A. Cox, J. Little, and D. O’Shea. Ideals, Varieties, and Algorithms: An Intro-duction to Computational Algebraic Geometry and Commutative Algebra. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2007.

[CR] Joel E Cohen and Uriel G Rothblum. Nonnegative ranks, decompositions, andfactorizations of nonnegative matrices. LINEAR ALGEBRA AND ITS APPLICA-TIONS, 190:149–168, 1993.

[ES] David Eisenbud and Bernd Sturmfels. Binomial ideals. 1996.

[GSS] Luis David Garcia, Michael Stillman, and Bernd Sturmfels. Algebraic geometry ofbayesian networks. Journal of Symbolic Computation, 39(3):331–355, 2005.

[Kar] Lars Karlsson. Rubik’s Cube.

[KB] Tamara G Kolda and Brett W Bader. Tensor decompositions and applications.SIAM review, 51(3):455–500, 2009.

[KRS+] Kaie Kubjas, Elina Robeva, Bernd Sturmfels, et al. Fixed points of the em algorithmand nonnegative rank boundaries. The Annals of Statistics, 43(1):422–461, 2015.

[Lan] J. M. Landsberg. Tensors: Geometry and Applications, volume 128 of GraduateStudies in Mathematics. American Mathematical Society, Providence, RI, 2012.

[LS] Colby Long and Seth Sullivant. Tying up loose strands: Defining equations of thestrand symmetric model. Journal of Algebraic Statistics, 6(1), 2015.

[PRW] G. Pistone, E. Riccomagno, and H.P. Wynn. Algebraic Statistics: ComputationalCommutative Algebra in Statistics. Chapman & Hall/CRC Monographs on Statistics& Applied Probability. CRC Press, 2000.

[PS] L. Pachter and B. Sturmfels. Algebraic Statistics for Computational Biology. Cam-bridge University Press, New York, NY, USA, 2005.

33

[Vav] S. A. Vavasis. On the complexity of nonnegative matrix factorization. ArXiv e-prints, August 2007.

[Wal] Robert Walker. Construction of the hyper cube (tesseract).

[Wik] Wikipedia. Maximum likelihood — wikipedia, the free encyclopedia, 2016. [Online;accessed 21-March-2016].

34