e cient probabilistic inference with partial ranking queries · e cient probabilistic inference...

8
k k n - k k

Upload: tranhanh

Post on 13-Jun-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Ecient Probabilistic Inference with Partial Ranking Queries

Jonathan HuangSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA 15213

Ashish KapoorMicrosoft Research1 Microsoft Way

Redmond, WA 98052

Carlos GuestrinSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA 15213

Abstract

Distributions over rankings are used to modeldata in various settings such as preferenceanalysis and political elections. The factorialsize of the space of rankings, however, typi-cally forces one to make structural assump-tions, such as smoothness, sparsity, or prob-abilistic independence about these underly-ing distributions. We approach the model-ing problem from the computational principlethat one should make structural assumptionswhich allow for ecient calculation of typi-cal probabilistic queries. For ranking mod-els, typical queries predominantly take theform of partial ranking queries (e.g., givena user's top-k favorite movies, what are hispreferences over remaining movies?). In thispaper, we argue that ried independence fac-torizations proposed in recent literature [7, 8]are a natural structural assumption for rank-ing distributions, allowing for particularly ef-cient processing of partial ranking queries.

1 IntroductionRankings arise in a number of machine learning appli-cation settings such as preference analysis for moviesand books [14] and political election analysis [6, 8]. Inthese applications, two central problems typically arise (1) representation, how to eciently build and rep-resent a exible statistical model, and (2) reasoning,how to eciently use these statistical models to drawprobabilistic inferences from observations. Both prob-lems are challenging because of the fact that, as thenumber of items being ranked increases, the numberof possible rankings increases factorially.

The key to ecient representations and reasoning is toidentify exploitable problem structure, and to this end,there have been a number of smart structural assump-tions proposed by the scientic community. These as-sumptions have typically been designed to reduce thenumber of necessary parameters of a model and have

ranged from smoothness [10], to sparsity [11], to expo-nential family parameterizations [14].

We believe that these problems should be approachedwith the view that the two central challenges are in-tertwined that model structure should be chosen sothat the most typical inference queries can be answeredmost eciently. So what are the most typical infer-ence queries? In this paper, we assume that for rankingdata, the most useful and typical inference queries takethe form of partial rankings. For example, in electiondata, given a voter's top-k favorite candidates in anelection, we are interested in inferring his preferencesover the remaining n− k candidates.

Partial rankings are ubiquitous and come in myriadforms, from top-k votes to approval ballots to ratingdata; Probability models over rankings that cannoteciently handle partial ranking data therefore havelimited applicability. Ranking datasets in fact are of-ten predominantly composed of partial rankings ratherthan full rankings, and in addition, are often heteroge-nous, containing partial ranking data of mixed types.

In this paper, we contend that the structural assump-tion of ried independence is particularly well suitedto answering probabilistic queries about partial rank-ings. Ried independence, which was introduced inrecent literature by Huang et al. [7, 8], is a general-ized notion of probabilistic independence for rankeddata. Like graphical model factorizations based onconditional independence, ried independence factor-izations allow for exible modeling of distributionsover rankings which can be learned with low samplecomplexity. We show in particular, that when riedindependence assumptions are made about a prior dis-tribution, partial ranking observations decompose in away that allows for ecient conditioning. The maincontributions of our work are as follows:

• When items satisfy the ried independence re-lationship, we show that conditioning on partialrankings can be done eciently, with runningtime linear in the number of model parameters.

• We show that, in a sense (which we formalize), itis impossible to eciently condition on observa-tions that do not take the form of partial rank-ings.

• We propose the rst algorithm that is capable ofeciently estimating the structure and parame-ters of rie independent models from heteroge-neous collections of partially ranked data.

• We show results on real voting and preferencedata evidencing the eectiveness of our methods.

2 Ried independence for rankings

A ranking, σ, of items in an item set Ω is a one-to-one mapping between Ω and a rank set R =1, . . . , n and is denoted using vertical bar notationas σ−1(1)|σ−1(2)| . . . |σ−1(n). We say that σ ranksitem i1 before (or over) item i2 if the rank of i1is less than the rank of i2. For example, Ω mightbe Artichoke,Broccoli, Cherry,Date and the rank-ing Artichoke|Broccoli|Cherry|Date encodes a pref-erence of Artichoke over Broccoli which is in turn pre-ferred over Cherry and so on. The collection of allpossible rankings of item set Ω is denoted by SΩ (orjust Sn when Ω is implicit).

Since there are n! rankings of n items, it is intractableto estimate or even explicitly represent arbitrary distri-butions on Sn without making structural assumptionsabout the underlying distribution. While there aremany possible simplifying assumptions that one canmake, we focus on a recent approach [7, 8] in which theranks of items are assumed to satisfy an intuitive gen-eralized notion of probabilistic independence known asried independence. In this paper, we argue that rif-ed independence assumptions are particularly eec-tive in settings where one would like to make queriestaking the form of partial rankings. In the remainderof this section, we review ried independence.

The ried independence assumption posits that rank-ings over the item set Ω are generated by indepen-dently generating rankings of smaller disjoint item sub-sets (say, A and B) which partition Ω, and piecing to-gether a full ranking by interleaving (or rie shuing)these smaller rankings together. For example, to rankour item set of foods, one might rst rank the veg-etables and fruits separately, then interleave the twosubset rankings to form a full ranking. To formally de-ne ried independence, we use the notions of relativerankings and interleavings.

Denition 1 (Relative ranking map). Given a rank-ing σ ∈ SΩ and any subset A ⊂ Ω, the relative rankingof items in A, φA(σ), is a ranking, π ∈ SA, such thatπ(i) < π(j) if and only if σ(i) < σ(j).

Denition 2 (Interleaving map). Given a rankingσ ∈ SΩ and a partition of Ω into disjoint sets A and B,

A,B,C,D,E,F

E,FEclair, Fondue

A,B,C,D

A,BArtichoke, Broccoli

C,DCherry,Date

Figure 1: An example of a hierarchy over six food items.

the interleaving of A and B in σ (denoted, τAB(σ)) isa (binary) mapping from the rank set R = 1, . . . , nto A,B indicating whether a rank in σ is occu-pied by A or B. As with rankings, we denote theinterleaving of a ranking by its vertical bar notation:[τAB(σ)](1)|[τAB(σ)](2)| . . . |[τAB(σ)](n).

Example 3. Consider a partitioning of an item set Ωinto vegetables A = Artichoke,Broccoli and fruitsB = Cherry,Date, and a ranking over these fouritems σ = Artichoke|Date|Broccoli|Cherry. In thiscase, the relative ranking of vegetables in σ is φA(σ) =Artichoke|Broccoli and the relative ranking of fruitsin σ is φB(σ) = Date|Cherry. The interleaving ofvegetables and fruits in σ is τAB(σ) = A|B|A|B.Denition 4 (Ried Independence). Let h be a dis-tribution over SΩ and consider a subset of items A ⊂ Ωand its complement B. The sets A and B are said tobe rie independent if h decomposes (or factors) as:

h(σ) = mAB(τAB(σ)) · fA(φA(σ)) · gB(φB(σ)),

for distributions mAB , fA and gB , dened over inter-leavings and relative rankings of A and B respectively.We refer to mAB as the interleaving distribution andfA and gB as the relative ranking distributions.

Ried independence has been found to approximatelyhold in a number of real datasets [9]. When such re-lationships can be identied in data, then instead ofexhaustively representing all n! ranking probabilities,one can represent just the factors mAB , fA and gB ,which are distributions over smaller sets.

Hierarchical rie independent models. Therelative ranking factors fA and gB are themselves dis-tributions over rankings. To further reduce the pa-rameter space, it is natural to consider hierarchicaldecompositions of itemsets into nested collections ofpartitions (like hierarchical clustering). For example,Figure 2 shows a hierarchical decomposition wherevegetables are rie independent of fruits among thehealthy foods, and these healthy foods are rie in-dependent of the subset Eclair, Fondue.

For simplicity, we restrict consideration to binary hier-archies, dened as tuples of the form H = (HA, HB),where HA and HB are either (1) null, in which caseH is called a leaf, or (2) hierarchies over item sets Aand B respectively. In this second case, A and B areassumed to be a nontrivial partition of the item set.

Denition 5. We say that a distribution h factorsrie independently with respect to a hierarchy H =(HA, HB) if item sets A and B are rie independentwith respect to h, and both fA and gB factor rieindependently with respect to subhierarchies HA andHB , respectively.

Like Bayesian networks, these hierarchies representfamilies of distributions obeying a certain set of (rif-ed) independence constraints and can be parameter-ized locally. To draw from such a model, one generatesfull rankings recursively starting by drawing rankingsof the leaf sets, then working up the tree, sequentiallyinterleaving rankings until reaching the root. The pa-rameters of these hierarchical models are simply theinterleaving and relative ranking distributions at theinternal nodes and leaves of the hierarchy, respectively.

By decomposing distributions over rankings into smallpieces (like Bayesian networks have done for other dis-tributions), these hierarchical models allow for betterinterpretability, ecient probabilistic representation,low sample complexity, ecient MAP optimization,and, as we show in this paper, ecient inference.

3 Decomposable observations

Given a prior distribution, h, over rankings and anobservation O, Bayes rule tells us that the posteriordistribution, h(σ|O), is proportional to L(O|σ) · h(σ),where L(O|σ) is the likelihood function. This oper-ation of conditioning h on an observation O is typi-cally computationally intractable since it requires mul-tiplying two n! dimensional functions, unless one canexploit structural decompositions of the problem. Inthis section, we describe a decomposition for a certainclass of likelihood functions over the space of rankingsin which the observations are `factored' into simplerparts. When an observation O is decomposable in thisway, we show that one can eciently condition a rieindependent prior distribution on O. For simplicity inthis paper, we focus on subset observations whose like-lihood functions encode membership with some subsetof rankings in Sn.

Denition 6 (Observations). A subset observation Ois an observation whose likelihood is proportional tothe indicator function of some subset of Sn.

As a running example, we will consider the class of rstplace observations throughout the paper. The rstplace observation O =Artichoke is ranked rst, forexample, is associated with the collection of rankingsplacing the item Artichoke in rst place (O = σ :σ(Artichoke) = 1). In this paper, we are interestedin computing h(σ|σ ∈ O). In the rst place scenario,we are given a voter's top choice and we would like toinfer his preferences over the remaining candidates.

Given a partitioning of the item set Ω into two subsetsA and B, it is sometimes possible to decompose (or

factor) a subset observation involving items in Ω intosmaller subset observations involving A, B and theinterleavings of A and B independently. Such decom-positions can often be exploited for ecient inference.

Example 7.

• The rst place observation O =Artichoke isranked rst can be decomposed into two indepen-dent observations (1) an observation on therelative ranking of Vegetables (OA =Artichoke isranked rst among Vegetables), and (2) an ob-servation on the interleaving of Vegetables andFruits, (OA,B =First place is occupied by a Veg-etable). To condition on O, one updates the rel-ative ranking distribution over Vegetables (A) byzeroing out rankings of vegetables which do notplace Artichoke in rst place, and updates the in-terleaving distribution by zeroing out interleavingswhich do not place a Vegetable in rst place, thennormalizes the resulting distributions.

• An example of a nondecomposable observation isthe observation O =Artichoke is in third place.To see that O does not decompose (with respectto Vegetables and Fruits), it is enough to noticethat the interleaving of Vegetables and Fruits isnot independent of the relative ranking of Vegeta-bles. If, for example, an element σ ∈ O inter-leaves A (Vegetables) and B (Fruits) as τAB(σ) =A|B|A|B, then since σ(Artichoke) = 3, the rel-ative ranking of Vegetables is constrained to beφA(σ) = Broccoli|Artichoke. Since the interleav-ings and relative rankings are not independent, wesee that O cannot be decomposable.

Formally, we use rie independent factorizations todene decomposability with respect to a hierarchy Hof the item set.

Denition 8 (Decomposability). Given a hierarchyHover the item set, a subset observation O decomposeswith respect to H if its likelihood function L(O|σ) fac-tors rie independently with respect to H.

When subset observations and the prior decomposeaccording to the same hierarchy, we can show (as inExample 7) that the posterior also decomposes.

Proposition 9. Let H be a hierarchy over the itemset. Given a prior distribution h and an observationO which both decompose with respect to H, the pos-terior distribution h(σ|O) also factors rie indepen-dently with respect to H.

In fact, we can show that when the prior and obser-vation both decompose with respect to the same hier-archy, inference operations can always be performed intime linear in the number of model parameters.

4 Complete decomposability

The condition of Proposition 9, that the prior and ob-servation must decompose with respect to exactly the

same hierarchy is a sucient one for ecient infer-ence, but it might at rst glance seem so restrictiveas to render the proposition useless in practice. Toovercome this limitation of hierarchy specic decom-posability, we explore a special family of observations(which we call completely decomposable) for which theproperty of decomposability does not depend speci-cally on a particular hierarchy, implying in particularthat for these observations, ecient inference is alwayspossible (as long as ecient representation of the priordistribution is also possible).

To illustrate how an observation can decompose withrespect to multiple hierarchies, consider again the rstplace observation O =Artichoke is ranked rst. Weargued in Example 7 that O is a decomposable ob-servation. Notice however that decomposability forthis particular observation does not depend on howthe items are partitioned by the hierarchy. Specif-ically, if instead of Vegetables and Fruits, the setsA = Artichoke, Cherry and B = Broccoli,Dateare rie independent, a similar decomposition of Owould continue to hold, with O decomposing as (1)an observation on the relative ranking of items in A(Artichoke is rst among items in A), and (2) an ob-servation on the interleaving of A and B (First placeis occupied by some element of A).

To formally capture this notion that an observationcan decompose with respect to arbitrary underlyinghierarchies, we dene complete decomposability :

Denition 10 (Complete decomposability). We saythat an observation O is completely decomposable if itdecomposes with respect to every possible hierarchyover the item set Ω.

The property of complete decomposability is a guaran-tee for an observation O, that one can always exploitany available factorized structure of the prior distribu-tion in order to eciently condition on O.Proposition 11. Given a prior h which factorizeswith respect to a hierarchy H, and a completely de-composable observation O, the posterior h(σ|O) alsodecomposes with respect to H and can be computed intime linear in the number of model parameters of h.

Given the stringent conditions in Denition 10, it isnot obvious that nontrivial completely decomposableobservations even exist. Nonetheless, there do existnontrivial examples (such as the rst place observa-tions), and in the next section, we exhibit a rich andgeneral class of completely decomposable observations.

5 Complete decomposability of partial

ranking observations

In this section we discuss the mathematical problem ofidentifying all completely decomposable observations.Our main contribution in this section is to show that

completely decomposable observations correspond ex-actly to partial rankings of the item set.

Partial rankings. We begin our discussion by in-troducing partial rankings, which allow for items to betied with respect to a ranking σ by `dropping' verticalsfrom the vertical bar representation of σ.

Denition 12 (Partial ranking observation). Let Ω1,Ω2,. . . , Ωk be an ordered collection of subsets whichpartition Ω (i.e., ∪iΩi = Ω and Ωi ∩ Ωj = ∅ if i 6= j).The partial ranking observation1 corresponding to thispartition is the collection of rankings which rank itemsin Ωi before items in Ωj if i < j. We denote thispartial ranking as Ω1|Ω2| . . . |Ωk and say that it hastype γ = (|Ω1|, |Ω2|, . . . , |Ωk|).

Given the type γ and any full ranking π ∈ SΩ, thereis only one partial ranking of type γ containing π,thus we will also equivalently denote the partial rank-ing Ω1|Ω2| . . . |Ωk as Sγπ, where π is any element ofΩ1|Ω2| . . . |Ωk.

The space of partial rankings as dened above capturesa rich and natural class of observations. In particu-lar, partial rankings encompass a number of commonlyoccurring special cases, which have traditionally beenmodeled in isolation, but in our work (as well as recentworks such as [14, 13]) can be used in a unied setting.

Example 13. Partial ranking observations include:

• (First place, or Top-1 observations): First placeobservations correspond to partial rankings oftype γ = (1, n − 1). The observation thatArtichoke is ranked rst can be written asArtichoke|Broccoli,Cherry,Date.• (Top-k observations): Top-k observations are par-tial rankings with type γ = (1, . . . , 1, n−k). Thesegeneralize the rst place observations by specify-ing the items mapping to the rst k ranks, leavingall n−k remaining items implicitly ranked behind.

• (Desired/less desired dichotomy): Partial rank-ings of type γ = (k, n − k) correspond to a sub-set of k items being preferred or desired over theremaining subset of n − k items. For example,partial rankings of type (k, n − k) might arise inapproval voting in which voters mark the subset ofapproved candidates, implicitly indicating disap-proval of the remaining n− k candidates.

Single layer decomposition. To show how partialrankings observations decompose, we will exhibit an

1As in [1], we note that The term partial ranking usedhere should not be confused with two other standard ob-jects: (1) Partial order, namely, a reexive, transitive anti-symmetric binary relation; and (2) A ranking of a subsetof Ω . In search engines, for example, although only thetop-k elements of Ω are returned, the remaining n− k areimplicitly assumed to be ranked behind.

explicit factorization with respect to a hierarchy Hover items. For simplicity, we begin by consideringthe single layer case, in which the items are partitionedinto two leaf sets A and B. Our factorization dependson the following notions of consistency of relative rank-ings and interleavings with a partial ranking.

Denition 14 (Restriction consistency). Given a par-tial ranking Sγπ = Ω1|Ω2| . . . |Ωk and any subsetA ⊂ Ω, we dene the restriction of Sγπ to A as thepartial ranking on items in A obtained by intersectingeach Ωi with A. Hence the restriction of Sγπ to A is:

[Sγπ]A = Ω1 ∩A|Ω2 ∩A| . . . |Ωk ∩A.

Given a ranking, σA of items in A, we say that σAis consistent with the partial ranking Sγπ if σA is amember of [Sγπ]A.

Denition 15 (Interleaving consistency). Given aninterleaving τAB of two sets A,B which partition Ω,we say that τAB is consistent with a partial rankingSγπ = Ω1| . . . |Ωk (with type γ) if the rst γ1 entriesof τAB contain the same number of As and Bs as Ω1,and the second γ2 entries of τAB contain the samenumber of As and Bs as Ω2, and so on. Given a partialranking Sγπ, we denote the collection of consistentinterleavings as [Sγπ]AB .

For example, consider the partial ranking Sγπ =Artichoke, Cherry|Broccoli,Date, which places a sin-gle vegetable and a single fruit in the rst two ranks,and a single vegetable and a single fruit in the last tworanks. Alternatively, Sγπ partially species an inter-leaving AB|AB. The full interleavings A|B|B|A andB|A|B|A are consistent with Sγπ (by dropping ver-tical lines) while A|A|B|B is not consistent (since itplaces two vegetables in the rst two ranks).

Using the notions of consistency with a partial ranking,we show that partial ranking observations are decom-posable with respect to any binary partitioning (i.e.,single layer hierarchy) of the item set.

Proposition 16 (Single layer hierarchy). For anypartial ranking observation Sγπ and any binary par-titioning of the item set (A,B), the indicator functionof Sγπ, δSγπ, factors rie independently as:

δSγπ(σ) = mAB(τAB(σ)) · fA(φA(σ)) · gB(φB(σ)), (5.1)

where the factors mAB, fA and gB are the indica-tor functions for consistent interleavings and relativerankings, [Sγπ]AB, [Sγπ]A and [Sγπ]B, respectively.

General hierarchical decomposition. The singlelayer decomposition of Proposition 16 can be turnedinto a recursive decomposition for partial ranking ob-servations over arbitrary binary hierarchies, which es-tablishes our main result. In particular, given a partialranking Sγπ and a prior distribution which factorizesaccording to a hierarchy H, we rst condition the top-most interleaving distribution by zeroing out all pa-

rameters corresponding to interleavings which are notconsistent with Sγπ, and normalizing the distribution.

We then need to condition the subhierarchies HA andHB on relative rankings of A and B which are con-sistent with Sγπ, respectively. Since these consistentsets, [Sγπ]A and [Sγπ]B , are partial rankings them-selves, the same algorithm for conditioning on a par-tial ranking can be applied recursively to each of thesubhierarchies HA and HB .Theorem 17. Every partial ranking is completely de-composable.See Algorithm 1 for details on our recursive condition-ing algorithm. As a consequence of Theorem 17 andProposition 11, conditioning on partial ranking obser-vations can be performed in linear time with respectto the number of model parameters.

It is interesting to consider what completely decompos-able observations exist beyond partial rankings. Oneof our main contributions is to show that there are nosuch observations.Theorem 18 (Converse of Theorem 17). Every com-pletely decomposable observation takes the form of apartial ranking.

Together, Theorems 17 and 18 form a signicant in-sight into the nature of rankings, showing that thenotions of partial rankings and complete decompos-ability exactly coincide. In fact, our result shows thatit is even possible to dene partial rankings via com-plete decomposability!

As a practical matter, our results show that there isno algorithm based on simple multiplicative updates tothe parameters which can exactly condition on obser-vations which do not take the form of partial rankings.If one is interested in conditioning on such observa-tions, Theorem 18 suggests that a slower or approxi-mate inference approach might be necessary.

6 Model estimation from partially

ranked dataIn this section we use the ecient inference algorithmproposed in Section 5 for estimating a rie indepen-dent model from partially ranked data. Because es-timating a model using partially ranked data is typ-ically considered to be more dicult than estimatingone using only full rankings, a common practice (seefor example [8]) has been to simply ignore the partialrankings in a dataset. The ability of a method to in-corporate all of the available data however, can lead tosignicantly improved model accuracy as well as widerapplicability of that method. In this section, we pro-pose the rst ecient method for estimating the struc-ture and parameters of a hierarchical rie independentmodel from heterogeneous datasets consisting of arbi-trary partial ranking types. Central to our approachis the idea that given someone's partial preferences,

function prcondition (Prior hprior, HierarchyH, Observation Sγπ = Ω1|Ω2| . . . |Ωk)

if isLeaf(H) thenforall σ do

hpost(σ)←

hprior(σ) if σ ∈ Sγπ0 otherwise

;

Normalize (hpost) ;return hpost;

else

forall τ do

mpost(τ)←

mprior(τ) if τ ∈ [Sγπ]AB0 otherwise

;

Normalize (mpost) ;f(φA)←prcondition (fprior, HA, [Sγπ]A) ;g(φB)←prcondition (gprior, HB , [Sγπ]B) ;return mpost, fpost, gpost;

Algorithm 1: Pseudocode for prcondition, an algo-rithm for recursively conditioning a hierarchical rie in-dependent prior distribution on partial ranking observa-tions. See Denitions 14 and 15 for [Sγσ]A, [Sγσ]B , and[Sγσ]AB . The runtime of prcondition is linear in the num-ber of model parameters.

we can use the ecient algorithms developed in theprevious section to infer his full preferences and con-sequently apply previously proposed algorithms whichare designed to work with full rankings.

Censoring interpretations of partial rankings.The model estimation problem for full rankings isstated as follows. Given i.i.d. training examplesσ(1), . . . , σ(m) (consisting of full rankings) drawn froma hierarchical rie independent distribution h, recoverthe structure and parameters of h.

In the partial ranking setting, we again assume i.i.d.draws, but that each training example σ(i) undergoesa censoring process producing a partial ranking con-sistent with σ(i). For example, censoring might onlyallow for the ranking of the top-k items of σ(i) to beobserved. While we allow for arbitrary types of partialrankings to arise via censoring, we make a common as-sumption that the partial ranking type resulting fromcensoring σ(i) does not depend on σ(i) itself.

Algorithm. We treat the model estimation frompartial rankings problem as a missing data problem.As with many such problems, if we could determinethe full ranking corresponding to each observation inthe data, then we could apply algorithms which workin the completely observed data setting. Since fullrankings are not given, we utilize an Expectation-Maximization (EM) approach in which we use infer-ence to compute a posterior distribution over full rank-ings given the observed partial ranking. In our case, wethen apply the algorithms from [8, 9] which were de-signed to estimate the hierarchical structure of a modeland its parameters from a dataset of full rankings.

Given an initial model h, our EM-based approach al-ternates between the following two steps until conver-gence is achieved.

• (E-step): For each partial ranking, Sγπ, in thetraining examples, we use inference to compute aposterior distribution over the full ranking σ thatcould have generated Sγπ via censoring, h(σ|O =Sγπ). Since the observations take the form of par-tial rankings, we use the ecient algorithms inSection 5 to perform the E-step.

• (M-step): In the M-step, one maximizes the ex-pected log-likelihood of the training data withrespect to the model. When the hierarchicalstructure of the model has been provided, or isknown beforehand, our M-step can be performedusing standard methods for optimizing parame-ters. When the structure is unknown, we use astructural EM approach, which is analogous tomethods from the graphical models literature forstructure learning from incomplete data [4, 5].

Unfortunately, the (ried independence) struc-ture learning algorithm of [8] is unable to directlyuse the posterior distributions computed from theE-step. Instead, observing that sampling fromrie independent models can be done ecientlyand exactly (as opposed to, for example, MCMCmethods), we simply sample full rankings fromthe posterior distributions computed in the E-stepand pass these full rankings into the structurelearning algorithm of [8]. The number of sam-ples that are necessary, instead of scaling facto-rially, scales according to the number of samplesrequired to detect ried independence (which un-der mild assumptions is polynomial in n, [8]).

Related work. There are several recent works tomodel partial rankings. Busse et al. [2] learned -nite mixtures of Mallows models from top-k data (alsousing an EM approach). Lebanon and Mao [14] de-veloped a nonparametric model based (also) on Mal-lows models which can handle arbitrary types of par-tial rankings. In both settings, a central problem isto marginalize a Mallows model over all full rankingswhich are consistent with a particular partial rank-ing. To do so eciently, both papers rely on the fact(rst shown in [3]) that this marginalization step canbe performed in closed form. This closed form equa-tion of [3], however, can be seen as a very specialcase of our setting since Mallows models can alwaysbe shown to factor rie independently according toa chain structure.2 Moreover, instead of resorting tothe more complicated mathematics based on inversioncombinatorics, our theory of complete decomposabil-ity oers a simple conceptual way to understand whyMallows models can be conditioned eciently on par-tial ranking observations.

2A chain structure is a hierarchy in which only a singleitem is partitioned out at each level of the hierarchy.

7 Experiments

We demonstrate our algorithms on simulated data aswell as real datasets taken from two dierent domains.

The Meath dataset [6] is taken from a 2002 Irish Par-liament election with over 60,000 top-k rankings of 14candidates. Figure 2(a) plots, for each k ∈ 1, . . . , 14,the number of ballots in the Meath data of length k.In particular, note that the vast majority of ballots inthe dataset consist of partial rather than full rankings.We can run inference on over 5000 top-k examples forthe Meath data in 10 seconds on a dual 3.0 GHz Pen-tium machine with an unoptimized Python implemen-tation. Using `brute force' inference, we estimate thatthe same job would require roughly one hundred years.

We extracted a second dataset from a database ofsearchtrails collected by [15], in which browsing ses-sions of roughly 2000 users were logged during 2008-2009. In many cases, users are unlikely to read ar-ticles about the same story twice, and so it is of-ten possible to think of the order in which a userreads through a collection of articles as a top-k rank-ing over articles concerning a particular story/topic.The ability to model visit orderings would allow usto make long term predictions about user browsingbehavior, or even recommend `curriculums' over arti-cles for users. We ran our algorithms on roughly 300visit orderings for the eight most popular posts fromwww.huffingtonpost.com concerning `Sarah Palin', apopular subject during the 2008 U.S. presidential elec-tion. Since no user visited every article, there are nofull rankings in the data and thus the method of `ig-noring' partial rankings does not work.

Structure discovery with EM In all experiments,we initialize distributions to be uniform, and do notuse random restarts. Our experiments have led toseveral observations about using EM for learning withpartial rankings. First, we observe that typical runsconverge to a xed structure quickly, with no morethan three EM iterations. Figure 2(b) shows theprogress of EM on the Sarah Palin data, whose struc-ture converges by the third iteration. As expected,the log-likelihood increases at each iteration, and weremark that the structure becomes more interpretable for example, the leaf set 0, 2, 3 corresponds to thethree posts about Palin's wardrobe before the election,while the posts from the leaf set 1, 4, 6 were relatedto verbal gaes made by Palin during the campaign.

Secondly, the number of EM iterations required toreach convergence in log-likelihood depends on thetypes of partial rankings observed. We ran our algo-rithm on subsets of the Meath dataset, each time train-ing on m = 2000 rankings all with length larger thansome xed k. Figure 2(c) shows the number of itera-tions required for convergence as a function of k (with

20 bootstrap trials for each k). We observe fastest con-vergence for datasets consisting of almost-full rankingsand slowest convergence for those consisting of almost-empty rankings, with almost 25 iterations necessary ifone trains using rankings of all types.

The value of partial rankings. We now show thatusing partial rankings in addition to full rankings al-lows us to achieve better density estimates. We rstlearned models from synthetic data drawn from a hi-erarchy, training using 343 full rankings plus vary-ing numbers of partial ranking examples (ranging be-tween 0-64,000). We repeat each setting with 20 boot-strap trials, and for evaluation, we compute the log-likelihood of a testset with 5000 examples. For speed,we learn a structure H only once and x H to learnparameters for each trial. Figure 2(d), which plots thetest log-likelihood as a function of the number of par-tial rankings made available to the training set, showsthat we are indeed able to learn more accurate dis-tributions as more and more data are made available.

Comparing to a nonparametric model. Com-paring the performance of rie independent models toother approaches was not previously possible since [8]could not handle partial rankings. Using our meth-ods, we compare rie independent models with thestate-of-the-art nonparametric Lebanon-Mao (LM08)estimator of [14] on the same data (setting their regu-larization parameter to be C =1,2,5, or 10 via a valida-tion set). Figure 2(d) shows (naturally) that when thedata are drawn synthetically from a rie independentmodel, then our EM method signicantly outperformsthe LM08 estimator.

For the Meath data, which is only approximately rieindependent, we trained on subsets of size 5,000 and25,000 (testing on remaining data). For each subset,we evaluated our EM algorithm for learning a rie in-dependent model against the LM08 estimator when (1)using only full ranking data, and (2) using all data. Asbefore, both methods do better when partial rankingsare made available.

For the smaller training set, the rie independentmodel performs as well or better than the LM'08 esti-mator. For the larger training set of 25,000, we see thatthe nonparametric method starts to perform slightlybetter on average, the advantage of a nonparametricmodel being that it is guaranteed to be consistent,converging to the correct model given enough data.The advantage of rie independent models, however,is that they are simple, interpretable, and can high-light global structures hidden within the data.

8 Conclusion

In probabilistic reasoning problems, it is often thecase that certain data types suggest certain distribu-tion representations. For example, sparse dependency

2 4 6 8 10 12 14k

num

be

r o

f vo

tes

0

20,000

10,000

(a) Histogram of top-k ballotlengths in the Irish election data.Over half of voters provide onlytheir top-3 or top-4 choices.

0,1,2,3,4,5,6,7

5 0,1,2,3,4,6,7

0,1,2,3,4,7 6

70,1,2,3,4

0,2,3 1,4

Log likelihood: -818.6579

After 1st iteration

0,1,2,3,4,5,6,7

5 0,1,2,3,4,6,7

0,1,2,3,4,7 7

60,1,2,3,4

0,2,3 1,4

Log likelihood: -769.2369

After 2nd iteration

0,1,2,3,4,5,6,7

5 0,1,2,3,4,6,7

0,1,2,3,4,7 7

1,4,60,2,3

Log likelihood: -767.2760

After 3rd iteration

(b) Iterations of Structure EM for the Sarah Palin datawith structural changes at each iteration highlighted inred. This gure is best viewed in color.

0 1 2 3 4 5 6 7 8 9 10 11 120

5

10

15

20

25

30

# o

f E

M ite

rati

on

sb

efo

re

co

nverg

en

ce

k

(c) # of EM iterations required forconvergence if the training set onlycontains rankings of length longerthan k.

0 250 1000 4000 16000 64000

-2.88

-2.84

-2.8

-2.76

-2.72x 104

test lo

g-lik

elih

ood

# of partial rankings in training set

(in addition to full rankings)

[Lebanon & Mao, ‘08]

EM with

decomposable

conditioning

(d) Density estimation from synthetic data. We plot testloglikelihood when learning from 343 full rankings andbetween 0 and 64,000 additional partial rankings.

-4.9

-4.8

-4.7

-4.6

-4.5x 10

4

-1.34

-1.32

-1.3

-1.28

-1.26x 10

5

Ours

Ours

[LM’08]

[LM’08]

Ours

Ours

[LM’08]

[LM’08]

5000 training examples

25000 training examples

Full only Mixed (Full+Partial) Full only Mixed (Full+Partial)

Test

log-

likel

iho

od

(e) Density estimation from small (5000 examples) and large subsets(25000 examples) of the Meath data. We compare our method against [14]training (1) on all available data and (2) on the subset of full rankings.

Figure 2: Experimental results

structure in the data often suggests a Markov randomeld (or other graphical model) representation [4, 5].For low-order permutation observations (depending ononly a few items at a time), recent work ([10, 12]) hasshown that a Fourier domain representation is appro-priate. Our work shows, on the other hand, that whenthe observed data takes the form of partial rankings,then hierarchical rie independent models are a nat-ural representation.

As with conjugate priors, we showed that a rie in-dependent model is guaranteed to retain its factoriza-tion structure after conditioning on a partial ranking(which can be performed in linear time). Most surpris-ingly, our work shows that observations which do nottake the form of partial rankings are not amenable tosimple multiplicative update based conditioning algo-rithms. Finally, we showed that it is possible to learnhierarchical rie independent models from partiallyranked data, signicantly extending the applicabilityof previous work.

Acknowledgements

This work is supported in part by ONR un-der MURI N000140710747, and ARO under MURIW911NF0810242. C. Guestrin was funded in partby NSF Career IIS-064422. We thank E. Horvitz, R.White, and D. Liebling for discussions. Much of thisresearch was conducted while J. Huang was visitingMicrosoft Research Redmond.

References[1] N. Ailon. Aggregation of partial rankings, p-ratings

and top-m lists. In SODA, 2007.[2] L. M. Busse, P. Orbanz, and J. M. Buhmann. Cluster

analysis of heterogeneous rank data. In ICML, 2007.[3] M. Fligner and J. Verducci. Distance-based ranking

models. J. Royal Stat. Soc., Series B, 83, 1986.[4] N. Friedman. Learning belief networks in the presence

of missing values and hidden variables. In ICML, 1997.[5] N. Friedman. The bayesian structural em algorithm.

In UAI '98, 1998.[6] C. Gormley and B. Murphy. A latent space model for

rank data. In ICML '06, 2006.[7] J. Huang and C. Guestrin. Ried independence for

ranked data. In NIPS, 2009.[8] J. Huang and C. Guestrin. Learning hierarchical rie

independent groupings from rankings. In ICML, 2010.[9] J. Huang and C. Guestrin. Uncovering the

ried independence structure of ranked data.http://arxiv.org/abs/1006.1328, 2011.

[10] J. Huang, C. Guestrin, and L. Guibas. Fourier theo-retic probabilistic inference over permutations. JMLR,10, 2009.

[11] S. Jagabathula and D. Shah. Inferring rankings underconstrained sensing. In NIPS, 2008.

[12] R. Kondor. Group Theoretical Methods in MachineLearning. PhD thesis, Columbia University, 2008.

[13] G. Lebanon and J. Laerty. Conditional models onthe ranking poset. In NIPS, 2002.

[14] G. Lebanon and Y. Mao. Non-parametric modeling ofpartially ranked data. In NIPS, 2008.

[15] R. White and S. Drucker. Investigating behavioralvariability in web search. In WWW, 2007.