devavrat’shah’ - icerm · devavrat’shah’ ’...

Devavrat Shah

Laboratory for Information and Decision Systems

Department of EECS

Massachusetts Institute of Technology

http://web.mit.edu/devavrat/www

(list of relevant references in the last set of slides)

o Ideally o Graphical models

o Belief propagation

o Connections to Probability, Statistics, EE, CS,…

o  In reality o A set of very exciting (to me, may be others) questions at the interface of all of the above and more

o Seemingly unrelated to graphical model

o However, provide fertile ground to understand everything about graphical models (algorithms, analysis)

o Recommendations o What movie to watch

o Which restaurant to eat

o …

o Precisely, o Suggest what you may like

o Given what others have liked o By finding others like you and what they had liked

o Ranking o Players and/or Teams

o Based on outcome of games

o Papers at a competitive conference o Using reviews

o Graduate admissions o  From feedback of professors

o Precisely, o Global ranking of objects from partial preferences

o Partial preferences are revealed in different forms o Sports: Win and Loss

o Social: Starred rating

o Conferences: Scores

o All can be viewed as pair-‐wise comparisons o  IND beats AUS: IND > AUS

o Clio ***** vs No 9 Park ****: Clio > No 9 Park

o Ranking Paper 9/10 vs Other Paper 5/10: Ranking > Other

o Revealed preferences lead to o Bag of pair-‐wise comparisons

o Question of interest o Recommendations

o  Suggest what you may like given what others have liked

o Ranking o Global ranking of objects given outcome of games/…

Ø This requires understanding (computing) choice model o What people like/dislike from pair-‐wise comparisons

o Rational view: Axiom of revealed preferences [Samuelson ’37] o There is one ordering over all objects consistent across population

o Unlikely (lack of transitivity in people’s preferences)

o Meaningful view – “discrete choice model’’ o Distribution over orderings of objects

o  consistent with population’s revealed preferences

Data Choice Model

Decision

A B

C B

C

A

C B A

C B A

A B

C B

C

A C B > A >

0.25

0.75

>>>>

>>

>>

> >> >

o Object tracking (cf. Huang, Guestrin, Guibas ‘08) o Noisy observations of locations

o Feasible to maintain partial information only o Q=[Qij] – first-‐order information

P1

P2

P3

1

2

3

Objects Locations

Q11 = P(1èP1)

Q12 Q13

o Object tracking o Noisy observations of locations

o Feasible to maintain partial information only o Q=[Qij] – first-‐order information

P1

P2

P3

1

2

3

Q11 Q12

Q13

Data Choice Model

Decision

P1

P2

P3

1

2

3

P1

P2

P3

1

2

3

o Recommendation

o Ranking

o Object tracking

o Policy making

o Business operations (assortment optimization)

o Display advertising

o Polling,…

o Canonical question o Decision using choice model learnt from partial preference data

P1

P2

P3

1

2

3

Q11

Q12 Q13

o Q. Given weighted bipartite graph G=(V, E, Q) o Find matching of objects/positions

o That is `most likely’

P1

P2

P3

1

2

3

Q11

Q12 Q13

o Answer: maximum weight matching o Weight of a matching equals

o  summation of Q-‐entries of edges participating in the matching

P1

P2

P3

1

2

3

A B

C B

C A

C A

>

>

>

>

o Q1. Given weighted comparison graph G=(V, E, A) o Find ranking of/scores associated with objects

o Q2. When possible (e.g. Conference/Crowd-‐Sourcing), choose G so as to o Minimize the number of comparisons required to find ranking/scores

1

6 2

3

4

5

A12

A21

# times 1 defeats 2

1

6 2

3

4

5

A12

A21

o Random walk on comparison graph G=(V,E,A) o d = max (undirected) vertex degree of G o For each edge (i,j):

o  Pij = (Aji +1)/(Aij +Aij +2) x 1/d

o For each node i: o  Pii = 1-‐ Σj≠i Pij

o Let G be connected o Let s be the unique stationary distribution of RW P

o Ranking: o Use s as scores of objects

sT = sTP

1

6 2

3

4

5

A12

A21

o Random walk on comparison graph G=(V,E,A) o d = max (undirected) vertex degree of G o For each edge (i,j):

o  Pij = (Aji +1)/(Aij +Aij +2) x 1/d

o For each node i: o  Pii = 1-‐ Σj≠i Pij

o Ranking: use s as scores of objects, where o  s be the unique stationary distribution of RW P

o Choice of graph G o  Subject to constraints, choose G so that

o  Spectral gap of natural RW on G is maximized o  SDP [Boyd, Diaconis, Xiao ‘04]

sT = sTP

P1

P2

P3

1

2

3

Q11

Q12 Q13

A B

C B

C A

C A

>

>

>

>

o Maximum Weight Matching

o How to compute it ? o  Belief propagation

o Why does it make sense ? o  Max-‐likelihood estimation

w.r.t. “exponential family’’

o Rank centrality

o How to compute it ? o  Power-‐iteration

o Why does it make sense ? o  Mode for Bradley-‐Terry-‐Luce

(or MNL) model

P1

P2

P3

1

2

3

Q11

Q12 Q13

(all of below explained using class-‐board)

o Computation o Belief propagation

o Algorithm

o Why it works

o Model o Maximum entropy (max-‐ent) consistent distribution

o Maximum Likelihood in exponential family

o Maximum weight matching o  “First-‐order” approximation of mode of this distribution

o  Exact computation of max-‐ent o  Via dual gradient

o  Belief propagation/MCMC at rescue

A B

C B

C A

C A

>

>

>

>

o Choice model (distribution over permutations)

[Bradley-‐Terry-‐Luce (BTL) or MNL (cf. McFadden) Model] o Each object i has an associated weight wi > 0

o When objects i and j are compared o  P(i > j) = wi /(wi + wj)

o Sampling model o Edges E of graph G are selected

o For each (i,j) ε E, sample k pair-‐wise comparisons

o Error(s) =

o G: Erdos-‐Renyi graph with edge prob. d/n

1w

(wi-wj)2I (s(i)-s(j))(wi-wj)<0{ }

i>j∑( )1/2

0.0001

0.001

0.01

0.1

1 10 100

Ratio MatrixL1 ranking

Rank CentralityML estimate

0.001

0.01

0.1

0.1 1

Ratio MatrixL1 ranking

Rank CentralityML estimate

d/n k

o Theorem 1. o Let R= (maxij wi/wj).

o Let G be Erdos-‐Renyi graph. o Under Rank centrality, with d = Ω(log n)

o That is, sufficient to have O(R5 n log n) samples o Optimal dependence on n for ER graph

o  Dependence on R ?

s-ww

≤ C R5log nkd


o Let G be Erdos-‐Renyi graph. o Under Rank centrality, with d = Ω(log n)

o  Information theoretic lower-‐bound: for any algorithm

s-ww

≥ C' 1kd

s-ww

≤ C R5log nkd


o Let G be any connected graph: o  L = D-‐1 E be natural RW transition matrix

o Δ = 1-‐ λmax(L)

o  κ = dmax /dmin

o Under Rank centrality, with kd = Ω(log n)

o That is, number of samples required O(R5 κ2 n log n x Δ-‐2) o Graph structure plays role through it’s Laplacian

s-ww

≤ CΔκ

R5log nkd

o Theorem 2. o Under Rank centrality, with kd = Ω(log n)

o That is, number of samples required O(R5 κ2 n log n x Δ-‐2)

o Choice of graph G o  Subject to constraints, choose G so that

o  Spectral gap Δ is maximized

o  SDP [Boyd, Diaconis, Xiao ‘04]

s-ww

≤ CΔκ

R5log nkd

o Bound on o Use of comparison theorem [Diaconis-‐Saloff Coste ‘94]++

o Bound on o Use of (modified) concentration of measure inequality for matrices

o Finally, use this to further bound Error(s)

Washington Post: Allourideas

o Ground truth: algorithm’s result with complete data

o Error: average position discrepancy

0

2

4

6

8

10

12

14

16

18

20

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

L1 rankingRank Centrality

1

6 2

3

4

5

A12

A21

o Input: complete preference (not comparisons) o Axiomatic impossibility [Arrow ’51]

o Some algorithms o Kemeny optimal: minimize disagreements

o  Extended Condorcet Criteria

o NP-‐hard, 2-‐approx algorithm [Dwork et al ’01]

o Borda count: average position is score o  Simple

o Useful axiomatic properties [Young ‘74]

2 3

>4

>1

>5

>6

>

6 2

>5

>1

>4

>3

>

o Algorithms with partial data o Let pair-‐wise data available for all pairs

o Kemeny distance depends on pair-‐wise marginal only

o Data is consistent with a distribution on permutations o For example, obtained as the max-‐ent approximation

o Kemeny optimal of this distribution is the same as above o NP-‐hard

o 2-‐approx for this distribution acts as 2-‐approx for the above

argminσ

Aij∑ Ι(σ (i)<σ ( j))

[Ammar-‐Shah ’11 ’12]

o Borda count o Average position

o But, comparison do not have position information

o Given pair-‐wise marginal pij for all i≠j

o For any distribution consistent with pair-‐wise marginal o Borda count is given as

c(i) ∝ pijj∑

[Ammar-‐Shah ’11 ’12]

o Finding winner and BTL choice model

[Adler, Gemmell, Harchol-‐Balter, Karp, Kenyon ’87] o O(log n) iteration adaptive algorithm, O(n) total comparisons

o Noisy sorting and Mallow’s model [Braverman, Mossel ’09] o O(n log n) samples in total (complete ordering)

o Average position (Borda count) algorithm++ o  Polynomial(n) time algorithm

o Choice model o A powerful model to tackle a range of questions

o Many are in it’s infancy (e.g. recommendations)

o Challenge being computation + statistics

o Excellent “play ground” to resolve challenges of graphical models

o Two examples in this tutorial o Object tracking

o  Learning from first-‐order marginals

o Ranking o Using pair-‐wise comparisons

o Open direction: o Learning graphical models efficiently

o Computationally and statistically

o A concrete question: o Given pair-‐wise comparisons data

o When can we learn the choice model efficiently ?

o For example, if exact pair-‐wise comparison marginals available o  Then, can learn `sparse’ choice model up to o(log n) sparsity [Farias + Jagabathula + S ‘09 ‘12]

o But what about noisy setting ?

o Or, max-‐ent learning ?

o Part I: o A. Ammar, D. Shah, ``Efficient rank aggregation from partial data,’’ Proceedings of ACM Sigmetrics 2012.

o M. Bayati, D. Shah, M. Sharma, ``Max-‐product for maximum weight matching: convergence, correctness and LP duality,’’ IEEE Transactions on Information Theory, 2008.

o S. Jagabathula, D. Shah, ``Inferring ranking using constrained sensing,’’ IEEE Transactions on Information Theory, 2011.

o S. Agrawal, Z. Wang, Y. Ye, ``Parimutuel betting on permutations,’’ Internet and Network Economics, 2008.

o J. Huang, C. Guestrin, L. Guibas, ``Fourier theoretic probabilistic inference over permutations,’’ Journal of Machine Learning Research, 2009.

covered in tutorial sparse choice model that is v. relevant others/ closely related/definitely worth a read Color coding

o Part II: o S. Negahban, S. Oh, D. Shah, ``Iterative ranking using pair-‐wise comparisons,’’ Proceedings of NIPS 2012.

o V. Farias, S. Jagabathula, D. Shah, ``Data driven approach to modeling choice,’’ Proceedings of NIPS, 2009. Also Management Science, 2012 (and Arxiv).

o V. Farias, S. Jagabathula, D. Shah, ``Sparse choice model,’’ available on Arxiv, 2012.

o A. Ammar, D. Shah, ``Compare, don’t score,’’ Proceedings of Allerton, 2011.


o At large: o H. Varian, ``Revealed preferences,’’ Samuelsonian economics and the twenty-‐first century, 2006.

o D. McFadden, ``Disaggregate Behavioral Travel Demand’s RUM Side,’’ A 30-‐year Retrospective, available online http://emlab.berkeley.edu/pub/wp/mcfadden0300.pdf

o P. Diaconis, ``Group representation in probability and statistics,’’ Lecture notes-‐monograph series, 1988.

o M. Wainwright, M. Jordan, ``Graphical models, exponential families, and variational inference,’’ Foundations and Trends in Machine Learning, 2008.


devavrat’shah’ - icerm · devavrat’shah’ ’...

Documents