devavrat’shah’ - icerm · devavrat’shah’ ’...
TRANSCRIPT
Devavrat Shah
Laboratory for Information and Decision Systems
Department of EECS
Massachusetts Institute of Technology
http://web.mit.edu/devavrat/www
(list of relevant references in the last set of slides)
o Ideally o Graphical models
o Belief propagation
o Connections to Probability, Statistics, EE, CS,…
o In reality o A set of very exciting (to me, may be others) questions at the interface of all of the above and more
o Seemingly unrelated to graphical model
o However, provide fertile ground to understand everything about graphical models (algorithms, analysis)
o Recommendations o What movie to watch
o Which restaurant to eat
o …
o Precisely, o Suggest what you may like
o Given what others have liked o By finding others like you and what they had liked
o Ranking o Players and/or Teams
o Based on outcome of games
o Papers at a competitive conference o Using reviews
o Graduate admissions o From feedback of professors
o Precisely, o Global ranking of objects from partial preferences
o Partial preferences are revealed in different forms o Sports: Win and Loss
o Social: Starred rating
o Conferences: Scores
o All can be viewed as pair-‐wise comparisons o IND beats AUS: IND > AUS
o Clio ***** vs No 9 Park ****: Clio > No 9 Park
o Ranking Paper 9/10 vs Other Paper 5/10: Ranking > Other
o Revealed preferences lead to o Bag of pair-‐wise comparisons
o Question of interest o Recommendations
o Suggest what you may like given what others have liked
o Ranking o Global ranking of objects given outcome of games/…
Ø This requires understanding (computing) choice model o What people like/dislike from pair-‐wise comparisons
o Rational view: Axiom of revealed preferences [Samuelson ’37] o There is one ordering over all objects consistent across population
o Unlikely (lack of transitivity in people’s preferences)
o Meaningful view – “discrete choice model’’ o Distribution over orderings of objects
o consistent with population’s revealed preferences
Data Choice Model
Decision
A B
C B
C
A
C B A
C B A
A B
C B
C
A C B > A >
0.25
0.75
>>>>
>>
>>
> >> >
o Object tracking (cf. Huang, Guestrin, Guibas ‘08) o Noisy observations of locations
o Feasible to maintain partial information only o Q=[Qij] – first-‐order information
P1
P2
P3
1
2
3
Objects Locations
Q11 = P(1èP1)
Q12 Q13
o Object tracking o Noisy observations of locations
o Feasible to maintain partial information only o Q=[Qij] – first-‐order information
P1
P2
P3
1
2
3
Q11 Q12
Q13
Data Choice Model
Decision
P1
P2
P3
1
2
3
P1
P2
P3
1
2
3
o Recommendation
o Ranking
o Object tracking
o Policy making
o Business operations (assortment optimization)
o Display advertising
o Polling,…
o Canonical question o Decision using choice model learnt from partial preference data
P1
P2
P3
1
2
3
Q11
Q12 Q13
o Q. Given weighted bipartite graph G=(V, E, Q) o Find matching of objects/positions
o That is `most likely’
P1
P2
P3
1
2
3
Q11
Q12 Q13
o Answer: maximum weight matching o Weight of a matching equals
o summation of Q-‐entries of edges participating in the matching
P1
P2
P3
1
2
3
A B
C B
C A
C A
>
>
>
>
o Q1. Given weighted comparison graph G=(V, E, A) o Find ranking of/scores associated with objects
o Q2. When possible (e.g. Conference/Crowd-‐Sourcing), choose G so as to o Minimize the number of comparisons required to find ranking/scores
1
6 2
3
4
5
A12
A21
# times 1 defeats 2
1
6 2
3
4
5
A12
A21
o Random walk on comparison graph G=(V,E,A) o d = max (undirected) vertex degree of G o For each edge (i,j):
o Pij = (Aji +1)/(Aij +Aij +2) x 1/d
o For each node i: o Pii = 1-‐ Σj≠i Pij
o Let G be connected o Let s be the unique stationary distribution of RW P
o Ranking: o Use s as scores of objects
sT = sTP
1
6 2
3
4
5
A12
A21
o Random walk on comparison graph G=(V,E,A) o d = max (undirected) vertex degree of G o For each edge (i,j):
o Pij = (Aji +1)/(Aij +Aij +2) x 1/d
o For each node i: o Pii = 1-‐ Σj≠i Pij
o Ranking: use s as scores of objects, where o s be the unique stationary distribution of RW P
o Choice of graph G o Subject to constraints, choose G so that
o Spectral gap of natural RW on G is maximized o SDP [Boyd, Diaconis, Xiao ‘04]
sT = sTP
P1
P2
P3
1
2
3
Q11
Q12 Q13
A B
C B
C A
C A
>
>
>
>
o Maximum Weight Matching
o How to compute it ? o Belief propagation
o Why does it make sense ? o Max-‐likelihood estimation
w.r.t. “exponential family’’
o Rank centrality
o How to compute it ? o Power-‐iteration
o Why does it make sense ? o Mode for Bradley-‐Terry-‐Luce
(or MNL) model
P1
P2
P3
1
2
3
Q11
Q12 Q13
(all of below explained using class-‐board)
o Computation o Belief propagation
o Algorithm
o Why it works
o Model o Maximum entropy (max-‐ent) consistent distribution
o Maximum Likelihood in exponential family
o Maximum weight matching o “First-‐order” approximation of mode of this distribution
o Exact computation of max-‐ent o Via dual gradient
o Belief propagation/MCMC at rescue
A B
C B
C A
C A
>
>
>
>
o Choice model (distribution over permutations)
[Bradley-‐Terry-‐Luce (BTL) or MNL (cf. McFadden) Model] o Each object i has an associated weight wi > 0
o When objects i and j are compared o P(i > j) = wi /(wi + wj)
o Sampling model o Edges E of graph G are selected
o For each (i,j) ε E, sample k pair-‐wise comparisons
o Error(s) =
o G: Erdos-‐Renyi graph with edge prob. d/n
1w
(wi-wj)2I (s(i)-s(j))(wi-wj)<0{ }
i>j∑( )1/2
0.0001
0.001
0.01
0.1
1 10 100
Ratio MatrixL1 ranking
Rank CentralityML estimate
0.001
0.01
0.1
0.1 1
Ratio MatrixL1 ranking
Rank CentralityML estimate
d/n k
o Theorem 1. o Let R= (maxij wi/wj).
o Let G be Erdos-‐Renyi graph. o Under Rank centrality, with d = Ω(log n)
o That is, sufficient to have O(R5 n log n) samples o Optimal dependence on n for ER graph
o Dependence on R ?
s-ww
≤ C R5log nkd
o Theorem 1. o Let R= (maxij wi/wj).
o Let G be Erdos-‐Renyi graph. o Under Rank centrality, with d = Ω(log n)
o Information theoretic lower-‐bound: for any algorithm
s-ww
≥ C' 1kd
s-ww
≤ C R5log nkd
o Theorem 2. o Let R= (maxij wi/wj).
o Let G be any connected graph: o L = D-‐1 E be natural RW transition matrix
o Δ = 1-‐ λmax(L)
o κ = dmax /dmin
o Under Rank centrality, with kd = Ω(log n)
o That is, number of samples required O(R5 κ2 n log n x Δ-‐2) o Graph structure plays role through it’s Laplacian
s-ww
≤ CΔκ
R5log nkd
o Theorem 2. o Under Rank centrality, with kd = Ω(log n)
o That is, number of samples required O(R5 κ2 n log n x Δ-‐2)
o Choice of graph G o Subject to constraints, choose G so that
o Spectral gap Δ is maximized
o SDP [Boyd, Diaconis, Xiao ‘04]
s-ww
≤ CΔκ
R5log nkd
o Bound on o Use of comparison theorem [Diaconis-‐Saloff Coste ‘94]++
o Bound on o Use of (modified) concentration of measure inequality for matrices
o Finally, use this to further bound Error(s)
Washington Post: Allourideas
o Ground truth: algorithm’s result with complete data
o Error: average position discrepancy
0
2
4
6
8
10
12
14
16
18
20
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
L1 rankingRank Centrality
1
6 2
3
4
5
A12
A21
o Input: complete preference (not comparisons) o Axiomatic impossibility [Arrow ’51]
o Some algorithms o Kemeny optimal: minimize disagreements
o Extended Condorcet Criteria
o NP-‐hard, 2-‐approx algorithm [Dwork et al ’01]
o Borda count: average position is score o Simple
o Useful axiomatic properties [Young ‘74]
2 3
>4
>1
>5
>6
>
6 2
>5
>1
>4
>3
>
o Algorithms with partial data o Let pair-‐wise data available for all pairs
o Kemeny distance depends on pair-‐wise marginal only
o Data is consistent with a distribution on permutations o For example, obtained as the max-‐ent approximation
o Kemeny optimal of this distribution is the same as above o NP-‐hard
o 2-‐approx for this distribution acts as 2-‐approx for the above
argminσ
Aij∑ Ι(σ (i)<σ ( j))
[Ammar-‐Shah ’11 ’12]
o Borda count o Average position
o But, comparison do not have position information
o Given pair-‐wise marginal pij for all i≠j
o For any distribution consistent with pair-‐wise marginal o Borda count is given as
c(i) ∝ pijj∑
[Ammar-‐Shah ’11 ’12]
o Finding winner and BTL choice model
[Adler, Gemmell, Harchol-‐Balter, Karp, Kenyon ’87] o O(log n) iteration adaptive algorithm, O(n) total comparisons
o Noisy sorting and Mallow’s model [Braverman, Mossel ’09] o O(n log n) samples in total (complete ordering)
o Average position (Borda count) algorithm++ o Polynomial(n) time algorithm
o Choice model o A powerful model to tackle a range of questions
o Many are in it’s infancy (e.g. recommendations)
o Challenge being computation + statistics
o Excellent “play ground” to resolve challenges of graphical models
o Two examples in this tutorial o Object tracking
o Learning from first-‐order marginals
o Ranking o Using pair-‐wise comparisons
o Open direction: o Learning graphical models efficiently
o Computationally and statistically
o A concrete question: o Given pair-‐wise comparisons data
o When can we learn the choice model efficiently ?
o For example, if exact pair-‐wise comparison marginals available o Then, can learn `sparse’ choice model up to o(log n) sparsity [Farias + Jagabathula + S ‘09 ‘12]
o But what about noisy setting ?
o Or, max-‐ent learning ?
o Part I: o A. Ammar, D. Shah, ``Efficient rank aggregation from partial data,’’ Proceedings of ACM Sigmetrics 2012.
o M. Bayati, D. Shah, M. Sharma, ``Max-‐product for maximum weight matching: convergence, correctness and LP duality,’’ IEEE Transactions on Information Theory, 2008.
o S. Jagabathula, D. Shah, ``Inferring ranking using constrained sensing,’’ IEEE Transactions on Information Theory, 2011.
o S. Agrawal, Z. Wang, Y. Ye, ``Parimutuel betting on permutations,’’ Internet and Network Economics, 2008.
o J. Huang, C. Guestrin, L. Guibas, ``Fourier theoretic probabilistic inference over permutations,’’ Journal of Machine Learning Research, 2009.
covered in tutorial sparse choice model that is v. relevant others/ closely related/definitely worth a read Color coding
o Part II: o S. Negahban, S. Oh, D. Shah, ``Iterative ranking using pair-‐wise comparisons,’’ Proceedings of NIPS 2012.
o V. Farias, S. Jagabathula, D. Shah, ``Data driven approach to modeling choice,’’ Proceedings of NIPS, 2009. Also Management Science, 2012 (and Arxiv).
o V. Farias, S. Jagabathula, D. Shah, ``Sparse choice model,’’ available on Arxiv, 2012.
o A. Ammar, D. Shah, ``Compare, don’t score,’’ Proceedings of Allerton, 2011.
covered in tutorial sparse choice model that is v. relevant others/ closely related/definitely worth a read Color coding
o At large: o H. Varian, ``Revealed preferences,’’ Samuelsonian economics and the twenty-‐first century, 2006.
o D. McFadden, ``Disaggregate Behavioral Travel Demand’s RUM Side,’’ A 30-‐year Retrospective, available online http://emlab.berkeley.edu/pub/wp/mcfadden0300.pdf
o P. Diaconis, ``Group representation in probability and statistics,’’ Lecture notes-‐monograph series, 1988.
o M. Wainwright, M. Jordan, ``Graphical models, exponential families, and variational inference,’’ Foundations and Trends in Machine Learning, 2008.
covered in tutorial sparse choice model that is v. relevant others/ closely related/definitely worth a read Color coding