a sextic algorithm for website design brent heeringa ([email protected]) (joint work with micah...
Post on 22-Dec-2015
217 views
TRANSCRIPT
A sextic algorithm for website design
Brent Heeringa ([email protected])(Joint work with Micah Adler)
21 October 2004Union College
A website design problem(for example: a new kitchen store)
Given products, their popularity, and their organization:
How do we create a good website?Navigation is naturalAccess to information is timely
paring chef bread steak
Wüstof Henkels
Knives
Type Maker
0.26 0.33 0.27 0.14
Good website: Natural Navigation
Organization is a DAG
TC of DAG enumerates all viable categorical relationships and introduces shortcuts
Subgraph of TC preserves logical relationship between categories
Transitive Closure
Subgraph of TC
A B C A B CTC
Good website: Timely Access to Info
Two obstacles to finding info quickly Time scanning a page for correct link Time descending the DAG
Associate a cost with each obstacle Page cost (function of out-degree of
node) Path cost (sum of page costs on path)
Good access structure: Minimize expected path cost Optimal subgraph is always a full tree
1/2
Page Cost = # links Path Cost = 3+2=5Weighted Path Cost = 5/2
Constrained Subtree Selection (CSS)
An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves
(constraint graph) is a function of the out-degree of
each internal node (degree cost) w is a probability distribution over
the n leaves (weights)
A solution is any directed subtree of the transitive closure of G which includes the root and leaves
An optimal solution is one which minimizes the expected path cost
CB DA
1/4 1/4 1/4 1/4
(x)=x
CB DA
1/4 1/4 1/4 1/4
(x)=x Cost:4
Constrained Subtree Selection (CSS)
An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves
(constraint graph) is a function of the out-degree of
each internal node (degree cost) w is a probability distribution over
the n leaves (weights)
A solution is any directed subtree of the transitive closure of G which includes the root and leaves
An optimal solution is one which minimizes the expected path cost
3(1/4)
Constrained Subtree Selection (CSS)
An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves
(constraint graph) is a function of the out-degree of
each internal node (degree cost) w is a probability distribution over
the n leaves (weights)
A solution is any directed subtree of the transitive closure of G which includes the root and leaves
An optimal solution is one which minimizes the expected path cost
CB DA
1/4 1/4 1/4 1/4
(x)=x Cost:4
3(1/4)5(1/4)
Constrained Subtree Selection (CSS)
An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves
(constraint graph) is a function of the out-degree of
each internal node (degree cost) w is a probability distribution over
the n leaves (weights)
A solution is any directed subtree of the transitive closure of G which includes the root and leaves
An optimal solution is one which minimizes the expected path cost
CB DA
1/4 1/4 1/4 1/4
(x)=x Cost:4
3(1/4)5(1/4)
5(1/4)
Constrained Subtree Selection (CSS)
An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves
(constraint graph) is a function of the out-degree of
each internal node (degree cost) w is a probability distribution over
the n leaves (weights)
A solution is any directed subtree of the transitive closure of G which includes the root and leaves
An optimal solution is one which minimizes the expected path cost
CB DA
1/4 1/4 1/4 1/4
3(1/4)5(1/4)
5(1/4)
(x)=x Cost:4
3(1/4)
Constrained Subtree Selection (CSS)
An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves
(constraint graph) is a function of the out-degree of
each internal node (degree cost) w is a probability distribution over
the n leaves (weights)
A solution is any directed subtree of the transitive closure of G which includes the root and leaves
An optimal solution is one which minimizes the expected path cost
CB DA
1/4 1/4 1/4 1/4
(x)=x Cost:4
1/4(3+5+5+3)= 1/4(16)= 4
CB DA
1/2 1/6 1/6 1/6
Constrained Subtree Selection (CSS)
An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves
(constraint graph) is a function of the out-degree of
each internal node (degree cost) w is a probability distribution over
the n leaves (weights)
A solution is any directed subtree of the transitive closure of G which includes the root and leaves
An optimal solution is one which minimizes the expected path cost
(x)=x Cost: 3 1/2
Constraint-Free Graphs and k-favorability
Constraint-Free GraphEvery directed, full tree with n leaves is a
subtree of the TC
CSS is no longer constrained by the graph
k-favorable degree cost Fix . There exists k>1 for any constraint-
free instance of CSS under where an optimal tree has maximal out-degree k
Linear Degree Cost - (x)=x
• 5 paths w/ cost 5
• 3 paths w/ cost 5• 2 paths w/ cost 4
• Unweighted path costs are all less, so weighted path costs must all be less• Generalization to n>6 paths is straightforward
Linear Degree Cost - (x)=x
• 4 paths w/ cost 4
• 4 paths w/ cost 4
• Prefer binary structure when a leaf has at least half the mass
• Prefer ternary structure when mass is uniformly distributed
> 1/2
Linear Degree Cost - (x)=x
CSS with 2-favorable degree costs and C.F. graphs is Huffman coding problem Examples: quadratic, exp, ceiling of log
Results
Complexity: NP-Complete for equal weights and many Sufficient condition on Hardness depends on constraint graph
Highlighted Algorithm: Theorem: O(n6)-time DP algorithm
(x)=x and G is constraint free
Other results: Characterizations of optimal trees for uniform probability
distributions Theorem: poly-time constant-approximation:
≥1 and k-favorable; G has constant out-degree Approximate Hotlink Assignment - [Kranakis et. al]
Related Work Adaptive Websites [Perkowitz & Etzioni]
Challenge to the AI community Novel views of websites: Page synthesis problem
Hotlink Assignment [Kranakis, Krizanc, Shende, et. al.] Add 1 hotlink per page to minimize expected distance
from root to leaves Recently: pages have cost proportional to their size
Hotlinks don’t change page cost
Optimal Prefix-Free Codes [Golin & Rote] Min code for n words with r symbols where symbol ai has
cost ci
Resembles CSS without a constraint graph
Dynamic Programming Review
Problems which exhibit:Optimal substructure
An optimal sol. may be written in terms of opt. solutions to subproblems
Inductive definition
Overlapping subproblemsDifferent problem instances share
subproblemsRepeated computation
Dynamic Programming: Fib
Optimal substructure (inductive definition)
Overlapping subproblemsFib(7) = Fib(6) + Fib(5) (but Fib(6) calls Fib(5))We only need to calculate Fib(5) onceDon’t repeat computationsIdea: Store solutions to subproblems in a table
Fib(0) = 0Fib(1) = 1Fib(i) = Fib(i-1) + Fib(i-2)
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, …Problem: What is the ith Fibonacci number?
Dynamic Programming: FibGeneral Approach
Write inductive definitionRange of parameters in definition defines table sizeFill in table using definitionAnalysis: (Table size) * (# of lookups)
Fib(14) : 0 ≤ i ≤ 14
0 1 1 2 3 5 8 144233377
12 13 14 0 1 2 3 4 5 6
…Fib(i):
i:
Fib(0) = 0Fib(1) = 1Fib(i) = Fib(i-1) + Fib(i-2)
Dynamic Programming: Subset Sum
Example: X={2, 3, 5, 9, 10, 15, 17} and T=28
Subset Sum (SS): Given a set of n positive integers X=(x1,…,xn) and a positive integer T, is there a subset of X which sums to T?
Dynamic Programming: Subset Sum
Example: X={2, 3, 5, 9, 10, 15, 17} and T=28 Yes: {2, 9, 17} and {3, 10, 15}
Subset Sum (SS): Given a set of n positive integers X=(x1,…,xn) and a positive integer T, is there a subset of X which sums to T?
Dynamic Programming: Subset Sum
Example: X={2, 3, 5, 9, 10, 15, 17} and T=28 Yes: {2, 9, 17} and {3, 10, 15} Inductive definition:
Let Xi = (x1,…,xi) = the first i integers of X
SS(t,i) = TRUE if there is a subset of Xi which sums to t
= FALSE, otherwise
Subset Sum (SS): Given a set of n positive integers X=(x1,…,xn) and a positive integer T, is there a subset of X which sums to T?
Dynamic Programming Review
…
…
… …
T
n (t,i)
Table Size: T*nEach cell – (t,i) – depends on 2 other cellsO(Tn) time for SS
SS(0,i) = TRUESS(t,0) = FALSESS(t,i) = SS(t-xi,i-1) OR SS(t,i-1)
The ith element is in the subset
The ith element is not in the subset
Parameter Range:0 ≤ t ≤ T0 ≤ I ≤ n
Lopsided Trees
Recall: (x)=x (3-favorable) and G is constraint free
Node level = path cost
Adding an edge increases level
Grow lopsided trees level by level
Lopsided Trees
Lopsided Trees
Lopsided Trees
Lopsided Trees
We know exact cost of tree up to the current level i:
Exact cost of m leaves Remaining n-m leaves must have path-cost at least i
Lopsided Trees: Cost
Exact cost of C: 3 • (1/3)=1
Remaining mass up to level 4: (2/3) • 4 = 8/3
Total: 1+8/3=11/3
Lopsided Trees: Cost
Tree cost at Level 5 in terms of Tree cost at Level 4: Add in the mass of
remaining leaves
Cost at Level 5: No new leaves 11/3+2/3=13/3
Cost updates don’t depend on level
Lopsided Trees
Lopsided Trees
Lopsided Trees
Equality on trees: Equal number of leaves at or above
frontier Equal number of leaves at each
relative level below frontier
Nodes have outdegree ≤ 3 Node below frontier ≤ (3)=3 (m;l1, l2, l3) = signature Example Signature: (2; 3, 2, 0)
2: C and F are leaves 3: G, H, I are 1 level past the frontier 2: J and K are 2 levels past the frontier
Signature if F is interior node with 3 children?
Inductive Definition
Let CSS(m,l1,l2,l3) = min cost tree with sig (m;l1, l2, l3)
Can we define CSS(m,l1,l2,l3) in terms of optimal solutions to subproblems?
Which trees, when grown by one level, have sig (m;l1,l2,l3)?
Which parent sigs (m’;l’1,l’2,l’3) lead to the child sigs (m;l1,l2,l3)
Different Signatures
(0; 4, 0, 0) (2; 2, 0, 0)
Same Signature (2; 0, 2, 3)
Different signatures lead to (2; 0, 2, 3)
Sig: (0; 2, 0, 0)
Sig: (1; 0, 0, 3)
Growing a tree only affects frontierOnly l1 affects next levelChoose # of leavesThe remaining nodes are
internalChoose degree-2 (d2)
Remaining nodes are degree-3 (d3)
O(n2) choices
The other direction
(which signatures can a tree grow)
The original question(warning: here be symbols)
Which (m’;l’1,l’2,l’3) (m;l1,l2,l3)
CHILDPARENT
The original question(warning: here be symbols)
Which (m’;l’1,l’2,l’3) (m;l1,l2,l3) Suppose we know
l’1 (the # of nodes one level below the frontier)
d2 (the # of l’1 which are degree-2 interior nodes in (m,l1,l2,l3))
Let’s determine the values of the remaining variables1
2
3l’1 nodes
1
2
d2 nodes3
The original question(warning: here be symbols)
Which (m’;l’1,l’2,l’3) (m;l1,l2,l3) Suppose we know
l’1 (the # of nodes one level below the frontier)
d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))
m = m’ + l’1 - d2 - d3
The new number of leaves
The old number of leaves
Nodes at one level below the frontier
Internal nodes of degree 2
Internal nodes of degree 3
1
2
3
The original question(warning: here be symbols)
Which (m’;l’1,l’2,l’3) (m;l1,l2,l3) Suppose we know
l’1 (the # of nodes one level below the frontier)
d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))
m = m’ + l’1 - d2 - l3/3
The new number of leaves
The old number of leaves
Nodes at one level below the frontier
Internal nodes of degree 2
Internal nodes of degree 3
1
2
3
The original question(warning: here be symbols)
Which (m’;l’1,l’2,l’3) (m;l1,l2,l3) Suppose we know
l’1 (the # of nodes one level below the frontier)
d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))
l’2 = l1
The old number of nodes at2 levels below the frontier
New nodes one level below the frontier
The original question(warning: here be symbols)
Which (m’;l’1,l’2,l’3) (m;l1,l2,l3) Suppose we know
l’1 (the # of nodes one level below the frontier)
d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))
l2 = l3+2d2
The new number of nodes 2 levels below the frontier
d2 nodes are binary so they contribute 2d2 to the frontier
The original question(warning: here be symbols)
Which (m’;l’1,l’2,l’3) (m;l1,l2,l3) l’1 and d2 are sufficient
l’1 and d2 are both O(n)
O(n2) possibilities for (m’;l’1,l’2,l’3)
CSS(m,l1,l2,l3) = min cost tree with sig. (m;l1, l2, l3)
= CSS(m’,l’1,l’2,l’3) + cm’ for 1≤d2≤l’1≤n
(cm’ are the smallest n-m’ weights)
CSS(n,0,0,0) = cost of optimal tree Analysis:
Table size = O(n4) Each cell takes O(n2) lookups O(n6) algorithm
Some Observations
Generalize algorithm: Theorem: O(n(k)+k)-time DP algorithm
is positive, integer-valued, non-decreasing, k-favorable and G is constraint free
Signatures = (k)+1 vectors Table size = (k)+1 Each cell requires k-1 lookups
(extra slides follow)
Motivation and Lower Bound
Many constraint graphs have constant out-degreeRemains NP-Hard for many degree costs
Lemma 1: H(w)/log(k) is a lower bound on the cost of an optimal tree For any k-favorable degree cost , with ≥1 G is constraint-free
T
C(T) ≥ c’(T) ≥ c’(T’) ≥ H(w)/log(k) (shannon)
1 1 1
1T1 1 1
1T’
1
A Simple Lemma Lemma 2: For any tree with m weighted nodes there exists 1 node
(splitter) which, when removed, divides the tree into subtrees with at most half the weight of the original tree.
splitter
< 1/2 < 1/2
<1/2
Aproximation AlgorithmLet G be a DAG where out-degree of every node
dChoose a spanning tree T from GBalance-Tree(T):
Find a splitter node in T (Lemma 2) Stop if splitter is child of root
Disconnect the splitter and reconnect it to the root root has degree at most d+1
Call Balance-Tree on all subtrees
splitter
Mass of each subtree is at least half of whole tree
Approximation Algorithm
Analysis: Mass under any node is half of mass under its
grandparent Path length to leaf with weight wi is -2log(wi)
Theorem: O(m)-time O(log(k)(d+1))-approx to optimal solution
For any DAG G with m nodes and out-degree d For every k-favorable degree cost ≥ 1,
Upper Bound on Node Cost Weighted Path Length
Open Problems
Theorem: There is an for any instance (G,,w) of CSS where G is constraint free, is k-favorable, maps the positive integers to the positive integers and is non-decreasing
Proof:c(T) ≥ c’(T) ≥ c’(T’) ≥ H(w)/log(k)T is optimal tree for CSS cost cT’ is optimal tree for OPC cost c’ for k symbols each with weight 1 (i.e. (x)=1)H is entropy
NO