a sextic algorithm for website design brent heeringa ([email protected]) (joint work with micah...

A sextic algorithm for website design

Brent Heeringa ([email protected])(Joint work with Micah Adler)

21 October 2004Union College

A website design problem(for example: a new kitchen store)

Given products, their popularity, and their organization:

How do we create a good website?Navigation is naturalAccess to information is timely

paring chef bread steak

Wüstof Henkels

Knives

Type Maker

0.26 0.33 0.27 0.14

Good website: Natural Navigation

Organization is a DAG

TC of DAG enumerates all viable categorical relationships and introduces shortcuts

Subgraph of TC preserves logical relationship between categories

Transitive Closure

Subgraph of TC

A B C A B CTC

Good website: Timely Access to Info

Two obstacles to finding info quickly Time scanning a page for correct link Time descending the DAG

Associate a cost with each obstacle Page cost (function of out-degree of

node) Path cost (sum of page costs on path)

Good access structure: Minimize expected path cost Optimal subgraph is always a full tree

1/2

Page Cost = # links Path Cost = 3+2=5Weighted Path Cost = 5/2

Constrained Subtree Selection (CSS)

An instance of CSS is a triple: (G,,w) G is a rooted, DAG with n leaves

(constraint graph) is a function of the out-degree of

each internal node (degree cost) w is a probability distribution over

the n leaves (weights)

A solution is any directed subtree of the transitive closure of G which includes the root and leaves

An optimal solution is one which minimizes the expected path cost

CB DA

1/4 1/4 1/4 1/4

(x)=x

CB DA

1/4 1/4 1/4 1/4

(x)=x Cost:4








3(1/4)








CB DA

1/4 1/4 1/4 1/4

(x)=x Cost:4

3(1/4)5(1/4)








CB DA

1/4 1/4 1/4 1/4

(x)=x Cost:4

3(1/4)5(1/4)

5(1/4)








CB DA

1/4 1/4 1/4 1/4

3(1/4)5(1/4)

5(1/4)

(x)=x Cost:4

3(1/4)








CB DA

1/4 1/4 1/4 1/4

(x)=x Cost:4

1/4(3+5+5+3)= 1/4(16)= 4

CB DA

1/2 1/6 1/6 1/6








(x)=x Cost: 3 1/2

Constraint-Free Graphs and k-favorability

Constraint-Free GraphEvery directed, full tree with n leaves is a

subtree of the TC

CSS is no longer constrained by the graph

k-favorable degree cost Fix . There exists k>1 for any constraint-

free instance of CSS under where an optimal tree has maximal out-degree k

Linear Degree Cost - (x)=x

• 5 paths w/ cost 5

• 3 paths w/ cost 5• 2 paths w/ cost 4

• Unweighted path costs are all less, so weighted path costs must all be less• Generalization to n>6 paths is straightforward

• Prefer binary structure when a leaf has at least half the mass

• Prefer ternary structure when mass is uniformly distributed

> 1/2


CSS with 2-favorable degree costs and C.F. graphs is Huffman coding problem Examples: quadratic, exp, ceiling of log

Results

Complexity: NP-Complete for equal weights and many Sufficient condition on Hardness depends on constraint graph

Highlighted Algorithm: Theorem: O(n6)-time DP algorithm

(x)=x and G is constraint free

Other results: Characterizations of optimal trees for uniform probability

distributions Theorem: poly-time constant-approximation:

≥1 and k-favorable; G has constant out-degree Approximate Hotlink Assignment - [Kranakis et. al]

Related Work Adaptive Websites [Perkowitz & Etzioni]

Challenge to the AI community Novel views of websites: Page synthesis problem

Hotlink Assignment [Kranakis, Krizanc, Shende, et. al.] Add 1 hotlink per page to minimize expected distance

from root to leaves Recently: pages have cost proportional to their size

Hotlinks don’t change page cost

Optimal Prefix-Free Codes [Golin & Rote] Min code for n words with r symbols where symbol ai has

cost ci

Resembles CSS without a constraint graph

Dynamic Programming Review

Problems which exhibit:Optimal substructure

An optimal sol. may be written in terms of opt. solutions to subproblems

Inductive definition

Overlapping subproblemsDifferent problem instances share

subproblemsRepeated computation

Dynamic Programming: Fib

Optimal substructure (inductive definition)

Overlapping subproblemsFib(7) = Fib(6) + Fib(5) (but Fib(6) calls Fib(5))We only need to calculate Fib(5) onceDon’t repeat computationsIdea: Store solutions to subproblems in a table

Fib(0) = 0Fib(1) = 1Fib(i) = Fib(i-1) + Fib(i-2)

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, …Problem: What is the ith Fibonacci number?

Dynamic Programming: FibGeneral Approach

Write inductive definitionRange of parameters in definition defines table sizeFill in table using definitionAnalysis: (Table size) * (# of lookups)

Fib(14) : 0 ≤ i ≤ 14

0 1 1 2 3 5 8 144233377

12 13 14 0 1 2 3 4 5 6

…Fib(i):

i:

Fib(0) = 0Fib(1) = 1Fib(i) = Fib(i-1) + Fib(i-2)

Dynamic Programming: Subset Sum

Example: X={2, 3, 5, 9, 10, 15, 17} and T=28

Subset Sum (SS): Given a set of n positive integers X=(x1,…,xn) and a positive integer T, is there a subset of X which sums to T?


Example: X={2, 3, 5, 9, 10, 15, 17} and T=28 Yes: {2, 9, 17} and {3, 10, 15}



Example: X={2, 3, 5, 9, 10, 15, 17} and T=28 Yes: {2, 9, 17} and {3, 10, 15} Inductive definition:

Let Xi = (x1,…,xi) = the first i integers of X

SS(t,i) = TRUE if there is a subset of Xi which sums to t

= FALSE, otherwise


Dynamic Programming Review

…

…

… …

T

n (t,i)

Table Size: T*nEach cell – (t,i) – depends on 2 other cellsO(Tn) time for SS

SS(0,i) = TRUESS(t,0) = FALSESS(t,i) = SS(t-xi,i-1) OR SS(t,i-1)

The ith element is in the subset

The ith element is not in the subset

Parameter Range:0 ≤ t ≤ T0 ≤ I ≤ n

Lopsided Trees

Recall: (x)=x (3-favorable) and G is constraint free

Node level = path cost

Adding an edge increases level

Grow lopsided trees level by level

Lopsided Trees

Lopsided Trees

We know exact cost of tree up to the current level i:

Exact cost of m leaves Remaining n-m leaves must have path-cost at least i

Lopsided Trees: Cost

Exact cost of C: 3 • (1/3)=1

Remaining mass up to level 4: (2/3) • 4 = 8/3

Total: 1+8/3=11/3

Lopsided Trees: Cost

Tree cost at Level 5 in terms of Tree cost at Level 4: Add in the mass of

remaining leaves

Cost at Level 5: No new leaves 11/3+2/3=13/3

Cost updates don’t depend on level

Lopsided Trees

Lopsided Trees

Equality on trees: Equal number of leaves at or above

frontier Equal number of leaves at each

relative level below frontier

Nodes have outdegree ≤ 3 Node below frontier ≤ (3)=3 (m;l1, l2, l3) = signature Example Signature: (2; 3, 2, 0)

2: C and F are leaves 3: G, H, I are 1 level past the frontier 2: J and K are 2 levels past the frontier

Signature if F is interior node with 3 children?

Inductive Definition

Let CSS(m,l1,l2,l3) = min cost tree with sig (m;l1, l2, l3)

Can we define CSS(m,l1,l2,l3) in terms of optimal solutions to subproblems?

Which trees, when grown by one level, have sig (m;l1,l2,l3)?

Which parent sigs (m’;l’1,l’2,l’3) lead to the child sigs (m;l1,l2,l3)

Different Signatures

(0; 4, 0, 0) (2; 2, 0, 0)

Same Signature (2; 0, 2, 3)

Different signatures lead to (2; 0, 2, 3)

Sig: (0; 2, 0, 0)

Sig: (1; 0, 0, 3)

Growing a tree only affects frontierOnly l1 affects next levelChoose # of leavesThe remaining nodes are

internalChoose degree-2 (d2)

Remaining nodes are degree-3 (d3)

O(n2) choices

The other direction

(which signatures can a tree grow)

The original question(warning: here be symbols)

Which (m’;l’1,l’2,l’3) (m;l1,l2,l3)

CHILDPARENT


Which (m’;l’1,l’2,l’3) (m;l1,l2,l3) Suppose we know

l’1 (the # of nodes one level below the frontier)

d2 (the # of l’1 which are degree-2 interior nodes in (m,l1,l2,l3))

Let’s determine the values of the remaining variables1

2

3l’1 nodes

1

2

d2 nodes3




d2 (the # of l’1 which are degree-2 nodes in (m,l1,l2,l3))

m = m’ + l’1 - d2 - d3

The new number of leaves

The old number of leaves

Nodes at one level below the frontier

Internal nodes of degree 2


1

2

3





m = m’ + l’1 - d2 - l3/3

The new number of leaves

The old number of leaves

Nodes at one level below the frontier



1

2

3





l’2 = l1

The old number of nodes at2 levels below the frontier

New nodes one level below the frontier





l2 = l3+2d2

The new number of nodes 2 levels below the frontier

d2 nodes are binary so they contribute 2d2 to the frontier


Which (m’;l’1,l’2,l’3) (m;l1,l2,l3) l’1 and d2 are sufficient

l’1 and d2 are both O(n)

O(n2) possibilities for (m’;l’1,l’2,l’3)

CSS(m,l1,l2,l3) = min cost tree with sig. (m;l1, l2, l3)

= CSS(m’,l’1,l’2,l’3) + cm’ for 1≤d2≤l’1≤n

(cm’ are the smallest n-m’ weights)

CSS(n,0,0,0) = cost of optimal tree Analysis:

Table size = O(n4) Each cell takes O(n2) lookups O(n6) algorithm

Some Observations

Generalize algorithm: Theorem: O(n(k)+k)-time DP algorithm

is positive, integer-valued, non-decreasing, k-favorable and G is constraint free

Signatures = (k)+1 vectors Table size = (k)+1 Each cell requires k-1 lookups

(extra slides follow)

Motivation and Lower Bound

Many constraint graphs have constant out-degreeRemains NP-Hard for many degree costs

Lemma 1: H(w)/log(k) is a lower bound on the cost of an optimal tree For any k-favorable degree cost , with ≥1 G is constraint-free

T

C(T) ≥ c’(T) ≥ c’(T’) ≥ H(w)/log(k) (shannon)

1 1 1

1T1 1 1

1T’

1

A Simple Lemma Lemma 2: For any tree with m weighted nodes there exists 1 node

(splitter) which, when removed, divides the tree into subtrees with at most half the weight of the original tree.

splitter

< 1/2 < 1/2

<1/2

Aproximation AlgorithmLet G be a DAG where out-degree of every node

dChoose a spanning tree T from GBalance-Tree(T):

Find a splitter node in T (Lemma 2) Stop if splitter is child of root

Disconnect the splitter and reconnect it to the root root has degree at most d+1

Call Balance-Tree on all subtrees

splitter

Mass of each subtree is at least half of whole tree

Approximation Algorithm

Analysis: Mass under any node is half of mass under its

grandparent Path length to leaf with weight wi is -2log(wi)

Theorem: O(m)-time O(log(k)(d+1))-approx to optimal solution

For any DAG G with m nodes and out-degree d For every k-favorable degree cost ≥ 1,

Upper Bound on Node Cost Weighted Path Length

Open Problems

Theorem: There is an for any instance (G,,w) of CSS where G is constraint free, is k-favorable, maps the positive integers to the positive integers and is non-decreasing

Proof:c(T) ≥ c’(T) ≥ c’(T’) ≥ H(w)/log(k)T is optimal tree for CSS cost cT’ is optimal tree for OPC cost c’ for k symbols each with weight 1 (i.e. (x)=1)H is entropy

NO

a sextic algorithm for website design brent heeringa ([email protected]) (joint work with micah...

Documents

x cost

internal node degree

links path cost

weighted path cost

expected path cost cbda

transitive closure of

x slide

n leaves constraint