a phylogenetic application of the combinatorial graph laplacian eric a. stone department of...

48
A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State University

Upload: paige-garza

Post on 27-Mar-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

A phylogenetic application of the combinatorial graph Laplacian

Eric A. Stone

Department of StatisticsBioinformatics Research CenterNorth Carolina State University

Page 2: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

My motivation for this project• Trees in statistics or biology

– Often a latent branching structure relating some observed data

• Trees in mathematics– Always a connected graph with no cycles

Page 3: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

My motivation for this project• Trees in statistics or biology

– PROBLEM: Recover properties of latent branching structure

• Trees in mathematics– Always a connected graph with no cycles

Page 4: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

My motivation for this project• Trees in statistics or biology

– PROBLEM: Recover properties of latent branching structure

• Trees in mathematics– Characterization of observed structure by spectral graph theory

Page 5: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

My motivation for this project• Trees in statistics or biology

– PROBLEM: Recover properties of latent branching structure

• Trees in mathematics– Characterization of observed structure by spectral graph theory

Page 6: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Bridging the gap• Rectifying trees and trees

• Can we use some powerful tools of spectral graph theory to recover latent structure?– Natural relationship between trees and complete graphs?!?

Page 7: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Tree and distance matrices

• The tree with vertex set {1,…,8} has distance matrix D

• The “phylogenetic tree” can only be observed at {1,…,5}– We can only observe (estimate) the phylogenetic portion D*

The phylogenetic portion D*

Page 8: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

More motivation for this project• Trees in statistics or biology

– PROBLEM: Recover properties of latent branching structure

• Given D* only, recover latent branching structure– This is the problem of phylogenetic reconstruction (w/o error!)

The phylogenetic portion D*

Page 9: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

NJ finds (2,n-2) splits from D*

• A split is a bipartition of the leaf set (e.g. {1,2,3,4,5}) that can be induced by cutting a branch on the tree– e.g. {{1,2},{3,4,5}} or {{1,2,5},{3,4}}

• Neighbor-joining criterion identifies (2,n-2) splits through

{{1,2},{3,4,5}} {{1,2,5},{3,4}}

Page 10: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

A recipe for tree reconstruction from D*

1. Find a split– NJ relies on theorem that guarantees (2,n-2) split from Q matrix

2. Use knowledge of split to reduce dimension– NJ prunes the cherry (neighboring taxa) to reduce leaves by one

3. Iterate until tree has been fully reconstructed– Tree topology specified by its split set

Page 11: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Our narrow goal

1. Find a split– NJ relies on theorem that guarantees (2,n-2) split from Q matrix

– Hypothesize criterion that identifies deeper splits

and prove that it actually works

Page 12: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Our solution

The phylogenetic portion D*

Page 13: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Our solution

• Let H be the centering matrix:

• Find eigenvector Y of HD*H with the smallest eigenvalue– The signs of the entries of Y identify a split of the tree

The phylogenetic portion D*

Page 14: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

About the matrix HD*H

• Entries of HD*H are Dij – Di. – D.j + D..

• HD*H is negative semidefinite– Zero is a simple eigenvalue with unit eigenvector– Entries of remaining eigenvalues have both + and - entries

• HD*H appears prominently in:– Multidimensional scaling– Principal coordinate analysis

Page 15: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Example of our solution

• Find eigenvector Y of HD*H with the smallest eigenvalue:

• Signs of Y identify the split {{1,2},{3,4,5}}

+0.5793

+0.4418

-0.0564

-0.4636

-0.5011

Page 16: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

A real example (data from ToL)

• Two iterations

Page 17: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Our solution

1. Find a split– NJ relies on theorem that guarantees (2,n-2) split from Q matrix

– Hypothesize criterion that identifies deep splits

and prove that it actually works

Page 18: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Affinity and distance

• In phylogenetics, common to consider pairwise distances– In graph theory, common to consider pairwise affinities

Distance-based

Affinity-based

Page 19: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Distance matrix Laplacian matrix

Page 20: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

The genius of Miroslav Fiedler• G connected smallest eigenvalue of L, zero, is simple

– Smallest positive eigenvalue, , called algebraic connectivity of G

• Fiedler vectors Y satisfy LY=Y– Fiedler cut is the sign-induced bipartition

+0.4840

+0.4038-0.4047

-0.4277-0.0223

+0.3449 -0.3653

-0.0158

Page 21: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

The genius of Miroslav Fiedler• G connected smallest eigenvalue of L, zero, is simple

– Smallest positive eigenvalue, , called algebraic connectivity of G

• Fiedler vectors Y satisfy LY=Y– Fiedler cut is the sign-induced bipartition

• Fiedler cut here is– {{1,2,6},{3,4,5,7,8}}

• Note that the cut implies a leaf split:– {{1,2},{3,4,5}}

+0.4840

+0.4038-0.4047

-0.4277-0.0223

+0.3449 -0.3653

-0.0158

Page 22: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Is this relevant here?

• We do not observe an 8x8 Laplacian matrix L– All we get is a 5x5 matrix of between-leaf pairwise distances D*

• Where is the connection to graph theory?

The phylogenetic portion D*

Page 23: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Recall: Our solution

• Let H be the centering matrix:

• Find eigenvector Y of HD*H with the smallest eigenvalue– The signs of the entries of Y identify a split of the tree

The phylogenetic portion D*

Page 24: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

An extremely useful relationship

• Recall the centering matrix H– The (Moore-Penrose) pseudoinverse of HDH is in fact -2L

• We have shown in the context of this formula– Principal submatrices of D relate to Schur complements of L

• In particular, (HD*H)+ = -2L* = -2(L/Z) = -2(W – XZTY), whereW X

ZY

Page 25: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Recall: Our solution

• Find eigenvector Y of HD*H with the smallest eigenvalue– The signs of the entries of Y identify a split of the tree

• The smallest eigenvalue of HD*H (negative semidefinite) is the smallest positive eigenvalue of L*

• In fact, L* can be seen as a graph Laplacian– And our solution, Y, is the Fiedler vector of that graph!

• But what does this graph look like?

Page 26: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Schur complementation of a vertex

• The vertices adjacent to 8 become adjacent to each other

Page 27: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Schur complementation of the interior

• The graph described by L* is fully connected– All cuts yield connected subgraphs No help from Fiedler

Page 28: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Recap thus far

• Given matrix D* of pairwise distances between leaves

• Find eigenvector Y of HD*H with the smallest eigenvalue– Claim: The signs of the entries of Y identify a split of the tree

• Y shown to be a Fiedler vector of the Laplacian L*– But graph of L* is fully connected, has no apparent structure

• Thus Fiedler says nothing about signs of entries of Y– But claim requires signs to be consistent with structure of the tree

Page 29: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Recap thus far

• Thus Fiedler says nothing about signs of entries of Y– But claim requires signs to be consistent with structure of the tree

• How does L* inherit the structure of the tree?

NO NO YES

Page 30: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

The quotient rule inspires a “Schur tower”

Page 31: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

The quotient rule inspires a “Schur tower”

• How does this help?

Page 32: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Cutpoints and connected components• A point of articulation (or cutpoint) is a point rG whose

deletion yields a subgraph with 2 connected components– Cutpoints: 6,7,8– Shown: {1}, {2}, {3,4,5,7,8} are connected

components at 6

• The cutpoints of a tree are its internal nodes

Page 33: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

The key observation (i.e. theorem)• Let L be the Laplacian of a graph G with some cutpoint v

– Let L{v} be the Laplacian of G{v} obtained by Schur complement at v

• Then the Fiedler cut G{v} identifies a split of G– Here the Fiedler cut of G{6} is {{1,2,5,8},{3,4,7}}– Including 6 in {1,2,5,8} defines two connected components in G

+

G G{6}

+0.5828

+0.4660

-0.3870

-0.4129+0.0570

-0.3439

+0.0380++

+

-

-

-

?

Page 34: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

The quotient rule inspires a “Schur tower”

• How does this help? Look at Schur paths to graph with Laplacian L*

LL*

Page 35: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

The punch line• The graph with Laplacian L* can be obtained in three ways

• The Fiedler cut of G{6,7,8} must split G{6,7} and G{6,8} and G{7,8}

Page 36: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

The punch line• The graph with Laplacian L* can be obtained in three ways

• The Fiedler cut of G{6,7,8} must split G{6,7} and G{6,8} and G{7,8}

Page 37: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Recall: Example

• Find eigenvector Y of HD*H with the smallest eigenvalue:

• Signs of Y identify the split {{1,2},{3,4,5}}

+0.5793

+0.4418

-0.0564

-0.4636

-0.5011

Page 38: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

The punch line• The graph with Laplacian L* can be obtained in three ways

• The Fiedler cut of G{6,7,8} must split G{6,7} and G{6,8} and G{7,8}

• This implies that the cut splits the progenitor graph G!

{{1,2,6},{3,4,5,7,8}}

Page 39: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Our solution actually works

• Let H be the centering matrix:

• Find eigenvector Y of HD*H with the smallest eigenvalue– The signs of the entries of Y identify a split of the tree

The phylogenetic portion D*

Page 40: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

A recipe for tree reconstruction

1. Find a split– NJ relies on theorem that guarantees (2,n-2) split from Q matrix– We have a theorem that guarantees splits from HD*H matrix

2. Use knowledge of split to reduce dimension– NJ prunes the cherry (neighboring taxa) to reduce leaves by one– We use a divisive method that reduces to pairs of subtrees

3. Iterate until tree has been fully reconstructed– Tree topology specified by its split set

Page 41: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Reconstruction from the inside out

Page 42: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Connections with Classical MDS and PCoA

• Classical solution to multidimensional scaling– a.k.a. Principal coordinate analysis

• Recipe for dimension reduction given distance matrix D:1. Construct matrix A from D entrywise: x -x2/22. Double centering: B = HAH3. Find k largest eigenvalues i of B with corresponding eigenvectors Xi

4. Coordinates of point Pr given by row r of eigenvector entries

k = 1 with sqrt of tree distance equivalent to our approach

Page 43: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Phylogenetic ordination• PCoA on sequence data with k = 3:

– For appropriate distance, C1 (x-axis) guaranteed to split taxa at 0

• Our results support popular use of PCoA– Provided that the right distance is considered…

Page 44: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Conclusion I• Natural connection between matrix of pairwise distances

and the Laplacian of a complete graph

Page 45: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Conclusion II• Structure of tree embedded in complete graph and

recoverable via spectral theory

• Notion of “Fiedler cut” extends concept to “Fiedler split”– Inheritance propagated through Schur tower

NO NO YES

Page 46: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Conclusion III

• Results inspire fast divisive tree reconstruction method

Page 47: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Conclusion IV

• Provides guidance and justification for ordination approach

Page 48: A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State

Acknowledgements

• Alex Griffing (NCSU Bioinformatics)

• Carl Meyer (NCSU Math)• Amy Langville (CoC Math)