cs 336 march 19, 2012 tandy warnow. basic graph terminology nodes, vertices, edges, degrees, paths,...
TRANSCRIPT
CS 336 March 19, 2012
Tandy Warnow
Basic Graph Terminology
• Nodes, vertices, edges, degrees, paths, cycles, connected components, adjacency, isolated vertices, trees, forests
• Directed graphs: indegree, outdegree, trees
Advanced terminology
• Cliques• Independent sets• Chromatic number and vertex colorings• Eulerian cycles and Eulerian paths• Hamiltonian paths• Matchings• Dominating Set• Vertex Cover
Paths, Connected Components, etc.
• A path is a sequence of vertices v1, v2, …, vn
so that vi is adjacent to vi+1 for i=1,2,…,n-1. A simple path is one that does not have repeated vertices.
• A graph is connected if every pair of vertices in the graph is connected by some path.
• A connected component is a maximal subset of the vertices that is connected.
Cycles
• A cycle in a graph is a path that starts and ends at the same vertex.
• A simple cycle is a cycle that does not have any repeated vertices (other than the start and end vertex).
• A graph is acylic if it has no simple cycles.
Trees
• Two types: rooted and unrooted
• Unrooted (simplest): acylic connected graph
• Rooted: take an unrooted tree, pick one node to be the root, and direct all edges away from the root. Voila!
Theorems about trees
Let T be a connected acyclic graph (i.e., a tree) with n vertices (n>0). Then:
• T has at least one leaf (node with degree 0 or 1).
• T has n-1 edges.
• Every edge in T is a cut-edge.
• Every tree can be 2-colored.
Theorem: Every tree has at least one leaf (node of degree 1)
Theorem: For any tree T with at least one vertex, T has at least one leaf (node with degree 0 or 1).
Proof: • If n=1, then T is a single vertex which is a leaf. • Else, n>1. Let P be a longest simple path in T, so P=v1,v2,
…,vk.• If vk has degree 1, we are done. Otherwise, vk has at least
two neighbors, and so some neighbor w other than vk-1. If w is in P, then we have a simple cycle in T, contradicting that T is a tree. If w is not in P, then we can extend P and get a longer path, contradicting that P is a longest simple path in T.
• Hence, vk has degree 1, and we are done.
Theorem: Any tree with n>0 nodes has n-1 edges
• Proof: by induction on n.• Base case: n=1 (trivial)• Inductive hypothesis: for some positive
n, any tree on n nodes has exactly n-1 edges.
• Let T be a tree on n+1 nodes. We want to show T has exactly n edges.
Proof (cont’d)
• Let v be a node in T with degree 1.
• Remove v from T. The result is a tree T’ with n nodes, and hence n-1 edges (by the inductive hypothesis)
• T’ contains one fewer edge and one fewer vertex (node) than T, and so T has n edges.
Theorem: every edge in a tree is a cut-edge
Proof (by contradiction). • Suppose T is a tree, e=(v,w) is an edge in T that is
not a cut-edge.• Then G=T-{e} (but keeping v and w) is connected.
Hence there is a simple path P from v to w in G. Since e is not in G, P does not include edge e.
• Therefore, we can form a simple cycle C by adding edge e to P.
• Since every edge in C is in T, this means that T is not acyclic, contradicting the assumption that T is a tree (connected acyclic graph).
Vertex Coloring
• A (proper) vertex coloring of a graph is a function c: V -> {1,2,…,k}, s.t. no two adjacent vertices are mapped to the same color.
• The chromatic number of a graph is the minimum number of colors needed to properly color the graph.
• How many colors does a tree need?
2-coloring a tree
• Theorem: every connected acyclic graph (i.e., tree) can be 2-colored.
• Proof: by induction on the number of vertices.
Proof that every tree can be 2-colored
• Let G be a tree on n vertices. The base case is n=1. Clearly every tree on 1 vertex can be 2-colored.
• The Inductive Hypothesis is that for some positive integer n, any tree on n vertices can be 2-colored.
• Let G be a tree with n+1 vertices. We want to show that G can be 2-colored.
Proof (cont’d)
• Let v be a node in G that has degree 1, and let w be its unique neighbor in G.
• Consider the graph G’ formed by deleting v (and its incident edge but not w) from G.
• G’ is also acyclic (why?) and has n-1 vertices.• Therefore, by the inductive hypothesis, G’ can be 2-
colored. • We extend the coloring from G’ to G, by letting c(v)
be 1 if c(w)=2, and c(v)=2 if c(w)=1.• Note that this coloring is proper for G.• Hence G can be 2-colored.
Structural Induction
• This was a proof by structural induction.
• Proofs by structural induction can be applied more generally!
Theorem about rooted trees
• A rooted tree in which every node has 0 or 2 children is called a “binary tree”
• Theorem: every binary tree with n nodes has (n-1)/2 internal nodes (defined to be nodes with more than 0 children).
• Proof: by strong induction on n.• Base case: n=1. Such a tree has no internal
nodes, so it is true.
Proof, cont’d.
• Strong Inductive hypothesis: for some n>0, and for all positive integers k up to n, all rooted binary trees with k nodes have (k-1)/2 internal nodes.
• Let T have n+1 nodes, and let the children of the root be A and B. (We know the root has two children, since if it had no children, T would have 1 node, contradicting our hypothesis.)
We want to show Int(T) = n/2
We want to show Int(T) = n/2
• TA, the subtree of T rooted at A, is a binary tree; let nA be the number of nodes in TA
• TB, the subtree of T rooted at B, is a binary tree; let nB be the number of nodes in TB
• Let Int(T) be the number of internal nodes of T, and Int(TA) and Int(TB) be similarly defined.
We want to show Int(T) = n/2
• Then nA and nB are both at most n, and by the inductive hypothesis
Int(TA) = (nA-1)/2
Int(TB ) = (nB-1)/2
• Therefore
Int(T) = (nA-1)/2 + (nB-1)/2 + 1
We want to show Int(T) = n/2We have established that
Int(T) = (nA-1)/2 + (nB-1)/2 + 1
Simplifying this, we get
Int(T) = (nA-1 + nB -1 + 2)/2 = (nA + nB)/2
Note nT = nA + nB + 1
Therefore,
Int(T) = (nT - 1)/2
Recall that nT = n+1. Therefore,
Int(T) = n/2
Q.E.D.
Genome Assembly
• Given a DNA sequence, technology can allow you to get a collection of k-mers (substrings of length k) that come from analyses of the sequence.
• From these k-mers, your objective is to come up with the sequence.
Genome Assembly
• Let X be a very long DNA sequence
• Consider all k-mers in X, with k big enough so that no k-mer appears two or more times
• Goal: reconstruct X from its set of k-mers
Genome Assembly, attempt #1
Approach 1:
• Make a node for each k-mer, and put a directed edge from v to w if the k-1 suffix of v is the k-1 prefix of w.
• Create the graph for the following string, using k=5– ACATAGGATTCAC
Genome Assembly, attempt #1
Approach 1:
• Make a node for each k-mer, and put a directed edge from v to w if the k-1 suffix of v is the k-1 prefix of w.
• Every such graph has a Hamiltonian Path, as long as no k-mer appears more than once!
Hamiltonian Path
• A Hamiltonian Path in a graph visits every node exactly once
Genome AssemblyAttempt #1
• Create the graph for the following string, using k=5– ACATAGGATTCAC
• Does the graph have a Hamiltonian Path?• Is it unique?• Can you reconstruct the sequence from the
path?
Hamiltonian Path
• A Hamiltonian Path in a graph visits every node exactly once
• Determining if a graph has a Hamiltonian Path is NP-Complete
• So this approach to Genome Assembly is computationally intensive (infeasible)
Eulerian Cycles
• An Eulerian cycle is one that goes through every edge exactly once
• It is easy to see that if a graph has an Eulerian cycle, then every node has even degree. The converse is also true, but a bit harder to prove.
• For directed graphs, the cycle will need to follow the direction of the edges (also called “arcs”). In this case, a graph has an Eulerian cycle if and only if the indegree is equal to the outdegree for every node.
Eulerian Paths
• An Eulerian path is one that goes through every edge exactly once
• It is easy to see that if a graph has an Eulerian path, then all but 2 nodes have even degree. The converse is also true, but a bit harder to prove.
• For directed graphs, the cycle will need to follow the direction of the edges (also called “arcs”). In this case, a graph has an Eulerian path if and only if the indegree(v)=outdegree(v) for all but 2 nodes (x and y), where indegree(x)=outdegree(x)+1, and indegree(y)=outdegree(y)-1.
de Bruijn Graph
Input: the set of k-mers for the DNA sequence
Output: the de Bruijn Graph• Vertices: the (k-1)-mers• Directed edges: from v->w if the (k-2)-suffix of
v is the (k-2)-prefix of w, and the k-mer formed by starting with v and ending with w is one of the k-mers in the input
de Bruijn Graph
• If the k-mer set comes from a sequence and no k-mer appears more than once in the sequence, then the de Bruijn graph has an Eulerian path!
Using de Bruijn Graphs
Given: set of k-mers from a DNA sequence
Algorithm: • Construct the de Bruijn graph• Find an Eulerian path in the graph• The path defines a sequence with the
same set of k-mers as the original
de Bruijn Graph
• Create the de Bruijn graph for the following string, using k=5– ACATAGGATTCAC
• Find the Eulerian path• Is the Eulerian path unique?• Reconstruct the sequence from this path