iddo-mccreight_slides on suffix tree updates

Upload: rickjdk1

Post on 06-Apr-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    1/75

    A SpaceA Space--Economical Suffix TreeEconomical Suffix Tree

    Construction AlgorithmConstruction AlgorithmEdward M. McCreight (1976)Edward M. McCreight (1976)

    {{ From Ukkonen to McCreight and Weiner: A Unifying }From Ukkonen to McCreight and Weiner: A Unifying }

    View of LinearView of Linear--Time Suffix Tree ConstructionTime Suffix Tree Construction

    R. Giegerich and S. Kurtz (1997)R. Giegerich and S. Kurtz (1997)

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    2/75

    OverviewOverview

    Algorithm for constructing auxiliary digitalsearch trees to help in search operations ofsubstrings.

    Advantages over other algorithms:Economical in Space.

    We describe the algorithm Incremental changing of the search tree

    corresponding to changes in the text.

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    3/75

    MotivationMotivation

    Text editors Automatic command completion

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    4/75

    Constructing a Suffix TreeConstructing a Suffix Tree

    AlgorithmAlgorithm Given a string S, we build an index to S in the

    form of a search tree T, whose paths are thesuffixes of S.

    Each path starting from the root of this treerepresents a different suffix.

    An edge is labeled with a string.

    the concatenation of these labels on through apath gives us a suffix.

    Each leaf correspond uniquely to positions within

    S.

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    5/75

    Mapping the string

    ababc into asuffix tree.

    ab

    abc

    bc

    abc c

    root

    c

    ExampleExample

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    6/75

    Constructing a Suffix TreeConstructing a Suffix Tree

    Algorithm by McCreight:Algorithm by McCreight:

    denoteddenoted mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    7/75

    AlgorithmAlgorithm mccmcc

    The algorithm requires that:

    S1. The final character of the string S shouldnot appear elsewhere in S.

    S1 yields:

    1. No suffix of S is a prefix of a differentsuffix of S.

    2. There is a leaf for each suffix of S.

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    8/75

    AlgorithmAlgorithm mccmcc

    Constraints on the TreeConstraints on the Tree

    T1. An edge of T may represent any

    nonempty substring of S.

    T2. Each internal node of T, except the root,

    must have at least two outgoing edges.

    T3. Siblings edges represent substrings with

    different starting characters.

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    9/75

    AlgorithmAlgorithm mccmcc

    Constraints on the TreeConstraints on the Tree

    Since every leaf maps uniquely to a suffix

    ofS, then T2 yields that the number oninternal nodes in Tn=|S| (since every

    branching yields another leaf). Proposition: The mapping ofSinto T,

    unique, up to order among siblings.

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    10/75

    Mapping the string

    ababc into asuffix tree.

    ab

    abc

    bc

    abc c

    root

    c

    ExampleExample

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    11/75

    DefinitionsDefinitions the alphabet

    We use a,b,c,dto denote characters in .

    p, q, s, t, u, v, w, y,z to denote strings.

    Ift = uvw for some strings (possibly empty)u,v,w then u is aprefix oft, vis a t -word, and

    wis asuffix oft.

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    12/75

    DefinitionsDefinitionsA prefix or suffix oftisproper, if it is

    different from t.Bypath(k) we denote the concatenation of the

    edge labels on the path from the root ofTto thenode k.

    By T3path labels are unique and we can

    denote kby w, if and only ifpath(k) =w.

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    13/75

    DefinitionsDefinitionsAnother terminology (McCreight):

    Definition:

    Node kis called the locus of the string uv, ifthe path from the root to kdenotes uv.

    hence, the locus ofuv is uv.

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    14/75

    DefinitionsDefinitionsExample:

    AlgorithmAlgorithm mccmcc

    root

    u

    v

    uv

    u

    k

    Path(k) = uv k=uv

    The locus ofuv

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    15/75

    DefinitionsDefinitions The Extended Locus of a string u is the locus of the

    shortest extension ofu, uw (w is possibly empty), .s.t.uw is a node in T.

    Example: u=hello, w=p

    AlgorithmAlgorithm mccmcc

    roothell

    uw

    The extended

    locus of u

    op

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    16/75

    DefinitionsDefinitions The Contracted Locus of a string u is the locus of

    the longest prefix ofu,x (x is possibly empty), s.t.x

    is a node in T.

    Example: u=hello, x=hell

    AlgorithmAlgorithm mccmcc

    roothell

    uw

    The contracted

    locus of u

    op

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    17/75

    Let Sbe our main string Sufi is the suffix ofSbeginning at the i

    th

    position (position are counted from 1

    suf1 = S).

    headi is the longest prefix ofsufi , which is

    also a prefix ofsufj for somej

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    18/75

    Example:

    S=ababc, suf3=abc, head3=ab, tail3=c

    suf4=bc, head3=b, tail3=c

    Constraint S1 assures that taili is never empty.

    AlgorithmAlgorithm mccmcc

    DefinitionsDefinitions

    a b c

    a b a b c

    b c

    a b a b c

    S:

    suf3:

    S:

    suf4:

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    19/75

    To build the suffix tree forababc mcc inserts

    every step i the sufi into tree Ti-1:

    AlgorithmAlgorithm mccmcc

    Overview ofOverview of mccmcc

    a b a b c

    b a b c

    a b c

    b c

    c

    Step 1

    Step 2

    Step 3

    Step 4

    Step 5

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    20/75

    To do this we have to insert every step sufi without duplicating

    its prefix in the tree, so we need to find its longest prefix in the

    tree.

    Its longest prefix in the tree is by definition headi .

    Example:

    AlgorithmAlgorithm mccmcc

    Overview ofOverview of mccmcc

    So what we do is finding the extended locus of headi in Ti-1 and its incomingedge is split by a new node which spawns a new edge labeled taili.

    ababc

    T2

    babcSuf3=abc. Since we already have the word

    ab in the tree thus we need to start fromthere bulding our new suffix. Note that

    indeed ab=head3, taili=c.

    c

    T3

    Al i h

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    21/75

    Overview ofmccs operations via example ofababc:

    AlgorithmAlgorithm mccmcc

    Overview ofOverview of mccmcc

    Step i=1rootT0

    ababcbabc

    c

    abT1

    2T2

    3T3

    4b

    c

    T45 c

    T5

    abcabc

    Al i hAl ith

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    22/75

    Notice that headi is the

    longest prefix ofsufi thatits extended locus exists

    within Ti-1.

    AlgorithmAlgorithm mccmcc

    ababc babc

    root

    T2

    We have entered 2suffixes by now.

    Overview ofOverview of mccmcc

    The extended

    locus of head3

    Al ithAl ith

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    23/75

    For efficiency we would represent each label of anedge by 2 numbers denoting its starting andending position in the main string.

    AlgorithmAlgorithm mccmcc

    The Data StructureThe Data Structure

    ab

    abc c

    bc

    abc c

    root(0,0)

    (2,2) (5,5)(1,2)

    (3,5) (5,5) (3,5) (5,5)

    1 2 3 4 5

    a b a b cS:

    Al ithAl ith

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    24/75

    Thus, the actual insertion of an edge to the

    tree takes O(1). The introduction of a new internal node and

    tailitakes O(1) , hence,

    ifmcc could find the extended locus of headiin Ti-1 in constant time, in average over all

    steps, thenmcc is linear inn. This is done by exploiting the following

    lemma:

    AlgorithmAlgorithm mccmcc

    The Data StructureThe Data Structure

    Al ithAlgorithm mcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    25/75

    Lemma 1: Ifheadi-1 = xu for some characterx and some string u

    (possibly empty), then u is a prefix ofheadi .

    Proof. headi-1=xu, hence, there is aj

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    26/75

    S=bdababdc, head5=ab, head6=bd

    AlgorithmAlgorithm mccmcc

    The Data StructureThe Data Structure

    b d a b a b d $

    h a l y h a l $

    h a l $

    S:

    Suf5:

    a b d $

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    27/75

    To exploit this we introduce Suffix Links:

    From each internal nodexu , where |x|=1, weadd a pointer to the node u. root

    AlgorithmAlgorithm mccmcc

    TheThe

    Data StructureData Structure

    xu

    u

    Suffix link

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    28/75

    Our example:

    AlgorithmAlgorithm mccmcc

    The Data StructureThe Data Structure

    ab

    abc

    b

    c

    abc cc

    root(0,0)

    (2,2)(5,5)(1,2)

    (3,5) (5,5) (3,5) (5,5)

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    29/75

    Note: All suffix links areatomic in the sensethatxu is suffix linked to u where |x| =1.

    AlgorithmAlgorithm mccmcc

    The Data StructureThe Data Structure

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    30/75

    We shall present mcc and prove by inductionon i, the step number ofmcc, that

    P1: in Ti every internal node, except perhapsthe locus ofheadi(headi), has a valid suffix

    link.

    P2: in step i mcc visits the contracted locus ofhead

    iin Ti-1 .

    P2 yields that we can use the contracted locus ofheadi-1 to jump with the suffix link to some prefixof headi. P1 assure us that there is such suffix link.

    AlgorithmAlgorithm mccmcc

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    31/75

    i=1:

    AlgorithmAlgorithm mccmcc

    Base case for P1Base case for P1root

    ababc

    P1 holds since there is no internal nodes.

    (note that head1= (=root)).

    T1

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    32/75

    i=1:

    AlgorithmAlgorithm mccmcc

    Base case for P2Base case for P2

    P2 holds since head1= and in step 1 mcc

    visits the root which is the locus of(=root) in T0 .

    root

    ababc

    T1rootT0

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    33/75

    In this substep mcc will identify strings it hadalready dealt with in the previous steps, in

    order to make a shortcut leap to the

    middle of its current head.

    AlgorithmAlgorithm mccmcc

    mccmccsubstep Asubstep A

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    34/75

    Identify 3 strings:xuw s.t.

    1. headi-1=xuw

    2. xu is the contracted locus of headi-1

    in Ti-2, i.e.xu is a node in Ti-2 . If the contracted

    locus of headi-1 in Ti-2 is the root then u=.

    3. |x| 1.x= only if headi-1= .

    AlgorithmAlgorithm mccmcc

    mccmccsubstep Asubstep A

    AlgorithmAlgorithm mcc:mcc: substep Asubstep A

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    35/75

    Algorithmg cc substep ApIllustrating substep A in the ith step: the move

    from Ti-1 to Ti

    Ti-2T

    i-1Ti

    Step i-1Step i

    Headi-1= xuw

    Headi= uwv

    xurootu

    wy

    Contracted locus

    of xuw

    xu

    root

    u

    xurootu

    w

    ywv

    w

    c

    y

    taili-1

    taili-1

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    36/75

    Our goal here is to go directly to the

    locus ofu in the tree so that we couldseach for w (substep B) and then forv

    (substep C).

    Algorithmg mcc

    mccmccsubstep Asubstep A

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    37/75

    Notice that:

    In the previous step headi-1 was found.

    Since |x| 1 then by lemma 1:

    headi = uwv for some, yet to be discovered,string (possibly empty) v.

    By induction hyp P2, mcc visitedxu in theprevious step (i-1), hence it can identifyxu.

    gg

    mccmccsubstep Asubstep A

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    38/75

    Ifu= then c root(note that root = u )

    else, c{suffix link ofxu} (note that c=u)explanation:

    u thus by definitionxu existed (as the

    contracted locus ofxuw) in Ti-2

    hence by

    P1: the internal nodexu has a suffix link.

    By P2 we rememberxu from step i-1 and

    we can now follow its suffix link.

    gg

    mccmccsubstep Asubstep A

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    39/75

    uwv = headi hence from the definition ofhead, the extended locus ofuw exists in Ti-1.

    Now we can start going down the edges,from u to find the extended locus ofuw.

    gg

    mccmccsubstep Asubstep A

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    40/75

    To rescan w:

    Find the edge that starts with the first character ofw.

    Denote the edges labelzand the node it leads tof.

    If|w| > |z| then start a recursive rescan ofw-z(orw|z| )

    fromf. If|w| |z| , then w is a prefix ofz, and we found the

    extended locus ofuw.

    Construct a new node (if needed): uw .

    d uw

    MccMccsubstep B:substep B: RescanningRescanning

    AlgorithmAlgorithm mcc:mcc: substep Bsubstep B -- rescanningrescanning

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    41/75

    p gIllustrating substep B in the ith step: rescanning

    substring w.

    Ti-1

    Step iHeadi-1= xuwHeadi= uwv

    wv

    xurootu

    w

    y

    c

    w

    y

    Ti

    xu

    root

    u

    w

    f

    c

    v

    z

    z

    fd

    In this case uw

    already exists

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    42/75

    Make the suffix link of

    xuwpoint to d .Hence, we have defined a

    suffix link to the node

    constructed in step i-1.By this and induction

    hyp P1 holds in Ti .

    mccmccsubstep B: Scanningsubstep B: Scanning

    Ti

    xu

    root

    u

    w

    c

    v

    fd

    w

    xuw

    Headi= uwv

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    43/75

    Scan the edges from din order to find the extendedlocus ofuwv.

    Since we dont know yet what is v we must scaneach character in the path from ddownward,comparing it to tail

    i-1.

    When we fall out of the tree we have found v.

    The last node in this trek is the contracted locus of

    headi in Ti-1, which proves P2. When we reach the extended locus ofuwv we

    construct the new node uwv, if needed.

    Construct the new leaf edge taili .

    mccmccsubstepsubstep CC -- ScanningScanning

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    44/75

    mccmccsubstep C: Scanningsubstep C: ScanningrootScanning for the

    requested v.Comparing each

    character of the

    downward pathbeginning at dto

    taili. When the

    comparison fails wehave reached headi .

    Ti

    xuu

    vt

    d

    w

    Headi= uwv

    w

    v

    tailit

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    45/75

    We shall prove that when we add a new node in the end

    of substep B as the locus of uw then we obeyconstraint T2 that an internal node has at least 2 son

    edges.

    Maintaining T2Maintaining T2

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    46/75

    Maintaining T2Maintaining T2

    uw

    vz

    Ti

    Lemma: In step i, at the end of substep B we add a new nodeonly ifv is empty.

    Proof. In step i, Ifv is not empty then headi=uwv andheadi-1=xuw hence, w.l.g. we can write S as follows:

    S=xuwz..uwv..xuwvThus, we have 2 occurences ofuw with different

    extensions, uwv, uwz, that occur already in the tree.

    Hence, there is a branching node uw.

    Position i-1

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    47/75

    Corollary: In the ith step ifv is empty then we add

    an outgoing edge from the locus ofuv=headi.Thus the only case where we add a node we add

    an outgoing edge to it.

    Maintaining T2Maintaining T2

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    48/75

    Define: resi

    = wv{taili

    } in step i .

    Hence, resi is the suffix of S rescanned and

    scanned during step i.

    Observation: For every intermediate nodef

    encountered in the rescan phase ofstep i,

    the substringz, labeling the edge tof, iscontained in resibut not in resi+1.

    Time Complexity AnalysisTime Complexity Analysis

    Time Complexity AnalysisTime Complexity Analysis

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    49/75

    Illustrating substep B in the ith step: rescanning

    substring w.

    Step i Headi= uwv

    w

    y

    Ti

    sz1

    root

    u=xs

    w

    f

    c

    v

    z1

    z2

    fd

    p y y

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    50/75

    Explanation: if we encounter nodefin step i

    during the rescan phase of substep B then fmust bean internal node in Ti-1 hence P1 yields that in Ti+1,

    f has a suffix link.

    Assume w.l.g thatf = az

    This suffix link serves us in substep A of

    step i+1 to reach the nodez, hence we do

    not have to rescan substringzagain.

    Time Complexity AnalysisTime Complexity Analysis

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    51/75

    Example:

    Time Complexity AnalysisTime Complexity Analysis

    Suffix link

    z

    root

    az

    c

    f

    u

    u

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    52/75

    Define: inti = number of intermediate nodes (f ) rescanned

    during step i. The observation yields:

    |(resi+1)| |(resi)| inti

    Hence,1 = |(resn)| |(resn-1)| intn-1 |(res1)|

    ni=1inti ( since

    intn=0)

    1 |(res1)| n

    i=1inti = n - ni=1inti

    ni=1inti n -1

    i.e., the total number of intermediate nodes rescanned n .

    Time Complexity AnalysisTime Complexity Analysis

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    53/75

    The total number of characters scanned in substep C tolocate headi (the length ofv):

    In step i the number of characters scanned during stepC is

    |(headi)|-|(head

    i-1)| +1

    since we already rescanned w (the suffix of headi-1) insubstep B. +1 comes from the first character of headi-1).

    The number of characters scanned is:ni=1 [|(headi)|-|(headi-1)|+1] = |(headn)|-|(head0)|+n = n

    Therefore, the total time complexity is O(n+n)=O(n)

    Time Complexity AnalysisTime Complexity Analysis

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    54/75

    Updating the suffix treeUpdating the suffix treeWe shall see how to update the suffix

    tree (not online), when a substring of

    the main string is being replaced by

    another.

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    55/75

    Updating the suffix treeUpdating the suffix treeGoal:

    Given a string S = uwv, and its

    corresponding suffix tree, we change S, so

    that: S = uzv .

    We wish to update the suffix tree to

    represent the change in S.

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    56/75

    Updating the suffix treeUpdating the suffix tree In order to make it possible to update the tree

    effectively, i.e., not change the whole tree, wewould adopt a numbering scheme representing the

    positions ofS, in which a position number need

    never change after it has been assigned. Also, the position numbers are strictly monotonic.

    Hence, the suffixes ofv, for instance, need not to

    be changed, when we change uwv to uzv.

    This requires a large pool of position numbers.

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    57/75

    Paths in need to be changedPaths in need to be changed We consider what kind of paths might need to

    change, by the change: uwv uzv. Denote u* as the longest suffix ofu that appears

    elsewhere in uwv.

    Definition: a w-splitters (w.r.t uwv uzv) are thestrings of the form tv, where tis a nonempty suffixofu*w.

    Equivalently,splitters are the paths whichproperly contain v and whose last edge do notcontain wv.

    Paths in need to be changedPaths in need to be changed

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    58/75

    Paths in need to be changedPaths in need to be changedIllustrating the paths not affected by the change:

    v

    root

    Suffix = v:

    Stays as it is.

    (1) root(2)Suffix = uu*wv:

    (by definition of

    u*) yx=uu*, for

    some y,x0 and

    |xy|>0, s.t. thebranching occurs at

    the end ofy. Stays

    as it is.

    xwv

    y

    Note that due to our numbering

    scheme and data structure the label

    xwv changes automatically toxzv.

    Paths in need to be changedPaths in need to be changed

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    59/75

    Paths in need to be changedPaths in need to be changedIllustrating paths that are affected by the change:Suffix = uwv, u=suf(u*):

    root(2) Need to update the

    last edge label, since

    the position index in

    the leaf changes.wv

    uroot(1) xy=w:

    need a change,

    since the

    positions indices

    of the leaf and its

    father change.

    yv

    ux

    y

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    60/75

    Overview of the algorithmOverview of the algorithm The updating algorithm removes all w-

    splitters paths and inserts all z-splitters

    paths,

    while preserving properties T1,T2,T3.

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    61/75

    Overview of the algorithmOverview of the algorithm3 stages of the algorithm, umcc:

    1. Discover u*wv, the longest w-splitter.

    2. Delete all paths tv, t=suf(u*w), from thetree.

    3. Insert all paths sv, s=suf(u*z), into the

    tree.

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    62/75

    Stage 1Stage 1Phase 1:

    Denote u(i) the suffix ofu of length i.

    Examine the paths u(1)wv, u(2)wv, u(4)wv,

    u(8)wv, until a non w-splitter is discovered, say,u(k)wv.

    Every path u(i)wv examined takes O(i)time.

    Stage 1Stage 1

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    63/75

    Stage 1Stage 1Phase 2:

    Examine the paths u(k)wv, u(k-1)wv,

    until the longest w-splitteris discovered, u*wv.Time complexity:

    This search can take full advantage of the suffix

    links, as in mcc, since kis incremented by 1, eachstep, hence it takes O(k) time.

    |u*|>k/2

    Phase 1 takes O(1+2+4++k)

    Phase 2 takes O(k)

    Hence, stage 1 takes O(u*)

    Stage 2Stage 2

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    64/75

    Stage 2Stage 2

    Delete all paths tv, t=suf(u*w), |t|0, from the

    tree.

    The deletion is done in order, from the longest to

    the shortest.

    Suppose that for all suffixes s of u*w longerthan t, the deletion of sv has been already done.

    We now consider how to delete tv.

    Stage 2Stage 2

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    65/75

    gg

    The general case is illustrated

    delete the edge labeled q andits leaf.

    If nodefhas more than 2sons than this is enough.

    Otherwise, delete nodef, and

    make kthe son ofp; labelturn toyo.

    Potential problem: anexisting suffix link tof.

    root

    y

    x

    p

    qf

    k

    m

    tv

    o

    Potentialsuffix link

    axy

    Stage 2Stage 2

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    66/75

    gg

    Denote the last internal node in the pathxu*zvbys* where |x|=1.

    We show that this problem could ariseonly for a unique node,s*.

    Lemma 2:

    1. Whenever a nodef is deleted there is nosuffix link pointing to it, except perhaps that

    of nodes*.2. Every path in Thas a suffix path, except

    perhapsxu*zv.

    Stage 2Stage 2

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    67/75

    Stage 2Stage 2proof:

    Base: (1) is trivially true. (2) is true, since we havent

    change the tree, so the only path without a suffix path is

    root

    l

    r

    p

    qf

    k

    m

    tv

    o

    arl

    the path whose suffix path is the longest

    w-splitter,xu*zv.

    Induction: (1)

    assume m is a node having its suffix

    link point tof, than m could not havean outgoing edge labeled q, since arlq

    would be a longer splitter than rlq so it

    would have been already deleted.

    Stage 2Stage 2

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    68/75

    Stage 2Stage 2

    Thus, nodef has only one son edge that has a prefix pathin T. Hence, node m has exactly 2 son edges (otherwise

    there would be more than 1 paths in Twithout a suffixpath, in contrast to induction hyp), and the path having nosuffix path must pass through m, so node m is actuallys*.

    (2)

    If we delete u*wv then since we have already changedxu*wv toxu*zv, hence its prefix path doesnt existanyway.

    If we delete a proper suffix ofu*wv then, we havealready deleted its prefix path in T.

    In both cases we havent prevented any path in T of asuffix path.

    Stage 3Stage 3

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    69/75

    Insert all paths sv, s=suf(u*z), into the tree.

    We do it as if we are running mcc with a pre-initialized

    suffix tree that already contains all suffixes ofv. Denote

    that tree T(v), and this variant algorithm as umcc(v). We already have all the suffixes longer than u*zv, so we

    start running mcc from there:

    Denotej=|(u)|-|(u*)|+1

    Denote k=|(uz)|

    We will insert the paths u*zv, , dv (where dis the last

    character ofuz), by running mcc from stepj throughk+1s rescanning substep (in order to connect the a suffix

    link to headk)

    Stage 3Stage 3

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    70/75

    Stage 3g

    We remember nodes* and its father (this settles

    the problem of the suffix link ofs*).

    The following 2 observations, corresponding to

    P1,P2, enable us to start running mcc from the

    jth

    step, with T(v):1. s* or its father are the contracted locuses of

    head(v)j-1 in T(v)j-1.

    2. s* is the only internal node that might nothave a suffix link in T(v)j-1.

    Time Complexity AnalysisTime Complexity Analysis

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    71/75

    p y yp y y

    We saw that finding u* takes O(|u*|).

    Deleting all the paths of the form tv, where tis anonempty suffix ofu*w, requires finding the leafedge of each path and deleting its leaf. Deletingthe leaf is constant.

    Finding the leaf edges of all these paths can bedone in a similar manner ofmcc(v):

    Find the path u*wv; remember its last internal node;follow its suffix link to find the last internal node ofsufi(u*wv).

    Hence, deleting the paths takes O(|u*wv|).

    Time Complexity AnalysisTime Complexity Analysis

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    72/75

    p y yp y y

    Running mcc from stepj through step k+1:

    Everything but scanning and rescanning takesconstant time.

    Denote the last character ofu*wby d.

    Define v* as the longest prefix ofdv that occurselsewhere in uzv.

    Time Complexity AnalysisTime Complexity Analysis

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    73/75

    p y yp y yDuring rescanning (substep B) we encounter:

    k+1i=jint(v)i |(res(v)j)|- |(res(v)k+1)| +int(v)k+1

    |(res(v)j)| |(sufj)| = |(u*zv)| For all i: int(v)i |(w)| |(headi-1(v))|, where w is

    the substring rescanned in substep B.

    Hence, int(v)k+1 |(headk(v))|.

    For all i: |(res(v)i)| |(suf(v)i-1)|- |(headi-1(v))|. Hence, |(res(v)k+1)| |(suf(v)k)| - |(headk(v))|.

    Time Complexity AnalysisTime Complexity Analysis

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    74/75

    p y yp y yThus, k+1i=jint(v)i |(u*zv)| +|(headk(v))| -

    |(suf(v)k)|+ |(headk(v))|

    = |(u*zv)| +2 |(v*)| - |(dv)|Hence, rescanning takes O(|u*zv|+|v*|).

    Scanning (substep C):

    As in mcc analysis:

    The number of character scanned in stepsj through k

    is exactly (k-j+1)+|head(v)k| - |head(v)j-1| |head(v)k|= |v*| , |head(v)j-1| = 0 , hence

    Scanning takes O(|u*z|+|v*|)

    AlgorithmAlgorithm mccmcc

  • 8/3/2019 Iddo-McCreight_slides on Suffix Tree Updates

    75/75

    In total, updating the suffix tree takes

    O(|u*|+|w|+|z|+|v*|)

    Time Complexity AnalysisTime Complexity Analysis