efficient approximate entity extraction with edit distance constraints

Presented by: Aneeta Kolhe

• Named Entity Recognition finds approximate matches in text.

• Important task for information extraction and integration, text mining and also for web search.

Approximate dictionary matching. Previous solution – Token based similarity

constraints Proposed solution – Neighborhood

generation method

It uses Jaccard co-efficient similarity

It may miss some match.

It may result in too many matches.

For Example: Given al-qaida *“al-qaeda” or “al-qa’ida” won’t be matched

unless use low jaccard similarity of 0.33.

“alqaeda” will match “al gore” as well as “al pacino”

Hence we use edit distance

Problem Definition:

For example: Given :document D, a dictionary E of entities To find: all substrings in D such that they are within edit

distance from one of the entities in E

Solution: Iterate through all the valid substrings of the document D

Issue a similarity selection query to the dictionary to retrieve the set of entities that satisfy the constraint.

Consider each substring as a query segment.

at least one partition with at most one edit error

select k т = (т +1)/2Example: s = [ abcdefghijkl ] s’= [ axxbcdefghxijkl ]т = 3 , k т = 2 s = [ abcdef ], [ ghijkl ] s’ = [ axxbcde ], [ fghxijkl ]

Shifting the first partition s by 2 => s = [cdef]

scaling it by -1 => s = [ cdefg ] Transformation rules First partition, we only need to consider

scaling within the range of [−2, 2]. Last partition, we only need to consider the

combination of the same amount of shifting and scaling within the range of [− т, т ] (so that the last character is always included in the resulting substring).

For the rest of the partitions, we need to consider shifting within the range [− т, т ] and scaling within the range [−2, 2].

1st partition: 5 variations intermediate partitions: 5*(2 т +1)

variations last partition: (2 т +1) variations Total amount of the 1-variants generated = O(m + 2).

s = [ abcdef ], [ ghijkl ] s’ = [ axxbcde ], [ fghxijkl ]

< [ abcd ], 1>< [ abcdefgh ], 1>< [ ghijkl ], 2> <[ abcde ], 1> <[ jkl ], 2> < [ fghijkl ], 2 > <[ abcdef ], 1> < [ ijkl ], 2 > < [ efghijkl ], 2> <[ abcdefg ],1>< [ hijkl ],2><[ defghijkl ], 2> segment s’ comes in second partition [ fghxijkl ], will have 1-variant match with s’s

partition variation [fghijkl ] generated from s’s second partition.

The partition (variation) is longer than a prefix length l p, we only use its l p-prefix to generate its 1-variants.

Assume l p is set to 3. Then 1-variantsare generated from only the following

prefixes. <[ abc ], 1> <[ ghi ], 2 > <[ hij ], 2> <[ fgh ], 2 > By setting l p ≤ m/kт – 2 Total # of 1-variants generated is further

reduced to O(l p т²).

to index short and long entities in the dictionary, and store them in two

inverted indexes, Ishort and Ilong For each entity whose length is smaller than kт lp + т lp-prefix of each partition variation is used

to generate its 1-variant family, which will be indexed.

Algorithm : BuildIndex (E, , lp) for each e Є E do if |e| < k lp + then V GenVariants(e[1 .. min(lp, |e|)], ); /* The GenVariants (s, k) function generates the k-variant family of string s */ for each v Є V do Ishort <- Ishort U { e }; if |e| ≥ k lp then P the set of k partitions of e; for each i-th partition p Є P do PT TransformPartition(p); /* according to the three transformation rules in Section 3.1 */ for each partition variations pT Є PT do V GenVariants(p[1 .. lp], 1); for each v 2 V do Ilong <- Ilong U <e, i >; return (Ishort, Ilong)

Algorithm : MatchDocument (D, E, т ) for each starting position p Є[1, |D| − Lmin + т + 1] do SearchLong (D[p .. p + lp − 1], E, т ); /* matching entities no shorter than kт lp */ SearchShort (D[p .. p + lp − 1], E, т );/* matching entities of length in [lmin, kт lp)

*/

R <- ф; /* holds results */

C <- ф ; /* holds candidates */

V <- GenVariants(s, 1) ; /* gen 1-variant family */

for each v Є V do for each <e, pid > Є Ilongv do C <- C U <e, pid > ; /*

duplicates removed */ 7 for each <e, pid > Є C do 8 S <- QuerySegmentInstantiation(e, pid); /* returns the set of query segment candidates for e */ for each seg Є S do if Verify(seg, e) = true then R <-R <seg, e > Return R

Search short(s) We need to generate the т-variant families for

each possible length l between Lmin − т and lp If the current query segment is shorter than lp,

every candidate pair formed by probing the index needs to be verified

Otherwise, we need to perform verification for 2 т + 1 possible query segments.

For example, enumerate 1-variants of the string [ abcdef ] from left to right.

no variant starts with abc in the index. Algorithm still enumerate other three 1-

variants containing abc. To avoid this set parameter lpp set to lp/2.

Consider 4 possible cases:

Prefix Match

Suffix Match

Action

True true enumerate all 1-variants of q[1 .. lp]

False False discard q as there is no match

False True enumerate all 1-variants of q[1 .. lpp]

False False enumerate all 1-variants of q[(lpp + 1) .. lp]

Successfully reduced the size of neighborhood

Proposed an efficient query processing algorithm

Optimized the algorithm to share computation

Avoid unnecessary variant enumeration

efficient approximate entity extraction with edit distance constraints

Documents

partition s

segment s

lp lpprefix

partition fghxijkl

genvariants s

abcdefghijkl s

dictionary e of entities

buildindex e