efficient approximate entity extraction with edit distance constraints
DESCRIPTION
Efficient Approximate Entity Extraction with Edit Distance Constraints. Presented by: Aneeta Kolhe. Introduction. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text mining and also for web search. Problem. - PowerPoint PPT PresentationTRANSCRIPT
• Named Entity Recognition finds approximate matches in text.
• Important task for information extraction and integration, text mining and also for web search.
Approximate dictionary matching. Previous solution – Token based similarity
constraints Proposed solution – Neighborhood
generation method
For Example: Given al-qaida *“al-qaeda” or “al-qa’ida” won’t be matched
unless use low jaccard similarity of 0.33.
“alqaeda” will match “al gore” as well as “al pacino”
Hence we use edit distance
Problem Definition:
For example: Given :document D, a dictionary E of entities To find: all substrings in D such that they are within edit
distance from one of the entities in E
Solution: Iterate through all the valid substrings of the document D
Issue a similarity selection query to the dictionary to retrieve the set of entities that satisfy the constraint.
Consider each substring as a query segment.
at least one partition with at most one edit error
select k т = (т +1)/2Example: s = [ abcdefghijkl ] s’= [ axxbcdefghxijkl ]т = 3 , k т = 2 s = [ abcdef ], [ ghijkl ] s’ = [ axxbcde ], [ fghxijkl ]
Shifting the first partition s by 2 => s = [cdef]
scaling it by -1 => s = [ cdefg ] Transformation rules First partition, we only need to consider
scaling within the range of [−2, 2]. Last partition, we only need to consider the
combination of the same amount of shifting and scaling within the range of [− т, т ] (so that the last character is always included in the resulting substring).
For the rest of the partitions, we need to consider shifting within the range [− т, т ] and scaling within the range [−2, 2].
1st partition: 5 variations intermediate partitions: 5*(2 т +1)
variations last partition: (2 т +1) variations Total amount of the 1-variants generated = O(m + 2).
s = [ abcdef ], [ ghijkl ] s’ = [ axxbcde ], [ fghxijkl ]
< [ abcd ], 1>< [ abcdefgh ], 1>< [ ghijkl ], 2> <[ abcde ], 1> <[ jkl ], 2> < [ fghijkl ], 2 > <[ abcdef ], 1> < [ ijkl ], 2 > < [ efghijkl ], 2> <[ abcdefg ],1>< [ hijkl ],2><[ defghijkl ], 2> segment s’ comes in second partition [ fghxijkl ], will have 1-variant match with s’s
partition variation [fghijkl ] generated from s’s second partition.
The partition (variation) is longer than a prefix length l p, we only use its l p-prefix to generate its 1-variants.
Assume l p is set to 3. Then 1-variantsare generated from only the following
prefixes. <[ abc ], 1> <[ ghi ], 2 > <[ hij ], 2> <[ fgh ], 2 > By setting l p ≤ m/kт – 2 Total # of 1-variants generated is further
reduced to O(l p т²).
to index short and long entities in the dictionary, and store them in two
inverted indexes, Ishort and Ilong For each entity whose length is smaller than kт lp + т lp-prefix of each partition variation is used
to generate its 1-variant family, which will be indexed.
Algorithm : BuildIndex (E, , lp) for each e Є E do if |e| < k lp + then V GenVariants(e[1 .. min(lp, |e|)], ); /* The GenVariants (s, k) function generates the k-variant family of string s */ for each v Є V do Ishort <- Ishort U { e }; if |e| ≥ k lp then P the set of k partitions of e; for each i-th partition p Є P do PT TransformPartition(p); /* according to the three transformation rules in Section 3.1 */ for each partition variations pT Є PT do V GenVariants(p[1 .. lp], 1); for each v 2 V do Ilong <- Ilong U <e, i >; return (Ishort, Ilong)
Algorithm : MatchDocument (D, E, т ) for each starting position p Є[1, |D| − Lmin + т + 1] do SearchLong (D[p .. p + lp − 1], E, т ); /* matching entities no shorter than kт lp */ SearchShort (D[p .. p + lp − 1], E, т );/* matching entities of length in [lmin, kт lp)
*/
R <- ф; /* holds results */
C <- ф ; /* holds candidates */
V <- GenVariants(s, 1) ; /* gen 1-variant family */
for each v Є V do for each <e, pid > Є Ilongv do C <- C U <e, pid > ; /*
duplicates removed */ 7 for each <e, pid > Є C do 8 S <- QuerySegmentInstantiation(e, pid); /* returns the set of query segment candidates for e */ for each seg Є S do if Verify(seg, e) = true then R <-R <seg, e > Return R
Search short(s) We need to generate the т-variant families for
each possible length l between Lmin − т and lp If the current query segment is shorter than lp,
every candidate pair formed by probing the index needs to be verified
Otherwise, we need to perform verification for 2 т + 1 possible query segments.
For example, enumerate 1-variants of the string [ abcdef ] from left to right.
no variant starts with abc in the index. Algorithm still enumerate other three 1-
variants containing abc. To avoid this set parameter lpp set to lp/2.
Consider 4 possible cases:
Prefix Match
Suffix Match
Action
True true enumerate all 1-variants of q[1 .. lp]
False False discard q as there is no match
False True enumerate all 1-variants of q[1 .. lpp]
False False enumerate all 1-variants of q[(lpp + 1) .. lp]
Successfully reduced the size of neighborhood
Proposed an efficient query processing algorithm
Optimized the algorithm to share computation
Avoid unnecessary variant enumeration