discrete point based signatures and applications to document matching
TRANSCRIPT
Discrete Point Based Signatures and Applications to Document MatchingNemanja Spasojevic, Guillaume Poncin, and Dan Bloomberg
September 15th 2011, Ravenna, Italy
Overview
● Background● Algorithm● Applications
○ duplicate page detection○ text image lockup
● Conclusion
Background (Duplicate Page Detection)
● Find duplicate pages for a given set of scans of physically same book. Assumption:
○ has to handle at time corpus of ~10k text pages○ pages rich in text ○ < 4o of rotation from image to image○ some translation○ needs to be quick ○ simple to use (discrete signatures, for easy indexing /
lookup)
Background (Image Lookup)
● See how well we perform in image lookup mode. Test how robust algorithm is for something it was not designed for:
○ index clean, images○ lookup by image take with cell phone camera○ skew○ rotation ○ blur○ part of original page
Other Aproaches
● Image matching well studied problem ○ SURF, SIFT, FIT work well at point matching across
images and image lookup ○ do not work as well for repetetive patterns such as text
documents ● Document page matching
○ Locally Likely Arrangement Hashing (LLAH), Nakai, et. al.■ affine invariant ■ produces thousands of signatures per page■ precise ■ handles 10k image corpus
Algorithm
Possible inputs:● raw image (operate on word centroids)● OCR-ed text with word bounding boxes (operate on word
bounding box center)● PDF with word bounding box info (operate on word
bounding box center)
Signature Instability
Signature is composed of N sub-signatures: S = [s(0)][s(1)]...[s(N-1)]
Instability of signatures comes from: ● Small shifts may lead to changes in discretized angle value
(e.g. s(0) flipping from 13 to 14 due to small word position shifts)
● order of sub-signatures may change (s(0) and s(1) swap as they had almost same radial distance)
Superposition of AmbiguousSignature
[s1][{s2,s'2}][s3][s4] => { [s1][s2][s3][s4], [s1][s'2][s3][s4] }
Duplicate Page Detection (metrics)
● The similarity based on signature sets is calculated as:
● The similarity based on matched (aligned) word bounding boxes:
Image Lookup Result
1M pages index (32bit signatures) stats:● 386M signatures total● stored as sorted array (<signature, book_id, page_pid, x, y>) fits
in ~4GB of memory● 0.8% of all signatures filtered (those repeated on 1k or more
pages)● each query on average returns 2000 canidates
Index Size [pages] Accuracy Signature Size [bits]
4.1k 0.966 164.1k 0.949 321M 0.871 32
Conclusion
● Simple schema for point cloud based discrete signature generation
● Filtering based on signature stability estimate● Superposing signatures ● Duplicate page detection ● Image lookup by cell phone camera image query (87.1%
accuracy for 1M pages indexed)