discrete point based signatures and applications to document matching

23
Discrete Point Based Signatures and Applications to Document Matching Nemanja Spasojevic, Guillaume Poncin, and Dan Bloomberg September 15 th 2011, Ravenna, Italy

Upload: nemanja-spasojevic

Post on 17-Jan-2017

145 views

Category:

Engineering


3 download

TRANSCRIPT

Discrete Point Based Signatures and Applications to Document MatchingNemanja Spasojevic, Guillaume Poncin, and Dan Bloomberg

September 15th 2011, Ravenna, Italy

Overview

● Background● Algorithm● Applications

○ duplicate page detection○ text image lockup

● Conclusion

Background (Duplicate Page Detection)

● Find duplicate pages for a given set of scans of physically same book. Assumption:

○ has to handle at time corpus of ~10k text pages○ pages rich in text ○ < 4o of rotation from image to image○ some translation○ needs to be quick ○ simple to use (discrete signatures, for easy indexing /

lookup)

Background (example)

Background (Image Lookup)

● See how well we perform in image lookup mode. Test how robust algorithm is for something it was not designed for:

○ index clean, images○ lookup by image take with cell phone camera○ skew○ rotation ○ blur○ part of original page

Other Aproaches

● Image matching well studied problem ○ SURF, SIFT, FIT work well at point matching across

images and image lookup ○ do not work as well for repetetive patterns such as text

documents ● Document page matching

○ Locally Likely Arrangement Hashing (LLAH), Nakai, et. al.■ affine invariant ■ produces thousands of signatures per page■ precise ■ handles 10k image corpus

Algorithm

Possible inputs:● raw image (operate on word centroids)● OCR-ed text with word bounding boxes (operate on word

bounding box center)● PDF with word bounding box info (operate on word

bounding box center)

Algorithm (Image Processing)

Signature Generation Algorithm

Signature Instability

Signature is composed of N sub-signatures: S = [s(0)][s(1)]...[s(N-1)]

Instability of signatures comes from: ● Small shifts may lead to changes in discretized angle value

(e.g. s(0) flipping from 13 to 14 due to small word position shifts)

● order of sub-signatures may change (s(0) and s(1) swap as they had almost same radial distance)

Signature Filtering Based on Estimated Risk

Superposition of AmbiguousSignature

[s1][{s2,s'2}][s3][s4] => { [s1][s2][s3][s4], [s1][s'2][s3][s4] }

Duplicate Page Detection (metrics)

● The similarity based on signature sets is calculated as:

● The similarity based on matched (aligned) word bounding boxes:

Duplicate Page Detection (example)

Js = 19% Jb = 93%

Duplicate Page Detection (example)

Js = 5% Jb = 37%

Image Lookup

Image Lookup Examples

Image Lookup Examples

Image Lookup Result

1M pages index (32bit signatures) stats:● 386M signatures total● stored as sorted array (<signature, book_id, page_pid, x, y>) fits

in ~4GB of memory● 0.8% of all signatures filtered (those repeated on 1k or more

pages)● each query on average returns 2000 canidates

Index Size [pages] Accuracy Signature Size [bits]

4.1k 0.966 164.1k 0.949 321M 0.871 32

Conclusion

● Simple schema for point cloud based discrete signature generation

● Filtering based on signature stability estimate● Superposing signatures ● Duplicate page detection ● Image lookup by cell phone camera image query (87.1%

accuracy for 1M pages indexed)

Q & A

Thank You!

Synthetic Data Evaluation

Synthetic Data Evaluation