presentation of "hybrid page layout analysis via tab-stop detection"

25
Presentation of Hybrid Page Layout Analysis via Tab-Stop Detection Ray Smith, Proc. ICDAR2009, Barcelona, Spain, 2009. Javier de la Rosa {jdelaros at uwo dotca} CS 9883

Upload: javier-de-la-rosa

Post on 07-Jul-2015

1.330 views

Category:

Technology


0 download

DESCRIPTION

Presentation of the proceeding article "Hybrid Page Layout Analysis via Tab-Stop Detection" by Ray Smith to the Page Segmentation Competition hold on ICDAR 2009.

TRANSCRIPT

Page 1: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

Presentation ofHybrid Page Layout Analysis via Tab-Stop Detection

Ray Smith, Proc. ICDAR2009, Barcelona, Spain, 2009.

Javier de la Rosa {jdelaros at uwo dotca}CS 9883

Page 2: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

2 | Internal use only2 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

Index.

1. Context and background.

2. Introduction.

3. Page layout via tab-stop detection.

4. Preprocessing.

5. Finding tab positions as line segments.

6. Finding the column layout.

7. Finding the regions.

8. Testing and results.

9. Conclusion and further work.

10. Criticism.

11. References.

Page 3: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

3 | Internal use only3 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

1. Context and background.

• International Conference on Document Analysis and

Recognition [1].

• Page Segmentation competitions: 2001, 2003, 2005, 2007

and 2009 [2].

• Tesseract, the OCR from Google [3].

Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011) <http://www.icdar2011.org/> [1]A. Antonacopoulos, et al. ICDAR 2009 Page Segmentation Competition, Barcelona, Spain, 2009. <http://www.cse.salford.ac.uk/prima/ICDAR2009_pscomp/> [2]

The Tesseract OCR <http://code.google.com/p/tesseract-ocr/> [3]

Page 4: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

4 | Internal use only4 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

2. Introduction.

Physical page layout analysis:

• Bottom-up [4].

• Top-down [5].

• Whitespaces [6].

Logical page layout analysis:

• Voronoi.

• Smearing.

• Etc.M. Chen, X. Q. Ding, "Unified HMM-based Layout Analysis Framework and Algorithm,” SCI CHINA Ser F, 46(6), Dec. 2003, pp401-408. [4]

G. Nagy, S.C. Seth, "Hierarchical Representation of Optically Scanned Documents" Proc. 7th Int. Conf. on Pattern Recognition, Montreal, Canada, 1984, pp347-349. [5] T.M. Breuel, "Two Geometric Algorithms for Layout Analysis," Proc. of the 5th Int. Workshop on Document Analysis Systems V, Springer-Verlag 2002, pp188-199. [6]

Page 5: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

5 | Internal use only5 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

3. Tab-stop detection.

• Regions bounded by tab-stops.

• Fixed x-positions.

• Vertical alignment.

Phases:

1. Preprocessing.

2. Bottom-up tab-stop detections.

3. Finding the column layout.

4. Set of typed regions.

Page 6: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

6 | Internal use only6 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

4. Preprocessing

• Detection of vertical lines and image mask [7].

• Connected components (CCs) analysis.

• CCs filtering by width, w, and height, h:

– Small: h < 7 (@300ppi) or h < h75 / 2

– Large: h > 2h75 or w > 8h75

– Medium: rest of reminder.

Leptonica image processing and analysis library <http://www.leptonica.com> [7]

Page 7: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

7 | Internal use only7 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

5. Finding the positions as line segments. (1/3)

• Candidate tab-stop components:

– A CC is a tab-stop by default.

– Look for aligned neighbours.

– Mark each CC as left tab, right tab or

neither.

• Grouping candidate tabs:

– In lines and, if there are many, in groups.

– Least median of squares to fit the lines

(left or right).

– Refit lines to the page-mean direction.

Page 8: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

8 | Internal use only8 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

5. Finding the positions as line segments. (2/3)

• Tracking text lines to connect tab-stops:

– From one tab-stop to another.

– Associate tab-stops connected by text

lines.

– Discard tab-stop with no connections.

– Record the most frequently occurring

text lines widths.

Page 9: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

9 | Internal use only9 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

5. Finding the positions as line segments. (3/3)

• Cleaning up tab-stop ends:

– Make connected tab lines end at the same y

coordinate:

– Moving the ends between the last member

CC and the first non-member CC.

• Reclassify CC as “Text” or “Unknown”:

– A CCs group of significant with form a text

line.

• Create artificial CCs from the image mask.

Page 10: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

10 | Internal use only10 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

6. Finding the column layout. (1/3)

• Scan CCs from left to right and top to down, gathering into

Column Partitions (CPs).

• A CP may not cross a tab-stop line.

• Collections of CPs are stored in Column Partition Sets

(CPsets).

• Find the column layout → find an optimal set of CPsets that

best “explains” all the CPsets on the page.

Page 11: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

11 | Internal use only11 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

6. Finding the column layout. (2/3)

• A good CP: it touches a tab line on both vertical edges.

• A good CP: its width is closely to frequency occurring width (slide 8).

• The coverage of CPset = total width of all the good CPs that it contains.

• A CPset A is better than CPset B if A has greater coverage.

• What does it mean “explain”? In a short:

– CPset A explains CPset B unless one or more of the following are true:

• B hasn't more text than A.

• A hasn't split a column fo common width.

• A hasn't a different number of columns to B.

• A hasn't merged two columns of B.

Page 12: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

12 | Internal use only12 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

6. Finding the column layout. (3/3)

• List from set of CPsets on

the page.

• Ordered by best ones first.

• Duplicates eliminated by

the A explains B rules.

• Image CPs are ignored.

• Improve the candidates

adding new CPs.

Page 13: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

13 | Internal use only13 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

7. Finding the regions. (1/3)

• Create flows of CPs:

– Choose the best matching upper and

lower partner.

– The list of partners is forced to become

zero or one iteratively.

– Different rules for image CPs and text

CPs.

– Each chain of CPs returned represents a

candidate region:

• Text is blue.

• Heading text is cyan.

• Heading image is magenta.

• Pull-out image is orange.

Page 14: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

14 | Internal use only14 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

7. Finding the regions. (2/3)

● The rules to apply:

1. Type. If there are multiple types, text can only stay with its own (exact) type,

whereas image any other image type.

2. Transitive partner shortcuts are broken. If A has 2 partners B and C, and also B

has C as a partner in the same direction, then delete C as a partner of A,

leaving a clean chain A-B-C. Also if A has a partner B, and B has a partner A in

the same direction, break the cycle.

3. (Text only) If A still has 2 partners B, C, chase B and C's partners to see which

has the longest chain. Delete from A the partner that has the shortest chain,

and convert the type of the shortest chain to pull-out.

4. (Image only) Choose the partner CP with the largest horizontal overlap.

Page 15: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

15 | Internal use only15 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

7. Finding the regions. (3/3)

• Determinate the order reading:

1. Flowing blocks follow by y position within a column.

2. Pull-out blocks follow by y position in an imaginary column

between the real columns that they touch.

3. A heading spans multiple columns and follows anything that is

above it in the columns spanned, or between them.

4. A change in column layout works just like a heading.

5. Between headings, the content of columns is ordered from

left to right.

• Find the polygon boundary for each region:

–Polygons are isothetic.

–Polygon edges are chosen to minimize the number of

vertices.

–All CPs are contained within their region.

Page 16: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

16 | Internal use only16 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

8. Testing and results. (1/2)

• Algorithm implemented in C++.

• Part of Tesseract Open Source OCR system [3].

• 1 image of 8MPixel per second on a 3.4GHz Pentium 4.

Page 17: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

17 | Internal use only17 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

8. Testing and results. (2/2)

Noise Sep Text Image Overall

0

10

20

30

40

50

60

70

80

90

100

PRImA Metric

2007-Besus2007-TH12007-TH2Tesseract

Measure

Me

tho

d

Noise Sep Text Image Overall

0

20

40

60

80

100

120

F-Measure

2007-Besus2007-TH12007-TH2Tesseract

Measure

Me

tho

d

Noise Sep Text Image Overall

0

20

40

60

80

100

120

Recall

2007-Besus2007-TH12007-TH2Tesseract

Measure

Me

tho

d

Noise Sep Text Image Overall

0

20

40

60

80

100

120

Precission

2007-Besus2007-TH12007-TH2Tesseract

Measure

Me

tho

d

ICDAR 2007 set [2, 8]

A. Antonacopoulos, et. al. “ICDAR2007 Page Segmentation Competition,” Proc 9th Int. Conf. on Doc. Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp1279-1283. [8]

Page 18: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

18 | Internal use only18 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

9. Conclusion and further work

• Tab-stop make an interesting and useful alternative to

white rectangles.

• It enables page layout analysis to easily handle the

complex non-rectangular layouts of modern magazines.

• Table detection will be added in the future.

Page 19: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

19 | Internal use only19 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

10. Criticism. (1/4)

• The idea is totally new and it works reasonably well, but

• No references.

• No formulas.

• No algorithms.

• No mathematical justification.

• Excess text and literature.

• Process too long and with no justifications in many

occasions.

Page 20: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

20 | Internal use only20 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

10. Criticism. (2/4)

• An example:

– Preprocessing: Small CCs: h < 7 (@300ppi) ...

• Why 7?

• Does it only work at 300ppi?

• Only on magazine papers (10.5” x78.5”)?

Page 21: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

21 | Internal use only21 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

10. Criticism. (3/4)

• More:

– Reclassify CC as “Text” or “Unknown”: A CCs group of

significant width form a text line.

• What's a “significant width”?

– Find the polygon boundary for each region: Polygon

edges are choosen to minimize the number of

vertices.

• What's the algorithm or reference to do this?

Page 22: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

22 | Internal use only22 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

10. Criticism. (4/4)

ICDAR 2009 Results [2]

Page 23: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

23 | Internal use only23 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

11. References. (1/2)

1. Eleventh International Conference on Document Analysis and Recognition (ICDAR 2011)

<http://www.icdar2011.org/>

2. A. Antonacopoulos, et al. ICDAR 2009 Page Segmentation Competition, Barcelona,

Spain, 2009. <http://www.cse.salford.ac.uk/prima/ICDAR2009_pscomp/>

3. The Tesseract OCR <http://code.google.com/p/tesseract-ocr/> [3]

4. M. Chen, X. Q. Ding, "Unified HMM-based Layout Analysis Framework and Algorithm,”

SCI CHINA Ser F, 46(6), Dec. 2003, pp401-408.

Page 24: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

24 | Internal use only24 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

11. References. (2/2)

5. G. Nagy, S.C. Seth, "Hierarchical Representation of Optically Scanned Documents" Proc. 7th Int. Conf.

on Pattern Recognition, Montreal, Canada, 1984, pp347-349.

6. T. M. Breuel, "Two Geometric Algorithms for Layout Analysis," Proc. of the 5th Int. Workshop on

Document Analysis Systems V, Springer-Verlag 2002, pp188-199.

7. Leptonica image processing and analysis library

<http://www.leptonica.com>

8. A. Antonacopoulos, et. al. “ICDAR2007 Page Segmentation Competition,” Proc 9th Int. Conf. on Doc.

Analysis and Recognition, IEEE, Curitiba, Brazil, Sep 2007, pp1279-1283.

Page 25: Presentation of "Hybrid Page Layout Analysis via Tab-Stop Detection"

25 | Internal use only25 of 25 | Javier de la Rosa | Hybrid Page Layout Analysis via Tab-Stop Detection | CS 9883

Questions?

Thank you