itec810 final report inferring document structure wieyen lin/41348133 supervised by jette viethen
Post on 18-Dec-2015
214 views
TRANSCRIPT
5
Introduction (cont’d)
Research ObjectiveAnalyze a document image and detect
its logical structure with annotated labels Project Scope
Focus on: Academic articlesSource Corpus: Association for
Computational Linguistics (ACL) Anthology Corpus
6
Related Work
Physical Layout AnalysisTop-down methodsBottom-up methods
Logical Structure AnalysisSyntactic methodsRule-based methods
9
Methodology
1a. Grouping texts into lines
XML sourceby text
1b. Aggregating lines into blocks
XML sourceby line
Physical Structure
Phase I: Aggregation of Homogeneous Blocks
10
Methodology (cont’d)
2. Annotating each block with a logical label
Logical Structure
XML sourceby block
1b. Aggregating lines into blocks
Phase II: Detection of Logical Structure
11
Methodology (cont’d)
Check dominant font size
Read-in3 lines at a time
A1A2A3 AAB ABB A1BA2 ABC
A B CA1 B A2A BBAA BCheckspacing
s1=s2
AAA
s1>s2
A1 A2A3
s1>s2
A3A1A2
A, B, C: lines of texts with different dominant font sizesA1, A2: lines of texts with the same dominant font sizes1: spacing between A1 and A2
s2: spacing between A2 and A3
A : belongs to the same block
Algorithm for aggregating blocks In Phase II
17
Conclusion:Information Evaluation
Error TypeError
FoundAccuracy of
Detection
Incorrect title or missing title 1 97.5% (39/40)
Incorrect Abstract heading or Missing Abstract heading
4 90.0% (36/40)
Incorrect Abstract or Missing Abstract 4 90.0% (36/40)
Incorrect Affiliation(s) or Missing Affiliation(s)
11 72.5% (29/40)
Missing >50% of Page number(s) or Erroneous Page number(s) found
15 62.5% (25/40)
Missing >50% Section heading(s) or Erroneous Section heading(s) found
11 72.5% (29/40)
Summary of detection results out of 40 randomly selected documents