logical structure analysis of scientific publications in mathematics
DESCRIPTION
A talk at WIMS'11.TRANSCRIPT
![Page 1: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/1.jpg)
Logical Structure Analysis ofScientific Publications in Mathematics
Valery Solovyev, Nikita Zhiltsov
Kazan (Volga Region) Federal University, Russia
1 / 44
![Page 2: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/2.jpg)
Overview
É LOD Cloud has been growing at 200-300%per year since 2007∗
É Prevalent domains: government (43%),geographic (22%) and life sciences (9%)
É However, it lacks data sets related toacademic mathematics
∗C.Bizer et al. State of the Web of Data.LDOW WWW’11
2 / 44
![Page 3: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/3.jpg)
1 Background
2 Proposed Semantic Model
3 Analysis Methods
4 Experiments and Evaluation
5 Prototype
3 / 44
![Page 4: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/4.jpg)
Mathematical Scholarly PapersEssential features
É Well-structured documentsÉ The presence of mathematical formulaeÉ Peculiar vocabulary (“mathematical
vernacular”)
4 / 44
![Page 5: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/5.jpg)
Research Objectives
Current studyÉ Specification of the document logical structureÉ Methods for extracting structural elements
Long-term goalsÉ A large corpus of semantically annotated papersÉ Semantic search of mathematical papers
5 / 44
![Page 6: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/6.jpg)
Modelling the Structure of ScientificPublicationsABCDE format
É LaTeX-based format to represent the narrativestructure of proceedings and workshop contributions
É Sections:� Annotations (Dublin Core metadata)� Background (e.g. description of research positioning)� Contribution (description of the presented work)� Discussion (e.g. comparison with other work)� Entities (citations)
6 / 44
![Page 7: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/7.jpg)
Modelling the Structure of ScientificPublicationsSALT
É LaTeX-based authoring tool for generatingsemantically annotated PDF documents
É Three ontologies:� SALT Document Ontology� SALT Annotation Ontology� SALT Rhetorical Ontology
7 / 44
![Page 9: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/9.jpg)
Mathematical KnowledgeRepresentation
É Languages for formalized mathematics� Mizar� Coq� Isabelle
É Semiformal math languages� HELM ontology� MathLang� OMDoc format (+ OMDoc ontology, sTeX)
É Presentation/authoring formats� PDF� LATEX
9 / 44
![Page 11: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/11.jpg)
Trade-off Candidates
É arXMLiv format� XHTML+MathML� Marked up theorem-like elements, sections,
equations� Automatic conversion for LaTeX documents with
styles of available bindings (LaTeXML)� 60% of arXiv.org were converted into the format
É Present work� Follow the slides ⇒
11 / 44
![Page 12: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/12.jpg)
1 Background
2 Proposed Semantic Model
3 Analysis Methods
4 Experiments and Evaluation
5 Prototype
12 / 44
![Page 14: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/14.jpg)
Proposed Semantic Model
É It is an ontology that captures the structural layoutof mathematical scholarly papers (as in the LaTeXmarkup)
É The segment represents the finest level ofgranularity and has the properties:
� starting and ending positions� the text or math contents� functional role
É Select most frequent segments from samplecollections of genuine papers
É Consider synonyms as one concept (e.g. conjectureand hypothesis)
14 / 44
![Page 15: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/15.jpg)
Proposed Semantic Model (cont.)
É Select basic semantic relations between segmentsfrom the prior-art models
É Integration with SALT Document Ontology classes:� Publication� Section� Figure� Table
15 / 44
![Page 16: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/16.jpg)
Ontology Elementshttp://cll.niimm.ksu.ru/ontologies/mocassin#
16 / 44
![Page 17: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/17.jpg)
1 Background
2 Proposed Semantic Model
3 Analysis Methods
4 Experiments and Evaluation
5 Prototype
17 / 44
![Page 18: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/18.jpg)
Logical Structure Analysis
É The ontology specifies a controlled vocabulary tosemantic analysis
É Two analysis tasks:� recognizing the types of document segments� recognizing the semantic relations between them
18 / 44
![Page 22: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/22.jpg)
Recognizing the Types of DocumentSegments
We exploit the LaTeX markup extensively
1 Elicit a LaTeX environment2 Associate it with a string that may beeither the environment name
or the environment title (if available)
3 Filter out standard formatting environments (e.g.center, align, itemize)
4 Compute string similarity between a string andcanonical names of ontology concepts
5 Check if the found most similar concept isappropriate using a predefined threshold
22 / 44
![Page 23: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/23.jpg)
Recognizing Navigational Relations
The dependsOn and refersTo relations are navigational
AssumptionNavigational relations are induced by referentialsentences
ExamplesÉ “By applying Lemma 1, we obtain ...” (dependsOn)É “Theorem 2 provides an explicit algorithm ...”
(refersTo)
23 / 44
![Page 24: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/24.jpg)
Recognizing Navigational RelationsSupervised method
1 Given a segment S; split its text into sentences,tokenize and do POS tagging
2 Referential sentences are ones that contain the \refcommand entries
3 For each sentence:� find mentioned segments; each of them makes a pair
with S (type feature)� for each pair, compute relative positions of segments
normalized by the document size (distance feature)� build a boolean vector for its verbs (verb feature)
24 / 44
![Page 25: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/25.jpg)
Recognizing Navigational Relations(cont.)Supervised method
Example training instance
t1 t2 d1 d2 add ... apply ... relation
proof lemma 0.09 0.27 0 ... 1 ... dependsOn
É Train a learning model using these features and alabeled example set
É Apply the model to classify new induced relations
25 / 44
![Page 26: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/26.jpg)
Recognizing Restricted Relations
The hasConsequence, exemplifies and proves relationsare restricted
AssumptionRestricted relations occur between consecutivesegments
26 / 44
![Page 27: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/27.jpg)
Recognizing Restricted Relations (cont.)Baseline method
According to the ontology, restricted relations involveinstances of three types, separately: Corollary, Exampleand Proof
1 Seek a segment of one of these types
2 Find its segments-predecessors
3 Filter out segments of inappropriate types
4 Return the closest predecessor
27 / 44
![Page 28: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/28.jpg)
1 Background
2 Proposed Semantic Model
3 Analysis Methods
4 Experiments and Evaluation
5 Prototype
28 / 44
![Page 29: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/29.jpg)
Experimental SetupCollectionsÉ 1355 papers of the “Izvestiya Vysshikh Uchebnykh
Zavedenii. Matematika” journalÉ A sample of 1031 papers from arXiv.org
ImplementationAn open source Java library built upon:É LaTeX-to-XML convertersÉ GATE frameworkÉ WekaÉ Jena
See http://code.google.com/p/mocassin
29 / 44
![Page 30: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/30.jpg)
Segment Recognition EvaluationÉ Evaluation on the arXiv sample onlyÉ Q-gram string matching algorithm was usedÉ The threshold value was optimized w.r.t. F1-score
Type # of F1-scoretrue instances
Axiom 5 1.000Claim 114 0.987Conjecture 152 0.987Corollary 1715 0.995Definition 1838 1.000Example 771 0.999Lemma 4061 0.998Proof 4943 0.997Proposition 3052 0.999Remark 2114 1.000Theorem 4670 0.991other 671 0.892
30 / 44
![Page 31: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/31.jpg)
Ontology Coverage Evaluation
É Evaluation on the both entire collections (“Izvestiya”and arXiv)
É Equations are most ubiquitous segments (52% and69%, respectively)
É The ontology covers types of 91.9% and 91.6% ofsegments (with SALT Section class – 99.5% and99.6%)
31 / 44
![Page 32: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/32.jpg)
Distribution of Segment Types
0%
5%
10%
15%
20%
25%
30%T
heo
rem
Pro
of
Lem
ma
Rem
ark
Coro
llar
y
Def
init
ion
Pro
posi
tion
Ex
ample
oth
ers
Cla
im
Co
nje
cture
Per
centa
ge
of
seg
men
t o
ccu
rren
ces
Izvestiya
arXiv
32 / 44
![Page 33: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/33.jpg)
Evaluation of Navigational RelationRecognition
É A paper contains 51.4 (Izvestiya) and 53.9 (arXiv)referential sentences on the average
É 243 referential sentences were randomly selectedand manually annotated
É 95% were true navigational relationsÉ A decision tree learner (C4.5) was trainedÉ The results were from 10-fold cross validation
Features Accuracy F1-score F1-scorerefersTo dependsOn
type 0.663 0.566 0.752type+distance 0.658 0.663 0.704type+verb 0.704 0.653 0.770type + distance + verb 0.741 0.744 0.772
33 / 44
![Page 35: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/35.jpg)
Evaluation of Restricted RelationRecognition
É Evaluation on the arXiv sample onlyÉ 10% of the documents which contain certain
segments were randomly selectedÉ For each such a segment, corresponding relations
were annotated manuallyÉ Known issues: imported corollaries and examples
for arbitrary text fragments
Relation # of instances F1-scorehasConsequence 178 0.687exemplifies 62 0.613proves 216 0.954
35 / 44
![Page 36: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/36.jpg)
Conclusion on Evaluation
É The ontology covers the largest part of the logicalstructure and appears to be feasible for automaticextraction methods
É The task of segment type recognition has beenaccomplished
É The method for recognizing navigational relationsestablishes ground truth, however, a large-scaleevaluation and learning model selection are required
É The baseline method for recognizing restrictedrelations must be improved by leveraging additionalinformation (discussed in the paper!)
36 / 44
![Page 37: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/37.jpg)
1 Background
2 Proposed Semantic Model
3 Analysis Methods
4 Experiments and Evaluation
5 Prototype
37 / 44
![Page 38: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/38.jpg)
Prototype
A prototype:É demonstrates our ongoing research on
semantic search of mathematical papersÉ incorporates the logical structure analysis
methodsÉ is integrated with arXiv APIÉ enables enhanced search for arXiv papers
and visualization of their logical structureÉ publishes the semantic index as Linked
Data via SPARQL endpoint
38 / 44
![Page 39: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/39.jpg)
Search Interfacehttp://cll.niimm.ksu.ru/mocassin
39 / 44
![Page 40: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/40.jpg)
Formulating a Queryhttp://cll.niimm.ksu.ru/mocassin
40 / 44
![Page 41: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/41.jpg)
Search Resultshttp://cll.niimm.ksu.ru/mocassin
41 / 44
![Page 42: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/42.jpg)
Preview a Search Resulthttp://cll.niimm.ksu.ru/mocassin
42 / 44
![Page 43: Logical Structure Analysis of Scientific Publications in Mathematics](https://reader034.vdocuments.site/reader034/viewer/2022051314/55842296d8b42a79568b468e/html5/thumbnails/43.jpg)
Summary
É The proposed approach aims to analyze thestructure of mathematical scholarly papersin an automatic way
É Our ontology provides a controlledvocabulary for analysis
É The methods elicit document segments interms of the ontology
É The extracted semantic graph can be usedfor:
� discovering important document parts� semantic search of theoretical results
43 / 44