learning to mine definitions from slovene structured and unstructured knowledge - rich resources
DESCRIPTION
Darja Fišer, Senja Pollak, Špela Vintar University of Ljubljana, Dept. of Translation Studies {darja.fiser, spela.vintar}@guest.arnes.si, [email protected]. Learning to Mine Definitions from Slovene Structured and Unstructured Knowledge - Rich Resources. Aim. - PowerPoint PPT PresentationTRANSCRIPT
Darja Fišer, Senja Pollak, Špela VintarUniversity of Ljubljana, Dept. of Translation Studies{darja.fiser, spela.vintar}@guest.arnes.si, [email protected]
Extract definitions of specialised concepts from texts (journals, textbooks etc.). Use Wikipedia to learn rules that help
distinguish between proper definitions and non-definitions.
Extract candidate sentences from texts using 3 approaches:▪ patterns (A cell is the smallest living unit in an
organism)▪ automatic term recognition ▪ wordnet
Apply rules to select good definitions and discard non-definitions
LREC2010 Malta
title
definition
non-definition
LREC2010 Malta
Slovene Wikipedia (December 2009): 162,500 articles only well-formed pages retained morphosyntactic annotation and
lemmatization with ToTaLe (Erjavec et al. 2005)
structural parsing: 19,964 instances building a classification model in Weka
(Witten and Frank 2005) features: most frequent PoS and lemmata
LREC2010 Malta
best: J48 decision tree classifierexperimenting with full and merged
PoS, absolute frequency (AF) and binary values
10-fold cross-validationSETS
Instances
Attributes
NaiveBayes
J48 JRIP PART
ORIG 19964 260 66.91%82.13%
80.91% 82.56%
ORIG_bin
19964 260 73.85% 82.2% 80.6% 81.88%
MERGED
19964 188 62.64%82.72%
81.68% 82.72%
MERGED_bin
19964 188 72.39%82.44%
80.5% 81.79%
LREC2010 Malta
“unstructured texts”: subset of the FidaPlus corpus (http://www.fidaplus.net) knowledge-rich: textbooks, popular science
volumes (e.g. “All about mushrooms”) various domains: astronomy, physics,
geography, botany ... sloWNet – Slovene wordnet
(Fišer 2007, http://lojze.lugos.si/~darja/slownet.html) Automatic term recognition system for
Slovene (Vintar 2004, http://lojze.lugos.si/cgitest/extract.cgi)
LREC2010 Malta
The sentence is a definition candidate if: the sentence starts with a sloWNet literal and
contains at least one more literal from the same hyperonymy chain (i.e. its hyponym or its hypernym)
<term id=ENG20-13313485-n>Diabetes</term> je <term id=ENG20-13268088-n>bolezen</term>, ki je posledica pomanjkanja inzulina, hormona, ki skrbi, da celice v telesu dobivajo glukozo (sladkor).
[Diabetes is a disease resulting from insulin deficiency, the hormone providing glucose (sugar) for body cells.]
LREC2010 Malta
The sentence is a definition candidate if: the sentence contains at least two domain-
specific terms and the first term is in the nominative case
<term score=“80.45“>Ekvator</term> je najdaljši vzporednik,ki deli Zemljo na severno in <term score=”43.21”>južnopoloblo</term>.
[The Equator is the largest circle of latitude dividing the Earthinto the Northern and the Southern Hemispheres.]
LREC2010 Malta
The sentence is a definition candidate if:
the sentence contains a defining morphosyntactic pattern (NP[nominative] is_a NP [nominative]).
NP is_a NPCelica je strukturna in funkcionalna enota vseh živih organizmov.
[A cell is a structural and functional unit of all living organisms.]
LREC2010 Malta
Def. candidates
True definitions
Precision
sloWNet 104 41 0.39
ATR 629 118 0.19
Patterns 311 98 0.31
Total / Average
1044 257 0.29
• manual evaluation of all definition candidates
• sloWNet: best precision, ATR: best recall
• what is a definition??
LREC2010 Malta
sloWNet ATR Patterns
MERGED + J48 61.76% 69.79% 69.45%
MERGED_bin + J48 66.67% 71.06% 63.9%
ORIG_bin + J48 63.72% 65.98% 62.7%
For definitions only:
sloWNet ATR Patterns
Precision 0.63 0.46 0.514
Recall 0.415 0.441 0.551
F-measure 0.5 0.452 0.532
LREC2010 Malta
The Equator is an imaginary line on the Earth's surface equidistant from the North Pole and South Pole that divides the Earth into a Northern Hemisphere and a Southern Hemisphere.
An equator is the intersection of a sphere's surface with the plane perpendicular to the sphere's axis of rotation and containing the sphere's center of mass.
The longest of the five main circles of latitude on Earth (the others being the Arctic and Antarctic Circles and the Tropics of Cancer and Capricorn) is called the Equator.
LREC2010 Malta
Head lice are parasites that live in the hair and scalp of humans.
HEAD LICE, also called Pediculus Humanus Capitis are small blood-sucking, wingless insects found on the human scalp. They are approximately the size of a sesame seed and cannot jump or fly. They are six-legged creatures with claws, which help them cling to and crawl through human hair.
Head lice are an emerging social problem, not only in economically poor countries but also in practically all other societies.
LREC2010 Malta
Wikipedia can help us learn the properties of definitions,
Knowledge-rich texts are a good source of definitions,
A semantically-rich approach (using wordnet and ATR) yields many definitions and defining contexts.
Defining a definition is hard...
Encyclopaedic definitions differ from those found in running texts,
Future work: use other features in
learning, use active learning, redefine definitions
and possibly re-evaluate definition candidates
LREC2010 Malta