kwic corpora as a source of specialized definitional information: a pilot study (by antonio san...

22
KWIC corpora as a source of specialized definitional information: a pilot study Antonio San Martín University of Granada, Spain

Upload: antonio-san-martin

Post on 22-Jul-2015

168 views

Category:

Education


0 download

TRANSCRIPT

KWIC corpora as a source of specialized definitional

information:a pilot study

Antonio San MartínUniversity of Granada, Spain

1. Introduction

Motivation: definition writing

http://ecolexicon.ugr.es

•Definitions in other resources•Corpus analysis

What should I include in my definitions?

Assumption

The lexical units that normally co-occur with another lexical

unit are potentially important to define them.

Hypothesis

Corpus of KWIC (Key Word In Context)

concordances of the concept to define

Term list: potentially definitional terms for the concept to define

2. Methods

2. Methods

Reference listAnalysis list

2.1. Reference list

- Term list generated with TermoStat Web 3.0 (Drouin 2003): most frequent nouns, noun phrases and adjectives (+4 occurrences)

- Source: English corpus of 133 specialized definitions of MAGMA.

2.1. Reference list

- To minimize interference from terminological variation, terms in the reference list were categorized according to the conceptual proposition established with MAGMA.

- Any categorization has a certain degree of subjectivity. The configuration of our reference list is the result of certain choices.

2.1. Reference listConceptual  proposition Instances  from  the  list  generated  by  TermoStat

magma  is  a  rock rock  (163),  molten  rock  (79),  rock  material  (17),  molten  rock  material  (10),  liquid  rock  (4)

magma  is  a  material material  (37),  rock  material  (17),  molten  rock  material  (10),  molten  material  (8)

magma  is  (a)  liquid  /  magma  is  a  >luid   liquid  (13),  >luid  (6),  liquid  rock  (4)

magma  is  a  mixture  /  magma  is  made  of  a  mixture mixture  (6)

magma  is  molten molten  (105),  molten  rock  (79),  molten  rock  material  (10),  molten  material  (8),  molten  state  (4)

magma  is  hot hot  (18),  temperature  (6)

magma  is  mobile mobile  (6)

magma  contains  gas/bubbles gas  (25),  bubble  (4)

magma  contains  crystals crystal  (24)

magma  contains  silicate silicate  (9)

magma  contains  volatiles volatile  (4)

magma  contains  minerals mineral  (4)

magma  undergoes  solidi>ication solidi>ication  (6),  solid  (5)

magma  undergoes  (partial)  melting melting  (7),  partial  melting  (6)

magma  causes  intrusion intrusion  (7)

magma  causes  extrusion extrusion  (6)

magma  becomes  igneous  rock  /  magma  is  the  raw  material  of  igneous  rocks igneous  (40),  igneous  rock  (37),  raw  material  (4)

magma  becomes  lava lava  (38)

magma  is  found  under  the  Earth’s  or  a  planet’s  surface earth  (98),  surface  (63),  planet  (5),  deep  (6),  depth  (4),  underground  (5)

magma  is  found  deep  in  the  Earth  /  at  depth deep  (6),  depth  (4)

magma  is  found  in  the  (Earth’s)  crust crust  (33)

magma  is  found  in  the  upper  part  of  the  (Earth’s)  mantle. mantle  (20),  upper  (5)

magma  is  erupted  from  a  volcano volcano  (7),  volcanic  (7)

2.2. Analysis lists

- An English corpus of environmental texts (PANACEA corpus + LexiCon corpus). 359 occurences of MAGMA.

- Wordsmith Tools (Scott 2008) to generate KWIC concordance lines:

100c MAGMA 100c250c MAGMA 250c500c MAGMA 500c750c MAGMA 750c

Sentences

2.2. Analysis list

-Each corpus was fed into TermoStat in order to obtain the most frequent nouns, noun phrases, and adjectives.-The 50 and 100 terms with the highest raw frequency were retained for comparison with the reference list.-Analysis lists:

50-term 100c50-term 250c50-term 500c50-term 750c50-term sentence

100-term 100c100-term 250c100-term 500c100-term 750c100-term sentence

2.3. Precision and recall

P = TP / (TP+FP)R = TP / (TP+FN)

-TP (true positive): a term in the analysis list that matches any of the categories in the reference list. The result is expressed as a percentage.

- FP (false positive): a term in the analysis list that matches no category in the reference list. The result is expressed as a percentage.

- FN (false negative): a category in the reference list that is not matched by any of the terms in the analysis list. The result is expressed as a percentage.

2.3. Precision and recall

F2-measurement (Chinchor, 1992, 25), which gives twice the importance to recall as to precision. The formula used was the following:

F2 = (5 · P ·R) / (5 · P + R)

3. Results

3. Results

3. Results

-The 100-term 250C list performed the best (F2-M: 69.08 %). Also, its recall ratio was the highest (78.28 %).-The highest precision ratio corresponded to the 50-term 100C list. But its recall ratio was 12 points below the 100-term 250C.-The SC list obtained a lower F2 score compared to any of the KWIC lists.-Once the threshold of the 250-character context was exceeded, longer contexts caused both precision and recall to decrease.

4. Conclusions and future work

Conclusions and future work

‣Although the scope of this pilot study was limited, results indicate that a 250-character KWIC corpus coupled with a 100-term list generated from it could be a useful tool for definition writing.

‣The inevitable bias caused by the use of a reference list based on a manual classification does not invalidate the results.

Conclusions and future work

‣This initial pilot study will subsequently be expanded to include new variables:

‣other kind of definienda‣verbs and adverbs in the term lists‣corpora of different levels of specialization‣more KWIC corpora with different character

counts. comparison of the output of TermoStat with other term extractors as well as a simple keyword generator

Conclusions and future work

‣Our ultimate objective is to combine our approach with the application of knowledge-pattern-based techniques (Pearson, 1998; Meyer, 2001; Malaisé et al., 2005; Marshman and L’Homme 2006; Auger and Barrière, 2008, inter alia) to create a system of semi-automatic definitional information extraction.

Thank you

[email protected]

http://lexicon.ugr.es/sanmartin