supporting e-learning with automatic glossary extraction experiments with portuguese

Supporting e-learning with

automatic glossaryextraction

Experiments with Portuguese

Rosa Del Gaudio, António BrancoRANLP, Borovets 2007

Presentation Plan

● LT4eL project● ILIAS● Corpus● Tool● Grammars

● Copula● Other Verbs● Punctuation

● Results● Conclusion

LT4eL● Improve retrieval and accessibility of LO in learning management systems●Employ language technology resources and tools for the semi-automatic generation of descriptive metadata .

●Develop new functionalities such as a key word extractor and a glossary candidate detector, semantic search, tuned for the various languages addressed in the project (Bulgarian, Czech, Dutch, English, German, Maltese, Polish, Portuguese, Romanian).

Objective

● Build a Glossary in an automatic way to support e-learning process. In practice this means to extract a definition from unstructured text (scientific papers, enciclopedia, web pages)

● Better access to information for student ●Accelerate the work of the tutor

ILIAS: Glossary Candidate Detector

The Corpus

• 274.000 tokens • Tutorials

• PhD Thesis

• Scientific papers

• 3 Domains evenly represented

• e-learning

• Technology for non experts

• Calimera

XML format

<definingText continue="y" def="m147" def_type1="is_def" id="d5"><markedTerm dt="y" id="m147" kw="y"><tok base="intranet" class="word" ctag="PNM" id="t9032" sp="y">Intranet</tok></markedTerm><tok base="ser" class="word" ctag="V" id="t9033" msd="pi-3s" sp="y">é</tok><tok base="uma" class="word" ctag="UM" id="t9034" msd="fs" sp="y">uma</tok><tok base="rede" class="word" ctag="CN" id="t9035" msd="fs" sp="y">rede</tok><tok base="desenvolver,desenvolvido" class="word" ctag="PPA" id="t9036" msd="fs"

sp="y">desenvolvida</tok><tok base="para" class="word" ctag="PREP" id="t9037" sp="y">para</tok><tok base="processamento" class="word" ctag="CN" id="t9038" msd="ms"

sp="y">processamento</tok><tok base="de" class="word" ctag="PREP" id="t9039" sp="y">de</tok><tok base="informação" class="word" ctag="CN" id="t9040" msd="fp"

sp="y">informações</tok><tok base="em" class="word" ctag="PREP" id="t9041" sp="y">em</tok><tok base="uma" class="word" ctag="UM" id="t9042" msd="fs" sp="y">uma</tok><tok base="empresa" class="word" ctag="CN" id="t9043" msd="fs" sp="y">empresa</tok><tok base="ou" class="word" ctag="CJ" id="t9044" sp="y">ou</tok><tok base="organização" class="word" ctag="CN" id="t9045" msd="fs">organização</tok><tok class="punctuation" ctag="PNT" id="t9046" sp="y">.</tok></definingText>

LxTransduce

• Input: simple text or xml

• Regular expressions

• Substitution and markup

• Output the same file with changes

• Match tree using elements

• Quick

• Unicode friendly

• freeware

• Easy to integrate in other tools (java)

Rules in lxtransduce

<rule name="Conj"> <query match="tok[@ctag =

'CJ']"/></rule>

<rule name="Coor"> 

<first><query match="tok[. = ',']"/><ref name="Conj" mult="+"/></first></rule>

<rule name="PARopen"> <query match="tok[.~'^\($']"/> </rule>

<rule name="PARcl"> <query match="tok[.~'^\($']"/> </rule>

<rule name="parenthetic"><seq><ref name="PARopen"/><repeat-until name="tok"><ref name="PARcl"/></repeat-until><ref name="PARcl"/></seq></rule>

First developmentphase

● Less than 50% of the corpus● Focus on the verb● Precision: manually marked/all automatic● Recall: correct automatic/manually marked● F2 :3*(precision*recall)/2*precision+recall

0.220.200.31Gr 01

0.260.440.14Gr 00

F2RecallPrecision

Second developing phase

• 75% of the corpus for developing

• 25% of the corpus for testing

• Specific grammar/rules for each type

Copula baseline grammar

<rule name="euristic"><seq><repeat-until name="tok"><ref name="SERdef" mult="+"/></repeat-until><ref name="SERdef" mult="+"/><not><ref name="PPA"/></not><ref name="tok" mult="*"/><end/></seq></rule>

Verb “to be” third person singular or plural present indicative

<rule name="SERdef"><best><ref name="Ser3"/><ref name="PoderSer"/></best></rule>

Copula base result

• Sentence level results

• Problem with precision

Copula Grammar

Rules for is_type



<rule name="Serdef"> <querymatch="tok[@ctag = ’V’ and

@base=’ser’ and(@msd[starts-with(.,’fi-

3’ )]or @msd[starts-with(.,’pi-

3’ )])]</rule>....

<rule name="copula1"><seq><ref name="SERdef"/><best><seq><ref name="Art"/><ref name="adj|adv|prep|"

mult="*"/><ref name="Noun" mult="+"/></seq>....</best><ref name="tok" mult="*"/><end/></seq></rule>

Confronting Results

Include that patterns that were excluded

Try to gather the syntactic pattern of non definition and confront with the syntactic pattern of definition.

Other_Verbs grammar

• Collect verbs in a lexicon• Three different category:

reflexive, active, passive.• 22 different verbs

<lex word="chamar"><cat>ref</cat></lex><lex word="chamar,chamado"><cat>pas</cat></lex>

<rule name="Vpas"><seq><ref name="tok"/><not><ref name="not"/> </not><ref name="tok" mult="?"/><query match="tok[mylex(@base)

and (@ctag='PPA')]" constraint="mylex(@base)/cat='pas'"/>

</seq></rule>

Results for verb_type

• Analyze each verbs separately as with is_type

• Richer syntactic patterns

Punctuation Grammar

<rule name="punct_def"><seq><start/><ref name="CompmylexSN"

mult="+"/><query match="tok[.~’^:\$’]"/><ref name="tok" mult="+"/><end/></seq></rule>

●Preliminary work

●Definition introduced by colon mark (most frequent)

All-in-one

• Combination of the previous grammars

• The type is not take into account to calculate precision and recall

Conclusions and Future Work

• Overall results: Recall 86%, Precision 14%

• Difference among domains: the style of a document influence the result.

• Improve the rules for verb_type and punc_type

• Combining with other techniques such as ML

supporting e-learning with automatic glossary extraction experiments with portuguese

Documents

learning material

elearning tasks

elearning process

verbs grammarcollect

project bulgarian

precision recallsecond

automatic way

effectiveness of retrieval