corpus based creation and extension of domain-specific resources manuela kunze, dietmar rösner...

31
Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Upload: willerich-neske

Post on 05-Apr-2015

104 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze,

Dietmar RösnerUniversity of Magdeburg

Page 2: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 2

Overview

Background: Corpus Characteristics

Experiment 1: Context-related Derivation of Concepts

Experiment 2: Clustering of Values

Page 3: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 3

Corpus: Forensic Autopsy Protocols different document parts:

findings histological findings background discussion …

Page 4: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 4

Autopsy Protocols: Findings

short linguistic structures typical attribute-value structures

expressed by noun phrases:

Unterblutung des Gewebes/Bleeding of tissue. Oberlippenbart/Upper lip beard.

noun phrases + verb/adjective/noun phrase Mund geschlossen./Mouth closed. Nebennieren ohne Besonderheiten./Adrenal glands

without anomalies.

Useable for the extension of the resources in combination with GermaNet?

Page 5: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 5

Corpus

400 Protocols parsed with a context free grammar (ca.

40 rules)

focus of the analyses complex noun phrases

derivation of concepts

attribute-value structuresclustering of values

Page 6: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 6

Overview

Corpus Characteristics

Experiment 1: Context-related Derivation of Concepts

Experiment 2: Clustering of Values

Page 7: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 7

Approach

analysis of high-frequency complex noun phrases example: Bruch des/der … (fracture of …) occurrence 749 types: 93

known (31): Rippe/rib (254), Brustbein/sternum (65), Wirbelsäule/spine

(58), Schambein/pubic bone (30), Schulterblatt/omoplate (23), …

unknown (62): Schädeldach/calvarium (43), Oberschenkelknochen/femur

(37), Schädelbasis/base of the skull (34), Schlüsselbein/clavicle (33), Brustwirbelsäule/thoracic spine (28), Halswirbelsäule/cervical spine (26), …

Page 8: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 8

Idea: Analysis of Complex Noun Phrases

fracture of <known>

keyword of complement

fracture of <unknown>

in corpus:

class of <known>

deduce: class of <unknown> == class of <known>

in GermaNet:

Page 9: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 9

Approach

top level category : T

remove senses which are not assigned with the preferred top level

category

collect all (GermaNet) senses

determine the most frequent top level

category

known complements types of a keyword

collect all semantic classes from the hypernym graph

for

each

sen

se known (31): Rippe/rib (254),

Brustbein/sternum (65), Wirbelsäule/spine (58), Schambein/pubic bone (30), Schulterblatt/omoplate (23), …

high-frequency top level categories (as percentage)

3

16,5

75

5,5

noun.body

noun.artifact

noun.quantity

noun.food

…<nomen.Koerper>Finger <nomen.Koerper>=> Gliedmaße, Extremität

<nomen.Artefakt>Finger <nomen.Artefakt>=> Computerprogramm, Programm

<nomen.Koerper>Rippe <nomen.Koerper>=> Knochen, Gebein…

top level category: noun.body

36 senses 27 senses

22 different semantic classes

36 senses

…<nomen.Koerper>Rippe, <nomen.Koerper>=> Knochen, Gebein, <nomen.Koerper>=> Hornsubstanz, <nomen.Koerper>=> Körpersubstanz, <nomen.Substanz>=> Stoff1, Substanz, Materie, <nomen.Tops>=> Objekt, <nomen.Koerper>=> Hornsubstanz, <nomen.Koerper>=> Körpersubstanz,<nomen.Substanz>=> Stoff1, Substanz, Materie, <nomen.Tops>=> Objekt, …

31 complement types

…<nomen.Koerper>Finger <nomen.Koerper>=> Gliedmaße, Extremität

<nomen.Artefakt>Finger <nomen.Artefakt>=> Computerprogramm, Programm

<nomen.Koerper>Rippe <nomen.Koerper>=> Knochen, Gebein

Page 10: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 10

Approach

collect all semantic classes from the hypernym graph

for each semantic class sc: • determine the level in the hypernym tree (fsc)• count occurences (nsc)

most specific semantic class: Knochen

22 different semantic classes

select the maximum of(fsc * nsc)/N

N: number of all semantic classes

…<nomen.Koerper>Rippe, <nomen.Koerper>=> Knochen, Gebein, <nomen.Koerper>=> Hornsubstanz, <nomen.Koerper>=> Körpersubstanz, <nomen.Substanz>=> Stoff1, Substanz, Materie, <nomen.Tops>=> Objekt, <nomen.Koerper>=> Hornsubstanz, <nomen.Koerper>=> Körpersubstanz,<nomen.Substanz>=> Stoff1, Substanz, Materie, <nomen.Tops>=> Objekt, …

Page 11: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 11

Results

85 % correct assignments (types) 94 % correct assignments (tokens)

erroneous cases: correct assignments to wrong complements wrong assignments to correct complements

Page 12: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 12

Results: Erroneous Cases

correct assignments to wrong complements: misspelling of tokens: „Oberschenkelknorren“ erroneous fragments of the treatment of German‘s

truncations: „Bruch des Ober- und Unterarmes“ erroneous syntactic analysis of the second NP: „Bruch der

Wandung der …“

wrong assignments to correct complements: (complex) systems of bones, cartilages, connective

tissues: „elbow joint“

Page 13: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 13

Overview

Corpus Characteristics

Experiment 1: Context-related Derivation of Concepts

Experiment 2: Clustering of Values

Page 14: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 14

Clustering of Values conceptual analysis of linguistic structures

Mund geschlossen/Mouth closed. Rachenschleimhaut duesterrot. /Mucosa of fauces dark red. Beckengeruest festgefuegt und unversehrt. /Pelvis closely joined and entire. Herzohren frei, ovales Vorhoffenster geschlossen./Auricles of heart clear, oval atrium

closed. Brustbein, Rippen und Wirbelsaeule intakt./Sternum, ribs and spine intact. Brustkorb sehr schmal und leicht eindrueckbar./Thorax very narrow and easy to incise. Nebennieren ohne Besonderheiten./Adrenal glands without anomalies. …

1908 concepts Mund/mouth Rachenschleimhaut/mucosa of fauces Beckengeruest/pelvis Herzohren, Vorhoffenster/auricles of

heart, atrium Brustbein, Rippen,

Wirbelsaeule/sternum, ribs, spine Brustkorb/thorax Nebennieren/adrenal glands

2098 different (linguistic) values geschlossen/closed duesterrot /dark red festgefuegt, unversehrt /closely joined,

entire frei, geschlossen/clear, closed intakt/intact sehr schmal, leicht eindrueckbar/very

narrow, easy to incise ohne Besonderheiten/ without anomalies

Have similar concepts same attributes?

What are the values for an attribute?

Page 15: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 15

Relations Between Values

Do the values describe different attributes? color, shape etc.

if not, are the values paraphrases/synonyms? antonyms? values of an ‚open‘ range?

Which lexical or conceptual relations exist between the values, e.g.

synonyms, antonyms etc.?

clustering of values

Page 16: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 16

Examples

Mund/mouth:

deutlich geoeffnet

fischmaulartig geoeffnet schlotartig geoeffnetruesselartig geoeffnetfroschmaulartig geoeffnetovalaer geoeffnetgeoeffnetspaltfoermig geoeffnetgeschlossen

different kinds of 'opened' vs. closed

Page 17: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 17

Examples

Milzgewebe/spleen tissue:nicht sehr blutreichfest deutlich gelockertstark gelockertrelativ gelockert verhaertet gelockertleicht gelockert blutreich sehr blutarmfaeulnisbedingt gelockertetwas faeulnisbedingt aufgelockert sehr blutreich

concentration of blood

consistency,form of tissue

Page 18: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 18

Examples

Wirbelsaeule/spine:ebenfalls unversehrt ebenfalls intakt intaktunversehrt ohne Besonderheitenohne Verletzungen

same findings

Page 19: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 19

Approach

comparison of values of a concept 33670 comparisons

comparison in several steps1. character-based: via bigrams2. lexical-conceptual relations: available

information in Germanet

Page 20: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 20

Approach

values of a concept

removing negations

removing modificators

'corrected' values

lexical/conceptualrelations in GermaNet?

compound?

bigrams of values

particles: sehr, sonst, ebenfalls

adjectives with suffixes: ‚-artig‘, ‚-lich‘, ‚-ig‘

example: 'sonst unaufällig' 'unauffällig'

negations: 'kein', 'nicht', …

Page 21: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 21

Results: Character-based Analysis similar values with modifications (particles) and negations

selbst unauffaellig

sonst unauffaelligunauffaellig

glaenzend

nicht glaenzend

geoeffnet

leicht geoeffnet

rundlich geoeffnet

spaltfoermig geoeffnet

spaltweit geoeffnetfroschmaulartig geoeffnet … geoeffnet

sehr muskelkraeftig

nicht sehr muskelstark

muskelkraeftig

nicht sehr muskelkraeftignicht muskelkraeftig

blutreichnicht-sehr-blutreich

sehr-blutreich

blutarmrelativ-blutarm

muskelschwachsehr-muskelschwach

geschlossenspaltfoermig-geschlossen

Page 22: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 22

Integration of GermaNet

search for relations between two tokens parts of tokens

queries about: coordinate terms synonyms, hypernyms, hyponyms antonyms

Page 23: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 23

Results with GermaNet sehr muskelkraeftig/very strong muscle vs. sehr muskelschwach/very

weak muscle bigrams: 0.5882, 0.4167 antonym: kraeftig vs. schwach

blutarm/bloodless vs. blutreich/bloodrich bigrams: 0.4286 GermaNet: antonym: arm vs. reich

feucht/wet vs. sehr trocken/very dry bigrams: 0.0000 GermaNet: coordinate terms, antonym

sehr gross/very great vs. sehr weit/very broad bigrams: 0.4706 GermaNet: hypernym

frei/free vs. größtenteils vorhanden/mostly existent bigrams: 0.0833 GermaNet: coordinate terms

keine Schwellung/no swelling vs. keine Verletzung/no trauma bigrams: 0.42, 0.4 GermaNet: hypernym

Page 24: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 24

Results: Character-based + GermaNet

selbst unauffaellig

sonst unauffaelligunauffaellig

glaenzend

nicht glaenzend

blutreichnicht-sehr-blutreich

sehr-blutreich

blutarmrelativ-blutarm

sehr muskelkraeftignicht sehr muskelstark

muskelkraeftig

nicht sehr muskelkraeftignicht muskelkraeftig

muskelschwachsehr-muskelschwach

geoeffnet

leicht geoeffnet

rundlich geoeffnet

spaltfoermig geoeffnet

spaltweit geoeffnetfroschmaulartig geoeffnet … geoeffnet

geschlossenspaltfoermig-geschlossen

Page 25: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 25

Problem: Paraphrases

Wirbelsaeule/spine:intaktunversehrt ohne Besonderheitenohne Verletzungen

same findings

future work

Page 26: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 26

Idea: Detection of Paraphases/Synonyms document information + corpus information

to analyse the value sets of a document

compare the value sets of a concept described in different documents values, which are synonyms or antonyms don‘t occur in a

document Example:

Spine closely joined and entire. closely joined, entire: different attributes

Page 27: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 27

Idea: Detection of Paraphases/Synonyms collect all values for a concept: candidates

• entire• closely

jointed

• entire• closely

jointed

candidates: intact == broken == entire/closely jointed == entire ?

AP#1 Ap#nAP#2 AP#3 …

…• broken • intact • intact

AP#4 AP#5

• entire

values for the concept 'spine':

Page 28: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 28

Idea: Detection of Paraphases/Synonyms

0

20

40

60

80

100

120

140

160

180

intact bleedings closely joined entire without anomalies without bleedings withoutpathological

findings

removing of candidates:

only one paraphrase

bleedings or without bleedings antonyms

closely joined vs. entire occur in the same document (for a concept)

prefer: entire (number of occurrences)

assumption: closely joined is an 'additional' attribute

selection of candidates (restrictions):

only frequent values

similar number of occurrences?

verification of results:

to obtain value sets of other concepts

which have similar values

Page 29: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 29

Problems: Detection of Paraphrases a value can be expressed by more than one value

'value 1' == 'value 2' + 'value 3'

result (set of paraphrases for a value) can contain antonyms

Page 30: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 30

Detection of Paraphases/Synonyms solutions?

integration of other resources: UMLS extension of GermaNet

1 sense of unversehrt

Sense 1<adj.Koerper>unverletzt, unversehrt <adj.Koerper>=> heil <adj.Koerper>=> gesund <adj.Koerper>=> ?krankheitsspezifisch <adj.Koerper>=> ?körperzustandsspezifisch <adj.Koerper>=> ?körperspezifisch

1 sense of intakt

Sense 1<adj.Relation>intakt, ganz1, funktionstüchtig, funktionsfähig <adj.Relation>=> ?funktionalitätsspezifisch <adj.Relation>=> ?relationsspezifisch

same meaning?

Page 31: Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze, Dietmar Rösner University of Magdeburg

Manuela Kunze 31

Conclusion

experiments about corpus based semiautomatic extension of GermaNet

analysis of complex noun phrases detection and transfer of GermaNet classes

clustering of values bigrams using GermaNet information