semanticmining wp20 meeting freiburg, march 29 – 20, 2004

23
SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Upload: jackson-douglas

Post on 28-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

SemanticMining WP20 meeting

Freiburg, March 29 – 20, 2004

Page 2: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Agenda

March 29

12:30 - 13:30 Lunch

13:30 Welcome, dicussion of agenda

14:00 - 14:35Linköping presentation

14:35 - 15:10Brighton presentation

15:10 - 15:45Göteborg presentation

15:45 - 16:30 Coffee break

16:30 - 17:05Stockholm presentation

17:05 - 17:40Geneva presentation

17:40 - 18:25Paris presentation

18:25 - 19:00Freiburg presentation

20:00 Dinner

March 30

9:00 - 10:30 Discussion of the

description of WP 20.

10:30 – 10:45coffee break

11:00-12:45Workplan for WP20Discussion and

elaboration ofdeliverables

13:00-14:00 Lunch

Page 3: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Multi-lingual Medical Dictionary

Description of Work (I)The lack of a large-scale multi-lingual medical dictionary hampers the integration of European research activities in the medical field, and more seriously also the development of multi-lingual information retrieval services. An interesting language technology useful for this problem is corpus-based machine translation. The aim of this project is to develop techniques and systems for lexical data generation from parallel corpora, and to develop and apply methods for evaluation of machine translation systems. Parallel corpora exist e.g. as translations from English to other European languages of the official WHO classifications and some other terminology systems. Several of the NoE partners have extensive experience in multilingual lexical resources and computational lexicography, while others have an interest in applying such tools e.g. for semi-automated translation, semi-automated coding and indexing, and advanced systems for information retrieval.

Page 4: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Tasks 20.1 Facilitating short study visits of members of each others’

groups 20.2 Sharing and exchange of methods, materials and collaboration

on work in progress 20.3 Proposal for a common data structure for a multi-lingual

medical dictionary 20.4 Generation of multi-lingual medical lexicon in English,

German, French, Portuguese, Italian, Spanish, Swedish in a range of 4.000-40.000 entries per language

Deliverables D20.1 Report Multi-lingual Medical Dictionary m11 D20.2 Report Multi-lingual Medical Dictionary m17

Multi-lingual Medical Dictionary Description of Work (II)

Page 5: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Topics for Discussion Lexeme features (morphology, syntax, semantics)

Application context (IR, NLG, …)

Linguistic framework (grammar theory)

Languages covered

Domain (sublanguages, general language)

Size of the lexicon

Implementation framework (sources, exchange templates,

Interfaces to terminological resources (UMLS, WordNet)

Methods for lexical acquisition (manual, semi-automatic)

Page 6: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

MorphoSaurusSubword Lexicon & Thesaurus

Freiburg University HospitalDepartment of Medical Informatics

Freiburg UniversityComputational Linguistics Lab

Page 7: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Motivation – Intra- and Crosslingual Indexing for Information Retrieval Requirements:

Elimination of inflectional e derivational variation: {nucleus,nuclei}, {diagnosis,diagnoses,diagnostic}{foot, feet}, {Lymphozyten, lymphozytär}

Decomposition of compound terms: procto|sigmoid|o|scop|ie, para|sympath|ectomy,Rechts|herz|insuffizienz, psic|o|s|somát|ic|o

Resolution of Synonyms and Spelling Variants: {oesophagus, esophagus}, {leuko, leuco}, {cutis, skin},{hemorrhage,bleeding}, {ascorbic,Vitamin C, {ancylostoma, hookworm}

Mapping of interlingual synonyms: {blood, blut, sangue}, {liver, hepat..., fígado}{kidney, nephr.., nefr.., nier.., ren, rim, },

Page 8: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

What is a subword ? An atomic linguistic sense unit:

Morphemes: nephr, anti, thyr, scler, hepat,

cardi

Morpheme aggregates: diaphys, ascorb, anabol,

diagnost

Words: amyloid, bone, fever, liver

exceptionally: noun groups: vitamin c,…

Taming the growth rates of lexical resources at a

sublinear level

Page 9: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Subword Delimitation Criteria

Semantic (compositionality)Hyper | cholesterol | emia

Lexical (enabling synonym matching)schleimhaut = mucosa (schleim | haut)

Data-driven (avoiding ambiguities and false segmentation), e.g.relationship, schwangerschaft (relation|ship, schwanger|schaft)

Page 10: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

The MorphoSaurus system Extracts semantically relevant subwords

from medical texts in different language Transforms IR relevant content to concept-

like semantic identifiers.(MID = MorphoSaurus identifiers)

Page 11: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Example:High TSH values suggest the diagnosis of primary hypo-thyroidism ...

Original

Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ...

Page 12: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Example:High TSH values suggest the diagnosis of primary hypo-thyroidism ...

Original

Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ...

high tsh values suggest the diagnosis of primary hypo-thyroidism ...

erhoehte tsh werte erlauben die diagnose einer primaeren hypo-thyreose ...

Orthographic Rules

Orthographic Normalization

Page 13: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Example:

high tsh value s suggest the diagnos is of primar y hypo thyroid ism

er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose

Morphosyntactic ParserLexicon

High TSH values suggest the diagnosis of primary hypo-thyroidism ...

Original

Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ...

high tsh values suggest the diagnosis of primary hypo-thyroidism ...

erhoehte tsh-werte erlauben die diagnose einer primaeren hypo-thyreose ...

Orthographic Rules

Orthographic Normalization

Page 14: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Example:

high tsh value s suggest the diagnos is of primar y hypo thyroid ism

er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose

Morphosyntactic ParserLexicon

High TSH values suggest the diagnosis of primary hypo-thyroidism ...

Original

Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ...

high tsh values suggest the diagnosis of primary hypo-thyroidism ...

erhoehte tsh-werte erlauben die diagnose einer primaeren hypo-thyreose ...

Orthographic Rules

Orthographic Normalization

upiiiij tsh valueiiqrij suggestiipzzr diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw

MID-Representation

upiiiij tsh valueiiqrij permitiji diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw

Thesaurus

Semantic Normalization

Page 15: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Example:

high tsh value s suggest the diagnos is of primar y hypo thyroid ism

er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose

Morphosyntactic ParserLexicon

High TSH values suggest the diagnosis of primary hypo-thyroidism ...

Original

Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ...

high tsh values suggest the diagnosis of primary hypo-thyroidism ...

erhoehte tsh-werte erlauben die diagnose einer primaeren hypo-thyreose ...

Orthographic Rules

Orthographic Normalization

upiiij tsh valueiiqrij suggestiipzzr diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw

MID-Representation

upiiij tsh valueiiqrij permitiji diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw

Thesaurus

Semantic Normalization

Page 16: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004
Page 17: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Morphosaurus Thesaurus Features

Only two semantic relations: Syntagmatical expansion:nephrotomiiqwjja = nephriikwjza +

tomyiiqjqqa (To avoid known mis-segmentations, e.g.nephr + oto + mie)

Ambiguous readings:seitiiyqyqa = lateraliijwira OR

pagerijjrja

Transforms IR relevant content to concept-like semantic identifiers.(MID = MorphoSaurus identifiers)

Page 18: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

MorphoeditLexicon Editor

Page 19: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004
Page 20: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004
Page 21: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

State of the Project Domain: clinical language and lay

expressions, partly Validated entries:

21,397 English, 22,053 German, 15,029 Portuguese.

Automatically generated entries 8,992 Spanish subwords from Portuguese

subwords

Page 22: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

CLIR Experiments (OHSUMED) Manual translation of 106 English queries to German and

Portuguese by medical experts Baseline: machine translation/bilingual dictionaries QTR

Google-Translator to re-translate German/Portuguese queries to English

additional search in a bilingual lexeme dictionary, derived from the UMLS-Metathesaurus.

stemmed by the Porter stemming algorithm / stop word elimination MorphoSaurus: normalization of queries/documents MSI Boolean search engine: frequency and adjacency measure Results German: QTR: 68%, MSI: 93% Results Portuguese: QTR: 54%, MSI: 62%

(RIAO’04)

Page 23: SemanticMining WP20 meeting Freiburg, March 29 – 20, 2004

Multilingual MeSH Mapping Morpho-semantic normalization of 35,000 English,

manual MeSH annotated Medline abstracts Statistical learning of indexing patterns Using indexing patterns for mapping of normalized

English/German/Portuguese texts

Results: gold standard human indexers English: 33% (68%) German: 30% (62%) Portuguese: 27% (56%)

(RIAO’04)

agreement with agreement with