semanticmining wp20 meeting freiburg, march 29 – 20, 2004
TRANSCRIPT
SemanticMining WP20 meeting
Freiburg, March 29 – 20, 2004
Agenda
March 29
12:30 - 13:30 Lunch
13:30 Welcome, dicussion of agenda
14:00 - 14:35Linköping presentation
14:35 - 15:10Brighton presentation
15:10 - 15:45Göteborg presentation
15:45 - 16:30 Coffee break
16:30 - 17:05Stockholm presentation
17:05 - 17:40Geneva presentation
17:40 - 18:25Paris presentation
18:25 - 19:00Freiburg presentation
20:00 Dinner
March 30
9:00 - 10:30 Discussion of the
description of WP 20.
10:30 – 10:45coffee break
11:00-12:45Workplan for WP20Discussion and
elaboration ofdeliverables
13:00-14:00 Lunch
Multi-lingual Medical Dictionary
Description of Work (I)The lack of a large-scale multi-lingual medical dictionary hampers the integration of European research activities in the medical field, and more seriously also the development of multi-lingual information retrieval services. An interesting language technology useful for this problem is corpus-based machine translation. The aim of this project is to develop techniques and systems for lexical data generation from parallel corpora, and to develop and apply methods for evaluation of machine translation systems. Parallel corpora exist e.g. as translations from English to other European languages of the official WHO classifications and some other terminology systems. Several of the NoE partners have extensive experience in multilingual lexical resources and computational lexicography, while others have an interest in applying such tools e.g. for semi-automated translation, semi-automated coding and indexing, and advanced systems for information retrieval.
Tasks 20.1 Facilitating short study visits of members of each others’
groups 20.2 Sharing and exchange of methods, materials and collaboration
on work in progress 20.3 Proposal for a common data structure for a multi-lingual
medical dictionary 20.4 Generation of multi-lingual medical lexicon in English,
German, French, Portuguese, Italian, Spanish, Swedish in a range of 4.000-40.000 entries per language
Deliverables D20.1 Report Multi-lingual Medical Dictionary m11 D20.2 Report Multi-lingual Medical Dictionary m17
Multi-lingual Medical Dictionary Description of Work (II)
Topics for Discussion Lexeme features (morphology, syntax, semantics)
Application context (IR, NLG, …)
Linguistic framework (grammar theory)
Languages covered
Domain (sublanguages, general language)
Size of the lexicon
Implementation framework (sources, exchange templates,
Interfaces to terminological resources (UMLS, WordNet)
Methods for lexical acquisition (manual, semi-automatic)
MorphoSaurusSubword Lexicon & Thesaurus
Freiburg University HospitalDepartment of Medical Informatics
Freiburg UniversityComputational Linguistics Lab
Motivation – Intra- and Crosslingual Indexing for Information Retrieval Requirements:
Elimination of inflectional e derivational variation: {nucleus,nuclei}, {diagnosis,diagnoses,diagnostic}{foot, feet}, {Lymphozyten, lymphozytär}
Decomposition of compound terms: procto|sigmoid|o|scop|ie, para|sympath|ectomy,Rechts|herz|insuffizienz, psic|o|s|somát|ic|o
Resolution of Synonyms and Spelling Variants: {oesophagus, esophagus}, {leuko, leuco}, {cutis, skin},{hemorrhage,bleeding}, {ascorbic,Vitamin C, {ancylostoma, hookworm}
Mapping of interlingual synonyms: {blood, blut, sangue}, {liver, hepat..., fígado}{kidney, nephr.., nefr.., nier.., ren, rim, },
What is a subword ? An atomic linguistic sense unit:
Morphemes: nephr, anti, thyr, scler, hepat,
cardi
Morpheme aggregates: diaphys, ascorb, anabol,
diagnost
Words: amyloid, bone, fever, liver
exceptionally: noun groups: vitamin c,…
Taming the growth rates of lexical resources at a
sublinear level
Subword Delimitation Criteria
Semantic (compositionality)Hyper | cholesterol | emia
Lexical (enabling synonym matching)schleimhaut = mucosa (schleim | haut)
Data-driven (avoiding ambiguities and false segmentation), e.g.relationship, schwangerschaft (relation|ship, schwanger|schaft)
The MorphoSaurus system Extracts semantically relevant subwords
from medical texts in different language Transforms IR relevant content to concept-
like semantic identifiers.(MID = MorphoSaurus identifiers)
Example:High TSH values suggest the diagnosis of primary hypo-thyroidism ...
Original
Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ...
Example:High TSH values suggest the diagnosis of primary hypo-thyroidism ...
Original
Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ...
high tsh values suggest the diagnosis of primary hypo-thyroidism ...
erhoehte tsh werte erlauben die diagnose einer primaeren hypo-thyreose ...
Orthographic Rules
Orthographic Normalization
Example:
high tsh value s suggest the diagnos is of primar y hypo thyroid ism
er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose
Morphosyntactic ParserLexicon
High TSH values suggest the diagnosis of primary hypo-thyroidism ...
Original
Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ...
high tsh values suggest the diagnosis of primary hypo-thyroidism ...
erhoehte tsh-werte erlauben die diagnose einer primaeren hypo-thyreose ...
Orthographic Rules
Orthographic Normalization
Example:
high tsh value s suggest the diagnos is of primar y hypo thyroid ism
er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose
Morphosyntactic ParserLexicon
High TSH values suggest the diagnosis of primary hypo-thyroidism ...
Original
Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ...
high tsh values suggest the diagnosis of primary hypo-thyroidism ...
erhoehte tsh-werte erlauben die diagnose einer primaeren hypo-thyreose ...
Orthographic Rules
Orthographic Normalization
upiiiij tsh valueiiqrij suggestiipzzr diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw
MID-Representation
upiiiij tsh valueiiqrij permitiji diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw
Thesaurus
Semantic Normalization
Example:
high tsh value s suggest the diagnos is of primar y hypo thyroid ism
er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose
Morphosyntactic ParserLexicon
High TSH values suggest the diagnosis of primary hypo-thyroidism ...
Original
Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypo-thyreose ...
high tsh values suggest the diagnosis of primary hypo-thyroidism ...
erhoehte tsh-werte erlauben die diagnose einer primaeren hypo-thyreose ...
Orthographic Rules
Orthographic Normalization
upiiij tsh valueiiqrij suggestiipzzr diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw
MID-Representation
upiiij tsh valueiiqrij permitiji diagnostiiiryz primariiiyiy smalliiiqqi thyreiiprzw
Thesaurus
Semantic Normalization
Morphosaurus Thesaurus Features
Only two semantic relations: Syntagmatical expansion:nephrotomiiqwjja = nephriikwjza +
tomyiiqjqqa (To avoid known mis-segmentations, e.g.nephr + oto + mie)
Ambiguous readings:seitiiyqyqa = lateraliijwira OR
pagerijjrja
Transforms IR relevant content to concept-like semantic identifiers.(MID = MorphoSaurus identifiers)
MorphoeditLexicon Editor
State of the Project Domain: clinical language and lay
expressions, partly Validated entries:
21,397 English, 22,053 German, 15,029 Portuguese.
Automatically generated entries 8,992 Spanish subwords from Portuguese
subwords
CLIR Experiments (OHSUMED) Manual translation of 106 English queries to German and
Portuguese by medical experts Baseline: machine translation/bilingual dictionaries QTR
Google-Translator to re-translate German/Portuguese queries to English
additional search in a bilingual lexeme dictionary, derived from the UMLS-Metathesaurus.
stemmed by the Porter stemming algorithm / stop word elimination MorphoSaurus: normalization of queries/documents MSI Boolean search engine: frequency and adjacency measure Results German: QTR: 68%, MSI: 93% Results Portuguese: QTR: 54%, MSI: 62%
(RIAO’04)
Multilingual MeSH Mapping Morpho-semantic normalization of 35,000 English,
manual MeSH annotated Medline abstracts Statistical learning of indexing patterns Using indexing patterns for mapping of normalized
English/German/Portuguese texts
Results: gold standard human indexers English: 33% (68%) German: 30% (62%) Portuguese: 27% (56%)
(RIAO’04)
agreement with agreement with