resolving abbreviations to their senses in medline

Resolving abbreviations to their senses in Medline

S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann

European Bioinformatics Institute, Wellcome Trust Genome Campus,

Hinxton, Cambridge, UK(Bioinformatics, Vol. 21, no. 18, 2005, p. 3658-3664)

2/24

Abstract Abbreviation resolution improves accuracy of

document retrieval engines and of IE systems.

Authors combine an automatic analysis of Medline abstracts and linguistic methods to build a dictionary of abbreviation/sense pairs.

Ambiguous global abbreviations are resolved using SVM.

The system disambiguates abbreviations with a precision of 98.9% for a recall of 98.2%.

3/24

1. Introduction (1/2) Global abbreviations appear in documents

without the long form explicitly stated, whereas local abbreviations come together with their long form in the document.

Common abbreviations become accepted as synonyms and represent important terms in their domain, whereas dynamic abbreviations are defined for convenience in only a particular paper.

The most problematic step in abbreviation resolution is retrieving the sense of a global abbreviation that is ambiguous.

4/24

1. Introduction (2/2) Disambiguation schema

1. A lexicon is used for collecting the abbreviations and their senses.

2. Then the method computes the context of use for each sense.

3. Finally, a machine-learning algorithm is trained on the context of each sense.

4. The disambiguation of an abbreviation contained in a document consists of computing its context in the document.

5. Retrieval of the most probable abbreviation sense, given the context, thanks to the machine-learning algorithm.

5/24

2. Dictionary of Abbreviations

After scanning all Medline abstracts available in August 2004, the result is 5,250,259 long-form/abbreviation pairs found in 2,857,954 Medline abstracts (refer to as D).

Example of long-form/abbreviation The changes in adrenocorticotropin hormone (AC

TH), cortisol and dehydroepiandrosterone (DHEA) in maternal and fetal plasma were estimated in two groups of women.

6/24

2.1 Merging morphologically similar long forms (1/2) e.g. ‘oestrogen receptor’ versus ‘estr

ogen-receptor’. An n-gram similarity algorithm is used wit

h a cut-off parameter (0.8) to merge similar long forms l1 and l2:

E.g., grams3(‘hello’)={‘hel’, ‘ell’, ‘llo’}

|)(||)(|

|)()(|),,(

21

2121

lgramslgrams

lgramslgramsnllsimilarity

nn

nn

7/24

2.1 Merging morphologically similar long forms (2/2)

Long form 1 Long form 2

Computed radiography

Computed radiographic

Compression ratios Compression rate

Caloric restriction Calorie-restricted

Thrombocytopenia with absent radii

Thrombocytopenia and absent radius

Transactivator responsive element

Trans-activator response element

8/24

2.2 Context based merging (1/3) Some long forms can be morphologically qui

te different (e.g. ‘beta site APP-cleaving enzyme’ versus ‘beta site amyloid precursor protein-cleaving enzyme’) but still code for the same meaning.

The similarity between two sets of long forms (g1 and g2) is computed by considering the number of common words in the sets Dg1 and Dg2 of documents containing the long forms, normalized by the total number of words in the documents of the two sets:

9/24

2.2 Context based merging (2/2)

.document in the wordsofset theis )( where,1|| and 1|| if

,|)(||)(|

|)()(|2

||||

1),(

with),(),(

),(),(

21

,,21

21

2211

2121

21

i

igg

ddDdDd ji

ji

gg

ddWDD

dWdW

dWdW

DDggc

ggcggc

ggcggsimilarity

jigjgi

10/24

2.2 Context based merging (3/3)

Long form 1 Long form 2

Alpha-amino-3-hydroxy-5-methyl-4-isoxazolepropionic acid receptors

AMPA receptors

Silver-stained nucleolar organizer regions

Argyrophilic staining of nucleolar organizer regions

Complete remission Complete response

11/24

3. Disambiguation of Abbreviations

Whenever no long form associated to an ambiguous abbreviation is found, the context is used to identify the correct meaning of the abbreviation. Which suitable context words are

generated to disambiguate abbreviations?

How is the classifier trained?

12/24

3.1 Context extraction C-value algorithm (Frantzi and Ananiadou,

1999)

Extract a tuple of size n (55 on average) of relevant words for every document.

. candidate in the contained patternsnoun -adjective ofset theis and corpus in the offrequency

theis , of (in words)length theis ,candidatespattern noun -adjective theis where

wTwf(w)ww

w

w

wTvw

vfT

wfwwvalueC )(||

1)()log()(

13/24

3.2 The model (1/2) S(a) is the set of senses for each abbreviatio

n a in the dictionary. Each sense s∈S(a) is illustrated by a set of d

ocuments Ds⊂D. Ds is the set of documents containing the ab

breviation/long-form pairs. For each document d, the context words ar

e extracted and the document is described by a vector v=g(d) with g: D →{0, 1}n.

14/24

3.2 The model (2/2)

The ith component of v, vi, is defined as

The function associates with each sense s a set of vectors (s) as follows:

otherwise.0

,document in the appears word theif1 dwv ii

}|)({)( sDddgs

15/24

3.3 Disambiguation (1/2)

This problem can be described as a classification problem of assigning g(d) to one of the classes represented by the vector sets (s) where s∈S(a).=> SVM.

For each sense s of an abbreviation a The positive class C+(s)= (s). The negative class C-(s)= .

16/24

3.3 Disambiguation (2/2)

An SVM is created for each sense s and trained with C+(s) and C-(s).

The result is a function hs:{0, 1}nR

The sense prediction

)()( predicts 0

)()( predicts 0))((

sCdg

sCdgdghs

.0))(( and)),(())((:)( ifonly and if

),(sense

dghdghdghaS

sda

s

s

17/24

4. Abbreviation Resolution If a long form is found in the text, its

most frequent form is kept. If no long form can be retrieved from

the document, then a look-up of the abbreviation in the dictionary is performed.

If only one sense is found, then the abbreviation is not ambiguous and the most frequent long form is kept.

If several senses are retrieved, then the disambiguation process is applied.

18/24

5.1 Results: dictionary (1/2)

Medline abstracts between 1965 and 2004

19/24

5.1 Results: dictionary (2/2)

20/24

5.2 Results: disambiguation (1/2) The disambiguation is required for abbrevia

tions having several senses and occurring without the long form.

Considering abbreviations occurring 40 documents, there are 7806 abbreviations with 12330 senses. Out of these 7806 abbreviations, 1851 are polysemic, having on average 3.4 senses with a maximum of 32 senses for ‘PC’.

21/24

5.2 Results: disambiguation (2/2) The SVMs were trained and tested using a 5-

fold cross-validation schema. In order to avoid the explicit indication of th

e sense, the abbreviation long forms are removed from the text before the SVMs learn or classify the test documents.

The system achieves a precision of 98.9% for a recall of 98.2% (98.5% accuracy).

22/24

6. Discussion (1/3) The dictionary of abbreviations, the

context extraction and the disambiguation module are the three main components.

Dictionary The dictionary has been generated from

Medline so that its content is most suitable for abbreviation resolution in biomedical text.

The high quality of the dictionary is crucial to achieve the resolution of abbreviations with a high precision/recall.

23/24

6. Discussion (2/3) Context extraction

The context extraction is based on the text itself and not based on human annotations, unlike MeSH terms.

The context of a sense is represented with vectors that have on average 3000 non-empty features. In other words, each sense is represented with a considerable number of words.

24/24

6. Discussion (3/3) Disambiguation module

The accuracy of the disambiguation method profits from the high performances achieved by SVMs, which have been successfully used in many text classification tasks.

Disambiguation of abbreviations is more accurate than word sense disambiguation on English words because abbreviation’s senses are on average more distant.

resolving abbreviations to their senses in medline

Documents

context of use

common abbreviations

similar long forms l1

sets of long forms g1

suitable context words

ambiguous global abbreviations

dynamic abbreviations

local abbreviations