towards a semantic lexicon for biological language processing

Comparative and Functional GenomicsComp Funct Genom 2005; 6: 61–66.Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.451

Conference Paper

Towards a semantic lexicon for biologicallanguage processing

Karin Verspoor*Los Alamos National Laboratory, PO Box 1663, MS B256, Los Alamos, NM 87545, USA

*Correspondence to:Karin Verspoor, Los AlamosNational Laboratory, PO Box1663, MS B256, Los Alamos,NM 87545, USA.E-mail: [email protected]

Received: 9 December 2004Accepted: 14 December 2004

AbstractThis paper explores the use of the resources in the National Library of Medicine’sUnified Medical Language System (UMLS) for the construction of a lexicon usefulfor processing texts in the field of molecular biology. A lexicon is constructed fromoverlapping terms in the UMLS SPECIALIST lexicon and the UMLS Metathesaurusto obtain both morphosyntactic and semantic information for terms, and the coverageof a domain corpus is assessed. Over 77% of tokens in the domain corpus are found inthe constructed lexicon, validating the lexicon’s coverage of the most frequent termsin the domain and indicating that the constructed lexicon is potentially an importantresource for biological text processing. Copyright 2005 John Wiley & Sons, Ltd.

Keywords: natural language processing; lexicon; unified medical language system

Introduction

It is well understood that natural language process-ing (NLP) applications require sophisticated lex-ical resources to support their processing goals.In the biomedical domain, we are privileged tohave access to extensive terminological resourcesin the form of controlled vocabularies and ontolo-gies, which have been integrated into the frame-work of the National Library of Medicine’s Uni-fied Medical Language System’s (UMLS) Metathe-saurus. However, the existence of such terminolog-ical resources does not guarantee their utility forNLP. In particular, we have two core requirementsfor lexical resources for NLP in addition to thebasic enumeration of important domain terms: rep-resentation of morphosyntactic information aboutthose terms, specifically part of speech informa-tion and inflectional patterns to support parsing andlemma assignment, and representation of semanticinformation indicating general categorical informa-tion about terms and significant relations betweenterms, to support text understanding and infer-ence (Hahn et al., 1999). Biomedical vocabulariesby and large commonly leave out morphosyntac-tic information, and where they address semantic

considerations, they often do so in an unprincipledmanner, e.g. by indicating a relation between twoconcepts without indicating the type of that rela-tion.

But all is not lost. The UMLS knowledgesources include two additional resources whichare relevant — the SPECIALIST lexicon, a lexi-con addressing our morphosyntactic requirements,and the Semantic Network, a representation of coreconceptual categories in the biomedical domain.The coverage of these two knowledge sourceswith respect to the full coverage of the Metathe-saurus is, however, not entirely clear. Furthermore,when our goals are specifically to process biologi-cal text — and often more specifically, text in themolecular biology domain — it is difficult to saywhether the coverage of these resources is mean-ingful. The utility of the UMLS knowledge sourcesfor medical language processing (MLP) has beenexplored (Johnson, 1999; Friedman et al., 2001);the time has now come to repeat these experi-ments with respect to biological language process-ing (BLP). To that end, this paper presents ananalysis of the UMLS resources, specifically withan eye towards constructing lexical resources suit-able for BLP. We follow the paradigm presented

Copyright 2005 John Wiley & Sons, Ltd.

62 K. Verspoor

in Johnson (1999) for medical language, exploringoverlap between the UMLS Metathesaurus andSPECIALIST lexicon to construct a morphosyn-tactic and semantically-specified lexicon, and thenfurther explore the overlap with a relevant domaincorpus for molecular biology.

The UMLS as a lexical knowledge source

There have been several investigations of theUMLS as a lexical knowledge source. McCrayet al. (2001) evaluated the nature of strings in theUMLS Metathesaurus with respect to their likeli-hood of appearing in a natural language corpus.They found that only 10% of the strings in theMetathesaurus occurred in their MEDLINE corpus(representing 1 year of MEDLINE abstracts), butwere able to identify some properties associatedwith the strings that could be used to filter outstrings that are unlikely to occur naturally in a cor-pus. While the authors suggest that occurrence of aterm in the Metathesaurus opens the possibility ofaccessing more extensive domain knowledge aboutthat term, they do not explore the nature of thatdomain knowledge for the terms they find in theircorpus, and do not explore the overlap of thoseterms with other UMLS resources.

Friedman et al. (2001) quantitatively compare alexicon developed manually for their MEDLEEsystem with a lexicon derived automatically fromthe UMLS, with respect to the task of processingclinical information in patient reports. They foundthe UMLS-derived lexicon led to poor performancerelative to their own lexicon. The results do not,however, invalidate the UMLS as an importantsource of lexical information, as they may simplybe a reflection of the completeness of the exist-ing MEDLEE lexicon for the task evaluated. Theauthors argue that using the UMLS can substan-tially reduce the manual effort in constructing alexicon.

Johnson (1999) explores the construction of alexical resource from the UMLS in support of pro-cessing of medical narrative, specifically utilizing acorpus of discharge summaries from hospital visits.Johnson explores the overlap between the Metathe-saurus, the SPECIALIST lexicon, and a domaincorpus, and presents some strategies for handlingsemantic ambiguities that arise during the mappingof terms in the different UMLS resources. Johnson

found that while 79% of the distinct lexical formsin his corpus occurred in the SPECIALIST lexicon,only 38% of those forms occurred in the semanticlexicon of more than 75 000 entries derived fromintersecting the Metathesaurus and the SPECIAL-IST lexicon — so only 38% of terms in the corpuscould be expected to have both morphosyntacticand semantic information derived from the UMLS.Johnson points out this may reflect the fact that theMetathesaurus may contain many complex medicalterms that should not be considered lexical items,or that may successfully be incorporated into thelexicon by assuming that they are nouns.

Methods

We follow Johnson (1999) and explore the over-lap in the UMLS Metathesaurus and the SPE-CIALIST lexicon to establish a baseline seman-tic lexicon, and then investigate its relevancefor a corpus in the molecular biology domain.We utilize the 2003AC UMLS release. As ourdomain corpus, we utilize 28 874 full text arti-cles from the Journal of Biological Chemistry(JBC), spanning the years 1998–2002, origi-nally obtained for the 2003 BioCreAtIvE compe-tition (http://www.mitre.org/public/biocreative/).While we realize that this is not a sample repre-sentative of the full domain of molecular biology,it is representative of a significant portion of thatdomain, and the results on JBC texts should beindicative of the coverage of our semantic lexi-con for this domain. We felt it preferable to usea corpus of full text articles rather than a corpusof abstracts derived from MEDLINE in order tomore completely assess coverage of the relevantlanguage.

The steps for building and evaluating our seman-tic lexicon are as follows:

• Lexemes in the SPECIALIST lexicon are match-ed to terms in the Metathesaurus. We load inall the strings represented in the SPECIALISTLRAGR file, and attempt to match Metathe-saurus strings extracted from the MRCON fileto these strings. This is done by considering dif-ferent kinds of matches:

° Exact match.

° Match after upper-casing the first letter of theSPECIALIST string.

Copyright 2005 John Wiley & Sons, Ltd. Comp Funct Genom 2005; 6: 61–66.

http://www.mitre.org/public/biocreative/

Towards a semantic lexicon for biological language processing 63

° Match after upper-casing the first letter of eachword of the SPECIALIST string.

° Match after upper-casing the entire SPECIAL-IST string.

° Other case-insensitive match.

° Match (any of the above types) after strippingthe Metathesaurus string of ‘, NOS’ or ‘〈1〉’,‘〈2〉’, etc. at the end.

° Finally, consider whether each of the con-stituent tokens of a multi-token (space con-taining) Metathesaurus string occurs in theSPECIALIST lexicon (after removal of wordsconsisting of all numbers or punctuation), inorder to assume a compositional analysis ofthe term.

• Filter the resulting lexicon (a subset of theoriginal SPECIALIST lexicon tied to specificMetathesaurus terms) by removing any terms forwhich the corresponding Metathesaurus stringis not associated with a semantic type throughone of its associated concepts. There may beconcepts for which the UMLS does not providesemantic information, and therefore they do notsatisfy our lexical constraints requiring bothmorphosyntactic and semantic information.

• Search the domain corpus for occurrences ofany lexical variant of each term in our semanticlexicon (obtaining lexical variants from theUMLS lexical tools), and track any matches inorder to establish the coverage of the lexicon.

Results

Our results on matching between the SPECIALISTlexicon and the Metathesaurus, shown in Table 1,indicate that the proportion of Metathesaurus

Table 1. Matches between the UMLS Metathesaurus termsand the SPECIALIST lexicon

n (%)

Exact matches 58 918 3.0First letter upper case 67 765 3.5First letter, all words upper case 13 922 0.7Entire string upper case 12 961 0.7Other case insensitive match 1 982 0.1Stripped term matches 5 945 0.3Total direct matches 161 493 8.2Constituent matches 1 548 389 79.0Total matches 1 709 882 87.3

terms directly occurring (through some matchingparadigm) in the SPECIALIST lexicon is in factslightly less than Johnson’s (1999) finding of 12%at 8.2%. This is due to the incredible growth inthe Metathesaurus in the past few years; John-son reports finding 630 658 unique strings in theMetathesaurus, while the version we worked withcontains 1 959 516 unique strings. The SPECIAL-IST lexicon has grown as well (from 164 850 dis-tinct lexical forms to 292 979), but clearly not atpace with the Metathesaurus. This result is in linewith Johnson’s observation that many of the termsin the Metathesaurus are probably not appropriatefor recording directly in the SPECIALIST lexicon.However, upon inspection of the constituent struc-ture of Metathesaurus terms, we found that for alarge proportion of terms (79%), each of the con-stituent members of the (multi-word) term couldbe found in the SPECIALIST lexicon. This opensthe possibility of a compositional analysis for manyMetathesaurus terms, although it does not addressthe assignment of semantic type to the term as awhole.

The number of unique SPECIALIST termsmatched by Metathesaurus terms was 108 295.These string matches were used to create a lex-icon containing 96 205 unique entries from theSPECIALIST lexicon (where a given term maycorrespond to multiple lexical entries due to themorphosyntactic ambiguity of the term, and a givenlexical entry may correspond to multiple termsdue to lexical variation) by identifying each ofthe lexical entries to which a matched string maycorrespond. This is 52% of the complete SPE-CIALIST lexicon (of 183 301 entries). Filteringthis lexicon according to the constraint of havinga semantic type for each had no impact whatso-ever — we found that each of the 78 595 uniqueMetathesaurus concepts matched to a SPECIALISTlexicon term was also associated with a seman-tic type in the Metathesaurus, so there was noreduction in the lexicon. It should be noted thatthe number of Metathesaurus concept matches issignificantly lower than the number of Metathe-saurus term matches, because several distinct termsin the Metathesaurus may correspond to the sameconcept.

We next explored the overlap of the resultinglexicon with our domain corpus, by looking formatches between tokens in the corpus and anylexical form associated with the 96 205 entries in


64 K. Verspoor

our subset of the SPECIALIST lexicon (whetheror not that exact form occurred in the initialMetathesaurus term set). We split each of the28 874 JBC files into tokens after stripping HTMLtags and converting HTML character entities. Weinvestigated several different ways in which a tokencould match a lexical entry:

• Exact single token match: the token occurs inthe lexicon exactly as it appears in the text.

• Case-insensitive single token match: the token inthe text matches a lexical entry when matchedcase insensitively.

• Exact multi-token term match: the token starts aphrase in the text that exactly matches a multi-token term in the lexicon.

• Case-insensitive multi-token term match: thetoken starts a phrase in the text that matchesa multi-token term in the lexicon when matchedcase insensitively.

• ‘Relaxed’ hyphenated token match: for singletokens containing hyphens that did not matcha lexical entry in some way, we generated avariant of the token with the hyphens replaced

by spaces, effectively generating a multi-tokenterm out of the original single token. Thefollowing matches were then attempted:

° Match (exact) of the relaxed token string to amulti-token term in the lexicon.

° Matching (exact and case-insensitive) wherethe relaxed token string starts a phrase in thetext matching a (longer) multi-token term inthe lexicon.

° If the relaxed token string did not match amulti-token term in the lexicon, attempt tomatch each of the constituent words of thestring to a lexeme.

The results appear in Table 2. We see that over77% of the tokens in the corpus match exactly to alexeme in the lexicon, with a total of 83% match-ing when case-insensitive matches are allowed.Although this corresponds to only approximately3% of the distinct tokens found in the corpus,the high coverage of the corpus as a whole indi-cates that this 3% corresponds to the most frequenttokens in the corpus. The lexicon therefore includesthe main content-bearing terms of the domain,

Table 2. Matches between the derived lexicon and the domain corpus

Count % of base set

Total number of files processed 28 874Basic token matches

Number of tokens 156 608 748Single token matches 121 552 230 77.6Additional matches with case insensitivity 9 419 429 6.0Multi-token term matches 2 866 226 1.8Additional multi-token term matches with case insensitivity 261 128 0.2

Number of unique tokens 1 898 320Unique unmatched tokens 1 836 148 97Unique unmatched numeric tokens 78 770 0.5

Matches for tokens following hyphenation relaxationNumber of tokens relaxed 6 869 993Number of constituent tokens 13 994 307

Relaxed tokens directly matching multi-token term 157 529 2.3Tokens starting (longer) multi-token term match 14 100 0.2Additional matches with case insensitivity 0 0.0Tokens with some constituent match 4 899 189 71.3Number of constituent tokens matching 7 396 976 52.9

Lexicon matchesUnique lexemes in lexicon 292 979Unique lexical entry IDs in lexicon 96 205Unique single token lexemes in lexicon 268 617Unique multi-token lexemes in lexicon 24 362Unique single token lexemes matched 62 172 23.1Unique multi-token lexemes matched 15 290 62.8Unique lexical entry IDs matched 59 199 61.5


Towards a semantic lexicon for biological language processing 65

in addition to the expected grammatical functionwords such as ‘and’, ‘the’, etc.

Inspection of the tokens which did not match anylexeme in the lexicon shows that a large proportionof the unmatched tokens are numeric tokens. Thisaccounts for an additional 7 969 674 of the tokens(5%), though they correspond to only 0.5% of alldistinct tokens. Other frequently unmatched tokentypes correspond to chemical formulas (e.g. ‘K+’),gene/protein names (e.g. ‘ERK2’), typographicalerrors (e.g. ‘negative’), protein sequences (e.g.‘CACAGAGGATGGGTTAACTCCCAG’), propernames, some tokens that only occur as part ofa multi-token term (e.g. ‘de’ in ‘de novo’ or‘vitro’ in ‘in vitro’), as well as many that seem toderive from errors in our tokenization or problemswith handling of UNICODE characters. Many ofthese could be handled by specific tokenization andtoken-tagging strategies, rather than requiring thatthe terms be enumerated in the lexicon.

The lexicon does contain a significant numberof terms which were not found in the corpus, sinceonly 62% of the lexical entries had a match (on atleast one of its lexical variants) in the corpus, but itdoes not necessarily follow that the remaining 38%of the lexicon is irrelevant for biological languageprocessing, as it could be that our corpus is notfully representative of the domain.

Conclusions

We have found sufficient overlap with our derivedsemantic lexicon to justify the use of the UMLSresources as a starting point for a lexicon forbiological language processing (BLP), on the basisof lexical overlap between a lexicon derived froma combination of the UMLS Metathesaurus andthe SPECIALIST lexicon, and the terms in adomain corpus. Over 77% of the tokens in thedomain corpus are found (through exact match)in the derived lexicon, though only 3% of theunique tokens in the corpus are covered. Thisshows that the terms captured in the derivedlexicon cover the most frequent, and probablythe most content-bearing, terms in the domaincorpus. Through augmentation with some domain-specific tokenization and named entity extraction,this lexicon can be extremely valuable for BLP.

There remain questions about the utility ofthe UMLS Semantic Network for BLP. Although

we have established a core lexicon for whichwe have the basic required lexical informa-tion — morphosyntactic and semantic informa-tion — we have not investigated any potentialshortcomings of the UMLS Semantic Network.There are 135 semantic types and 54 relation-ship types represented in the 2003AC version ofthe Semantic Network; the number of types isquite small given the complexity of the biomed-ical domain, and this begs the question of whetherit adequately characterizes the semantic distinc-tions needed for BLP. In contrast, the Gene Ontol-ogy resource (Ashburner et al., 2000) contains over16 000 concepts grouped hierarchically and there-fore in principle represents a much more fine-grained semantic breakdown of the domain. TheGENIA ontology under development (Ohta et al.,2002) is focused on cell signalling reactions inhumans and as such characterizes concepts spe-cific to those processes, again likely to be muchmore fine-grained than the broad UMLS ontology.The relative utility of different ontologies shouldbe investigated.

Furthermore, we have not explored whether thesemantic types associated with terms in our derivedlexicon are correct for the specific usages found inour corpus. This is an important issue to be inves-tigated, as a semantic lexicon is only useful to theextent that it captures the appropriate semantics.Finally, we have not assessed the impact of seman-tic ambiguity on our lexicon — how many of thelexical items are multiply ambiguous, how muchof this ambiguity is appropriate to the biologicallanguage, and how can we best deal with this ambi-guity? These are the questions that we must answerto fully assess the utility of a UMLS-based lexiconfor biological language processing.

AcknowledgementThis work was supported by the Department of Energyunder contract W-7405-ENG-36 to the University of Cali-fornia.

References

Ashburner M, Ball CA, Blake JA, et al. 2000. Gene Ontology:tool for the unification of biology. Nature Genet 25(1):25–29.

Friedman C, Liu H, Shagina L, Johnson S, Hripcsak G. 2001.Evaluating the UMLS as a source of lexical knowledge for med-ical language processing. Proc AMIA Symp, Washington, D.C.,189–193.


66 K. Verspoor

Hahn U, Romacker M, Schulz S. 1999. How knowledge drivesunderstanding — matching medical ontologies with the needsof medical language processing. Artif Intell Med 15: 25–51.

Johnson SB. 1999. A semantic lexicon for medical languageprocessing. J Am Med Inform Assoc 6(3): 205–218.

McCray AT, Bodenreider O, Malley JD, Browne AC. 2001.Evaluating UMLS strings for natural language processing. ProcAMIA Symp, Washington, D.C., 448–452.

Ohta T, Tateisi Y, Mima H, Tsujii J. 2002. GENIA Corpus: anannotated research abstract corpus in molecular biology domain.Proceedings of the Human Language Technology Conference(HLT 2002) San Diego, CA; 73–77.


towards a semantic lexicon for biological language processing

Documents