using machine learning to extract drug and gene relationships from

USING MACHINE LEARNING TO EXTRACT DRUG

AND GENE RELATIONSHIPS FROM TEXT

A DISSERTATION

SUBMITTED TO THE PROGRAM IN BIOMEDICAL INFORMATICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Jeffrey T. Chang

September 2003

c© Copyright by Jeffrey T. Chang 2004All Rights Reserved

ii

I certify that I have read this dissertation and that,in my opinion, it is fully adequate in scope and qual-ity as a dissertation for the degree of Doctor of Phi-losophy.

Russ B. Altman, Principal Adviser


Douglas L. Brutlag


Serafim Batzoglou


Hinrich Schutze

Approved for the University Committee on GraduateStudies.

iii

Abstract

Interpatient variability in responses to drugs leads to millions of hospitaliza-

tions every year. To help prevent these failures, the discipline of pharmacoge-

nomics intends to characterize the genomic profiles that may lead to unde-

sirable drug responses. Pharmacogenomic scientists must integrate research

findings across the genomic, molecular, cellular, tissue, organ, and organismic

levels. To address this challenge, I have developed methods to extract infor-

mation relevant to pharmacogenomics from the literature. These methods can

serve as the foundation for powerful tools that help scientists synthesize infor-

mation and generate new biological hypotheses.

Specifically, this thesis covers novel applications and extensions of super-

vised machine learning algorithms to extract relationships between genes and

drugs automatically. This task comprises several problems that must be solved

separately. Thus, I have also developed algorithms to identify and score gene

names and their abbreviations from text. I have framed these tasks as classi-

fication problems, where the computer must integrate diverse evidence to pro-

duce a decision. I identified features that captured information relevant to the

problem and then encoded them into representations suitable for classification.

To extract a comprehensive list of gene-drug relationships, an algorithm

must find gene and protein names from text. Using such an algorithm, the

v

computer could identify newly coined gene names. My approach to this prob-

lem achieved 83% recall at 82% precision. Since many of these names were ab-

breviations, e.g. TPMT for Thiopurine Methyltransferase, I developed an abbre-

viation identification algorithm that found these concurrences with 84% recall

at 81% precision. The final algorithm classified relationships between genes

and drugs into five categories with 74% accuracy.

Finally, I have made these algorithms and other results available on the

internet at http://bionlp.stanford.edu/ . The code is available both as

human-accessible web pages and computer-accessible web services.

vi

http://bionlp.stanford.edu/

Acknowledgements

Paraphrasing someone else, it takes a village to produce a thesis. Thomas

Kuhn acknowledged that science is a social endeavour when he wrote that

“Scientific knowledge, like language, is intrinsically the common property of

a group or else nothing at all.” It is true that during my own education, I have

incurred many debts to those who have generously shared their knowledge and

wisdom, and also to those who have helped and encouraged me in other ways.

However, I will approach my acknowledgements differently. I hope to thank

people, rather than names, by illustrating the context in which they have im-

pacted my life and work.

My interests in science were undoubtedly inherited from my dad, who is

also a scientist. Every since I can remember, Dad has kept books on his night-

stand to be read before bed. He reads voraciously and consumes books covering

a broad span of subjects. When I began graduate school, his interests turned

toward the life sciences so that he could learn about the field I was pursuing.

Dad often called me to tell me about recent books he read or biological issues

he had been thinking about. In a sense, he has experienced my graduate edu-

cation with me.

Mom, on the other hand, taught me to do my best work ever since grade

school. Then, I often lost points on homework assignments when I would ne-

glect to copy the problems or circle the answers. To me, they were just details.

vii

To my mother, those were the easy points. The real challenge was in the actual

problem, so why lose the easy points? After years of gentle chiding, I have fi-

nally begun to learn those lessons. Plus, the problems have become strikingly

more difficult.

My sister Katherine is not pursuing science and has instead been making

a living in New York. It is nice to hear from her once in a while to learn how

things are going in the real world.

Ten years ago, almost to the day, my parents sent me off to college. As an

undergraduate at Stanford, although I had officially studied biology, I also had

an interest in computers. Thus, I took on a job as a section leader (like an un-

dergraduate TA) for computer science classes. While teaching, I met Mehran

Sahami. Although he was a graduate student then, Mehran had been an un-

dergraduate at Stanford as well. Naturally, undergraduates regarded him with

a sense of awe due to his long tenure here. I think he still teaches classes at

Stanford.

In his time here, Mehran accumulated knowledge about everything at Stan-

ford, which would change my life. At the time, I had been frustrated because

I was unsure how to combine my interests in biology and computers. Noth-

ing I had studied in my courses seemed quite like what I wanted to do. As I

shared my thoughts with Mehran (on the steps of the Stanford Bookstore on a

sunny day), he directed me to Russ Altman. It was just like that scene from

Star Wars — “You seek Yoda.” The effect on my career and life was just as

profound, except that Mehran wasn’t Yoda, and neither was Russ, and I had no

mitichlorians.

viii

After talking to Mehran, I went home that day and looked Russ up in the

phone book. Remember that the year was 1995. Students called professors on

the phone, and initiating contact by email was practically unheard of. Most

students did not regularly use it.1 I was in an unusual demographic because I

checked my email almost every day.

Once I found the phone number, I called Russ and explained what I wanted

to do. The first question he asked was whether I had seen his web page. Web

page? I didn’t know professors had web pages.2 Then, he asked me to check his

web page and email him to set up a meeting. I knew immediately that I would

enjoy working in his lab.

Soon after, I met with Russ to talk about possible research projects. We

met in his office, which back then was a tiny room in the MSOB. As soon as

he started talking, however, his energy, excitement, and exhuberance in his

research were palpable and permeated the entire room. Throughout the years,

I would enjoy meetings as ideas would burst forth from Russ like a stack dump.

While my formal classes made science rote, Russ made science fun.

During my first meeting with Russ, I chose a project analyzing protein

structures. At that time, Russ had a graduate student, Liping Wei, work-

ing in that area and introduced us. Liping was developing a creative and novel

approach to analyze protein sites using Bayesian statistics. She introduced me

to the power of statistical methods and machine learning that would form the

basis of this thesis.

1Back then, email addresses to us were just usernames and the @leland.stanford.edu wasjust understood. Very few students at Stanford had any other email account. Hotmail wouldnot introduce the idea of web-based email until it launched in July, 1996.

2Netscape 1.0 had been released less than a year earlier.

ix

Because of my positive undergrad and subsequent experiences, I wanted to

continue studying informatics and returned to Stanford. My plan was to pre-

dict protein function by building and characterizing structural “motifs.” When

I started the graduate program, I undertook a series of rotations through sev-

eral labs. I did my first rotation with an expert in the area, Doug Brutlag.

As a graduate student, Doug studied DNA replication with Arthur Korn-

berg. It was an enviable pedigree and a fantastic launching pad for a career in

“classical” biochemistry. However, at some point, he became interested in com-

putational methods and shifted the focus of his lab so that he could investigate

them. At that time, few people were working in the area, and it turned out to

be an incredibly prescient move. It was that audacity that gave me the drive

to pursue a research topic that few people were investigating.

Next, I rotated through the lab of Michael Levitt, which was a shocking

experience. Michael held weekly lab meetings where, during each meeting, ev-

eryone would talk in turn about what research they did in the previous week.

Finally, as the meeting was winding to a close, Michael picked up a mop and

bucket and scrubbed the room clean for the next group! No, that’s a lie, but

what actually happened was equally unbelievable. He talked about the re-

search he did in the last week. Michael ran his own lab, and his own depart-

ment, but he still had time to write and debug his own C code. Incredible.

Nevertheless, I eventually rejoined Russ’ lab. While investigating meth-

ods to find protein structural motifs, I realized that one of the tough problems

that few people were looking at was the problem of describing what the motif

does. At the time, and still mostly true today, protein function is primarily

x

documented in the unstructured literature. Thus, to describe the function of a

structural motif, the computer must be able to “read” text and pull out the in-

formation about the function. Heading down this path, I eventually developed

an algorithm to predict protein functions more accurately by having the com-

puter “read” journal papers (Chang et al., 2001). Thus, I became interested in

the problem of using information in literature to analyze biological data, which

is the topic of this dissertation.

Fortunately, Russ was very supportive of my new research direction. Al-

though the subject complemented some of his earlier work on knowledge mod-

elling, at that time, he had no expertise, per se, in text mining. The project

might have died, except by a stroke of luck, in 1999, an expert in statistical

text mining algorithms, Chris Manning, began his appointment in the De-

partment of Computer Science. He gave much of my initial introduction into

this new area. Chris ultimately chaired my defense.

It was through Chris that I met someone who would advise me on analyzing

biological literature. Hinrich Schutze came to the Helix Group (as Russ likes

to call his lab) to do research in this area. Hinrich was particularly enthusiastic

about advising students, and was always very generous with his time. Because

of his close involvement with my work, he came to be a co-author on many of

my papers. It was natural for Hinrich to be on my thesis committee.

For the final member of my thesis committee, I sought out Serafim Bat-

zoglou. He was an expert on string algorithms due to his research in genome

analysis and assembly. I met one day Serafim when I dropped by his office.

As I started to tell him about my work, he “got it” instantly and immediately

xi

started to give me new perspectives on my algorithms. Not only was he ex-

tremely sharp, Serafim was also very approachable. We often had informal

meetings when I occasionally bumped into him at the gym, where he would

bench press an impressive amount of weight. Adding Serafim to my committee

guaranteed that somebody’s name would be spelled wrong in the acknowledge-

ments slide of my defense.

Of course, all of this research occurred in the environment, or as Russ might

say gleefully, “milieu,” of the Biomedical Informatics (SBMI) training program3

It is impossible to talk about this program without mentioning Darlene Vian,

who has been the lifeblood of the program since its beginning. Darlene kept me

moving through the program smoothly, taking care of many important things

behind the scenes (like my stipend checks). As she reduces her involvement

in the program for a much deserved retirement, she will sorely be missed.

For now, she is continuing the tradition of hosting TGiTh parties at Tennyson

Manor, keeping the wine flowing.

Scientifically however, SBMI was rich because of the diversity of its stu-

dents and research scientists. Perhaps due to its interdisciplinary nature,

someone was an expert for nearly every problem. Instead of spending hours

trying to hunt down an answer to a problem, it was very easy to ask someone

who already knew it. For example, Daniel Rubin provided me expertise on

medicine and pharmacology, Micheal Hewett and Mike Liang knew every-

thing about computers, and Soumya Raychaudhuri was an expert in math.

3Technically, I joined the Medical Information Sciences training program. To reflect the cur-rent state of the field, the name was updated to the Biomedical Informatics Training program.Therefore, I will invent an acronym here and refer to it as SBMI.

xii

Every dissertation produced at SBMI is the product of many students’ collec-

tive efforts.

Outside of work though, there is some truth to the statement when Soumya

called SBMI a social club. I’m not sure whether he was making an observation

or a wish. Soumya himself had the habit of wandering around afternoons and

chatting with people who were trying to work. However, it is accurate that

SBMI students often socialize. We discovered that much of the research equip-

ment also had recreational uses. The projector for talks doubled as a big screen

movie theater, and the fast network supported games of Age of Empires after

hours.

Outside of lab, Mike Hewett noted that SBMI people tend to join the same

activities. It’s been true for me – at the time, three of us comprised the majority

of a sand volleyball class of about five students. As an avid runner, Mike also

formed the nucleus of a running group that regularly tortured me along with

George Scott and Elmer Bernstam. On off days, Elmer and I would often

play tennis. When Elmer finished the program, I started in a karate class

along with Serge Saxonov and Iwei Yeh. Even with these activities, Brian

Naughton and I still managed to keep up with way too many TV shows. I was

fortunate to have made so many friends in the program. They have also made

science fun.

Finally, while studying here, I met my girlfriend Zhen Lin. I have claimed

throughout these acknowledgements that science is a social endeavour. That’s

not totally true. Urually, pursuing science is isolating and lonely. I spend much

time by myself learning new concepts, pondering ideas, reading papers, and

xiii

writing code. At certain graduate school milestones, such as the orals, proposal

defense, and final thesis, the attention science demands had been a source of

tension in our relationship. Zhen has always kept me grounded and reminded

me what was most important. Stay tuned. . .

Enough stories. It is time to finish this thesis so that I can send it home,

where my father is eagerly waiting for it. I’m sure it will end up on his night-

stand.

xiv

Table of Contents

Abstract v

Acknowledgements vii

1 Introduction 11.1 Uses of Structured Data in Biology . . . . . . . . . . . . . . . . . . 31.2 Unstructured Knowledge in Literature . . . . . . . . . . . . . . . . 41.3 Discovering Knowledge from Text . . . . . . . . . . . . . . . . . . . 51.4 Extracting Pharmacogenomics Information . . . . . . . . . . . . . 71.5 Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Statistical Text Analysis 112.1 Modelling a Document in Vector Space . . . . . . . . . . . . . . . . 112.2 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 k Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . 142.2.2 Naıve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.3 Maximum Entropy . . . . . . . . . . . . . . . . . . . . . . . 172.2.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 192.2.5 Support Vector Machines . . . . . . . . . . . . . . . . . . . . 202.2.6 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Categorizing Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Finding Abbreviations 283.1 An Algorithm to Identify Abbreviations . . . . . . . . . . . . . . . 31

3.1.1 Finding Abbreviation Candidates . . . . . . . . . . . . . . . 313.1.2 Aligning Abbreviations with their Prefixes . . . . . . . . . 333.1.3 Computing Features from Alignments . . . . . . . . . . . . 333.1.4 Scoring Alignments with Logistic Regression . . . . . . . . 343.1.5 Implementating the Algorithm . . . . . . . . . . . . . . . . 35

3.2 Performance of the Abbreviation Identification Algorithm . . . . . 363.3 Clarifying and Reconciling Notions of Abbreviations . . . . . . . . 38

3.3.1 Reannotating the Medstract Gold Standard . . . . . . . . . 383.3.2 Comparing the Medstract and Expert Gold Standards . . . 393.3.3 Defining Abbreviations . . . . . . . . . . . . . . . . . . . . . 40

3.4 Compiling the Abbreviations in MEDLINE . . . . . . . . . . . . . 42

xv

TABLE OF CONTENTS

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Identifying Gene Names 514.1 An Algorithm to Identify Gene and Protein Names . . . . . . . . . 57

4.1.1 Tokenizing the Sentences . . . . . . . . . . . . . . . . . . . 574.1.2 Filtering Recognized Words . . . . . . . . . . . . . . . . . . 594.1.3 Scoring Words . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1.4 Extending to Noun Phrase . . . . . . . . . . . . . . . . . . . 684.1.5 Matching Abbreviations . . . . . . . . . . . . . . . . . . . . 684.1.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 Performance of GAPSCORE . . . . . . . . . . . . . . . . . . . . . . 694.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Extracting Gene-Drug Relationships 795.1 Information Extraction Systems in the NLP Community . . . . . 825.2 Identifying Biological Relationships . . . . . . . . . . . . . . . . . 84

5.2.1 Co-occurrence . . . . . . . . . . . . . . . . . . . . . . . . . . 855.2.2 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 885.2.4 Natural Language Processing . . . . . . . . . . . . . . . . . 88

5.3 NLP Systems in Biomedicine . . . . . . . . . . . . . . . . . . . . . 895.4 Identifying Related Genes and Drugs . . . . . . . . . . . . . . . . 925.5 Classifying Gene-Drug Relationships . . . . . . . . . . . . . . . . . 975.6 Predicting Gene and Drug Relationships . . . . . . . . . . . . . . . 1025.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6 Distributing the Algorithms 1086.1 Clustering Abbreviations to Aid Browsing . . . . . . . . . . . . . . 1106.2 Making Servers Computer-Friendly . . . . . . . . . . . . . . . . . 113

7 Conclusions 1167.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 1187.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.2.1 Disambiguating Gene Names . . . . . . . . . . . . . . . . . 1227.2.2 Formal Descriptions of Data . . . . . . . . . . . . . . . . . . 1237.2.3 Annotated Text Data . . . . . . . . . . . . . . . . . . . . . . 1257.2.4 Full Text Articles . . . . . . . . . . . . . . . . . . . . . . . . 125

7.3 Final Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

A Training Maximum Entropy 127

xvi

TABLE OF CONTENTS

B Sentence Boundary Disambiguation 131

C Gene Drug Relationships 133

D Classifying PharmGKB Relationships 138

E Gene Name Normalization 144

Bibliography 148

xvii

List of Tables

2.1 Overview of Machine Learning Algorithms . . . . . . . . . . . . . 222.2 Parameters for Machine Learning Algorithms . . . . . . . . . . . 222.3 Performance of Machine Learning Algorithms . . . . . . . . . . . 23

3.1 Features Used to Score Abbreviations . . . . . . . . . . . . . . . . 343.2 Types of Abbreviations Missed . . . . . . . . . . . . . . . . . . . . 37

4.1 Overview of Gene/Protein Name Algorithms . . . . . . . . . . . . 524.2 MeSH Terms That End with -in . . . . . . . . . . . . . . . . . . . . 544.3 Gene Name Appearance Features . . . . . . . . . . . . . . . . . . . 614.4 Morphologic Variations in Gene Names . . . . . . . . . . . . . . . 634.5 Gene Name Signal Words . . . . . . . . . . . . . . . . . . . . . . . 664.6 Parameters for Support Vector Machines . . . . . . . . . . . . . . 714.7 Comparing Algorithms to Classify Gene Names . . . . . . . . . . 724.8 Removing Modules Reduces GAPSCORE Performance . . . . . . 72

5.1 Relationships Between Ten Genes and Drugs . . . . . . . . . . . . 965.2 Pharmacogenomic Relationships in PharmGKB . . . . . . . . . . 985.3 Informative Features for Gene-Drug Relationships . . . . . . . . . 101

6.1 Clusters of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . 111

B.1 Sentence Boundary Ambiguities . . . . . . . . . . . . . . . . . . . 132

C.1 Relationships Between Genes and Drugs . . . . . . . . . . . . . . 137

D.1 Gene-Drug Classification Results . . . . . . . . . . . . . . . . . . . 143

E.1 Classes of Words in Gene Names . . . . . . . . . . . . . . . . . . . 147

xviii

List of Figures

1.1 Growth of MEDLINE Citations . . . . . . . . . . . . . . . . . . . . 51.2 Architecture for Finding Gene-Drug Relationships . . . . . . . . . 81.3 BioNLP Web Server Screenshot . . . . . . . . . . . . . . . . . . . . 9

2.1 Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Zipf ’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1 Abbreviation System Architecture . . . . . . . . . . . . . . . . . . 323.2 Abbreviations Predicted in Medstract Gold Standard . . . . . . . 373.3 Growth of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . 443.4 Scores of Abbreviations Found in China Medical Tribune . . . . . 453.5 Abbreviation Server Screenshot . . . . . . . . . . . . . . . . . . . . 49

4.1 Recognizing Gene Names . . . . . . . . . . . . . . . . . . . . . . . . 584.2 Performance of GAPSCORE . . . . . . . . . . . . . . . . . . . . . . 74

5.1 Sample Relationships between Drugs and Genes . . . . . . . . . . 855.2 Frequency of Gene-Drug Co-occurrences . . . . . . . . . . . . . . . 945.3 Scoring Gene-Drug Relationships . . . . . . . . . . . . . . . . . . . 1005.4 Distribution of Relationship Scores . . . . . . . . . . . . . . . . . . 1035.5 Errors in Gene-Drug Relationships . . . . . . . . . . . . . . . . . . 1045.6 Common Co-occurrences Classified More Accurately . . . . . . . . 105

B.1 Heuristics for Sentence Boundary Disambiguation . . . . . . . . . 132

xix

LIST OF FIGURES

xx

CHAPTER

1

Introduction

According to McLeod and Evans, “Inter-patient variability in response to drug

therapy is the rule, not the exception, for almost all patients” (McLeod and

Evans, 2001). Even correctly prescribed drugs can lead to unexpected effects

such as adverse drug reactions. In 1994, there were 2,216,000 serious adverse

drug reactions, and 106,000 resulted in death, after excluding cases of inappro-

priate administration and use (Lazarou et al., 1998).

Adverse drug reactions occur due to many reasons, such as poor patient

compliance and environmental factors. Since the 1950s, scientists have recog-

nized that genetic variations influence drug response, which suggests that ge-

netic tests may be able to predict and prevent adverse drug reactions (Meyer,

2000). Today, advances in sequencing technologies and the availability of un-

precedented quantities of genomic data have empowered pharmacogenomics

research.

Pharmacogenomics studies how variations in the genome, genetic polymor-

phisms, can cause people to respond differently to drugs. One well-studied ex-

ample is the thiopurine methyltransferase (TPMT) gene (McLeod and Evans,

1

CHAPTER 1. INTRODUCTION

2001). This enzyme inactivates the thiopurine drugs used to treat childhood

leukemia, rheumatoid arthritis, and dermatological disorders. However, 10%

of the population inherit a variant of TPMT that cannot metabolize those drugs

as efficiently. In those people, thiopurine accumulates to toxic levels and in-

creases the risk of secondary malignancies. Because of these types of problems,

scientists are investigating methods to correlate polymorphisms with drug re-

sponses, and to apply that knowledge into medical practice (Roses, 2000, 2001).

However, pharmacogenomics investigations are hindered by the vast quan-

tities of information available, produced by diverse disciplines over several

decades of research. Many genes influence the efficacy of drugs, and their

mechanism of action is often unknown (Evans and Relling, 1999). Most stud-

ies have concentrated on either finding variations in drug responses or finding

polymorphisms in genes; few have linked the two (Phillips et al., 2001). To un-

derstand the relationships between drugs and biological systems, researchers

must integrate knowledge covering many fields and draw inferences among

them (for example, to link the effects of drugs with possible molecular path-

ways). They must synthesize research findings across genomic, molecular,

cellular, tissue, organ, and organismic levels. Having tools to organize vast

amounts of diverse information will help scientists quickly mine the literature

and formulate new research hypotheses.

To stimulate the production of pharmacogenomics data, the National In-

stitutes of Health is funding the Pharmacogenetics Research Network (http:

//www.nigms.nih.gov/pharmacogenetics/ ) to collect information related

to genotypes and phenotypes (Klein et al., 2001). To manage the data, it is

2

http://www.nigms.nih.gov/pharmacogenetics/

http://www.nigms.nih.gov/pharmacogenetics/


also funding the Pharmacogenomics Knowledge Base (PharmGKB) at Stan-

ford. The PharmGKB models and stores pharmacogenomics data for the re-

search community. Since the data is also available in a computationally acces-

sible format, the data sets may also become the focus of intense bioinformatic

research (Hewett et al., 2002).

1.1 Uses of Structured Data in Biology

Knowledge-based systems such as PharmGKB organize data according to on-

tologies. An ontology is a detailed representation of information that unam-

biguously defines 1) the types of data stored, and 2) the relevant relation-

ships between them. For example, PharmGKB contains entities for Gene and

Polymorphism . The fact that a Gene can have zero or more Polymorphism s

is indicated by a HAS-A relationship.

The structure in knowledge bases reduces ambiguity, insures reliable trans-

fer of data to other representations, and facilitates computational analysis.

Presently, knowledge bases have successfully improved the storage and re-

trieval of information. However, in the long term, knowledge bases may be

used to generate novel research hypotheses. Therefore, scientists have begun

efforts to model biological knowledge in knowledge bases (Karp et al., 1996;

Chen et al., 1997; Baker et al., 1998; Humphreys et al., 1998; Schulze-Kremer,

1998; Ashburner et al., 2000).

Unfortunately, developing knowledge bases and adding data are labor in-

tensive. Currently, experts in a problem domain (e.g. pharmacogenomics) de-

velop knowledge bases manually. They must specify the domain and insert

3


knowledge according to the ontology. To simplify this task, many researchers

are developing methods to add biological knowledge to knowledge bases auto-

matically.

1.2 Unstructured Knowledge in Literature

One rich source of biological knowledge is the published literature. The MED-

LINE database catalogs nearly all journals related to biology and medicine

(Hutchinson, 1998). It is available over the web using the PubMed interface.

MEDLINE contains 12 million citations and grows by over 400,000 per year 1

(see Figure 1.1). 56% of those citations contain abstracts. In addition, journal

articles are increasingly becoming available online. Electronic publishers such

as High-Wire press and PubMed Central are starting to permit access to full

text articles (Roberts et al., 2001).

The knowledge in biomedical literature is undeniably valuable. However,

the vast amount and unstructured nature of the literature proffer challenges

for scientists. Many endeavor to develop computational tools to simplify ac-

cess, understanding, and application of that knowledge. However, the knowl-

edge residing in free text form is inaccessible for computation. Fortunately, the

field of natural language processing (NLP) has been investigating automated

techniques to understand and interpret unstructured literature (Manning and

Schutze, 1999).

A discipline within NLP, information extraction (IE), studies computational

representations and algorithms that can identify relevant information in text1http://www.nlm.nih.gov/pubs/factsheets/medline.html

4


MEDLINE Abstracts1975-2000

0

1

2

3

4

5

6

1975 1980 1985 1990 1995 2000

Mill

ions

of A

bstr

acts

Figure 1.1: Growth of MEDLINE Citations. MEDLINE contains 12 mil-lion citations and grows by 400,000 a year. Over half of the citations containabstracts.

and map it into predefined structured forms. Ultimately, IE or other NLP tech-

niques will be able to read the literature and populate instances in a knowledge

base.

1.3 Discovering Knowledge from Text

In addition to helping scientists, analyses of the literature may one day be

able to leverage the vast knowledge in MEDLINE to produce novel hypotheses.

5


Swanson noticed that because of increasing specialization amidst an unman-

ageable amount of literature, researchers are often unaware of relevant infor-

mation or solutions to their problem, even if the knowledge is widely known

in another field (Smalheiser and Swanson, 1998). He theorized that comput-

ers may be able to identify such disconnects and combine knowledge to solve

problems across disciplines.

Swanson found possible relationships between two concepts by searching

for them in MEDLINE and finding words and phrases in the intersection of

the hits. Using this method, he could detect relationships not explicitly stated

in any article. Focusing on treating diseases, Swanson discovered a new treat-

ment for Raynaud’s disease by noticing that it has symptoms that are known

to be alleviated by fish oil.

Similarly, some have argued that experimental biology also contains such

“undiscovered public knowledge” across subdisciplines. Blagosklonny and

Pardee have claimed that the information necessary to understand feedback

control of p53 function was available in MEDLINE in 1990, although the

mechanism was not elucidated for another 10 years (Blagosklonny and Pardee,

2002). Believing that similar insights are missed, they proposed building sys-

tems to review the contents in MEDLINE automatically to search for other

similarly hidden discoveries.

However, systems that automatically scan text for novel biological hypothe-

ses do not yet exist. Swanson’s method required experts to generate the queries

and interpret results. Nevertheless, algorithms for identifying concepts, rela-

tionships, and performing inferences on them are active research areas (Chang

6


and Altman, 2002). Thus, this thesis investigates methods to extract struc-

tured knowledge from unstructured literature.

1.4 Extracting Pharmacogenomics Information

This thesis describes novel methods that support efforts to create a database

of relationships between genes or proteins and drugs from literature in MED-

LINE. Such a database will be useful for researchers studying pharmacoge-

nomics, including the scientists in the Pharmacogenomics Research Network.

In the future, linking this data with other biological data, such as protein struc-

tures or molecular pathways, or for humans, single nucleotide polymorphisms

(SNPs) or clinical data, will lead to deeper insights into biological systems.

Finding relationships between genes and drugs from the literature requires

algorithms to recognize drug names, gene names, and the relationships be-

tween them. Drug names are relatively easy to recognize. There are a lim-

ited number of drugs, their development time is lengthy, and the nomenclature

is tightly controlled by a few drug developers. Conversely, gene names are

dynamic, new ones are created weekly, many names follow no standard, and

many scientists are creating new names. Because there is no standardization,

orthologs may have different names, and genes with the same name may not be

homologously related (Stein, 2003). Exacerbating the problem, many gene and

protein names are abbreviated, effectively increasing the number of names.

Identifying gene names from literature is an ongoing research problem.

Similarly, finding relationships between drugs and genes or proteins is dif-

ficult because of the many different types of relationships possible (e.g. protein

7


DocumentsDrug Name Identifier

Gene NameIdentifier

AbbreviationFinder

RelationshipScorer

Gene-DrugRelationships

Figure 1.2: Architecture for Finding Gene-Drug Relationships. Findingrelationships between drugs and genes requires components that can recognizethe drug and gene names from the literature, and a component that can scorepossible relationships between drugs and genes.

metabolizes drug, drug is cofactor of gene, genetic variation causes physiologi-

cal change in drug effect, etc.) and the various ways they are expressed in text.

Fortunately, the problems of finding abbreviations, gene names, and gene/drug

relationships share similarities. They can all be framed as classification prob-

lems, where the computer integrates multiple types of information to resolve

an ambiguous decision (e.g. whether a word is a gene name).

Thus, I have approached the problems of finding abbreviations, identifying

gene names, and collecting gene/drug relationships as machine learning tasks,

where judicious choices of informative features help the classifier produce con-

fidence scores. Because these problems are difficult and ambiguous, algorithms

that produce confidence scores help users (and computer programs) prioritize

results and choose the quality of information desired. My methods can identify

abbreviations in text with 84% recall and 81% precision, find gene and protein

names with 83% recall and 82% precision, and classify relationships between

8


Figure 1.3: BioNLP Web Server Screenshot. The BioNLP web server athttp://bionlp.stanford.edu/ provides biological NLP services to the com-munity.

genes and drugs into five categories with 74% accuracy. I have built a BioNLP

web server to provide these services to the community (Figure 1.3). It is avail-

able at http://bionlp.stanford.edu/ .

9




1.5 Overview of Thesis

Chapter 1 introduces the motivations and scientific framework leading to this

work.

Chapter 2 reviews common algorithms in machine learning and statistical

natural language processing that are relevant to this work.

Chapter 3 presents an algorithm for finding abbreviations in text.

Chapter 4 presents an algorithm for identifying the gene and protein names

in text.

Chapter 5 presents methods for finding genes and drugs with relationships,

and then identifying the type of relationship.

Chapter 6 describes the development of a server to present the results of the

algorithms, and to make them computationally accessible with web ser-

vices.

Chapter 7 concludes the thesis with a summary of the work, as well as a

discussion on possible future work.

10

CHAPTER

2

Statistical Text Analysis

Increased computer power has permitted the analysis of large text data sets

(called corpora, sing. corpus) with statistical methods. Using statistical meth-

ods, computers can find patterns and quantify latent trends in the data. This

chapter provides a background on these methods, introducing the data struc-

tures and algorithms relevant to the developments in this thesis.

2.1 Modelling a Document in Vector Space

Text documents are sequences of words, spaces, and punctuation and must

be reduced to a form amenable to computational analysis. One simple and

effect representation of text is called the vector space model (VSM). Salton first

used VSM in an application to retrieve documents from a database, similar to

functionality now provided by PubMed or Google (Salton, 1968). In VSM, each

document is modelled as a vector of word counts.

~Document = [w1w2 . . . wn]

11

CHAPTER 2. STATISTICAL TEXT ANALYSIS

Our analysis includes comparison ofamino acid environments with randomcontrol environments as well as witheach of the other amino acid environ-ments.

⇒acid 2amino 2analysis 1comparison 1control 1environments 3

. . .our 1protein 0

Figure 2.1: Vector Space Model. Documents can be represented as vectors ofword counts. Each dimension of the vector contains the number of times a wordappears in the document. If a word does not occur in the document, the valuefor the corresponding dimension of the vector is 0.

where wi is the number of times that word appears in the document and n is

the number of unique words in the corpus. Since a typical document contains

only small subset of all the possible words, these vectors are almost always

sparse. Most of the values in the vector are zero.

The VSM document representation is simple and discards all information

derived from the ordering of the words. No information about context, phrases,

modifiers, syntax or other structure is retained, leading some to call it a bag-of-

words model. Although the lack of structural information seems like an exigent

deficiency, in practice, VSM performs acceptably well for many applications.

2.2 Supervised Machine Learning

One advantage of the vector space model is that the vector representation is

requisite for many statistical algorithms. One type is called supervised ma-

chine learning. These algorithms discover patterns in data vectors that can

12


help distinguish among subsets of the data. For example, supervised machine

learning algorithms have been applied to detect spam from other email.

Generally, supervised machine learning consists of two steps. The first step,

training, the algorithm constructs a model of the data using a training set of

data vectors and assignments of the vectors to classes. In the spam classifi-

cation example, the classes would be either spam or not-spam. In the second

step, classifying, the algorithm assigns new data vectors to classes based on

the model created. There are many supervised machine learning algorithms;

they differ based on their models and assumptions of the distributions of the

underlying data.

More rigorously, a supervised machine learning algorithm contains:

a set of Kclasses C1...k (2.1)

a set of Mtraining data X1...m (2.2)

Nclass assignments Y1...n (2.3)

where Yi ∈ C (2.4)

where ~Xi is a vector of dimension D. With a new data point ~x, the classifier pro-

duces the most likely class C for the data. C enumerates the possible classes.

The training data is a set of vectors, where each dimension describes a feature

of the data. For text classification, C would be the different categories of text

to classify, Xi is a (VSM) vector of word counts from a document, and Yi is the

category for document Xi. Therefore, D is the size of the vocabulary for the

13


training set.

For biology, one application of supervised machine learning is

to find articles in MEDLINE that contain information about reg-

ulatory networks in S. cerevisiae (Usuzaka et al., 1998). Here,

the authors created a training set of 758 articles that they manu-

ally assigned into two classes, WITH-REGULATORY-INFORMATIONand

WITHOUT-REGULATORY-INFORMATION. Then, they used the vector space

model and classified 35,000 yeast-related MEDLINE abstracts to find the ones

containing regulatory information. Their method achieved 90% recall.

2.2.1 k Nearest Neighbors

One simple machine learning algorithm is called k Nearest Neighbors (kNN)

(Manning and Schutze, 1999). This straightforward algorithm contains few

assumptions about the distribution of the data and often performs among the

top text categorization algorithms (Yang, 1999). kNN classifies a new data

vector based on its distance to the k most similar data vectors (the nearest

neighbors) in the training set. It assigns a class to the vector based on the

classes of the nearest neighbors.

class = Yi where argmini∈1...m

dist(x, Xi) (2.5)

One popular distance metric, out of many reasonable ones, is Euclidean

14


distance. The Euclidean distance between two vectors ~x and ~y is:

dist(~x, ~y) =

√√√√ D∑i

(xi − yi)2 (2.6)

Another metric, the cosine distance is also commonly used for comparing

documents. This is:

dist(~x, ~y) =

∑Di xiyi√∑D

i xi

√∑Di yi

(2.7)

If the vectors x and y were both normalized to lengths of 1, both the Eu-

clidean and cosine distance result in the same relative ordering of data vectors

(Manning and Schutze, 1999).

kNN is robust because it does not impose a generalized model (e.g. a nor-

mality constraint) on the data. However, relative to other methods, the clas-

sification decision is slow. A straightforward implementation evaluates every

vector in the training data and requires O(M) time.

2.2.2 Naıve Bayes

Compared to kNN, Naıve Bayes classifies faster, but imposes a stricter model

on the data (Manning and Schutze, 1999). However, it is also simple, effective,

and easy to implement. Naıve Bayes computes a probabilistic model for each

dimension of the data based on the conditional probability of observing the data

given a specific class. It assigns a class by choosing the one with the highest

probability. To calculate that:

15


P (C = c|~x) =P (~x|C = c)P (C = c)

P (~x)(2.8)

=

∏Di P (xi|C = c)P (C = c)

P (~x)(2.9)

∝D∏i

P (xi|C = c)P (C = c) (2.10)

∝D∑i

log P (xi|C = c)P (C = c) (2.11)

Since P (~x) in Equation 2.9 does not affect the classification decision, it is

often dropped. Also, computations are typically performed in log space to avoid

numerical underflow problems.

P (xi|C = c) is estimated from the training data. The maximum likelihood

estimate (MLE) is computed by dividing the number of times a feature occurs

in a class by the total number of times it occurs in all classes. Note that if

a feature never occurs for a class in the training set, the MLE probability in

that model is 0. If that feature were observed in a subsequent data vector,

that vector would automatically be disqualified from that class, regardless of

the values in any other vector. This can cause problems for text classification;

the vectors are sparse, and minor variations in word choice can lead to a word

(feature) not occurring in a dataset by chance. To avoid this situation, one

can add pseudo-counts to guarantee that some probability is assigned to every

feature. A simple strategy, Laplace’s law, is to add 1 to each count.

16


Also note that naıve Bayes assumes independence among features. It calcu-

lates their probabilities separately and multiplies them together during clas-

sification. This assumption is violated severely in text vectors, where many

words and phrases are correlated. For example, amino acid often occurs as

a phrase, and observing either word increases the likelihood of observing the

other. In a naıve Bayes model, such correlated words are over-counted in the

classification decision. Nevertheless, naıve Bayes is often effective in practice.

Finally, naıve Bayes calculates probabilities for specific observed values.

Thus, continuous data such as word counts are either treated as boolean (seen

or not seen) or binned into discrete ranges.

2.2.3 Maximum Entropy

Similarly to naıve Bayes, Maximum Entropy uses a probabilistic framework.

However, it uses the information theoretic measure of entropy to quantify the

amount of information (the opposite of disorder) in probability distributions

(Cover and Thomas, 1991). It then chooses the model that contains the least

amount of information (highest entropy) not present in the training set (Man-

ning and Schutze, 1999). This ensures that the model does not contains biases

that adversely affect its ability to classify new data vectors.

Maximum entropy classifiers use a loglinear model:

P (~x, c) =1

Z

D∏i=1

αfi(~x,c)i (2.12)

Z normalizes the result so that it is always within a range of 0 to 1, ensuring a

probability. α is a D dimensional vector of weights; there is one weight for each

17


of the D features fi(~x, c).

To train a maximum entropy model, an α vector is found such that the ex-

pectation of each feature in the model matches the expectation in the training

data.

Epfrommodelfi = Epfromdatafi (2.13)

The most popular training algorithm is generalized iterative scaling (GIS),

which is an expectation maximization approach. It converges to an optimal

solution (Darroch and Ratcliff, 1972), but requires the features to be binary.

To handle continuous features, theoretically slower gradient descent optimiza-

tion methods have been applied (Malouf, 2002). For completeness, I provide a

description and derivation of this method in Appendix A.

Note that the features in maximum entropy differ from those in other ma-

chine learning algorithms. The features here are functions that take both an

input vector and a class and return a value. For a hypothetical example, to

classify whether a document concerns regulatory networks, one possible fea-

ture could return 1 if the word MAPK is in the document and the class is

REGULATORY-NETWORK. This flexibility allows authors to construct complex

feature functions to capture interesting nuances of their domain.

Thus, from Equation 2.13, this feature formulation implicitly includes the

probabilities of the features as well as the prior probabilities of the classes.

Because the statistics are calculated across all features, the learning algorithm

accounts for their dependencies, unlike naıve Bayes. To assign a document to

a class, the algorithm chooses the class that yields the highest score.

18


For classifying academic and financial web pages, maximum entropy per-

forms comparably to naıve Bayes on some data sets and more accurately on

others (Nigam et al., 1999). Modifying the algorithm to incorporate a Gaussian

prior for the features improved the results further. However, for classifying

biological text according to Gene Ontology codes, maximum entropy was over

10% more accurate than naıve Bayes and kNN (Raychaudhuri et al., 2002).

2.2.4 Logistic Regression

Another loglinear model with a simpler formulation than maximum entropy is

logistic regression (Hastie et al., 2001). Binary logistic regression distinguishes

between two classes and fits the feature vectors into a log odds (logit) function:

logp

1− p= β~x (2.14)

with some manipulation:

p =eβ~x

1 + eβ~x(2.15)

where ~x is the feature vector, p is the probability it belongs to a class, and β is

a vector of weights. Thus, training this model consists of finding the β vector

that maximizes the difference between the two classes.

This model is trained by finding a β vector that optimizes the likelihood of

the training set:

`(β) =n∑

i=1

(yi log pi + (1− yi) log (1− pi)) (2.16)

19


where yi is 1 if feature vector i is in one class and 0 otherwise.

This equation is optimized using Newton’s method. Although it is not guar-

anteed to converge, it usually does so in practice (Hastie et al., 2001). Because

the training algorithm does not scale to large dimension spaces, logistic regres-

sion is not commonly used for text classification.

2.2.5 Support Vector Machines

Finally, support vector machines (SVM) are binary classifiers grounded in

strong statistical theory (Hastie et al., 2001). In general, classifiers demar-

cate the feature space according to classes of the training vectors. An SVM

finds a hyperplane in the middle of the closest data points of the two classes.

Those data points are called the support vectors.

An SVM classifier constructs the following hyperplanes:

(~w ∗ ~xclass1) + b >= +1 (2.17)

(~w ∗ ~xclass2) + b <= −1 (2.18)

(~w ∗ ~xhyper) + b = 0 (2.19)

where data points in one class (the positive class) have values >= +1 and data

points in the other (the negative class) have values <= −1. Equation 2.19

represents the hyperplane in between, which is the decision boundary.

To classify a new data point, the SVM projects it onto the decision boundary

hyperplane. Points on one side of the boundary belong to one class, those on

20


the other side belong to the other class.

To find the optimally separating hyperplane, observe that the distance

between the two hyperplanes is 2‖w‖ , the difference between Equation 2.17

and Equation 2.18. Thus, to maximize the separation between the hyper-

planes, minimize ‖w‖. This is a constrained optimization problem solvable

with quadratic programming. Details are described in (Burges, 1998).

Although the decision boundary is linear, the formulation allows a function,

called a kernel function, to project the vectors into a higher dimension space.

Kernel functions then return the similarity between two vectors when calcu-

lated in higher dimensionality. By projecting vectors thusly, SVM can classify

data points that are not separable in lower dimension. Although higher dimen-

sions generally lead to overfitting (the models capture idiosyncracies specific to

the training set, rather than learning broad trends), SVMs are theoretically

robust because their decision boundaries do not characterize either class too

closely.

Behaving well in high dimension spaces, SVMs outperform naıve Bayes and

kNN in text classification (Joachims, 1997). They have been applied in the

biomedical domain to classify literature about proteins according to their sub-

cellular localization (Stapley et al., 2002).

2.2.6 Feature Selection

A typical corpus contains a large number of unique words. However, many of

those words are used in few documents. The frequency of a word is inversely

related to its rank, when sorted in decreasing frequency. This is called a Zipfian

21


Probabilistic Multiway Handles ContinuousScores Classification Dependence Data

KNN N Y Y YNB Y Y N NME Y Y Y YLR Y N Y YSVM Y N Y Y

Table 2.1: Overview of Machine Learning Algorithms. KNN is k nearestneighbors, NB is naıve Bayes, ME is maximum entropy, LR is logistic regression,and SVM is support vector machine. Probabilistic Scores indicates whetherthe method yields probabilistically interpretable scores. Multiway Classifica-tion indicates whether the method can classify more than two classes at a time.This can also be emulated by implementing N one-versus-rest classifiers. Han-dles Dependence indicates whether the method correctly handles correlationsamong features. Continuous Data indicates whether the method can handlecontinuous data, rather than binned or binary data.

ParametersKNN k, distance metricNB pseudocountsME NoneLR NoneSVM kernel, error penalty

Table 2.2: Parameters for Machine Learning Algorithms. For optimalperformance, most machine learning algorithms require parameter fitting.

distribution:

f ∝ 1

r(2.20)

f is the frequency of the word and r is its rank when sorted by decreasing

frequency (Manning and Schutze, 1999). The product of the rank of the word

and its frequency is a constant called Zipf ’s constant, Z = fr.

22


Training ClassifyingMethod Speed Algorithm Speed

KNN none none look through data slowNB counting data fast vector multiply fastME expectation maximization slow vector multiply fastLR Newton’s method medium vector multiply fastSVM quadratic programming slow vector multiple fast

Table 2.3: Performance of Machine Learning Algorithms. This containsa general description of the algorithms used for training and classification ofvarious machine learning algorithms. The speed columns indicate relative al-gorithmic complexity.

Zipf ’s Law was noticed empirically and can be observed across domains,

including in biological literature (Figure 2.2).

Zipf ’s Law implies, for statistical classifiers, that many words in the corpus

are uninformative. Approximately half of the words will only appear once.

Also, some words are uninformative because they appear too frequently. For

example, the word the is the most common word in the SWISS-PROT corpus

(and in most corpora) and comprises 7% of the words in the corpus. The 10

most common words in these abstracts (the, of, and, a, in, to, that, is, with,

gene) constitute 25% of the corpus. Such words are often called stop words.

Ignoring stop words in machine learning algorithms removes a source of

noise and simplifies the classification task by reducing the dimensionality of

the feature vectors. By using only the most informative features, the perfor-

mance of the algorithm can be improved.

Some common methods for feature selection are word count cutoffs, mutual

information, or χ2 tests. Using a word count cutoff is simplest. Here, the most

common words are considered the least informative and thus ignored.

23


0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

1 101 201 301 401 501 601 701 801 901 1001

Word Rank

Zip

f's

Co

nst

ant

Figure 2.2: Zipf’s Law. The SWISS-PROT database links protein sequences toMEDLINE citations (Bairoch and Boeckmann, 1991). I collected a corpus of theMEDLINE abstracts cited in SWISS-PROT version 37. This plot shows thatthe words in this corpus roughly follows a Zipfian distribution. Zipf’s constant,plotted on the vertical axis, is roughly constant starting from word 200.

The other two methods, mutual information and χ2, examine the frequency

of words in a two way contingency table O:

Class 1 Class 2

Has Word

Without Word

Mutual information is grounded in information theory. It measures the as-

sociation between the word and the classes. If the occurrence of a word is

independent of the class of the document, the mutual information will be 0.

24


Conversely, high dependence yields high scores. To calculate the mutual infor-

mation:

2∑i,j

P (Oij) logP (Oij)

P (Oi)P (Oj)(2.21)

The best features are the words with the highest mutual information scores.

However, mutual information uses probabilities and does not account for

the number of observations. Since 1/4 and 25/100 are both 25%, they are consid-

ered equal, even though the variances in these estimates are different. Thus,

mutual information is particularly sensitive to infrequent words, which, ac-

cording to Zipf ’s law, are very common.

In contrast, the χ2 test, grounded in statistical theory, does account for the

actual number of observations. It measures the probability that the word dis-

tribution could be observed by random chance by comparing against the ex-

pected distribution.1 It computes the expected word counts from the marginal

probabilities of the word and class using:

Eij = P (Oi) ∗ P (Oj) ∗N (2.22)

where N is the sum of the observed matrix O. Then, it calculates a χ2 score

using:

χ2 =∑i,j

(Oij − Eij)2

Eij

(2.23)

1 In other applications, some have argued that a log ratio test may produce more accuratestatistics for rare events (Dunning, 1993).

25


The words whose distribution is highly correlated with the classes have the

highest χ2 scores and are the best features.

An extensive study comparing methods for feature selection found that the

χ2 test performed best (Yang and Pedersen, 1997). Surprisingly, using a word

count cutoff performed nearly as well. Because of its simplicity and low compu-

tational demands, it is often used for feature selection. The mutual information

method performed poorly due in part to its sensitivity to infrequent words.

This study also showed that text classification algorithms can be accurate

even with many features discarded. χ2 performs well with up to 98% of the

features removed, and word count cutoff with up to 90% removed.

All the methods presented thus far evaluate features independently. To

handle any correlations among features require an exhaustive search of all pos-

sible subsets. Unfortunately, this is computationally intractable for the large

dimension feature spaces in text. However, some work has been done on ap-

proximating good feature sets without exhaustive search (Koller and Sahami,

1996).

2.3 Categorizing Words

Supervised machine learning methods have been applied to identify the mean-

ing of words using the context of an unknown word. The idea that the context

is informative was immortalized by J.R. Firth who coined the sentence “You

shall know a word by the company it keeps” (Firth, 1957).

In a machine learning framework, the possible meanings of a word are the

classes, and the features of the training vectors are the neighboring words. The

26


closest neighbors contain the most information, but distant words sometimes

also include information. Thus, the number of neighbors to examine varies and

is usually discovered empirically.

Word sense disambiguation has been applied in bioinformatics to distin-

guish whether a word is a gene, protein, or mRNA. Since the name of a gene

can be the same as the mRNA or protein product, the algorithm must discover

its meaning based on context (Hatzivassiloglou et al., 2001). This study found

considerable ambiguity, but still achieved up to 85% accuracy.

27

CHAPTER

3

Finding Abbreviations

An algorithm that automatically finds and defines abbreviations can simplify

gene name identification, which is a necessary component of systems that can

identify gene-drug relationships. Many long protein and gene names are ab-

breviated, such as thiopurine methyltransferase (TPMT). Algorithms that rec-

ognize abbreviations can use both the long form and the abbreviation to help

identify a gene. In this example, methyltransferase is easily identified from its

appearance. From the abbreviation, it is clear that 1) TPMT is also a protein,

and 2) thiopurine is part of the protein name.

I define abbreviation broadly as a shortened form of a longer word or phrase

(the long form). An acronym is typically defined as a type of abbreviation in

which the short form is a conjunction of the initial letters of words in the long

form; some authors also require them to be pronounceable.

Using such a strict definition excludes many types of abbreviations that

appear in biomedical literature. Writers create abbreviations in many different

ways as summarized here:

Portions of this chapter have appeared in Chang et al. (2002).

28

CHAPTER 3. FINDING ABBREVIATIONS

Abb. Definition DescriptionVDR⇒vitamin D receptor The letters align to the begin-

nings of the words.PTU⇒propylthiouracil The letters align to a subset of

syllable boundaries.JNK⇒c-Jun N-terminal kinase The letters align to punctuation

boundaries.IFN⇒interferon The letters align to some other

place.SULT⇒sulfotransferase The abbreviation contains con-

tiguous characters from a word.ATL⇒adult T-cell leukemia The long form contains words

not in the abbreviation.CREB-1⇒CRE binding protein The abbreviation contains let-

ters not in the long form.beta-EP⇒beta-endorphin The abbreviation contains com-

plete words.

Nevertheless, the numerous lists of abbreviations covering many domains

attest to broad interest in identifying them. Opaui, a web portal for abbrevia-

tions, contains links to 152 lists alone (Opaui). Some are compiled by individu-

als or groups (Acronyms and Initialisms; Human Genome Acronym List). Oth-

ers accept submissions from users over the internet (Acronym Finder; Three-

Letter Abbreviations). For the medical domain, a manually-collected published

dictionary contains over 10,000 entries (Jablonski, 1998).

Because of the size and growth of the biomedical literature, manual compi-

lations of abbreviations suffer from problems of completeness and timeliness.

Automated methods for finding abbreviations are therefore of great potential

value. In general, these methods scan text for candidate abbreviations and

then apply an algorithm to match them with the surrounding text. Most ab-

breviation finders fall into one of three types.

The simplest type of algorithm matches the letters of an abbreviation to

29


the initial letters of the words around it. The algorithm for recognizing this is

relatively straightforward, although it must perform some special processing

to ignore common words. Taghva gives an example Office of Nuclear Waste

Isolation (ONWR) where the O can be matched with the initial letter of either

Office or of (Taghva and Gilbreth, 1995).

More complex methods relax the first letter requirement and allow matches

to other characters. These typically use heuristics to favor matches on the

first letter or syllable boundaries, upper case letters, length of acronym, etc.

(Yoshida et al., 2000) However, Yeates notes the challenge in finding optimal

weights for each heuristic and further posits that machine learning approaches

may help (Yeates, 1999).

Another approach recognizes that the alignment between an abbreviation

and its long form often follows a set of patterns (Larkey et al., 2000; Puste-

jovsky et al., 2001; Nenadic et al., 2002; Wren and Garner, 2002; Yu et al.,

2002b). Thus, a set of carefully and manually crafted rules governing allowed

patterns can recognize abbreviations. Furthermore, one can control the per-

formance of the system by adjusting the set of rules, trading off between the

leniency in which a rule allows matches and the number of errors that it intro-

duces. Also, good results have been reported from a system that simply looks

for letter matches close to the abbreviation (Schwartz and Hearst, 2003).

In their rule-based system, Pustejovsky et al. introduced an interesting in-

novation by including lexical information (Pustejovsky et al., 2001). Their in-

sight is that abbreviations are often composed from noun phrases, and that

30


constraining the search to definitions in the noun phrases closest to the abbre-

viation will improve precision. With the search constrained, they found that

they could further tune their rules to also improve recall.

Finally, there is one completely different approach to abbreviation search

based on compression (Yeates et al., 2000). The idea here is that a correct ab-

breviation gives better clues to the best compression model for the surrounding

text than an incorrect one. Thus, a normalized compression ratio built from the

abbreviation gives a score capable of distinguishing abbreviations.

3.1 An Algorithm to Identify Abbreviations

I decompose the abbreviation-finding problem into four components: 1) scan-

ning text for occurrences of possible abbreviations, 2) aligning the candidates

to the preceding text, 3) converting the abbreviations and alignments into a

feature vector, and 4) scoring the feature vector using a statistical machine

learning algorithm (Figure 3.1).

3.1.1 Finding Abbreviation Candidates

I searched for possible abbreviations inside parentheses, assuming that they

followed the pattern:

long form ( abbreviation )

For every pair of parentheses, I retrieved the words up to a comma or semi-

colon. I rejected candidates longer than two words, candidates without any

letters, and candidates that exactly matched the words in the preceding text.

31


Figure 3.1: Abbreviation System Architecture. I use a machine learningapproach to find and score abbreviations. First, I scan text to find possibleabbreviations, align them with their prefix strings, and then collect a featurevector based on 8 characteristics of the abbreviation and alignment. Finally, Iapply binary logistic regression to generate a score from the feature vector.

For each abbreviation candidate, I saved the words before the open paren-

thesis (the prefix) so that I could search them for the long form of the abbre-

viation. Although I could have included every word from the beginning of the

sentence, as a computational optimization, I only used 3N words, where N was

the number of letters in the abbreviation. I chose this limit conservatively

based on an informal observation that I always found long forms well within

3N words.

32


3.1.2 Aligning Abbreviations with their Prefixes

For each pair of abbreviation candidate and prefix, I found the alignment of

the letters in the abbreviation with those in the prefix. This is a case of the

Longest Common Substring (LCS) problem studied in computer science and

adapted for biological sequence alignment in bioinformatics (Needleman and

Wunsch, 1970).

I found the optimal alignment between two strings X and Y using dynamic

programming in O(NM) time, where N and M were the lengths of the strings.

This algorithm is expressed as a recurrence relation:

M [i, j] =

0 : i = 0 or j = 0

M [i− 1, j − 1] + 1 : i, j > 0 and Xi = Yj

max(M [i, j − 1], M [i− 1, j]) : i, j > 0 and Xi 6= Yj

(3.1)

M is a score matrix, and M [i, j] contains the total number of characters

aligned between the substrings X1...i and Y1...j. To recover the aligned char-

acters, I created a traceback parallel to the score matrix. This matrix stored

pointers to the indexes preceding M [i, j]. After generating these two matrices,

I recovered the alignment by following the pointers in the traceback matrix.

3.1.3 Computing Features from Alignments

Next, I calculated feature vectors that quantitatively described each candidate

abbreviation and the alignment to its prefix. For the abbreviation recognition

task, I used 9 features described in Table 3.1. Each feature constituted one

33


Feature Description βDescribes the abbreviation

LowerAbbrev Percent of letters in abbreviation in lowercase.

-1.21

Describes where the letters are alignedWordBegin Percent of letters aligned at the beginning of

a word.5.54

WordEnd Percent of letters aligned at the end of aword.

-1.40

SyllableBoundary Percent of letters aligned on a syllableboundary.

2.08

HasNeighbor Percent of letters aligned immediately afteranother letter.

1.50

Describes the alignmentAligned Percent of letters in the abbreviation that

are aligned.3.67

UnusedWords Number of words in the prefix not aligned tothe abbreviation.

-5.82

AlignsPerWord Average number of aligned characters perword.

0.70

MiscellaneousCONSTANT Normalization constant for logistic regres-

sion.-9.70

Table 3.1: Features Used to Score Abbreviations. These features are usedto calculate the score of an alignment using Equation 2.15. I identified syllableboundaries using the algorithm used in TEX(Knuth, 1986). The β column indi-cates the weight given to each feature. The sign of the weight indicates whetherthat feature is favorably associated with real abbreviations.

dimension of a 9-dimension feature vector.

3.1.4 Scoring Alignments with Logistic Regression

Finally, I used a supervised machine learning algorithm to recognize abbrevia-

tions. To train this algorithm, I created a training set of 1000 randomly-chosen

candidates identified from a set of MEDLINE abstracts pertaining to human

34


genes, which I had compiled for another purpose. For the 93 real abbreviations,

I hand-annotated the alignment between the abbreviation and prefix.

Next, I generated all possible alignments between the abbreviations and

prefixes in my set of 1000. This yielded my complete training set, which con-

sisted of 1) alignments of incorrect abbreviations, 2) correct alignments of cor-

rect abbreviations, and 3) incorrect alignments of correct abbreviations. I con-

verted these alignments into feature vectors.

Using these feature vectors, I trained a binary logistic regression classifier

(Hastie et al., 2001). I chose this classifier based on its lack of assumptions

on the data model, ability to handle continuous data, speed in classification,

and probabilistically interpretable scores. To alleviate singularity problems, I

removed all the duplicate vectors from the training set.

Finally, the score of an alignment is the probability calculated from Equa-

tion 2.15 using the optimal β vector. The score of an abbreviation is the maxi-

mum score of all the alignments to its prefix.

3.1.5 Implementating the Algorithm

I implemented the code in Python 2.2 (Lutz et al., 1999) and C with the Biopy-

thon 1.00a4 and mxTextTools 2.0.3 libraries. The website was built with Red-

Hat Linux 7.2, MySQL 3.23.46, and Zope 2.5.0 on a Dell workstation with a

1.5GHz Pentium IV and 512Mb of RAM.

35


3.2 Performance of the Abbreviation Identifica-

tion Algorithm

I evaluated the performance of the algorithm on the Medstract acronym

gold standard (Pustejovsky et al., 2001). It contains MEDLINE abstracts

with expert-annotated abbreviations and forms the basis of the evaluation

of Acromed. The gold standard is publically available as an XML file at

http://www.medstract.org/gold-standards.html .

I ran my algorithm against the Medstract gold standard (after correcting

6 typographical errors in the XML file) and generated a list of the predicted

abbreviations, long forms, and their scores. With these predictions, I calcu-

lated the recall and precision at every possible score cutoff generating a re-

call/precision curve.

I counted an abbreviation/long form pair correct if it matched the gold stan-

dard exactly, considering only the highest scoring pair for each abbreviation. To

be consistent with Acromed’s evaluation on Medstract, I allowed mismatches

in 10 cases where the long form contained words not indicated in the abbrevia-

tion. For example, I accepted protein kinase A for PKA and did not require the

full cAMP-dependent protein kinase A indicated in the gold standard.

I ran my algorithm against the Medstract gold standard and calculated the

recall and precision at various score cutoffs (Figure 3.2). Identifying 140 out of

168 correctly, it obtained a maximum recall of 83% at 80% precision (Table 3.2).

The recall/precision curve plateaued at two levels of precision, 97% at 22%

recall (score=0.88) and 95% at 75% recall (score=0.14).

36

http://www.medstract.org/gold-standards.html


0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Inte

rpo

late

d P

reci

sio

n Acromed: regular expression

Acromed:syntactic information

0.88

0.140.03

Figure 3.2: Abbreviations Predicted in Medstract Gold Standard. Icalculated the recall and precision at every score cutoff and plotted the resultingcurve. I marked the scores at various points on the curve. The performance ofthe Acromed system is shown for comparison.

# Description Example12 Abbreviation and long form are synonyms. apoptosis⇒programmed cell death7 Abbreviation is outside parentheses.3 Best alignment score yields incorrect long

form.FasL⇒Fas and Fas ligand

3 Letters in abbreviation are out of order. ATN⇒anterior thalamus25 TOTAL

Table 3.2: Types of Abbreviations Missed. My algorithm failed to find 25total abbreviations in the Medstract gold standard. This table categorizes typesof abbreviations and the number of each type missed.

37


At a score cutoff of 0.14, the algorithm made 8 errors. 7 of those errors

were abbreviations missing from the gold standard: primary ethylene response

element (PERE), basic helix-loop-helix (bHLH), intermediate neuroblasts de-

fective (ind), Ca2+-sensing receptor (CaSR), GABA(B) receptor (GABA(B)R1),

polymerase II (Pol II), and GABAB receptor (GABA(B)R2). The final error oc-

curred where an unfortunate sequence of words in the prefix yielded a higher

scoring alignment than the long form: Fas and Fas ligand (FasL).

3.3 Clarifying and Reconciling Notions of Ab-

breviations

Although my algorithm could find abbreviations from MEDLINE accurately,

the evaluation against Medstract remains unsatisfying. The evaluation uncov-

ered some subtleties and ambiguities in the abbreviation identification prob-

lem. To gain a better understanding of the problem, I had an expert reanno-

tate the Medstract gold standard to reveal differences among expert notions of

abbreviations.

3.3.1 Reannotating the Medstract Gold Standard

The goals for reannotating Medstract are twofold. First, the comparison of the

new annotations to the original ones should reveal differences in definitions of

abbreviations; therefore, the notion of abbreviation should not be biased by the

one in Medstract. Second, to generate a more complete standard, the new data

set should not omit any correct annotations already found in the original.

38


To address these two goals, I used a two-pass approach using an Expert

(Daniel Rubin) not involved in the development of abbreviation identification

algorithms, but familiar with the problem. The Expert was a board certified

physician with postdoctoral training in Biomedical Informatics. In the first

pass, both the Expert and I identified the abbreviations in Medstract. I asked

the Expert to mark the abbreviations in the gold standard and did not give

further definition of abbreviation.

The second pass resolved differences due to inconsistent annotation and hu-

man error. Here, the Expert alone had the authority to resolve the differences

between his annotations and those in Medstract and my list. I presented the

Expert each difference and asked whether he wanted to change his annota-

tions. There were four possible types of differences: 1) the Expert had an ab-

breviation not annotated elsewhere, 2) the Expert was missing an abbreviation

annotated elsewhere, 3) the long forms annotated had incongruent boundaries,

and 4) the long forms annotated were completely different. The resolved ab-

breviations yielded the Expert gold standard.

3.3.2 Comparing the Medstract and Expert Gold Stan-

dards

During the initial markup, the Expert identified 154 abbreviations in Med-

stract, which is fewer than the 168 marked in the gold standard. During the

adjudication step, the Expert added an additional 14 abbreviations, removed 3,

and changed the long form boundaries on 2. There were no instances where the

same abbreviation was annotated with different long forms.

39


Then, I compared the final Expert gold standard against Medstract. Dis-

regarding differences in the long form boundaries, the gold standards agreed

on 151 abbreviations. Expert had 10 abbreviations not in Medstract, and con-

versely, Medstract had 13 abbreviations not in Expert. Thus, the inter-observer

variability was 87% and was calculated:

# same abbreviations# same + # differences

The Expert disagreed with Medstract on 12 borders.

Finally, I compared the results of my abbreviation identification algorithm

against the Expert gold standard. The algorithm I used here has been modified

from the one reported above, so that it also recognizes abbreviations in the

form:

abbreviation ( long form )

as well as the other way implemented previously. Here, the algorithm attained

a precision and recall of 98.7% and 95.2%, in contrast to the 94.3% and 88.1%

on Medstract.

3.3.3 Defining Abbreviations

Comparing the results of the annotations revealed latent assumptions in the

definition of abbreviations. There are three main types of differences in the

markups: differing definitions of abbreviations, disagreement on the bound-

aries of long forms, and overlooked abbreviations.

First, there is considerable variation in the definition of an abbreviation.

40


In the broadest sense, an abbreviation is a shortened version of a longer word

or phrase. For example, IFN is constructed from the letters in interferon. An

acronym is a special case of an abbreviation where the letters are constructed

from the first letters of the words, such as NAT for N-Acetyl Transferase.

Although many abbreviations are synonyms for their long forms, some are

instead hypernyms or hyponyms. They can be more general or more specific

than the long form. One example is HOT-SPOT (HOT1). The abbreviation

contains more information and indicates a specific variant of the HOT-SPOT

gene. While this construction might technically be a parenthetical statement,

it also falls within my definition of abbreviation because the abbreviation is

constructed from the letters in the long form.

Similarly, the long forms can contain different amounts of information.

There is often ambiguity in the boundaries of the long form. The disagree-

ments stem from domain knowledge, where an expert may include more words

than indicated in the abbreviation, based on their expert knowledge. For ex-

ample, a strictly letter matching heuristic on RNA Polymerase I (Pol I) would

indicate a long form of Polymerase I. In this case, many experts would include

the word RNA because of their expert understanding of Pol I. Another example

is lateral arcuate nucleus (Arc). Experts knowledgeable in anatomy would in-

clude the word lateral. To address this, future algorithms will need to predict

boundaries based on the usage of phrases in the text.

41


Finally, Medstract also includes synonyms where neither entity is con-

structed from the letters of the other. These fall under the broad class of en-

tities called acronym-meaning pairs in Medstract. Some pairs found in Med-

stract are ommatidia and dorsal and ventral; and apoptosis and programmed

cell death. These are synonyms and have a short and long form, but are not

abbreviations because the short form is not constructed from the letters found

in the long form.

The inclusion of acronym-meaning (non-abbreviation) pairs in Medstract

led to many differences when compared against the Expert gold standard. Out

of those 168 acronym-meaning pairs in Medstract, 13 were not abbreviations.

These were the same 13 that Expert did not annotate.

The final source of discordance with the Medstract gold standard stemmed

from missing abbreviations. This is likely due to human error; the Expert

overlooked 14 in his first pass. This problem in particular complicates eval-

uations. Algorithms evaluated against Medstract can be unfairly penalized

for correct answers, leading to situations where better algorithms can receive

lower scores. When an algorithm identifies a correct abbreviation not anno-

tated in the gold standard, the precision drops. In addition to difficulty mea-

suring precision, the recall is inflated because the formula does not account for

the missing abbreviations.

3.4 Compiling the Abbreviations in MEDLINE

Nevertheless, I applied the algorithm and scanned for abbreviations in all

MEDLINE abstracts through the year 2001. I kept only the predictions that

42


scored at least 0.001. This computation required 70 hours using 5 processors

on a Sun Enterprise E3500 running Solaris 2.6. In all, I processed 6,426,981

MEDLINE abstracts (only about half the 11,447,996 citations had abstracts)

at an average rate of 25.5 abstracts/second.

From this scan, I identified a total of 1,948,246 abbreviations, and 20.7% of

them were defined in more than one abstract. 2.7% were found in 5 or more

abstracts. 2,748,848 (42.8%) of the abstracts contained at least 1 abbreviation

and 23.7% of them contained 2 or more.

Out of the nearly two million abbreviation/definition pairs, there were only

719,813 distinct abbreviations because many of them had different definitions

e.g. AR can be autosomal recessive, androgen receptor, amphiregulin, aortic

regurgitation, aldose reductase, etc. 156,202 (21.7%) abbreviations had more

than one definition.

The average number of definitions for abbreviations with 6 characters or

fewer was 4.61, higher than the 2.28 reported by (Liu et al., 2001). One possible

reason for this discrepancy is that Liu’s method correctly counts morphological

variants of the same definition. Both methods, however, overcount definitions

that have the same meaning, but different words. We found that 37.5% of

the abbreviations with 6 characters or fewer had multiple definitions, which

concurs with Liu’s 33.1%.

781,632 of the abbreviations had a score of at least 0.14. Of those,

328,874 (42.1%) were acronyms, i.e. they were composed of the first letters

of words.

43


0

50000

100000

150000

200000

250000

300000

350000

400000

1975 1980 1985 1990 1995 2000

New AbstractsNew Abbreviations

Figure 3.3: Growth of Abbreviations. The number of abstracts and abbre-viations added to MEDLINE steadily increases.

The growth rate of both abstracts in MEDLINE and new abbreviation def-

initions is increasing (Figure 3.3). 64,262 new abbreviations were introduced

last year, and there is an average of 1 new abbreviation in every 5-10 abstracts.

To evaluate the coverage of the database of predicted abbreviations from

MEDLINE, I used a list of abbreviations from the China Medical Tribune, a

weekly Chinese language newspaper covering medical news from Chinese jour-

nals (China Medical Tribune). The web site includes a dictionary of 452 com-

monly used English medical abbreviations with their long forms. I searched

the database for these abbreviations (after correcting 21 spelling errors) and

44


0

50

100

150

200

250

300

350

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

Score

Nu

mb

er o

f A

bb

revi

atio

ns

Figure 3.4: Scores of Abbreviations Found in China Medical Tribune.Using a score cutoff of 0.90 yields a recall of 68%, 0.14 87%, and 0.03 88%.

calculated the recall as:

# long forms identified# abbreviations (= 452)

(3.2)

I searched for abbreviations from the China Medical Tribune against my

database of all MEDLINE abbreviations. Allowing differences in capitalization

and punctuation, I matched 399 of the 452 abbreviations to their correct long

forms for a maximum recall of 88% (Figure 3.4). Using a score cutoff of 0.14

yields a recall of 395452

= 87%.

Out of the 53 abbreviations missed, 11 of them were in the database as a

45


close variation, such as Elective Repeat Caesarean-Section instead of Elective

Repeat C-Section. Also, the algorithm could identify all but 8 of the 53 with a

score cutoff of 0.14.

During validation, I found that the server contained 88% of the abbrevia-

tions from the dictionary in the China Medical Tribune.

Since the abbreviation list from the China Medical Tribune was created

independently of MEDLINE, the results suggest that the database contains

nearly all biomedical abbreviations. To improve the recall even further, Yu

has shown that linking to external dictionaries of abbreviations can augment

the ability of automated methods to assign definitions that are not indicated

in the text (Yu et al., 2002b). Nevertheless, this shows that my abbreviation

identification algorithm can successfully identify abbreviations, and also that

MEDLINE abstracts are a rich source of biomedical abbreviations.

3.5 Conclusions

Due to the enormous number of abbreviations currently in MEDLINE and the

rate at which prolific authors define new ones, maintaining a current dictio-

nary of abbreviation definitions clearly requires automated methods. Since

nearly half of MEDLINE abstracts contains abbreviations, computer programs

analyzing this text will frequently encounter them and can benefit from their

identification. Since fewer than half of all abbreviations are formed from the

initial letters of words, automated methods must handle more sophisticated

and non-standard constructs.

46


Thus, I used machine learning to create a method robust to varied abbrevi-

ating patterns. I evaluated it against the Medstract gold standard because it

was easily available, it eliminated the need to develop an alternate standard,

and it provided a reference point to compare methods.

The majority of the errors on this data set (see Table 3.2) occurred because

the gold standard included synonyms, words and phrases with identical mean-

ings, in addition to abbreviations. In these cases, the algorithm could not find

the correspondences between letters, indicating a fundamental limitation of

letter matching techniques.

My precision in this evaluation was hurt by abbreviations missing from the

gold standard. My algorithm identified 8 of these, and 7 had scores higher

than 0.14. Disregarding these cases yields a precision of 99% at 75% recall,

which is comparable to Acromed at 98% and 72%.

It is important for a gold standard to differentiate accurately the perfor-

mances of algorithms. To insure this, gold standards themselves should be

carefully reviewed. I therefore reviewed the MEDSTRACT gold standard by

reannotating it and analyzing the differences. This re-analysis has revealed

ambiguities and latent assumptions in the definition of an abbreviation. Many

of these are not handled explicitly in the first generation of abbreviation iden-

tification algorithms. Gaining a deeper understanding of abbreviations has led

to an improved gold standard and created a blueprint for the development of

second generation systems.

Finally, I applied the algorithm to search for abbreviations in all abstracts

in MEDLINE. Although this run completed in a reasonable amount of time,

47


under 3 days, the algorithm could be optimized by reducing the number of

alignments between abbreviations and prefixes that must be scored. One way

to do this is to encode the features into the alignment step to discard poor

alignments. The current algorithm uses dynamic programming to align abbre-

viations to possible long forms, giving equal weight to all matches. However,

assigning more weight to important positions, such as the initial letter of the

word, can help differentiate high quality alignments from others. However, a

new method must be developed to discover suitable weightings for different po-

sitions. Although many of the current features can be encoded this way, those

that depend on other aligned characters (e.g. HasNeighbor) violate assump-

tions in dynamic programming and must be handled separately.

I stored the predicted abbreviations into a relational database and built an

abbreviation server, a web server that, given queries by abbreviation or word,

returns abbreviations and their definitions. The server can also search for

abbreviations in text provided by the user (Figure 3.5).

I note that using the abbreviation server to look up definitions must be done

carefully. Since about a fifth of all abbreviations were degenerate, the correct

one must be disambiguated using the abbreviation’s context. Pustejovsky has

shown the suitability of the vector-space model for this task (Pustejovsky et al.,

2001).

I am making the abbreviation server available at http://abbreviation.

stanford.edu/ . This server contains all the abbreviations in MEDLINE and

also includes an interface that will identify abbreviations from user-specified

48

http://abbreviation.stanford.edu/

http://abbreviation.stanford.edu/


Figure 3.5: Abbreviation Server Screenshot. My abbreviation server sup-ports queries by abbreviation or keyword.

49


text. I hope that this server will also be useful for the general biomedical com-

munity. I describe the creation of this server in Chapter 6.

50

CHAPTER

4

Identifying Gene Names

Building biological databases, such as a pharmacogenomics database, requires

methods to identify the names of entities, such as genes and proteins, accu-

rately. Errors in this step can lead to problems in downstream algorithms;

overlooked gene names accounted for 85% of the missed interactions in the

protein-protein interaction database PubGene (Jenssen et al., 2001).

Computationally finding gene and protein names in natural language text

is difficult. The lack of uniform nomenclature standards has resulted in dis-

cordant naming practices (Jan, 1997; White et al., 2002). To handle the result-

ing diversity of the names, gene and protein name identification algorithms

use combinations of approaches including: Dictionary, searching from a list

of known gene names; Appearance, deducing word type based on its makeup

of characters; Syntax, filtering words based on parts of speech; Context, using

nearby words to infer gene and protein names; and Abbreviation, using abbre-

viations in text to help identify names (Table 4.1).

Perhaps the simplest approach is to create a dictionary of all known gene

and protein names. Krauthammer invented such a method by adapting BLAST

Portions of this chapter have appeared in Chang et al. (Accepted).

51

CHAPTER 4. IDENTIFYING GENE NAMES

Dictionary Appear. Syntax Context Abbr.Fukuda 1

√ √keywords

Proux 2√ √

Rindflesch 3 UMLS keywordsKrauthammer 4 GenBank

Kazama 5√ √ √

Tanabe 6√ √ √

Franzen 7√ √

Hanisch 8 HUGO,√

SP/TREMBLNarayanaswamy 9

√ √ √

Hou 10√ √ √

Lee 11√ √ √ √

Morgan 12√ √

Shen 13√ √ √ √

Tsuruoka 14√ √

Table 4.1: Overview of Gene/Protein Name Algorithms. Each row de-scribes a previous gene and protein name identification algorithm. The columnsshow types of data used to identify those names. (1Fukuda et al. (1998); 2Prouxet al. (1998); 3Rindflesch et al. (2000); 4Krauthammer et al. (2000); 5Kazamaet al. (2002); 6Tanabe and Wilbur (2002b); 7Franzen et al. (2002); 8Hanischet al. (2003); 9Narayanaswamy et al. (2003); 10Hou and Chen (2003); 11Leeet al. (2003); 12Morgan et al. (2003); 13Shen et al. (2003); 14Tsuruoka and Tsu-jii (2003); )

to search a database of gene names, rather than DNA sequences (Krautham-

mer et al., 2000; Altschul et al., 1990). Because BLAST allowed approximate

matches, this method could also detect small variations of the gene names in

the dictionary. Although such dictionary-based methods are easy to under-

stand and relatively simple to implement, maintaining dictionaries is difficult

given the rapid rate of genome research. The Mouse Genome Database alone

logged 166 name additions and withdrawals in a single week (Mouse Genome

Database; Friedman et al., 2003).

One insight that decreased the reliance on dictionaries is that, despite their

52


diversity, many gene names look like other gene names. The appearance of a

word, its suffixes, prefixes, capitalization, or numbers, can help identify it as

a gene or protein (Fukuda et al., 1998). One particularly strong clue that a

word may be a protein is the suffix -ase, which the Nomenclature Committee of

the International Union of Biochemistry and Molecular Biology (NC IUBMB)

has standardized for naming enzymes (Webb, 1992). Another commonly used

heuristic is the -in suffix. However, although many protein names end with -in,

that suffix is also common among technical words, such as penicillin, heparin,

or serotonin (Table 4.2). Appearance clues can mislead when scientific naming

conventions, such as those for cell lines or viruses, are similar to those of genes

(Tanabe and Wilbur, 2002a).

Fortunately, leveraging the syntax of a sentence can alleviate some errors.

Since all names are nouns, a part of speech tagger can restrict the domain

of words and eliminate the possibility of erroneously identifying words that

have other parts of speech. Unfortunately, there are no taggers optimized for

biological literature. Using taggers developed for other corpora can result in

errors. One gene name identification algorithm compensates for tagging errors

by using a dictionary and appearance rules to recover lost names (Proux et al.,

1998).

Another use of syntax structure is to define the local context of a putative

gene or protein name. A noun phrase with a gene or protein name often con-

tains related words, such as those that describe molecular function or inter-

actions. One system EDGAR contains a contextual identification module that

53


MH Name # Proteins # OtherD01 Inorganic Chemicals 3D02 Organic Chemicals 134D03 Heterocyclic Compounds 99D04 Polycyclic Hydrocarbons 27D06 Hormones, Hormone Substitutes, and Hormone Antagonists 19 23D08 Enzymes, Coenzymes, and Enzyme Inhibitors 37 2D09 Carbohydrates and Hypoglycemic Agents 46D10 Lipids and Antilipemic Agents 10D11 Growth Substances, Pigments, and Vitamins 32D12 Amino Acids, Peptides, and Proteins 202 1D13 Nucleic Acids, Nucleotides, and Nucleosides 8D20 Anti-Infective Agents 4D24 Immunologic and Biological Factors 39

22TOTAL 258 450OTHER

Table 4.2: MeSH Terms That End with -in. This table shows the distributionof words that end with -in across MeSH. The first column is the Mesh HeadingID. Nearly all the terms in MeSH that end with -in occur under D. Chemicalsand Drugs. The final two columns show the number of -in words that are pro-teins and non-proteins, respectively. Although protein names constitute a ma-jority of words that end with -in, many other technical terms, such as organicchemicals, also share the suffix.

uses the signal words directly before gene names such as activated, expres-

sion, mutated, or gene (Rindflesch et al., 2000). Other systems consider more

distant words (Fukuda et al., 1998; Narayanaswamy et al., 2003). However,

such heuristics miss the many occurrences of gene names without context clues

(Tanabe and Wilbur, 2002b).

One final characteristic of gene names that has not yet been fully exploited

is morphology, the derivation and formation of words. Biologists sometimes

indicate relationships among genes and proteins by varying their prefixes or

suffixes. For example, the cdk4 and cdk7 genes both share the stem cdk and

54


are both involved in cell cycle regulation. Since biologists name many genes

similarly, examining the variants of a word stem can help classify it as a gene

or protein name. Morphology is analogous to appearance because they both

scrutinize the patterns of characters in a word. However, while the appearance

of a word can be examined by itself, my notion of morphology compares the

appearance of a word to the other words in the lexicon.

To handle the diversity of gene and protein names, I have implemented

a method called GAPSCORE that combines syntax, appearance, context, and

morphology. My notion of context, however, differs from that of previous ap-

proaches. To identify a single word gene name that occurs without context

clues, I use all information about the word in MEDLINE. I combine these char-

acteristics using supervised machine learning.

In my supervised machine learning framework, a classifier learns a model

by fitting parameters based on information from a training set of labelled

genes and non-genes. I quantify the appearance, morphology, and context

of each gene or non-gene as a numerical feature vector. Then, the classifier

can identify new words by scoring it based on similarities to the previously

observed training set. There are many well-studied machine learning clas-

sifiers that learn different models. Since no classifier performs best over all

types of data, I tested a simple classifier, Naıve Bayes, against two more com-

plex classifiers known for high accuracy, Maximum Entropy and Support Vec-

tor Machines (Manning and Schutze, 1999; Ratnaparkhi, 1998; Burges, 1998;

Joachims, 1997).

After developing my system, I evaluated its performance on the publicly

55


available Yapex text collection (Franzen et al., 2002). The Yapex collection

consists of a training set of 99 abstracts from MEDLINE related to protein

binding, and a test set of 101 abstracts, of which 48 are relevant to protein

binding, and the rest were chosen randomly from the GENIA corpus (Ohta

et al., 2002).

Evaluating gene and protein name identification algorithms, however, is dif-

ficult. Problems stem from equivocal distinctions between genes (both genomic

and transcribed mRNA) and proteins and disagreements in the definition of

protein. When reading the same text, experts agree on whether a name refers

to a gene, protein, or mRNA only 77% of the time (Hatzivassiloglou et al., 2001).

Furthermore, experts only agree on whether a word is even a gene or protein

69% of the time (Krauthammer et al., 2000). The Yapex text collection ad-

dresses this ambiguity by specifically excluding peptides and protein families

(Franzen et al., 2002).

Because these ambiguities have not been explicitly resolved, algorithms of-

ten contain differing notions of protein names, which hinders direct compari-

son. In addition, implicit assumptions about the text also impede attempts to

compare. Algorithms often perform worse when applied to a different corpus.

Proux found that the precision of his method dropped from 91% to 70% when

transferred from a corpus of sentences from FlyBase to a more general set of

MEDLINE articles (Proux et al., 1998). Tanabe addressed this problem by ap-

plying a Bayesian statistical method to filter articles that were not likely to

contain a gene name (Tanabe and Wilbur, 2002b).

Therefore, to obtain an accurate measure of performance, I developed the

56


features used by my machine learning classifier on a corpus we created inde-

pendent from Yapex. I also fit the parameters of the classifier on this data set.

To reconcile differences in definitions of protein names, I used the Yapex train-

ing set to create a list of stop words that did not match the stricter definition of

protein name in Yapex. Finally, I evaluated my algorithm on the Yapex test.

4.1 An Algorithm to Identify Gene and Protein

Names

GAPSCORE scores gene and protein names in written natural language text.

Since it does not distinguish between genes and proteins, we use gene generi-

cally to mean both. The algorithm consists of five steps: (1) TOKENIZE: I split

the document into sentences and words. (2) FILTER: I remove from consid-

eration any word that is clearly not a gene name. (3) SCORE: I score words

using a machine learning classifier. (4) EXTEND: I extend each word to the

full gene name. (5) MATCH ABBREVIATION: Finally, we score abbreviations

of the gene names identified (Figure 4.1).

4.1.1 Tokenizing the Sentences

The TOKENIZE step identifies the sentences and words in a document. I iden-

tify the sentence boundaries using a simple set of heuristics. I start by assum-

ing that any period, question mark, or exclamation point followed by a space

and then a capitalized letter is a sentence boundary. Periods that occur as part

of e.g. are exceptions to this rule.

57


coactivate human keratin 4 (K4) promoter and inter

coactivate human keratin 4 VERB ADJ NOUN #

human keratin 4 (K4)

Feature Vectorkeratin

Extend

Score

Match Abbreviations

DNA

Filter

Non-G

enes

fragment

peptide

0.05

complex

PossibleGenes

Gene Found: human keratin, K4 Score: 0.97

Tokenize

Figure 4.1: Recognizing Gene Names. I scan through text one word at atime, filtering words that we immediately recognize to not be gene names. Then,I score the remaining words with a machine learning classifier, extend multi-word gene names, and score their abbreviations.

Within each sentence, I define a word as a string of alphanumeric char-

acters. Any space and most punctuation are word boundaries. We handle

dashes separately since many gene names contain them (e.g. c-jun, IL-2, IGF-

I). Dashes are not boundaries when the previous token is a single letter, or the

next token is a number or Roman numeral.

58


4.1.2 Filtering Recognized Words

The FILTER step removes known non-gene words to increase the overall ac-

curacy and performance of the algorithm. I discard words that are not gene

names, but may be part of a gene name. For example, the 1 and alpha in

Interleukin-1 alpha are discarded, but recovered later in the EXTEND step.

First, I apply Brill’s tagger and remove from consideration words that are

not nouns, adjectives, participles, proper nouns, or foreign words (Brill, 1994).

I use the default settings for the tagger and did not customize it for my corpus.

I discard numbers, Roman numerals (I-X), Greek letters, amino acids, 7 virus

names, and 13 chemical compounds. Because virus names and chemical com-

pounds resemble gene symbols and may indicate genes in certain contexts, we

conservatively discard only the ones prevalent in my training set. Finally, I

discard names of organisms found in the SWISS-PROT database (Bairoch and

Boeckmann, 1991).

Finally, I discard words from a manually created list of 49 regular expres-

sion patterns. I compiled this list by running the algorithm on the Yapex train-

ing set and scanning the results for high-scoring technical terms. These pat-

terns include 7 that match words that indicate genes and proteins (e.g. protein,

DNA, gene); 17 subunits, parts, or complexes of genes and proteins (e.g. pep-

tide, chain, motif, complex); 5 related molecules (e.g. ATP, cAMP); and 20 types

or descriptions of genes (e.g. receptor, expressed).

59


4.1.3 Scoring Words

My algorithm scores most remaining unfiltered words using a machine learn-

ing classifier. I score separately two classes of proteins that are common and

easy to recognize unambiguously: enzyme names and Cytochrome P-450 pro-

teins. Gene names in these two classes automatically receive the highest pos-

sible score.

To distinguish gene names that end with -ase, I compiled a dictionary of all

known non-gene words that also end with -ase (e.g. kilobase, disease). I selected

the 327 words that end with -ase or -ases from Webster’s Second International

dictionary. Then, I manually removed gene names from the list and added

one word gases. This resulted in a list of 196 words that are not gene names.

There were no ambiguous words that had both an enzyme and a non-enzyme

definition.

I also score separately the Cytochrome P450 proteins because they follow

a regular nomenclature (Cytochrome P450 Homepage). I use four regular ex-

pression patterns to recognize names with the form: cytochrome P450 2D6,

p450 IID6, CYP2d6, or just CYPs. Once a regular expression matches a Cy-

tochrome P450 protein name, the algorithm also identifies in the document

other short forms of the same family, e.g. 2D6, as proteins.

Most words, however, do not match these two special cases. For these, I

encode their appearance, morphology, and context as a feature vector for a

machine learning classifier.

60


LengthWord is 1 Letter LongWord is 2 Letters LongWord is 3-5 Letters LongWord is 6+ Letters Long

Presence of NumbersWord Starts with DigitsWord Ends with DigitsWord Ends with a Roman Numeral

Looks at cases of wordsWord is CapitalizedWord Ends With an Upper Case LetterWord Has Mixed Upper and Lower CaseWord Ends with Upper Case Letter and Number

OtherWord Has Greek LetterWord Has Dash

Table 4.3: Gene Name Appearance Features. These features encode a 13 di-mension vector that describes the appearance of a word. For a specific word, thevalue for each feature is 1 if it describes the word and 0 otherwise.

Appearance

I model the appearance of two types of genes, gene symbols (such as TPMT or

NAT1) and gene names that end with -in (e.g. insulin). For gene symbols, I

compute a feature vector from the features described in Table 4.3. The value of

each feature is 1 if the symbol has the characteristic and 0 otherwise.

I recognize gene names that end with -in based on the hypothesis that those

names have characteristic patterns of letters that distinguish them from other

words. To find those patterns, I use a generic statistical model that learns

variable length N-grams to classify phrases, described in (Smarr and Manning,

2002)

To train the N-gram model, I created a training set of words that

end with -in from Medical Subject Headings (MeSH) (Humphreys et al.,

61


1998). MeSH is a hierarchy of 21,973 phrases used to index MEDLINE

citations. I normalized each phrase by removing numbers, single let-

ters, and Roman numerals. Then, I discarded multi-word phrases, leav-

ing only single words. We stemmed each word by removing the termi-

nal -s if the resulting stem occurs as a word in MEDLINE. Finally, I re-

moved any word that did not end with -in, leaving a list of 708 unique

words. A word was a protein if it belonged to one of 15 MeSH classes:

Cytochromes , DNA Restriction-Modification Enzymes , Holoenzymes ,

Isoenzymes , Isomerases , Ligases , Lyases , Neuropeptides , Oxido-

reductases , Peptide Hormones , Peptides , Proteins , Receptors,

Immunologic , and Transferases . The remaining words were not proteins.

I trained the N-gram classifier on the MeSH training set. For words that

do not end with -in, I automatically assign them the lowest possible score of 0.

For words that do end with -in, I use the score from the classifier; those scores

constitute the final dimension of the appearance feature vector.

Morphology

I model morphology by quantifying the tendency for a word to vary in ways

similar to gene names. The morphology feature vector consists of scores for

8 types of variations (Table 4.4). I calculate the score based on the number

of times a stem and its variants appear in MEDLINE. The stem is the word

without the prefixes and suffixes, and the variant includes them. For example,

one type of variation counts word stems with numbers appended. Many genes

vary this way, such as ced with its variants ced1, ced3, and ced9.

62


Prefix Suffixstem + Greek Letter

Greek Letter + stemstem + Roman Numeral

"apo" or "holo" + stemstem + Upper Case Letterstem + Number

STEM + Upper Case Letter + Numberstem + lower case letter

Table 4.4: Morphologic Variations in Gene Names. This table show varia-tions of gene and protein names that I score in a feature vector. Each variant iseither a prefix or suffix of the word stem.

The value of each morphology feature is:

log max(1

1000,

# Vars# Stems

)

where # Stems is the number of times a stem appears by itself in MEDLINE,

and # Vars is the total number of times the stem appears with a variation.

Empirically, the ratio of these counts, when plotted for all words in MEDLINE,

follows an exponential distribution. Therefore, to improve discrimination in

machine learning, I take the log of that ratio.

The final piece of the equation handles typographical and spelling errors

that become significant over all of MEDLINE. For example, with1 occurs

9 times. To alleviate the effects of such errors, we set a minimum cutoff to

ignore variants that appear less than 1 time in 1000. I found this cutoff empir-

ically based on cross-validation on my training set. I did not set a maximum

ratio cutoff for variants that outnumber stems. Finally, I did not score a variant

if the stem never appears in MEDLINE.

63


As an example, the ced word contains many variations that match the pat-

tern “stem+Number.” The stem ced occurs 182 times in MEDLINE, and the

variants, ced1, ced3, and ced9, occur 1, 3, and 5 times. The score for ced and its

3 variants is thus log( 9182

) = −1.31. On the other hand, with appears 5,193,871

times and all its variants (including with1) occur 82 times. Since the ratio is

less than my minimum cutoff, the score for those words is −3.

I precompute the morphology score for every word in MEDLINE. For each

type of variation, I first calculate the score assuming that the word is a stem. I

look for all other words in MEDLINE that match that variation. For example,

for the variation requiring Greek letter suffixes, I look for words in MEDLINE

that are composed of the stem word followed by any Greek letter. For the Ro-

man numeral variation, I only consider the first ten Roman numerals since

they are the most common. If a word can be either a variation or a stem, I use

the higher of the two scores.

Context

Finally, I model context features based on the observation that gene names

often occur next to strong positive and negative signal words. For example,

the word directly before gene is frequently a gene name, but the word directly

after within is rarely one. Thus, my approach is similar to earlier systems that

only consider immediate neighbors. It differs by using negative signal words

as well as positive. I searched for both types of signal words by calculating the

correlation between each word in MEDLINE with the presence of gene names

directly before or after the word. Then, for an unknown word, the distribution

64


of its occurrences around signal words comprises its context feature vector.

Gene names should appear most often next to positive signals and least next

to negative ones.

To find the signal words, I created a training set of 1025 words, which in-

cluded 574 gene names. I randomly chose 500 nouns that appeared in year

2001 MEDLINE abstracts containing the word gene or protein. To increase the

prevalence of gene names, I added 525 more words that appeared before gene,

protein, or mrna. I chose these words randomly to insure that there would be

no bias toward making these three words signal words.

Then, for each word in MEDLINE, I tallied the number of times it occurred

next to my labelled words in a 2x2 contingency table:

expression NOT expression

GENE (A) 253 (B) 321

NON-GENE (C) 111 (D) 340

In this example, cell (A) contains the number of genes from my training

set found before expression anywhere in MEDLINE, cell (B) is the number of

genes never found before expression, cell (C) is the number of non-genes found

before expression, and cell (D) is the number of non-genes never found before

expression. I similarly counted the occurrences of expression after gene names.

If expression is a strong signal that the previous word is a gene name, then

the ratio of genes to non-genes would be higher in the first column, the expres-

sion column, than the second. We calculated the significance of the difference

in the ratio using a χ2 test. Out of the 287,680 words from MEDLINE that

appeared next to a word from my training set, 2567 were significant with a

65


Previous Word Next Word pgene name gene 0.0E+00gene name mrna 1.2E-20gene name protein 4.8E-13gene name promoter 1.3E-13

gene gene name 1.7E-10gene name genes 1.5E-10gene name expression 4.5E-09gene name transcripts 3.8E-08gene name mrnas 3.4E-07

or non-gene name 3.3E-27by non-gene name 3.5E-21

non-gene name or 1.8E-16with non-gene name 2.3E-12

to non-gene name 1.5E-11in non-gene name 1.5E-10

non-gene name were 9.0E-09non-gene name to 8.2E-09

for non-gene name 2.0E-08

…

Table 4.5: Gene Name Signal Words. This table shows a list of words that oc-cur next to gene names either more frequently or more rarely than expected. Theword pairs in the top half of the table occur more often than expected, implyingthat those bold-faced words are strong indicators of gene names. The pairs inthe bottom half of the table occur more than expected for words that are not genenames. The p column shows the statistical significance of the association.

p <= 1x10−7, which is roughly a p-value cutoff of 0.05 with a Bonferroni correc-

tion for multiple tests.

Although the 2567 words I found were all statistically significant, useful

signal words should be also ubiquitous. Obscure words could only discriminate

a small number of names. Therefore, I further narrowed my list by selecting

the most common signal words. Since only 9 of the 2,567 words were positive

signal words, I kept them all. Then, for balance, I chose the 9 negative signal

words that appeared with the greatest number of words in my training set.

This resulted in the 18 signal words that are listed in Table 4.5.

66


Then, I used these signal words to encode the context of a word into feature

vectors. Each feature is the number of times that a word occurs with each

signal word across all of MEDLINE. I calculated the distribution across signal

words by normalizing the feature vector to 1.0.

Classifier

Finally, I concatenated the appearance, morphology, and context vectors to cre-

ate the final combined vector.

To train the machine learning classifiers, I created a training set of words

from 634 MEDLINE abstracts found from searches on regulatory elements

and 101 MEDLINE abstracts cited by a review article on pharmacogenomics

(Evans and Relling, 1999). I manually categorized each word from these ab-

stracts as either a gene name or non-gene. For a multiple word gene name,

I labelled as genes only the core gene-meaning words. I labelled ambiguous

words, those that have a dominant non-gene meaning, non-genes. In addi-

tion, I included 8,617 words from MeSH that I identified using the criteria

described above in the Morphology section. This resulted in a training set of

19,952 unique labelled words.

I used these words and trained 3 types of classifiers: Naıve Bayes, Maxi-

mum Entropy, and Support Vector Machines. Since Naıve Bayes required cat-

egorical features, I binned the features. For each dimension in the feature

vector, I assigned the values into 5 bins evenly spaced between the lowest and

highest values. For Maximum Entropy, I estimated the parameters using a con-

jugate gradient descent method that has been found to converge quickly and

67


accurately (Malouf, 2002). I trained Support Vector Machines using the linear,

polynomial, and radial basis function kernels. I varied the C error penalty pa-

rameter and chose the parameters that performed best on the Yapex training

set.

4.1.4 Extending to Noun Phrase

After a word is scored, to identify multi-word gene names, I extend the name

using heuristics similar to the ones described in (Fukuda et al., 1998). Using

the parts of speech from Brill’s tagger, I include the nouns, adjectives, and

participles preceding the putative gene name. Then, I lengthen the name to

include the following words that are single letters, Greek letters, and Roman

numerals. Finally, I remove extraneous punctuation at the beginning or end of

the name, except for open or close parenthesis characters required to complete

a pair.

4.1.5 Matching Abbreviations

After the algorithm establishes the full gene names, it searches for abbrevia-

tions in the document using the algorithm described in Chapter 3. If the long

form of an abbreviation has a higher score, it transfers that score to the ab-

breviation. The algorithm likewise transfers higher scores from abbreviations

back to the long forms.

68


4.1.6 Implementation

I implemented GAPSCORE in the Python 2.2 and C languages using the Biopy-

thon 1.10 library (Lutz et al., 1999; Kernighan and Ritchie, 1988; Biopython).

The Naıve Bayes code is available as part of Biopython. We implemented the

Maximum Entropy code using the conjugate gradient descent code from Nu-

merical Recipes in C, including the fixes from the website (Press et al., 1993;

Numerical Recipes Home Page). I used the libsvm implementation of Support

Vector Machines (Chang and Lin, 2001). To improve performance, I cache word

scores as well as various intermediate computations into a MySQL database

(MySQL).

4.2 Performance of GAPSCORE

I evaluated the performance of my algorithm against the Yapex test gold stan-

dard. To obtain an accurate result, I did not look at or run my algorithm against

this data set until after I had finalized it and fitted all parameters. To compare,

I ran the Yapex algorithm, available from their web site, on that data set on

4/6/2003. When the Yapex algorithm predicted overlapping gene names, I used

only the longest one.

I quantified the performance of the algorithms using recall, precision, and

F-score. Recall# correctly predicted gene names

# gene names

measures how thoroughly a method can identify gene names.

69


Precision# correctly predicted gene names

# predictions

indicates the rate an algorithm produces errors.

F-score2 ∗ recall ∗ precision

recall + precision

combines recall and precision into a single number.

I assessed the performance of the algorithms on exact and sloppy gene name

matches using the definitions described in (Franzen et al., 2002). Using exact

match, the predicted gene name must be equivalent to the corresponding name

in the gold standard. Using sloppy match, a predicted gene name only needs to

overlap the name in the gold standard. However, if two predicted genes overlap

the same multi-word gene name, only one is considered correct and the other

is incorrect.

Since my algorithm could produce scores, I calculated the recall and preci-

sion at every score cutoff. The resultant curve illustrates the tradeoff between

recall and precision. The user can choose a strict cutoff for applications that

require high precision, or a more lenient one for applications that require high

recall.

I trained Naıve Bayes (NB), Maximum Entropy (ME), and Support Vector

Machine (SVM) classifiers on my own training set. Since this training set con-

sisted of words and not phrases, there was no distinction between sloppy or

exact match. Thus, a gene name prediction was correct if the word was la-

belled a gene in the data set. Since SVMs required user-specified parameters,

70


Linear Poly1 Poly2 Poly3 RBF1 78.11% 78.12% 76.48% 75.67% 77.82%5 78.14% 78.02% 76.98% 76.31% 76.92%

10 78.14% 78.08% 76.36% 76.62% 76.11%50 78.14% 78.11% 74.24% 76.37% 76.00%

100 78.05% 78.14% 74.35% 76.12% 75.68%

Table 4.6: Parameters for Support Vector Machines. This table showsthe F-scores achieved by SVMs with different parameters on the Yapex trainingset. The columns contain kernels (linear, first degree polynomial, second degreepolynomial, third degree polynomial, and radial basis function) and the rowscontain different values for C, an error penalty parameter. Higher values of Cresult in models with stronger support vectors. For each set of parameters, Ichose the score cutoff that resulted in the highest F-score.

I tested various combinations of kernels and C error penalty parameters (Ta-

ble 4.6). (See (Burges, 1998) for a description of the parameters.) The F-scores

of these parameters varied from 74% to 78%, attained by a linear kernel with

C = 5.

Using optimal parameters, I compared the performance of NB, ME, and

SVM on the Yapex training set, scoring with sloppy matches (Table 4.7). Al-

though all classifiers performed comparably, the F-score for SVM was slightly

higher than the rest. At its maximum F-score, SVM scored 3.5% higher preci-

sion with 2.1% loss in recall compared to ME, the next best performing classi-

fier. However, at the same recall as ME, 81.4%, SVM only improved precision

by a marginal 0.4%.

Then, with the best classifier parameters, I tested the algorithm with vari-

ous modules disabled (Table 4.8). Leaving out the negative filter had the most

detrimental impact on performance, resulting in a 11.8% decrease in F-score,

71


Recall Precision F-ScoreNaïve Bayes 81.1% 72.5% 76.6%Maximum Entropy 81.4% 73.5% 77.3%SVM 79.3% 77.0% 78.1%

Table 4.7: Comparing Algorithms to Classify Gene Names. This tablecompares the recall, precision, and F-score of three different classifiers whensearching for gene names.

Recall Precision F-score F-score Prec/75No Filter 67.1% 65.6% 66.3% -11.8% 56.5%

No Appearance 68.1% 72.4% 70.2% -7.9% 61.4%No -in Feature 68.2% 75.4% 71.6% -6.5% 65.8%

No -ase Classifier 74.3% 76.0% 75.1% -3.0% 74.7%No Abbreviation 74.7% 76.4% 75.5% -2.6% 76.0%No Morphology 71.5% 82.4% 76.6% -1.6% 78.1%

No Context 85.0% 71.5% 77.7% -0.5% 76.9%ALL 79.3% 77.0% 78.1% N/A 79.2%

†

D

Table 4.8: Removing Modules Reduces GAPSCORE Performance. Thistable shows the performance of GAPSCORE (using sloppy match scoring on theYapex training set) with different modules disabled. The modules are sorted byincreasing F-score. The first three columns shows the greatest F-score that thereduced algorithm can achieve, as well as the recall and precision at that F-score. ∆ F-score is the decrease in F-score compared to the complete algorithm.Prec/75 is the precision at 75% recall.

while leaving out context features had the least effect, leading to a 0.5% de-

crease. I also tested the context feature with only the positive and only the

negative signal words. Leaving out the positive signal words decreased the

F-score 0.07%, and leaving out the negative ones led to a larger decrease of

0.14%.

Finally, I ran GAPSCORE on the Yapex test set and compared the per-

formance against the Yapex algorithm (Figure 4.2). On sloppy match, the

Yapex algorithm received an F-score of 75.4% at recall and precision of 70.3%

72


and 81.4%. In comparison, GAPSCORE achieved 82.5% F-score at 83.3% and

81.8%. The performance on exact match was closer, with Yapex receiving 54.3%

F-score (50.1% recall, 59.3% precision) and GAPSCORE 57.6% (58.5% recall,

56.7% precision).

The performance of Yapex and GAPSCORE differed with respect to the

length of the gene name. The Yapex test set contains 1,967 gene names; 1,546

of those names consist of only a single word. On sloppy match, GAPSCORE

found 85% of the one word names but only 76% of the multi-word ones. Yapex

exhibited a smaller discrepancy in performance and found 71% of the one word

names and 67% of the others.

4.3 Conclusions

I have created a method GAPSCORE that identifies protein and gene names

from text. It uses a word-based approach and scores the confidence that a

word may be a gene based on appearance, morphology, and context criteria

that includes information from all of MEDLINE. To identify the boundaries

of multi-word gene names, GAPSCORE extends the name using heuristics on

part of speech tags.

GAPSCORE scores new words using a Support Vector Machine. With care-

ful parameter tuning, this algorithm outperformed Maximum Entropy, and

Naıve Bayes. The performance of the linear kernel exceeded that of the more

complex ones including radial basis function. However, the differences among

the various parameters and classifiers were largely insubstantial.

In contrast, leaving out different modules led to more dramatic impacts

73


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cisi

on

Yapex

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cisi

on

Yapex

Figure 4.2: Performance of GAPSCORE. I compared the performance ofGAPSCORE and Yapex on the Yapex test set using sloppy and exact match scor-ing. For sloppy match, a gene name prediction is correct if it partially matchesthe actual gene name. For exact match, the predicted name must match thewhole gene name.

74


on performance. Disabling the filter reduced F-score by 20%, because many

unspecific gene terms scored highly. The word protein occurred 199 times. For-

tunately, the manually constructed filter successfully discarded those terms in

the complete system. The next largest reduction in performance came from

removing appearance features, which confirms the approaches of some earlier

methods. However, even without appearance, the classifier could still achieve

an F-score of 70.2%, suggesting that the other features contain much informa-

tion about genes.

When comparing performance loss between the two types of signal words

in the context features, it is counterintuitive that removing the negative signal

words had a greater impact than removing the positive. After all, positive

signal words have been commonly used as gene name markers, while negative

signal words have not been used conversely. This performance can be explained

by the greater prevalence of the negative signal words. They are much more

common than the positive signal words, and thus affect a greater number of

words overall. Nevertheless, the performance difference is small.

Using all features, GAPSCORE outperformed Yapex by an F-score of 7.1%

on sloppy match and 3.3% on exact match. In addition, GAPSCORE found a

relatively larger portion of single word genes than Yapex. These differences

in performance indicate 1) more sophisticated analysis of single words can

help overall accuracy, and 2) deeper syntactic analysis can help find the cor-

rect boundaries for the names.

Nevertheless, methods that analyze single words will never be able to iden-

tify some phrases that indicate genes. For example, parathyroid hormone is

75


a peptide hormone encoded by a gene. However, neither parathyroid nor hor-

mone would indicate a gene by itself. Identifying these phrases would require

scoring phrases or collocations rather than single words.

Fortunately, such phrases that signify genes were not a significant source

of error. More substantial errors for GAPSCORE arose from differences in no-

tions of gene. The 10 highest scoring false positives were Kunitz-type protease,

PTK, alpha2, beta-globin, branched-chain alpha-ketoacid dehydrogenase, con-

ditional tyrosine kinase, elevated tyrosine kinase, endogenous 5-lipoxygenase,

globin, and glycoprotein. These proteins and genes may be missing from Yapex

because its definition excludes protein families. GAPSCORE was more sensi-

tive and identified many names that did not indicate a single identifiable gene.

When evaluating the names found by decreasing score, the first name that was

not a gene was COS-1, the 550th name at recall and precision of 25% and 90%.

Thus, these results underscore the importance of developing clear definitions

of protein names.

Some of the names from the list of false positives, and the relative decrease

in performance on exact match, suggest that deeper analysis is required to

correctly identify name boundaries. For example conditional tyrosine kinase

should not include the word conditional. In the future, I will investigate more

sophisticated methods for finding boundaries than part of speech heuristics.

One possible approach is to use Markov models that quantify the tendencies of

certain words to appear together (Majoros et al., 2003).

I also have not yet directly addressed ambiguous names – those that mean

76


genes in some contexts and non-genes in others. My current strategy of la-

belling those words in my training set as non-genes add noise to my data and

could have hurt my performance. These may need to be handled separately. In

addition, the context could be used to update the confidence of ambiguous gene

name predictions.

Finally, several difficult problems remain in gene name identification.

There is still considerable ambiguity in the definition of the task; related en-

tities must be differentiated, for example gene and gene products, gene struc-

ture, gene families, protein domains, protein complexes, and alleles. Also con-

founding the task is the inconsistent naming of many genes. These differences

may be small variations in tokenization or word order, or the names may be un-

related synonyms (Yu et al., 2002a; Hanisch et al., 2003). Therefore, methods

must be developed to normalize synonyms and other variants before these al-

gorithms are generally useful for unambiguous indexing and extraction tasks.

Nevertheless, I have begun using my algorithm to identify gene and pro-

tein names and their relationships to drugs. Preliminary timings indicate that

my current implementation, running on a (busy) single processor 1.5GHz In-

tel Xeon with 512Mb of RAM, requires an average of 15 seconds to tag each

MEDLINE abstract. Another technical limitation to such a data-intensive ap-

proach is that the method may need retraining as new words and word senses

are introduced into the literature. It is unclear how rapidly the literature is

changing, and how these changes may affect the performance of the algorithm.

In conclusion, I have developed a new method, GAPSCORE, for finding gene

and protein names by combining novel formulations of features in a machine

77


learning framework. I found that Support Vector Machines slightly outper-

form other popular methods. When applied to the Yapex text collection, my

method achieves high performance due to its sophisticated analysis of single

words and the high prevalence of single word gene names. The algorithm

produces confidence scores that can be adjusted for either high recall or pre-

cision. GAPSCORE is available on the web at: http://bionlp.stanford.

edu/gapscore/ .

78

http://bionlp.stanford.edu/gapscore/

http://bionlp.stanford.edu/gapscore/

CHAPTER

5

Extracting Gene-Drug Relationships

Much biological insight originates from the identification and characterization

of relationships among macromolecules. These interactions drive higher level

processes. Therefore, scientists have devoted much effort to developing tech-

nologies to elucidate those interactions, such as yeast two-hybrid screens, di-

rected genetic crosses, and expression analysis to infer networks.

Many interesting interactions are reported in unstructured free text, and

thus, unfortunately, are unavailable for high-throughput analysis. Because of

the vast number of molecules and relationships, identifying them manually is

daunting. Therefore, to extract interactions, researchers are investigating the

suitability of text processing algorithms.

The natural language processing (NLP) community, in particular, has stud-

ied the problem of identifying relationships from text. Spurred by the DARPA-

sponsored series of Message Understanding Conferences (MUC), the commu-

nity developed a technology called information extraction (IE). IE addressed

the problem of identifying pre-designated relationships in text. Because it

narrows the problem by specifying the relationships of interest, IE was more

79

CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS

tractable than the grandiose ambition of natural language understanding. IE

also differed from information retrieval, because its goal was to identify facts

rather than documents.

The MUC conferences provided a single data set and uniform evaluation

metric, which made possible the comparison of the performances of algorithms.

The extraction tasks were typically complex, involving the relationships among

many different types of entities. For example, the task for MUC-6 pertained to

corporation management changes. Identifying all the information related to a

change required the identification of people in the organization, their job titles,

the location of the business, etc. However, not all extraction tasks must be this

complicated. Binary relationships, those that involve only two entities, are a

common special case. In the most recent MUC in 1997, the organizers recog-

nized this and created a separate category of binary extraction tasks (SAIC

Information Extraction). Binary relationships are particularly important for

the biological community. Much work has been focused on such relationships,

for example, the interactions between proteins.

Similar to the MUC conferences, there have also been open evaluations of

biological information extraction algorithms. The Knowledge Discovery and

Data Mining (KDD) Challenge Cup was an open evaluation to test algorithms

for data mining. Although the KDD tasks traditionally concerned neither text

nor biological applications, the 2002 context focused on mining biological text

(Zaki et al., 2002). The contest included two tasks. The first was to identify

papers that contained experimentally derived functions of Drosophila genes

(and identifying the relevant gene) (Yeh et al., 2002, 2003). The winner of this

80


task used a traditional NLP approach, creating a rule-based system that could

recognize experiments and gene products (Regev et al., 2002). The number of

rules developed was not reported. However, statistical approaches performed

competitively – the two runners up that published their methods both used

them (Keerthi et al., 2002; Ghanem et al., 2002).

The second KDD task was to predict whether a knockout gene would affect a

signalling pathway (Craven, 2002). This was presented outright as a statistical

classification problem, containing in addition to text data, information on gene

function, protein localization, and protein interactions. The winning strategy

used support vector machines (Kowalczyk and Raskutti, 2002). It is notable,

however, that papers in both tasks stressed the importance of a careful choice

of features (Forman, 2002; Keerthi et al., 2002).

Another community-based effort to compare algorithms on biological

datasets has recently begun at the Text REtrieval Conference (TREC)

(Voorhees, 2002a). In 2003, TREC included an information extraction task

in its Genomics track. The tasks was to automate the assignment of GeneRIF

(Gene References Into Function) annotations in LocusLink, a database of ge-

netic loci, including genes (Pruitt et al., 2000). A GeneRIF is “a concise phrase

describing a function or functions” for a gene (or technically, a genetic locus)

(LocusLink). Currently, manual annotators scour MEDLINE literature to as-

sign GeneRIFs for genes in LocusLink. However, because of the vast number

of genes and volume of literature, the annotations are incomplete. Thus, there

is considerable interest in developing information extraction methods that can

reproduce these automatically. At this moment, although the entries for the

81


TREC task have been submitted, the results have not yet been released or

published.

Finally, there is one more community-based evaluation called BioCreAtIve,

for the Critical Assessment of Information Extraction systems in Biology. This

ongoing competition consists of two tasks. The first one is to identify the genes

and proteins in text. Although this task is nominally similar to the one in

Chapter 4, the goal of the task is more comprehensive. Here, a successful

system must produce a list of unique genes, with each linked to all synonyms

used in the texts. Thus, in addition to identifying the names, the system must

also resolve synonyms.

The second problem in BioCreAtIve is to annotate proteins with GeneOn-

tology (GO) codes, identifiers for an ontology of gene function (Ashburner et al.,

2000). Successful systems here will identify the text in full text journal articles

that provide evidence of protein function (similar to the TREC task), and then

annotate the function with a GO code. The training data for the BioCreAtIve

tasks have been made available; the final test data will not be available until

November 2003.

5.1 Information Extraction Systems in the NLP

Community

Traditional IE systems identified relationships in text by looking for distinctive

patterns. These patterns could be either rule-based or statistical. Rule-based

patterns were either regular languages (Hobbs et al., 1996; Soderland, 1999) or

82


ad-hoc, (Fisher et al., 1995; Kim and Moldovan, 1993; Lehnert et al., 1992; Yan-

garber and Grishman, 2000) consisting of words, parts of speech, or semantic

classes from a domain ontology.

Rule based patterns were initially developed manually, with an expert cre-

ating and tuning a set of rules so that they would work accurately for a spe-

cific problem. This process was labor intensive and required domain expertise.

Also, the patterns were then specific to a particular problem and hard to adapt

to another. Since this was recognized as a problem, adaptability of IE systems

became an issue in MUC-7.

The need to alleviate the difficulty of developing patterns spurred research

into automated inference of patterns. One early approach employed a feedback

system where the algorithm would propose possible patterns for an expert to

accept or reject (Riloff, 1993). However, this was still labor intensive. Subse-

quent systems used training sets of annotate relationships, and would induce

rules that accurately identified them (Soderland et al., 1995; Huffmann, 1996;

Aseltine, 1999; Califf and Mooney, 1999; Freitag, 1998; Catala et al., 2000; Kim

and Moldovan, 1993). In these systems, the expert would create a training set

rather than tune the patterns. Finally, to further reduce manual work, one sys-

tem induced patterns from a training set where the documents were labelled,

rather than the specific relationships (Riloff, 1996). As research increased,

methods for rule induction became more accurate. In 1996, Freitag noted that

automatically generated systems were approaching the performance of hand-

crafted ones (Freitag, 1996).

More recently, however, the community has also investigated using hidden

83


Markov models (HMM) to find relationships (Freitag, 1996). These models

scanned a series of words to compute an optimal probability that each word

belonged to a relationship. Since these were hidden Markov models, the mean-

ings of the states were inferred from the data. However, the topology of the

states were predetermined, and there has been work to find topologies suit-

able for IE (Freitag and McCallum, 2000) and estimating accurate transition

probabilities from sparse data (Freitag and McCallum, 1999). Also, McCallum

proposed a variant of Markov models trained with maximum entropy (McCal-

lum et al., 2000).

5.2 Identifying Biological Relationships

The biological research community also has built or adapted information ex-

traction systems to identify relationships. However, nearly all these systems

have been optimized to recognize specific relationships between two entities,

such as proteins. Binary relationships are particularly important in biology.

Some groups have developed algorithms to analyze text and automatically con-

struct databases of protein-protein interactions (Blaschke et al., 1999; Ng and

Wong, 1999; Thomas et al., 2000; Jenssen et al., 2001; Ono et al., 2001; Park

et al., 2001; Wong, 2001; Yakushiji et al., 2001), protein cellular localization

(Craven and Kumlien, 1999), metabolic enzymes (Humphreys et al., 2000),

gene-drug interactions (Craven and Kumlien, 1999; Rindflesch et al., 2000),

gene and gene products (Sekimizu et al., 1998), diseases associated with pro-

teins or keywords (Andrade and Bork, 2000; Craven and Kumlien, 1999), and

other relationships (Hirschman et al., 2002b).

84


• Renal cells accumulate the osmolyte sorbitol through increased transcrip-tion of the aldose reductase gene.

• This increase was markedly inhibited by addition of sulfaphenazole, aselective inhibitor of CYP2C9.

• It is not known whether VDR genotype influences bone accretion or loss,or how it is related to calcium metabolism.

• The results are consistent with recent findings showing that CYP1A2,rather than CYP2D6, is the major enzyme responsible for the metabolismof clozapine.

Figure 5.1: Sample Relationships between Drugs and Genes. Identifyingrelationships from text is difficult. There are many types of relationships andthe vocabulary for describing them is diverse. In addition, language describinguncertainty or negation can confound analyses.

5.2.1 Co-occurrence

The simplest algorithms leveraged the idea that if two entities appeared in the

same sentence or abstract, they may be related. This idea can also be refined

in two ways. First, entities that appear closer may be more likely to be related.

Second, entities appearing together more frequently may also be more likely to

be related (Andrade and Bork, 2000; Stapley and Benoit, 2000; Jenssen et al.,

2001; Stephens et al., 2001).

The co-occurrence algorithm was the basis for PubGene, a database contain-

ing gene-gene interactions (Jenssen et al., 2001). In this study, Jenssen et al.

looked for MEDLINE abstracts that contained at least two gene names from

their dictionary of human genes. They scanned all of MEDLINE for genes

that co-occur. Then, they used human experts to evaluate whether the co-

occurrence was biologically meaningful. This resulted in 60% precision and

85


51% recall. Also, Jenssen discovered that the gene-gene pairs occurring multi-

ple times were more likely to interact. When considering only the relationships

appearing in five or more articles, the precision increased to 72%. However,

many pairs that occurred less than five times also interacted. During the eval-

uation, the scientists found that nearly all the gene-gene interactions missed

were due to failures in gene name identification.

Although Jenssen looked for pairs of genes that occurred in the same ab-

stract, people have also used methods that find co-occurrences in the same

phrase (Ono et al., 2001) or sentence (Rindflesch et al., 2000). Intuitively, the

larger the scope, the more relationships the algorithm would detect. However,

the relationships may also be more likely to be incorrect. One comprehensive

study quantified this and found the recall and precision for co-occurrences in

abstracts to be 100% and 57%, respectively; performance on sentences was

85% and 64%, and for phrases, 62% and 74% (Ding et al., 2002). One inter-

esting finding from this study was that long range co-occurrences identified

relationships more effectively if one of the terms was a general name that was

used more specifically in the rest of the document. For example, the study

showed that flavonoid was used in the first sentence, and more specific words

for flavonoids were used subsequently. Without the benefit of an ontology (or

deep semantic analysis), flavonoid relationships were best found by consider-

ing the whole abstract.

Although co-occurrence methods could successfully find relationships, they

provided no insight into the characteristics of the relationships. Co-occurrences

between genes could indicate direct physical relationships such as binding, or

86


more abstract relationships such as mutual involvement in a biological process

(Stapley and Benoit, 2000). Further processing was necessary to characterize

the relationship.

5.2.2 Keywords

To identify the types of relationships, algorithms must examine relevant infor-

mation, such as the neighboring words in the sentence or abstract. A simple

heuristic identifies key words or phrases that can discriminate particular types

of relationships. One way to make this method more specific is to use patterns

of words.

In pattern-based methods, researchers developed patterns of biological en-

tities and regular words that distinguished particular types of relationships.

These patterns were typically simple. They did not require part of speech

or complex semantics. These methods usually employed only a few gen-

eral patterns. For example, one system used only <protein A> <action>

<protein B> where <action> consisted of a list of 14 possible words and

their variants (Blaschke et al., 1999). Other systems used 5 (Ng and Wong,

1999) and 20 (Ono et al., 2001) patterns. One group also developed a method

that could score relationships. For each co-occurrence, the algorithm scored

the words in that sentence against a list of words typically found with different

types of relationships (Stephens et al., 2001).

87


5.2.3 Machine Learning

Instead of requiring patterns or keywords, IE has also been framed as a ma-

chine learning problem. Formulated thus, the sentences with co-occurrences

are represented in a vector space model, and then the classifier scores the like-

lihood that the sentence contains a relationship.

Craven and Kumlein implemented such a classification system to determine

subcellular localization of proteins (Craven and Kumlien, 1999). Using experts,

they created a training set of sentences describing proteins localized to subcel-

lular compartments. Then, they trained a naıve Bayes classifier and scored

new sentences containing localization information. They found that method

had higher precision than co-occurrence methods, but lower recall.

5.2.4 Natural Language Processing

In addition to identifying the type of relationship, identifying the subject and

object of a relationship was also useful. For example, in regulatory networks,

one gene influences the expression of the next. In these cases, it was useful to

distinguish the regulator gene from the gene being regulated. This required a

more sophisticated language model. NLP systems addressed this by including

a domain ontology (semantics) and a structured model of the sentence (syntax).

Researchers have developed or adapted some NLP systems to extract bio-

logical information (Hishiki et al., 1998; Humphreys et al., 2000; Thomas et al.,

2000). Although technologies to model knowledge and parse syntax was still

88


under development, they often performed well enough for information extrac-

tion (Sekimizu et al., 1998). Although these systems have been built on tech-

nologies developed in the NLP community, all needed to be adapted for the

biological domain.

The use of NLP techniques could be as simple as just using the part of

speech to score relationships (Thomas et al., 2000). For example, in a protein-

protein relationship, both proteins should be nouns in the sentence. Slightly

more complicated systems used shallow parsing to determine the subject and

object of known verbs (Sekimizu et al., 1998; Proux et al., 2000; Wong, 2001).

Finally, systems that performed full parsing could ascertain the relationships

among all components in a sentence (Yakushiji et al., 2001; Park et al., 2001;

Park, 2001). Although these full parsers suffered from parsing ambiguities

(Yakushiji et al., 2001), they nevertheless achieved up to 48% recall with 80%

precision (Park et al., 2001).

5.3 NLP Systems in Biomedicine

With the methodological advances in the NLP IE community, it is somewhat

surprising that those technologies have not been used more commonly in biol-

ogy. One possibility is that bioinformaticians are unaware of the work of the

NLP community. However, this is not likely, as evidenced by the systems that

have been adapted. Other possible reasons may be related to idiosyncrasies

in the biological domain, difficulties in adapting NLP systems, and differing

expectations of accuracy in biology and general text.

89


One important difference that distinguishes IE in biology from IE in gen-

eral text is the method for recognizing entities. The characteristics of entity

names, and thus the methods for identifying them, differ. In general text, en-

tities include names of people and companies from news reports. Biological

systems must recognize names such as genes, proteins, drugs, or subcellular

locations. Biological names include domain-specific idiosyncracies, such as un-

usual patterns in the prefixes and suffixes of words (Hishiki et al., 1998). Also,

tokenizers must handle the mixed punctuation and numbers that occur in gene

names (Thomas et al., 2000). Although algorithms can identify entities from

regular text with 93–95% accuracy, biological names are generally recognized

with 75–80% accuracy (Hirschman et al., 2002a).

Also, it is difficult to adapt existing NLP systems. As one author wrote

about the entity recognition problem,

The process is weakly domain dependent . . . changing from news to

scientific papers would involve quite large changes

(Cunningham, 1999). Since the overall performance of the system depends

on the quality of the entity recognizer, this portion of an IE system must be

accurate (Fukuda et al., 1998),

In addition to the named entity recognizer, the ontology must be adapted

to the biological domain. To adapt the EmpathIE system, Humphreys cre-

ated a new ontology specifically to support extraction of metabolic pathways

(Humphreys et al., 2000). This ontology consisted of a lexicon of 52 categories

and 25,000 terms used to describe pathways. Even at this size, they reported

90


that further refinements of the ontology could still lead to performance in-

creases.

Nevertheless, one interesting application of NLP technologies to the

biomedical domain is MedLEE, which is developed at Columbia University

(Friedman et al., 1994; Friedman, 2000). This system was originally cre-

ated to extract information from clinical documents to support the Columbia-

Presbyterian Medical Center. It uses traditional NLP technologies, employ-

ing a domain-specific lexicon and synactic and semantic grammars. Their ap-

proach is to develop the system for a single domain and then adapt it to differ-

ent ones (Friedman, 1997). In medicine, MedLEE has been applied successfully

to different types of text, including radiology reports, mammography, discharge

summaries, electrocardiography, echocardiography, and pathology.

In addition to clinical work, MedLEE is now being adapted to the biologi-

cal domain. This involves developing a new concept recognizer, creating new

ontologies, and developing new patterns. Their final system, GENIES, for ex-

tracting molecular pathways only uses the tokenizer and parser from MedLEE

(Friedman et al., 2001). GENIES includes new modules for gene name recog-

nition and gene name disambiguation (Krauthammer et al., 2000; Hatzivas-

siloglou et al., 2001). To retrieve signal transduction pathways, the MedLEE

researchers adapted an ontology by extending the UMLS to handle basic bio-

logical concepts relevant for regulatory networks (Rzhetsky et al., 2000). For

this task, they reported 53% recall and 100% precision on identifying binary

relationships.

91


Hatzivassiloglou observed differing expectations in accuracy between tra-

ditional knowledge-based NLP systems and the systems currently built for

biology. Traditional systems, such as MedLEE, are focused heavily on high

specificity (Hatzivassiloglou et al., 2001). These systems emphasizing propos-

ing relationships that are correct. This concurs with the observation Thomas

made, that Highlight was “tuned to produce high precision but lower recall”

(Thomas et al., 2000). However, many of the systems developed specifically for

biology seem to prefer high recall and tolerate lower precision. Thus, adapt-

ing MedLEE also required supplementing it with the statistical approaches to

meet the expectations of performance in the biological community (Hatzivas-

siloglou et al., 2001).

5.4 Identifying Related Genes and Drugs

To support pharmacogenomics research, I applied information extraction tech-

niques to search for evidence of relationships between genes and drugs in the

literature. In 1999, Evans and Relling published a review article describing ge-

netic polymorphisms in “drug-metabolizing enzymes, transporters, receptors,

and other drug targets” that led to variations in responses to drugs (Evans and

Relling, 1999). Their work documented 215 relationships among 62 genes and

127 drugs.

Using the relationships from Evans & Relling, I tested whether a co-

occurrence method could find evidence of the 215 relationships in MEDLINE

abstracts. For my analysis, I compiled a list of the genes and drugs found in

the article. Then, I searched MEDLINE for gene and drug names from the list.

92


I used a local version of MEDLINE that contained citations only until the end

of the year 2001.

I searched for genes and drugs in the abstract or title of a MEDLINE cita-

tion. Each gene and drug word (or phrase) must have appeared in the text and

be preceded and followed by either whitespace or punctuation. This ensured

that spurious partial word matches were not found. I allowed variation in cap-

italization. If the word was an abbreviation or long form defined in the text,

I also searched for the corresponding form in that citation only. I recognized

abbreviations using the algorithm described in Chapter 3 with a score cutoff of

at least 0.03.

Then, for each gene-drug pair, I counted the number of abstracts and sen-

tences where both the gene and drug occurred. I treated the title as the first

sentence of the abstract. To split text into sentences, I used a sentence bound-

ary detection heuristic described in Appendix B. I did not count co-occurrences

where the gene and drug names overlapped, e.g. if the vitamin D drug used

the same words as the gene vitamin D receptor .

Out of 7,874 possible gene-drug pairs (= 62 genes x 127 drugs), 1,462 (19%)

occurred in at least one abstract. 489 (6%) occurred in 5 or more. Out of the

215 pairs from Evans & Relling, 167 (78%) appeared in at least one abstract

and 113 (53%) in at least five (Figure 5.2). 22% of the known gene-drug re-

lationships did not appear in any abstract, and 32% did not appear in any

sentence together.

This method missed nearly a quarter of the relationships described in Evans

& Relling. Many of these omissions occurred due to the minimal handling of

93


0

10

20

30

40

50

60

70

80

0 1 2-5 6-10 11-50 51-100 101-1000 More

Cooccurrence Frequency

Nu

mb

er o

f G

ene/

Dru

g P

airs Abstracts

Sentences

Figure 5.2: Frequency of Gene-Drug Co-occurrences. This figure showsthe frequency in which a gene and drug, with a relationships identified in Evans& Relling, occur in the same abstract and sentence.

synonyms; I only considered abbreviations. Since many genes and proteins

have multiple names, not including a thesaurus of synonyms caused any syn-

onym of a gene to be missed (Yu et al., 2002a; Yu and Agichtein, 2003).

In nearly half the missing gene-drug pairs (22 out of 48), the gene was

a cytochrome P450 protein listed in Evans & Relling as CYP1A2, CYP2C9,

CYP2D6, CYP3A5, and CYP3A7. However, in the literature there was con-

siderable variation in cytochrome P450 nomenclature. These proteins can be

written in many ways, including CYP1A1/2, CYP2D family, or more gener-

ally cytochrome p450 protein. Identifying these phrases, and resolving them as

94


individual genes, would require more sophisticated strategies for tokenization

and name resolution.

Similarly, there were many ways to refer to drugs. Drugs were commonly

classified into categories. As noted in Section 5.2.1, authors often refer to

classes of drugs rather than specific ones. For example, the drugs in Evans

& Relling included classes such as steroids, antipsychotics, and calcium chan-

nel blockers. Knowledge about classes of drugs may have been inferred based

on data pertaining to individual drugs. Resolving such discrepancies computa-

tionally would require a hierarchical classification of drugs.

Finally, this algorithm missed the relationships from Evans & Relling that

were not indicated in the MEDLINE abstracts.

In addition to missing relationships, the co-occurrence-based algorithm also

identified 1295 gene-drug pairs that were not described in the Evans & Relling

article. To see whether these co-occurrences indicated real relationships be-

tween genes and drugs, I manually checked 100 pairs. I examined the fifty

pairs that occurred in the most number of citations, and I also chose at ran-

dom fifty pairs that occurred in only one citation each. I read the citations and

found that 70, the majority, of these gene-drug pairs shared some relationship.

Out of the remaining 30, for 8 pairs, the text specifically documented that they

had no relationship, and the rest were errors in the co-occurrence algorithm.

Ten of these pairs are shown in Table 5.1, and the full list can be found in Ap-

pendix C. This was consistent with previous studies on gene-gene relationships

reporting that co-occurrences between correctly identified gene names indicate

some type of biological relationship (Stapley and Benoit, 2000). The major

95


Gene Drug AbsAldehyde dehydrogenase Ethanol 291 1 Gene catalyzes drug

metabolism.Glutathione S-transferase Glutathione 277 2 Drug is substrate of gene.Angiotensin-converting enzyme Insulin 178 3 Gene affects sensitivity to

drug.Glucocorticoid receptor Steroids 165 4 Drug is substrate of gene.CYP1A1 Ethoxyresorufin 151 5 Gene metabolizes drug.Dihydropyrimidine dehydrogenase Bilirubin 1 6 Disabled gene leads to

increased level of drug.N-Acetyltransferase Insulin 1 7 Drug leads to activation of

gene.NAT1 Phenacetin 1 8 Drug exposure does not lead

to gene polymorphisms.Glucocorticoid receptor Tacrolimus 1 9 Drug is substrate of gene.CYP1A1 Lovastatin 1 10 Drug does not influence gene

activity.

Relationship

Table 5.1: Relationships Between Ten Genes and Drugs. This table de-scribes the relationship between 10 genes and drugs found to co-occur in theliterature, but were not identified in a review by Evans & Relling. The first fiveare the genes and drugs that appear in the greatest number of abstracts. The lastfive are randomly chosen genes and drugs that appear in one abstract only. TheAbs column describes the number of abstracts that contained both the gene anddrug. The Relationship column describes the relationship between the geneand drug. (1Agarwal (2001); 2Jakoby (1978); 3Hamilton (1990); 4Baxter (1978);5Kitchin (1983); 6Tateishi et al. (1999); 7Namboodiri et al. (1981); 8Bringuieret al. (1998); 9Oyanagui (1998); 10Cohen et al. (2000))

source of errors in co-occurrence algorithms arose from errors in identifying

the gene names (Jenssen et al., 2001).

There are many reasons that gene-drug relationships could have been omit-

ted from the Evans & Relling article. First, the review article concentrated

mainly on the genes with polymorphisms that could affect drug response. Sec-

ond, 30 of the 100 gene-drug pairs in Appendix C indicated no relationship.

Third, the authors may not have intended to catalog all known gene-drug rela-

tionships. Finally, the review article contained only the relationships known to

96


occur in humans. Since the MEDLINE search was not limited to a specific or-

ganism, many of the results described studies performed in model organisms,

notably rats.

5.5 Classifying Gene-Drug Relationships

The previous section showed that a co-occurrence approach could identify from

the literature many related genes and drugs. Once I had established that a

gene and drug were related, I began to analyze the literature to identify char-

acteristics of the relationship. Genes and drugs can interact in many ways,

from direct substrate-ligand binding relationships to more abstract ones, e.g.

“variations in the MDR1 gene reduce the plasma concentration of digoxin.” In-

teractions between genes and drugs were organized in the Pharmacogenomics

Knowledge Base (PharmGKB) according to five broad categories relevant for

pharmacogenomic researchers (Table 5.2).

PharmGKB collected information about related genes and drugs using a

community-based submission tool. Researchers on the internet could directly

submit, pending approval by a PharmGKB annotator, gene-drug relationships

classified according to the five categories. Often, multiple categories applied to

a specific gene and drug. In particular, the consequences of Genotype varia-

tions were often revealed in the other categories. For example, a mutation in

the TPMTgene could lead to blood toxicity of 6-thioguanine , indicating both

Genotype and Pharmacodynamic relationships. Currently, PharmGKB an-

notators had approved submissions on 325 drug-gene pairs exhibiting 515 re-

lationships. That is, if a gene and drug were related both by Genotype and

97


Clinical Outcome Genetic variations in the response to drugs can cause measurable differences in clinical endpoints such as rates of cure, morbidity, side effects, and death. Data in this category demonstrate that genetic variability in the context of a drug effect significantly changes medical outcomes. These data sets are different from pharmacodynamics data sets, which may show a difference that is not sufficiently significant to alter practice or policy.

Pharmacodynamics and Drug Response

Genetic variation in drug targets can cause measurable differences in the response of an organism to a drug. Data in this category document that the biological or physiological response to a drug varies, and that this variation can be associated with the variation of one or more genes. This variation is often measured at the whole-organism level. The measured variables may be surrogates for clinical responses, but do not constitute outcomes themselves.

Pharmacokinetics Genetic variation in processes involved in the absorption, distribution, metabolism, or elimination of a drug can result in changes in drug availability. Data in this category are primarily concerned with demonstrating that genetic polymorphisms lead to variations in the levels or concentrations of drugs or their metabolites at the site of action

Molecular and Cellular Functional Assays

Genetic variation can alter results of molecular and cellular functional assays, and this may correlate with variations in the organism's drug response. Data in this category demonstrate associations between genetic variation and laboratory assays of function at the molecular or cellular level. These assays may test the molecular properties of drug targets or drug metabolizing enzymes, or may test the cellular properties of cells involved in the response to a drug (such as whole-cell gene expression).

Genotype Genotype is the internally coded, heritable information carried by the organism. Variation in genotype is fundamental to pharmacogenetics and is measured as sequence variation in individual genes--the type and location of the variation, and the frequency of the variation in the populations of interest. This genetic variation is independent of individual drugs, but forms the basis for variation in response to drugs.

Table 5.2: Pharmacogenomic Relationships in PharmGKB. PharmGKBidentifies five types of relationships between genes and drugs that are of interestto pharmacogenomic researchers. The relationships and their definitions areduplicated here from their website (PharmGKB).

98


Pharmacodynamics, that gene-drug pair exhibited two relationships.

For each gene-drug pair, I retrieved the MEDLINE citations (until the end

of the year 2001) that contained both the gene and the drug. I matched names

to the text using the heuristics described in Section 5.4, looking for exact

word matches, allowing abbreviations, and ignoring overlapping gene and drug

names. With the citations found, I created a data set of the sentences that con-

tained a gene and drug. The title was also considered a sentence.

From each sentence, I removed occurrences of gene and drug names man-

ually. I discarded any gene or drug that appeared in either the PharmGKB or

Evans & Relling lists. Then, I examined the remaining words in the sentences

and removed gene and drug names manually.

Then, I converted each document into a vector representation suitable for a

machine learning classifier. Each document was a vector of words:

~Document = [w1w2 . . . wn] (5.1)

where wi is 1 if the word occurred in the document and 0 otherwise. n was the

total number of unique words that occurred in the corpus.

Then, to eliminate uninformative features, I performed feature selection on

the sentences (without gene or drug names) for each type of relationship. I

ranked the words using the χ2 test. This test produced a p-value that indicated

the strength of the association between each word and relationship. I kept

the 100 features that were most strongly related (or technically, least likely

to be unrelated) to the relationship. I also kept the 100 features most strongly

99


Text with gene and drug Words from Text(no gene or drug)

Clinical Outcome

Pharmaco-dynamics

Pharmaco-kinetics

FunctionalAssay

Genotype

Classifiers Scores

TPMT deficiency is associated with profound toxicity after thiopurinetherapy.

afterassociateddeficiencyisprofoundtherapytoxicitywith

0.50

0.05

0.26

0.07

0.05

Figure 5.3: Scoring Gene-Drug Relationships. I scored the pharmacoge-nomic relationships for sentences that contain a gene and drug. First, I removedthe genes and drugs from the sentence. Then, using the high-scoring featuresfrom the sentence, machine learning classifiers scored each type of relationship.

related to no relationship, resulting in a total of 200 features. I chose this num-

ber to balance the competing demands of maintaining small dimensionality for

computational tractability, but including enough features to capture the infor-

mation relevant to the classification decision. The most informative features

for each relationship are shown in Table 5.3.

Using the reduced-size vectors, I trained supervised machine learning clas-

sifiers to score the relationships described in a sentence (Figure 5.3). Because

the sentence could describe multiple types of relationships, I trained one clas-

sifier for each type. Using 5-way cross validation, I trained each classifier on

80% of the sentences and withheld the rest for testing. This way, I compiled

the scores for every sentence in the data set, without having used the same

sentence for training.

To classify, I used a Maximum Entropy classifier. Although Support Vector

100


Indicates Token Indicates Token

NON resistant REL boneNON cells REL geneREL patients REL mineralNON gene REL nuclearREL heart REL bindingREL failure REL bmdNON cell REL elementNON bone REL dihydroxyvitaminREL are REL retinoidREL drugs REL transcription

Indicates Token Indicates Token

REL metabolism NON geneREL resistant NON cellsREL demethylation NON resistantREL hydroxylation REL patientsNON heart NON cellNON failure REL heartREL liver NON expressionREL amplification REL failureNON receptor REL hypertensionREL probe NON bone

Indicates Token

REL productionREL hydroxylationREL methylationREL myelosuppressionREL tumourREL toxicityREL pyloriREL substratesREL hivREL nucleotides

Molecular & Cellular Functional Assays

Pharmacodynamics and Drug Response

Clinical Outcome

Pharmacokinetics

Genotype

Table 5.3: Informative Features for Gene-Drug Relationships. This tableshows the 10 most informative features for classifying each type of gene-drugrelationship. The Indicates column contains either REL if the presence of theToken indicates a relationship, or NON otherwise. These features were calcu-lated from sentences that contain a gene and drug.

101


Machines may ultimately outperform Maximum Entropy, Maximum Entropy

performs well even without manually-fitted parameters (see Chapter 4).

When applied to the data sets, the classifier achieved high performance in

categorizing each sentence according to the relationships described. That is,

sentences that described a relationship received a high score from that classi-

fier, and those that did not received a low score (Figure 5.4).

The sentences with the Pharmacodynamics and Drug Responses re-

lationship were easiest to identify, with 41% of the results receiving a score

of 0.95 or higher. Conversely, Clinical Outcome was the hardest to predict

with only 2% of the correct results receiving a score of at least 0.95. This cate-

gory may have been difficult to predict due to the great breadth of vocabulary

that described various clinical outcomes. 6% of the sentences contained no in-

dicative words identified by the feature selection and thus received scores of

exactly 0.5. Interestingly, Genotype was hard both to predict (3% get score of

0.95 or higher) and to rule out (9% of non-genotype get score of 0.95 or lower).

5.6 Predicting Gene and Drug Relationships

Finally, I applied the sentence-relationship scores calculated in the previous

section to identify the types of relationships for each gene-drug pair. For each

pair, I collected the scores for every relationship of every sentence containing

the gene and drug. Then, for each relationship, I averaged those scores to be

the relationship score for that gene-drug pair. Finally, I assigned a relationship

to a pair if the average score was at least 0.5.

I found 187 pairs of genes and drugs from PharmGKB that occurred

102


Genotype

0%

10%

20%

30%

40%

50%

Molecular & CellularFunctional Assays

0%

10%

20%

30%

40%

50%

Pharmacokinetics

0%

10%

20%

30%

40%

50%

Pharmacodynamics andDrug Responses

0%

10%

20%

30%

40%

50%

Clinical Outcome

0%

10%

20%

30%

40%

50%

Gene & Drug in Same

Sentence

Figure 5.4: Distribution of Relationship Scores. These histograms, onefor each relationship, show the distribution of scores for sentences derived fromgenes and drugs in PharmGKB. The dark bars show the scores for sentencesdescribing that relationship, and the hashed bars show the scores for other sen-tences. The scores on the horizontal axis are split into 20 evenly spaced binsfrom 0 to 1. To preserve space, the scale is not shown.

103


3 Errors11%

4 Errors3%

5 Errors0%

2 Errors27%

1 Error32%

No Errors27%

Figure 5.5: Errors in Gene-Drug Relationships. This chart shows the num-ber of errors made for each gene-drug pair. No Errors indicates that all 5possible types of relationships for a pair was classified correctly.

in the same sentences. This resulted in 935 classification decisions (=

187 pairs x 5 relationship types). Out of those, 690 (74%) were predicted cor-

rectly. Of the 187 gene-drug pairs, 50 (27%) were predicted exactly correctly;

the state of all 5 relationships were classified correctly. Conversely, 5 gene-drug

pairs had 4 or more errors (Figure 5.5). A complete list of the gene-drug pairs

and relationship predictions appear in Appendix D.

104


1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

1 10 100 1000 10000

Number of Sentences

Ave

rag

e E

rro

rs

Figure 5.6: Common Co-occurrences Classified More Accurately. Thisplot shows the number of errors for gene-drug pairs observed in different num-ber of sentences. Each gene-drug pair was classified into the 5 PharmGKB phar-macogenomics categories. The number of errors is the number of categories as-signed incorrectly, i.e. between 0 to 5. Average Errors is the averaged numberof errors for all gene-drug pairs found in at most Number of Sentences. Thenumber of sentences is plotted on a log scale for clarity.

The number of errors for each of the gene-drug pairs seemed to be depen-

dent on the amount of data available (Figure 5.6). Gene-drug pairs that co-

occurred in only one sentence could be classified into the 5 types of relation-

ships with an average of 1.89 errors. However, the error rate decreased precip-

itously to 1.45 in pairs that co-occurred in up to 10 sentences, and 1.36 in those

co-occurring in up to 100.

The 5 gene-drug pairs that were classified with 4 or more errors were

105


CYP2D6-interferon alpha, CYP3A5-midazolam, CYP3A5-xenobiotics, CYP4B1-

xenobiotics, and TPMT-sulfasalazine. 4 of these were cytochrome P450 pro-

teins. As noted in Section 5.4, this protein family was difficult to recognize be-

cause of tokenization and semantic problems. A more sophisticated tokenizer

that handled P450 nomenclature may have been able to recognize analogous

references to the token from the gene name list. For example, the CYP3A5-

midazolam co-occurrence appeared in one abstract as “. . . CYP1A2, 2A6, 2B6,

2C9, 2C19, 2D6, 2E1, 3A4 and 3A5, three (CYP2B6, 3A4 and 3A5) showed mi-

dazolam 1’-hydroxylation activity . . . ” (Hamaoka et al., 2001). Tokenizers that

could recognize these and other forms of the P450 nomenclature could increase

the amount of data available for classifying particular gene-drug pairs.

5.7 Conclusions

Recognizing relationships between genes and drugs is a complex problem that

requires further investigation. In particular, as noted in previous studies,

recognition of the basic entities, the genes and drugs, should be improved.

Future algorithms should handle familial and categorical relationships (hy-

pernymy), and well as synonymy within gene names. In addition, the rela-

tionships may need to be annotated with other useful types of classes, such as

source organism.

One of the major barriers for such statistical approaches is the dearth of

training data. The PharmGKB data set is relatively small, including 442 sub-

missions covering 325 drug/gene pairs. (Some of the submissions involved the

same gene and drug.) Also, over 70% of the submissions were contributed by

106


4 submittors, with the top 3 coming from the same institution. Such homo-

geneity in the data submissions could have led to individual (or institutional)

biases where the choice of articles or submissions clouded the data set.

One modification that will increase the amount of training data (but unfor-

tunately not the diversity) is to examine co-occurrences in abstracts instead of

sentences. Statistically, this approach will examine more words. This will in-

crease accuracy when the whole abstract contains language that is distinct for

a particular class, such as some abstracts describing outcomes from clinical tri-

als. In addition, this will detect long range co-occurrences, which can arise due

to hypernymy and other types of indirect references, e.g. that drug. However,

as documented above, previous studies have shown that although reviewing

more text will increase recall, it also decreases precision. It is unclear how

abstract-based co-occurrence will affect the extraction of gene-drug relations.

Nevertheless, this study has shown promising results, indicating that the

gene-drug relationship problem is tractable and may be solvable soon.

107

CHAPTER

6

Distributing the Algorithms

The potential value in computationally accessing knowledge from the biomed-

ical literature has driven the development of natural language algorithms

(Hirschman et al., 2002b). Concurrently, technologies for distributing those

algorithms are also in development (Stein, 2003). Distributing software is use-

ful so that other researcher may evaluate, use, and build upon the work. The

dominant methods for distributing software include compiled binaries, source

code, web sites, or web services.

Each option for distributing programs has advantages and disadvantages.

Compiled binaries force the end user to use a particular platform. Distribut-

ing source code affords more flexibility in the platform and also allows the end

user to modify the algorithm according to their needs. However, source code

may be difficult to install and also can disclose the intellectual property in the

algorithms or implementation. One compromise is to conceal the code behind

a web interface. This allows platform-independent access, but places the bur-

den of maintenance on the operator. The computation is run on the server,

which may be problematic if the application is well used. An extension of the

108

CHAPTER 6. DISTRIBUTING THE ALGORITHMS

web interface is a web service, which allows access from computer programs

(Jamison, 2003).

In particular, distributing the algorithms described in Chapters 3 and 4 is

challenging because of third-party dependencies, intellectual property issues,

and data requirements. Although my code is available as open source, it re-

quires programs that I cannot redistribute, such as Brill’s tagger and Numeri-

cal Recipes in C (Brill, 1994; Press et al., 1993). In addition, the code depends

on many libraries (e.g. Biopython, Numerical Python, libsvm) (Biopython; Nu-

merical Python; Chang and Lin, 2001). While these libraries are freely avail-

able, the requirement still complicates the installation process for the end user.

Finally, the algorithm depends on data stored in a relational database. Dis-

tributing the source code is still impractical because the difficult installation

process prevents casual use.

Therefore, I am distributing the algorithm as both open source code and as

a web page (and web service). This allows both casual use through the web

interfaces and more dedicated use and extension with local installation of the

source code.

In this chapter, I describe the development of the web page and web service

as a BioNLP text mining server. I report on an algorithm to simplify browsing

lists of abbreviations by aggregating similar long forms. I also report on a web

service interface that allows users to access algorithms to identify abbrevia-

tions and gene names from their programs over the internet.

109


6.1 Clustering Abbreviations to Aid Browsing

In Chapter 3, I described an algorithm to identify abbreviations from text.

Given the string “Activation of Jun N-kinase (JNK) is a cellular response to

stress”, the algorithm identifies JNK as an abbreviation for Jun N-kinase, with

a score of 0.86.

I applied the algorithm on all abstracts from MEDLINE and found

1,948,246 abbreviation/long form pairs. Out of these pairs, many long forms

were similar. For example, JNK was an abbreviation for 155 different long

forms, including Jun N-kinase, Jun N-kinases, and Jun NH2 kinase. This led

to long and redundant results when users searched for an abbreviation on the

abbreviation web server.

To improve the presentation of abbreviations in a website, I developed a

heuristic to cluster similar abbreviations. The heuristic is based on the idea

that the long forms with small variations can be aggregated if their abbrevi-

ations are the same. For each abbreviation found in MEDLINE, I first sort

the long forms in alphabetical order. Then, I consider each abbreviation/long

form pair sequentially and aggregate it with a previous pair if they meet two

conditions:

• The abbreviations are the same, or they differ because one of them has an

-s (or -es) appended to the end.

• The edit distance between the alpha-numeric characters (ignoring spaces

and punctuation) in the long forms <= 1.

If, according to these condition, the abbreviation/long form pair is similar

110


N # Clusters Edit DistanceX 609,162 142 461,822 833 510,955 815 555,813 477 576,524 33

10 592,912 2315 609,105 21

Table 6.1: Clusters of Abbreviations The abbreviations are clustered allow-ing N mismatches per alpha-numeric character in the long form. The first row,X mismatches is the clustering obtained when only 1 mismatch is allowed re-gardless of the length of the long form. The Edit Distance is the maximum editdistance between two long forms in the same cluster.

to two or more other pairs, I cluster them all together. Note that because of

this transitivity, the final clusters can contain long forms with edit distances

greater than 1.

An alternate, more lenient, second condition allows more mismatches de-

pending on the length of the string. I experimented with conditions that

allowed 1 mismatch per N characters (Table 6.1). As expected, more strin-

gent mismatch requirements (greater N) led to increased numbers of clusters.

Fewer abbreviation/long form pairs could be clustered together.

Although the number of clusters vary, the method appears robust and clus-

ters together similar long forms. For example, when using the most lenient

strategy, allowing a mismatch every other character, the two most distant long

forms were:

1 colony-forming units, erythroid burst-forming units, and|||||||||||||| ||||

111


2 colony-forming unit-

1 granulocyte erythrocyte macrophage megakaryocyte||||||||||| |||||||||| |||||||||||||

2 granulocyte, macrophage, erythroid, megakaryocyte

1 colony-forming units

2

Requiring the abbreviations to be the same (except for a possible -s at the end)

provides a constraint on the long forms that may be clustered together. How-

ever, such a simple heuristic can cluster long forms with similar letters but

different meaning. One example is:

1 androgen receptor| || || ||||||||

2 a dr energic receptor

However, the frequency or significance of such errors for the user is unclear.

For the final clustering used on the web site, I used the computationally

cheapest strategy and allowed 1 mismatch between any two long forms. This

heuristic reduced 1,948,246 abbreviations into 609,162 clusters.

Finally, to display on web pages, I chose to represent each cluster the abbre-

viation/long form pair that had the highest score. If multiple pairs shared the

highest score, I chose the one that appeared most frequently in MEDLINE.

112


6.2 Making Servers Computer-Friendly

Another important aspect of a web interface is to allow computational access

as well as human access. Developers create human-readable web pages us-

ing HTML (HyperText Markup Language), which encodes visual layout rather

than the semantics that computers require (Musciano and Kennedy, 2002).

Therefore, protocols called web services allow computers to access code on other

computers over a network.

For a web service protocol, my requirement was that it 1) allows clients

to access code over a network, 2) is supported across a wide range of oper-

ating systems and programming languages, 3) is relatively simple to install,

and 4) works with normal firewall configurations. Out of the many web ser-

vice protocols available, such as Common Object Request Broker Architec-

ture (CORBA), Common Object Model (COM), Enterprise Java Beans (EJB),

XML Remote Procedure Call (XML-RPC), and Simplified Object Access Pro-

tocol (SOAP), the one that most closely fit our needs was XML-RPC (Bolton,

2001; Templeman and Mueller, 2003; Englander, 1997; Laurent et al., 2001;

Snell et al., 2001). Libraries for XML-RPC are available for many languages

including C, C++, C#, Java, LISP, PERL, Python, Ruby, Scheme, Tcl, .NET, and

Visual Basic.

I developed an XML-RPC interface to the BioNLP server. The server exports

two functions: find abbreviations and find gene and protein names.

The documentation and usage for these functions are as follows:

113


find abbreviationsINPUT: stringOUTPUT: array

This function will search for the abbreviations in a string and returnan array of the abbreviations found. Each element of the returnedarray is itself an array of:[string long form , string abbreviation , double score ]

find gene and protein namesINPUT: stringOUTPUT: array

This function will search for the gene and protein names in a stringand return an array of the names found. Each element of the re-turned array is itself an array of:[string name, int start , int end , double score ]

start and end are indexes into the input string that describe wherethe name was found. The indexes begin at 0, and the end index isexclusive.

These functions can be accessed natively from many languages. For exam-

ple, in perl, to access the BioNLP web service to find gene names in a string:

# RPC::XML::Client is the PERL module that handles XML-RPC client# requests. It is available from CPAN.use RPC::XML::Client;

# Create a new XML-RPC client for the BioNLP server.$client = new RPC::XML::Client "http://bionlp.stanford.edu/xmlrpc";

# Call the ’find_gene_and_protein_names’ function on the BioNLP# server and save the response.$resp = $client->send_request(

’find_gene_and_protein_names’,"We observed an increase in mitogen-activated \protein kinase (MAPK) activity.");

114


# Save the return value of the function. @results is an array of# information for each gene or protein name found. Each name found is# itself an array of [name, start index, end index, score].@results = $resp->value;

# Print the name and score for each gene found.foreach $i (0..$#results) {

my @data = @{$results[$i]};print ‘‘NAME=’’, $data[0], ‘‘\n’’;print ‘‘SCORE=’’, $data[3], ‘‘\n’’;print ‘‘\n’’;

}

Only two lines of code is necessary to access the web service — one line to

create the connection, and one to call the function. Simple access to algorithms

is important to bioinformatics, and I hope to see more web service enabled

servers in the future.

115

CHAPTER

7

Conclusions

This thesis has described work to develop methods that automatically identify

relationships between drugs and genes from the literature. The final algorithm

can be used to create a database, which will be useful for pharmacogenomics re-

searchers and allow new biological insights. However, the initial investigations

into a gene-drug relationship algorithm uncovered several other areas of sci-

entific research that were not sufficiently advanced to support my endeavors.

These areas also required attention. Thus, for this thesis, I have addressed the

problems of: 1) identifying abbreviations in text (Chapter 3), 2) finding gene

names (Chapter 4), and 3) recognizing related genes and drugs (Chapter 5).

I have framed the main challenges in this thesis as machine learning tasks.

Each one is a problem of classification, where the computer must decide be-

tween two alternatives, e.g. abbreviation or not, gene name or not, or type of

relationship or not. The computer reaches its decision based on pieces of evi-

dence, the features.

Although the algorithms depend heavily on advances in machine learning,

116

CHAPTER 7. CONCLUSIONS

the work in the thesis is somewhat atypical. Instead of focusing on the machin-

ery supporting machine learning algorithms, I have instead found larger per-

formance gains by concentrating on developing informative features and data

representations. In two of the chapters in this thesis, the problems presented

were not amenable to classification with generic features, which for text, is typ-

ically words. Chapter 4 showed that the choice of features influenced the final

performance more than the machine learning algorithm. In this case, there

was little difference between the most and least accurate algorithms, and it

seems unlikely that further algorithm development will enhance performance.

However, the development of features is labor intensive. It requires domain

knowledge, and the features may not be immediately obvious. Good features

capture information relevant to the classification decision. Also, a representa-

tion of the feature must also be discovered that is suitable for machine learning

algorithms. Although much work has been done on developing algorithms, the

features are not as well developing.

As machine learning algorithms mature and are applied to different fields,

scientists must find ways to adapt the methologies to their domain. Perhaps

the science of machine learning will turn toward developing new theories on

automatic development of features, or finding rigorous formulations of good

features. Currently, feature selection methods can find sets of features to help

distinguish classes. However, there should also be formalisms to assess the

quality of a feature based on its distribution, shape, or prevalence. In my work,

I have addressed these issues manually.

117


Thus, I have discovered features and developed methods necessary to iden-

tify relevant genes and their relationships with drugs. I have shown that the

relationship scoring algorithm performs well when applied to a static list of

genes that a user specifies. However, the literature is dynamic, with new genes

being named and altered constantly. The scoring algorithm would be robust

to these changes if it were applied to a current list of gene names found by

the gene name identification algorithm. Thus, the work in Chapter 5 may be

repeated by replacing the static list of gene names with a dynamic one. The

method would then be able to discover relationships of drugs involving genes

that the user may not have known about or previously considered.

However, as discussed in Section 4.3, developing a gene name identifica-

tion algorithm is only the first step to creating a list of gene names. Further

problems that need to be solved include handling variations in gene names,

synonymous gene names, and gene families. I will discuss these issues in Sec-

tion 7.2.

7.1 Summary of Contributions

My dissertation addressed three separate, but related, challenges in extrac-

tion information from biological text: identifying abbreviations, finding gene

names, and classifying the relationships between genes and drugs. In addi-

tion, I have made technical contributions in the delivery and accessibility of

my algorithms. Here are my contributions grouped according to these four ar-

eas.

118


Identifying Abbreviations

• I have developed an abbreviation identification algorithm that obtains

higher recall and precision that previous methods. It uses a dynamic-

programming algorithm to find possible alignments and scores the like-

lihood using logistic regression. It is robust to missing characters and

locations of letter alignments.

• I have created a database of all predicted abbreviations from MEDLINE.

The database contains nearly all biomedical abbreviations.

• I have developed an algorithm for aggregating similar abbreviations to

simplify online navigation.

• I have characterized the inter-observer variability and error rate in man-

ually identifying abbreviations in text.

• I have distinguished types of abbreviations that should be handled in sec-

ond generation abbreviation identification algorithms.

Finding Gene Names

• I have developed an algorithm to recognize names of genes and proteins

from free text. The algorithm produces a score that can be adjusted to

trade recall for precision. However, a tradeoff can be chosen such that

both precision and recall exceed that of previous approaches.

• I have developed a novel feature, morphology, for recognizing gene names.

119


• I have automatically discovered a list of signal and non-signal words asso-

ciated with gene names. The signals include, as high-scoring words, those

from previous hand-generated lists. The use of non-signal words is novel.

Non-signal words have not been used previously.

• I have discovered that deep analysis on single gene symbols can yield

better performance than a method that requires information from several

words.

• I have compared the performance of machine learning algorithms to this

task and found that support vector machines perform best.

Classifying Gene-Drug Relationships

• I have applied the co-occurrence algorithm to a new problem domain —

gene-drug relationships. Similar to previous reports on protein-protein

interactions, co-occurring entities appear to have some biological relation-

ship.

• I have classified sentences with gene-drug co-occurrences into classes de-

fined by the PharmGKB.

• I have predicted the relationships of the genes and drugs in PharmGKB

by combining the scores from the sentences in which they co-occur.

120


Online Access to Algorithms

• I have created a web site for delivering information about abbreviations

in MEDLINE.

• I have developed XML-RPC web services for hosting algorithms over the

internet. I have demonstrated that web services can allow cross platform

and language independent access to code.

7.2 Future Work

An online resource of gene-drug relationships will be useful for pharmacoge-

nomics scientists. Although the data itself is beneficial, having it in a struc-

tured format allows scientists to generate and test more sophisticated biologi-

cal queries by linking the data to those in other databases. For example, un-

derstanding pharmacogenomic relationships, genetic variations that cause ad-

verse reactions, requires information about polymorphisms, such as the data

stored in dbSNP (Smigielski et al., 2000). Genes that have many polymor-

phisms and are also known to interact with many drugs may be possible can-

didates to include in a clinical test for drug sensitivity. Similarly, information

about the mechanism of drug relationships may exist in the structure or path-

way of the gene product.

Ultimately, such a database of gene-drug relationships may be compiled au-

tomatically using the tools described in this thesis. It is important to note here,

that many of the methods and ideas are not specific to pharmacogenomics. For

example, methodologies for identifying relationships can be applied to classify

121


gene-gene interactions. Also, the general types of features may be useful for

information extraction algorithms in other domains.

Although this thesis presents methodological advances toward the goal of

automatic identification of gene-drug relationships, many areas of the problem

still warrant further investigation.

7.2.1 Disambiguating Gene Names

Considerable ambiguity still exists in the definition of a gene name. Previous

studies have found large ambiguities in whether experts agree on whether a

word is a gene (or protein) or not (Krauthammer et al., 2000). In common

use, there can be little or no distinction between genes and their products,

types of genes, gene families, complexes, peptides, motifs, domains, alleles, and

gene structure such as introns and exons. This causes considerable difficulty

in the development of gene name identification algorithms. Without a clear

delineation of the different entities, comparing the performance of algorithms

is difficult.

Another issue related to gene and protein names are synonyms. Many genes

have multiple names. For example, the BRCA1 gene is also PSCP and RNF53.

Algorithms that extract information about genes from literature must handle

synonyms to resolve references to the same gene.

However, even if a gene does not have synonyms, its name may still

vary in ways that confound simple string matching algorithms. For example,

hemoglobin beta may be written hemoglobin B or, more rarely in MEDLINE

abstracts, hemoglobin β. This is also the same gene as beta hemoglobin. Thus,

122


matching gene names is more complicated than matching strings. One possible

way to recognize this is to normalize the gene name into a structured form so

that each of the names appears similarly as:

Core: hemoglobinSpecifier: beta

I present a possible approach to gene name normalize, with no evaluation,

in Appendix E.

7.2.2 Formal Descriptions of Data

One way to resolve the ambiguities around gene names is to organize those

and related terms in an ontology. Such a formalism complements work in in-

formation extraction. Ontologies could define the different types of biological

entities related to genes (such as proteins or gene families), and the relation-

ships between them. For example, a protein is a product of a gene, an exon is a

part of a gene, or, more specific information, hemoglobin is a type of globin. Us-

ing strict formal definitions provides informative distinctions for the user and

enables more specific characterization of the performance of algorithms. As de-

scribed in Section 3.3.3, even seemingly simple concepts such as abbreviations

may have subtle ambiguities that lead to disagreements and cause difficulties

in evaluating algorithms.

Using ontologies would also be important for inferring the family or type

of gene or drug. My current methods do not account for classes of genes or

drugs. In the linguistics literature, this relationship is called hypernymy. If

the text analyzed contain references to broad classes of drugs, the algorithms

123


would not recognize that those may include specific names in the drug lexicon.

For example, they would not recognize that an abstract describing information

about anti-hypertensive drugs may also apply to Propranolol. To handle these

distinctions, algorithms should employ existing hierarchical lists of drugs, such

as the one produced by the commercial company Apelon. However, for gene

names, new methods will need to be developed to recognize hypernyms in text.

Another important use for ontologies is to guide the development of stan-

dardized data sets. If a data set is annotated according to an ontology, it will

contain the distinctions in the ontology and thus be useful for other users also

interested in those distinctions. However, without precise definitions, notions

of distinctions may be different enough so that the data set would have to be

reannotated. The Yapex test collection introduced in Chapter 4 contains a por-

tion of text from the GENIA data set (Franzen et al., 2002; Ohta et al., 2002).

However, the Yapex notion of protein was generally broader than the one in GE-

NIA. For example, Yapex includes as protein c-jun, although GENIA describes

it more specifically as a DNA domain or region . Unfortunately, not all DNA

domain or region names, such as promoter and proximal region were pro-

teins in Yapex. Presumably because of this type of inconsistency in definitions,

the researchers reannotated the text in Yapex that was derived from GENIA.

Even if the community could agree on standard definitions of terms relevant

to biological text mining, there will be ambiguity in the definitions. This, an

ontology should also include the expected inter-observer variability of different

distinctions in a manner similar to that reported in Section 3.3.3. This would

be a reasonable upper bound on the performance of algorithms.

124


7.2.3 Annotated Text Data

Several projects, notably GENIA, have produced high-quality annotated data

sets. However, as the knowledge domain becomes more complex and the dis-

tinctions more subtle, the amount of data needed to develop algorithms, to eval-

uate them meaningfully, and to distinguish them with statistical significance

will grow. The lack of annotated data sets can hinder research into algorithm.

Developing many of my methods required significant labor to produce anno-

tated training sets. In addition, the lack of a proper gold standard significantly

hindered my ability to develop methods in Section 5.4 to score the correctness

of gene-drug relationships.

Although developing data sets require manual verification, leveraging auto-

mated methods may mitigate some of the labor costs. Current machine learn-

ing methods such as active learning or co-training can generate labelled data

sets from little training data. Thus, these methods may be able to simplify

the task by generating an initial low resolution “draft” of the annotated data

set. Some have argued that algorithms generated from noisy training data

may perform competitively with those trained on higher quality data (Morgan

et al., 2003). Assuming that verifying annotations is simpler than composing

them, these methods may be able to help efficiently develop larger quantities

of annotated data.

7.2.4 Full Text Articles

The final limitation of the work in this thesis is the reliance on information

in MEDLINE abstracts. Although abstracts should contain a summary of the

125


important findings in the paper, clearly there is more detailed and useful infor-

mation in the full text. Analyzing the full document should increase the recall

performance of algorithms. Fortunately, full text documents may soon be per-

vasive due to efforts such as PubScience and communication technologies such

as the internet (Roberts et al., 2001).

However, full text documents may introduce challenges not present when

analyzing abstracts. Journal articles contain many sections with information

other than new research findings. For example, Methods sections contain in-

formation about the experimental protocol that may be irrelevant to the re-

search hypothesis. Furthermore, Discussion sections often contain specula-

tions that may not yet be proven. Therefore, when full text articles are readily

available, the legitimacy of the information extracted should be analyzed with

respect to its source section.

7.3 Final Conclusions

Having structured knowledge in computable form will accelerate research in

pharmacogenomics and other health disciplines. Using tools that can automat-

ically extract the knowledge from unstructured literature will ensure access to

the most current known information. Once these knowledge bases are avail-

able, researchers will be able to develop algorithms that can link diverse types

of biological data to propose novel, and perhaps clever, biological hypotheses.

Only at that point will the scientific community have met the challenge pre-

sented by Swanson, when he threw down the gauntlet and charged the com-

munity to find the “undiscovered public knowledge.”

126

APPENDIX

A

Training Maximum Entropy

The maximum entropy classifier has a statistical formulation that is provably

equivalent to the information theoretic one (Berger et al., 1996). Viewed thusly,

maximum entropy searches for feature weightings that best explain the train-

ing set. This turns into an optimization problem of maximizing the probability

of a model, and can be solved with generic algorithms such as conjugate gra-

dient descent (or inversely, ascent). Since I have not been able to locate the

derivation of this formulation, I provide it here for completeness.

The function to maximize is defined:

L(α) = p(x, y) log p(y|x)

This is reminiscent of the conditional entropy formula. x is an observation

vector, and y is a class. Thus, p(x, y) is the probability that the observation

belongs to the class. p(y|x) is the probability of the class given an observation,

calculated from the model.

The formula for the model is, according to the maximum entropy formula-

tion:

127

APPENDIX A. TRAINING MAXIMUM ENTROPY

pα(y|x) =1

Z(x)

∏i

eαifi(x,y)

Z(x) is a normalization factor to insure a probability.

Z(x) =∑

y

∏i

eαifi(x,y)

Then, taking a log of the weighted feature values (fi(x, y)) yields the likeli-

hood of the parameters α:

L(α) =∑x,y

p(x, y) log1

Z(x)

∏i

eαifi(x,y)

This likelihood is calculated by summing over all combinations of x and y

from the training set. p(x, y) is the probability of seeing a particular combi-

nation as provides a weight for the features based on their prevalence in the

training set.

Distributing the log:

L(α) =∑x,y

p(x, y)[∑

i

αifi(x, y)− log Z(x)]

L(α) =∑x,y

p(x, y)αi

∑i

fi(x, y)−∑x,y

p(x, y) log Z(x)

And the expanding the Z(x):

L(α) =∑x,y

p(x, y)αi

∑i

fi(x, y)−∑x,y

p(x, y) log(∑

y

∏i

eαifi(x,y))

128


Summing through the y variable:

L(α) =∑x,y

p(x, y)∑

i

αifi(x, y)−∑

x

p(x) log(∑

y

∏i

eαifi(x,y))

yields the final version of the function to be maximized.

Maximizing this function using conjugate gradient descent requires the par-

tial derivatives with respect to each of the αi parameters. The partial deriva-

tive is:

dL(α)

dαi

=∑x,y

p(x, y)fi(x, y)−∑

x

p(x)1∑

y

∏i e

αifi(x,y)

∑y

[∏

i

(eαifi(x,y))fi(x, y)]

Moving∑

y out:

dL(α)

dαi

=∑x,y

p(x, y)fi(x, y)−∑x,y

p(x)1∑

y

∏i e

αifi(x,y)(∏

i

eαifi(x,y))fi(x, y)

dL(α)

dαi

=∑x,y


p(x)1

Z(x)(∏

i

eαifi(x,y))fi(x, y)

dL(α)

dαi

=∑x,y


p(x)p(y|x)fi(x, y)

The partial derivatives can be interpreted as the difference between the in-

formation from the training data and the model. Thus, the extreme values of

129


the likelihood function is found when this function is 0, i.e. there is no differ-

ence between the training data and model.

130

APPENDIX

B

Sentence Boundary Disambiguation

Sentence boundary identification is useful for text processing. Many tagging

and parsing applications require separate sentences as input.

Unfortunately, finding sentence boundaries is not straightforward because

of ambiguities in sentence-ending punctuation. Periods appear in many con-

texts, such as 0.05, N.I.H., or G. Bush (Table B). Therefore, determining

whether punctuation indicates a sentence boundary requires special process-

ing. Researchers have approached this problem both with heuristics and sta-

tistical models such as neural networks and maximum entropy models (Palmer

and Hearst, 1994; Reynar and Ratnaparkhi, 1997).

Fortunately, MEDLINE abstracts are relatively well structured. In general,

the text is regular and the sentences are well-formed. Thus, I use a simple set

of heuristics to find sentence boundaries (Figure B).

131

APPENDIX B. SENTENCE BOUNDARY DISAMBIGUATION

• Only ’.’, ’?’, ’!’, and ’”’ can be sentence boundaries.

• There is always a sentence boundary at the end of the text.

• A sentence boundary cannot precede another sentence boundary.

• Sentence boundaries always precede whitespace.

• Question marks and exclamation points mark the end of the sentence, aslong as they are not followed by quotes.

• Quotation marks are sentence boundaries if they follow a sentence bound-ary character.

• A period followed by whitespace followed by a capital letter is a sentenceboundary.

Figure B.1: Heuristics for Sentence Boundary Disambiguation. Thisheuristic does a reasonable job at finding sentence boundaries in MEDLINEabstracts.

Handled? Description Examples√Period is inside a token. 14.2√Punctuation repeated foremphasis

!!!

√Abbreviation 300 ng i.p. , N.I.H.-approved√Period is inside a quote ."√Species name E. coli√E.C. numbers EC 1.6.2.4Next sentence starts witha number

"observed. 5?-Deletion"

Next sentence starts witha lower case letter

p-Nitrophenol , hCG

Numbering a list 1. 2. 3.Abbreviation in a name Mol. Brain Res.

Table B.1: Sentences Boundary Ambiguities. This table shows some caseswhere the end of sentences are ambiguous. The first column indicates whethereach case is correctly handled by my heuristic.

132

APPENDIX

C

Gene Drug Relationships

I manually identified the relationships between 100 pairs of genes and drugs.

See Section 5.4 for more information.

Gene Drug Abs Relationship PMIDAldehydedehydrogenase

Ethanol 291 Gene catalyzes drugmetabolism.

11762132

GlutathioneS-transferase

Glutathione 277 Drug is substrate ofgene.

345769

Angiotensin-convertingenzyme

Insulin 178 Gene affects sensitivityto drug.

2220797

Glucocorticoidreceptor

Steroids 165 Drug is substrate ofgene.

366226

CYP1A1 ethoxyresorufin 151 Gene metabolizes drug. 6422171CYP2D6 Quinidine 145 Drug inhibits gene. 12867484CYP2D6 Mephenytoin 115 Drug is substrate of

related gene.8861658


Beta blockers 109 Drug treats conditionrelated to gene.

191300


hydralazine 88 Drug inhibits gene. 2416221

CYP3A4 Mephenytoin 84 Drug is substrate ofrelated gene.

8861658

CYP1A2 Mephenytoin 82 Drug is substrate ofrelated gene.

8861658

CYP2C9 Mephenytoin 81 Drug is substrate ofrelated gene.

8861658

133

APPENDIX C. GENE DRUG RELATIONSHIPS


propranolol 78 Drug treats conditionrelated to gene.

2531184

MRP Conjugates 76 Drug is non-specific.Angiotensin-IIreceptor

Enalapril 75 Drug inhibits gene inthe pathway of the gene.

10076917

Alcoholdehydrogenase

Glutathione 73 Drug is metabolite ofgene.

12631283

Angiotensin-IIreceptor

captopril 68 Drug inhibits gene inthe pathway of the gene.

10052650


calcium channelblockers

65 Drug treats conditionrelated to gene.

2487803

CYP1A2 phenacetin 61 Gene metabolizes drug. 7678502CYP1A2 Quinidine 60 Drug inhibits related

gene.7895609

CYP2E1 Glutathione 60 Gene activity influencedby drug concentration.

9101035

CYP3A4 dextromethorphan 60 Gene metabolizes drug. 9352574Angiotensin-convertingenzyme

digoxin 58 Drug treats conditionrelated to gene.

7503006

CYP1A2 Benzo(a)pyrene 57 Drug is substrate ofgene.

8819302

CYP1A2 dextromethorphan 57 Related genemetabolizes drug.

9352574

CYP1A2 Tolbutamide 56 Related genemetabolizes drug.

9431831

CYP3A4 Tolbutamide 56 Related genemetabolizes drug.

9431831

CYP2D6 Tolbutamide 55 Related genemetabolizes drug.

9431831

CYP2E1 Mephenytoin 55 Related genemetabolizes drug.

11523064

DT-diaphorase Glutathione 55 Drug transferase andgene activity weresometimes similar.

2391358

CYP2C9 Quinidine 54 Related genemetabolizes drug.

7720520

CYP2D6 Fluvoxamine 54 Gene metabolism ofdrug is not clinicallysignificant.

8846617

CYP2E1 ethoxyresorufin 53 Related genemetabolizes drug.

10383922

CYP2C19 Tolbutamide 51 Gene metabolized drug. 10411572CYP2C19 dextromethorphan 50 Related gene

metabolizes drug.10859141

134


CYP2C9 dextromethorphan 50 Related genemetabolizes drug.

10859141

CYP2D6 caffeine 50 Related genemetabolizes drug.

9867310


Insulin 50 Low drug levelsassociated with genepolymorphism.

12351458

MDR1 Glutathione 50 Levels of drugtransferase and geneare correlated.

1348425

CYP2D6 Coumarin 49 Related genemetabolizes drug.

10828259

CYP2E1 Coumarin 49 Related genemetabolizes drug.

10828259

CYP1A2 Coumarin 47 Related genemetabolizes drug.

10828259

CYP3A4 Coumarin 46 Related genemetabolizes drug.

10828259

CYP2C19 Quinidine 44 Related genemetabolizes drug.

7640150

ALDH2 Ethanol 43 Gene metabolizes drug. 3067025CYP3A4 ethoxyresorufin 42 Related gene

metabolizes drug.10725303

Catechol-O-methyltransferase

dopamine 42 Gene activity reducesamount of drugavailable.

8190296

CYP1A1 Glutathione 41 Drug reduces inhibitionof gene.

12519694

CYP1A2 Glutathione 41 Inactivation of gene notaffected by drug.

11714871

CYP2C19 Fluvoxamine 41 Drug inhibits gene. 10674711ALDH2 Glutathione 1 No relationship.Angiotensin-IIreceptor

androgens 1 No relationship.

CYP1A1 Fluorouracil 1 No relationship.CYP1A1 Lovastatin 1 Drug does not influence

gene activity.11523064

CYP1A1 lidocaine 1 No relationship.CYP1A2 Histamine 1 Competitor for drug had

no effect on geneactivity.

9485522

CYP1A2 Macrolides 1 Drug inhibits relatedgene.

8119047

CYP1B1 omeprazole 1 No relationship.CYP1B1 phenytoin 1 Drug is substrate of

gene family.9493761

CYP2A6 Oral contraceptives 1 Drug does not affectmetabolic rate of gene.

9653923

135


CYP2A6 glucuronide 1 Drug is substrate ofrelated gene.

11377097

CYP2A6 procainamide 1 Related genemetabolizes drug.

9352574

CYP2C19 Steroids 1 Gene family metabolizesdrug.

7704034

CYP2C19 halothane 1 No relationship.CYP2C9 ascorbic acid 1 No relationship.CYP2C9 codeine 1 Gene does not affect

rate of drug metabolism.9113345

CYP2C9 flecainide 1 Drug inhibits gene. 8801060CYP2D6 hydralazine 1 No relationship.CYP2E1 dopamine 1 Gene is expressed in

cells containing drug.9881865

CYP3A5 aflatoxin 1 Gene activates drug. 10224324CYP3A7 phenytoin 1 Drug is substrate for

family of gene.9493761

DT-diaphorase glucuronide 1 No relationship.Dihydropyrimidinedehydrogenase

Bilirubin 1 Disabled gene leads toincreased level of drug.

10348793

Dihydropyrimidinedehydrogenase

Glutathione 1 No relationship.

GSTM1 vinyl chloride 1 Drug may bemetabolized by gene.

9838066

GSTP1 Cyclosporin A 1 Drug activity leads toreduced gene function.

11108662


Quinidine 1 No relationship.


Tacrolimus 1 Drug is substrate ofgene.

9600660


menadione 1 Drug binds to gene. 11311319


Histamine 1 Gene influences drugaction.

10372823


paclitaxel 1 Drug did not enhanceinfluence of NSAIDs ongene activity.

9849488

MRP Benzo(a)pyrene 1 No relationship.MRP ethoxyresorufin 1 No relationship.MRP naproxen 1 Increased toxicity

observed with activedrug and overexpressedgene.

9849488

MRP propranolol 1 No relationship.MRP tamoxifen 1 No relationship.N-Acetyltransferase

Conjugates 1 No relationship.

136


N-Acetyltransferase

Insulin 1 Drug leads to activationof gene.

7017937

N-Acetyltransferase

vinyl chloride 1 Gene does notcontribute to diseasecaused by gene.

1458774

NAT1 Phenacetin 1 Drug exposure does notlead to genepolymorphisms.

9761125

NAT2 Irinotecan 1 No relationship.Peroxisomeproliferator-activatedreceptor

Glutathione 1 No relationship.

Peroxisomeproliferator-activatedreceptor

naringenin 1 Drug had no effect ongene expression.

11245597

Prothrombin p-aminobenzoicacid

1 No relationship.

Stromelysin Dexamethasone 1 No relationship.Sulfotransferase Benzo(a)pyrene 1 No relationship.Sulfotransferase Oral contraceptives 1 Drug may lead to

changes in gene activity.6934248

UGT1A1 Ethanol 1 Drug influences activityof gene.

11091029

Vitamin D receptor tamoxifen 1 No relationship.uridinediphosphate-glucuronosyltransferase

glucuronide 1 Drug is substrate ofgene.

9054958

Table C.1: Relationships Between Genes and Drugs. Thistable describes the relationship between 100 genes and drugsfound to co-occur in the literature, but were not identified in areview by Evans & Relling. The first 50 are the genes and drugsthat appear in the greatest number of abstracts. The last 50 arerandomly chosen genes and drugs that appear in one abstractonly. The Abs column describes the number of abstracts thatcontained both the gene and drug. The Relationship columndescribes the relationship between the gene and drug.

137

APPENDIX

D

Classifying PharmGKB Relationships

Chapter 5 presented an algorithm for predicting the relationships between

genes and drugs. I applied this algorithm to a list of genes and drugs from

the Pharmacogenomics Knowledge Base. The following table contains a com-

plete list of the predictions.

Gene Drug Actual Rel. Predicted Rel.ABCC8 tolbutamide PD PDACE ace inhibitors Gn PD Gn PDACE atenolol PD Gn PDACE captopril PD PDACE enalapril PD Gn PDACE enalaprilat Gn PD PDACE fluvastatin Gn PDCO Gn PDACE fosinopril GnFA PDCO Gn PDACE gemfibrozil Gn PD Gn PDACE imidapril PD PDACE lisinopril Gn PD Gn PDALDH2 ethanol Gn PD Gn PDALDH2 vinyl chloride Gn PD Gn PKPDAPOA1 testosterone PD GnAPOE choline PD FAAPOE donepezil PD PDAPOE estrogens Gn PDAPOE fenofibrate PD FA PD

138

APPENDIX D. CLASSIFYING PHARMGKB RELATIONSHIPS

Gene Drug Actual Rel. Predicted Rel.APOE gemfibrozil Gn PD PDAPOE simvastatin GnFA PDCO PDAPOE tacrine Gn PDCO Gn PDBCHE succinylcholine CO Gn PDC3 gemfibrozil Gn PD Gn PDCETP pravastatin PDCO PDCFTR cpx GnFACOMT methyldopa PKPD PKPDCYP1A2 amiodarone PK PKCYP1A2 caffeine PK PKCYP1A2 clozapine PK PKPDCYP1A2 estradiol PK PDCYP1A2 fluvoxamine PK PKPDCYP1A2 haloperidol PK PKPDCYP1A2 imipramine PK PKCYP1A2 modafinil PK PKPDCYP1A2 naproxen PK PKCYP1A2 ondansetron PK PKCYP1A2 propranolol PK PKCYP1A2 riluzole PK PKCYP1A2 ropivacaine PK PKPDCYP1A2 tacrine PK PKCYP1A2 theophylline PK PKCYP1A2 ticlopidine PK PKPDCYP1A2 verapamil PK PKCYP1A2 zolmitriptan PK PKCYP1A2 zoxazolamine PD PKCYP1B1 estrogens CO Gn PKCYP2A6 fadrozole Gn Gn PDCOCYP2A6 fluorouracil PK PKCYP2A6 rifampin FACYP2A6 tegafur PK PKCYP2B6 aflatoxin b1 FA PD PKCYP2B6 bupropion PK PKCYP2B6 cyclophosphamide FAPKPD PKCYP2B6 ifosfamide PK PKCYP2B6 phenobarbital PK FACYP2B6 rifampin PK PKCYP2C19 amitriptyline PK PK

139


Gene Drug Actual Rel. Predicted Rel.CYP2C19 cyclophosphamide PK Gn PKPDCYP2C19 diazepam Gn PKPD Gn PKPDCYP2C19 fluoxetine PK PKPDCYP2C19 fluvoxamine PK PKPDCYP2C19 hexobarbital PK Gn PKPDCYP2C19 lansoprazole Gn PK CO Gn PKPDCYP2C19 mephenytoin FA PKCYP2C19 modafinil PK PKPDCYP2C19 nelfinavir PK PKPDCYP2C19 omeprazole Gn PKPDCO Gn PKPDCYP2C19 pantoprazole PK Gn PKCYP2C19 proguanil PK PKCYP2C19 ticlopidine FA PKPDCYP2C8 paclitaxel PKPDCO PKCYP2C8 rifampin FACYP2C9 acenocoumarol Gn PD PKCYP2C9 amiodarone PK PKPDCYP2C9 celecoxib PK PKPDCYP2C9 diclofenac PK PKPDCYP2C9 fluconazole PK Gn PDCYP2C9 fluoxetine PK PKPDCYP2C9 fluvastatin PK Gn PKPDCYP2C9 fluvoxamine PK PKCYP2C9 glimepiride PK PDCYP2C9 glyburide PK PKPDCYP2C9 ibuprofen CO PKCYP2C9 isoniazid PK PKPDCYP2C9 losartan PK Gn PKPDCYP2C9 phenylbutazone PK PDCOCYP2C9 phenytoin PKPDCO PKPDCYP2C9 rifampin FACYP2C9 tolbutamide PKPDCO PKPDCYP2C9 torsemide PK PKCYP2C9 warfarin Gn PKPDCO Gn PKPDCYP2D6 amitriptyline PK PKCYP2D6 cimetidine PK PKCYP2D6 clomipramine PK PKCYP2D6 clozapine FA PKCYP2D6 cocaine PK PK

140


Gene Drug Actual Rel. Predicted Rel.CYP2D6 codeine Gn PKPDCO PKPDCYP2D6 debrisoquine Gn PK PKCYP2D6 desipramine PK PKPDCYP2D6 dextromethorphan PK PKCYP2D6 diltiazem PK PKCYP2D6 flecainide PK Gn PKPDCYP2D6 fluoxetine Gn PK CO PKPDCYP2D6 fluvoxamine PK PKPDCYP2D6 haloperidol Gn PKPD PKPDCYP2D6 imipramine PK PKCYP2D6 interferon alpha PK Gn PDCOCYP2D6 metoprolol Gn PKPD PKCYP2D6 mexiletine Gn PKPD PKCYP2D6 morphine CO PKPDCYP2D6 paroxetine PK PKPDCYP2D6 perphenazine PK Gn PKPDCYP2D6 propafenone PKPD PKPDCYP2D6 risperidone Gn PKPD Gn PKPDCYP2D6 ritonavir PK PDCYP2D6 sparteine PD PKCYP2D6 thioridazine PK PKPDCYP2D6 timolol PK PDCYP2D6 tramadol PKPD PKCYP2D6 xenobiotics Gn PK Gn PKCYP2D6 zuclopenthixol PK Gn PKPDCYP2E alcohol PKPDCYP2E ethanol GnFA PD PDCYP2E xenobiotics GnFAPKPD FACYP3A4 alprazolam PK PKPDCYP3A4 epipodophyllotoxin Gn PD Gn PKPDCYP3A4 nifedipine Gn PK PKCYP3A4 omeprazole PKPD PKPDCYP3A4 testosterone Gn PD PKCYP3A4 xenobiotics GnFA Gn PKCYP3A5 midazolam GnFA PD PKCYP3A5 tacrolimus PK PK COCYP3A5 xenobiotics Gn PD FAPKCYP4B1 xenobiotics Gn FAPKPDDHFR methotrexate PK PK

141


Gene Drug Actual Rel. Predicted Rel.DRD3 clozapine Gn PD PDDRD3 neuroleptics Gn PDCO PDDRD4 antipsychotics Gn COF2 estrogens CO PDF2 oral contraceptives Gn CO Gn PDGSTA1 busulfan GnFAPK PKGSTA1 xenobiotics Gn FA PDGSTM1 tacrine PKPD Gn PDGSTM1 xenobiotics CO Gn PKGSTT1 xenobiotics CO PKHLA-DRB1 cyclosporine PD Gn PDHNMT histamine GnFA Gn PDHTR2A clozapine Gn PDCO Gn PDHTR2C clozapine PD GnHTR6 clozapine Gn PDINPP1 lithium Gn PD PDLPL fenofibrate PD PDMGMT carmustine PDNAT1 sulfamethoxazole FA PKPDNAT1 xenobiotics Gn GnFAPKNAT2 caffeine Gn PK PKNAT2 hydralazine PKPD Gn PKPDNAT2 isoniazid PK Gn PKPDNAT2 procainamide Gn PKPD PKPDNAT2 sulfamethoxazole PK PKPDNAT2 xenobiotics GnFA CO Gn PKNQO1 benzene PKPDCO PKNQO1 mitomycin c CO PKNQO1 xenobiotics GnFA FANR1I2 xenobiotics GnFA FAOPRM1 opiates FA GnFA PDRYR1 halothane PDSCN5A mexiletine PD Gn PDSULT xenobiotics FA GnSULT1A1 phenol GnFA PKSULT2A1 dehydroepiandrosterone GnFA PKTNF thalidomide COTPH fluvoxamine PD PDTPH paroxetine PD PD

142


Gene Drug Actual Rel. Predicted Rel.TPMT 6-mercaptopurine Gn PKPDCO Gn PKPDCOTPMT 6-thioguanine Gn PD GnFAPKPDCOTPMT azathioprine GnFAPKPDCO Gn PDTPMT cefazolin Gn PD Gn PKPDCOTPMT mercaptopurine GnFAPKPDCO GnFAPKPDCOTPMT sulfasalazine PK GnFAPKPDCOTPMT thioguanine GnFAPKPDCO GnFAPKPDCOTPMT thiopurines GnFA PD GnFAPKPDCOUGT1A1 irinotecan PKPDCO PKPDUGT1A9 flavopiridol FA PKPDUGT2B7 epirubicin FAPK PKVDR vitamin d FA FAXRCC1 alcohol PD Gn PKPD

Table D.1: Gene-Drug Classification Results. Ipredicted the type of relationship for gene-drug pairs.The Actual Relationships column lists the rela-tionships identified by a PharmGKB curator. ThePredicted Relationships column lists the relation-ships my algorithm predicted. See Section 5.6 for adescription of the algorithm.

143

APPENDIX

E

Gene Name Normalization

The names of many genes can exhibit diverse variations. For example, beta

hemoglobin and hemoglobin B are the same gene. Thus, one way to recognize

these as synonyms of the same gene is to normalize them to a common form.

In this form, they will be easier to compare. Matching these variations in

gene names is important for information retrieval engines. If the engine can

not recognize a variation of a name, it will not be able to recognize a relevant

document.

As part of a submission to the information retrieval task of the 2003 TREC

Genomics Track, I developed a gene name normalization heuristic using word

matching (Voorhees, 2002b). Assuming that a limited number of words appear

peripherally to the core words of the gene name, it is possible to enumerate

those words nearly comprehensively. The words are then classified into types

useful for normalization. For example, two gene names that differ by a function

word (e.g. inhibitor) likely indicate different genes. On the other hand, if they

differ only by a specifier (e.g. the number 5), then they may be in the same

family.

144

APPENDIX E. GENE NAME NORMALIZATION

I started with an initial set of word classes defined in ProMiner (Hanisch

et al., 2003). I applied the algorithm to a list of gene names derived in the

TREC training set. This results in a list of words assigned to type accord-

ing to the classification. Because the classification was not complete, many of

the words were unassigned. I assigned types to the most frequently occurring

words, creating new types when necessary. Then, I repeated these steps until

I was satisfied with the coverage of the classification.

In the tables below, I present the classification of the words in gene names

into types. Every word that does not appear in this classification is a putative

Core gene name.

Entitymolecule protein gene oncogenedna cdna rna trna mrnafragment fragments peptide polypeptide neuropeptideprecursor product transcriptclone factor factors antigen isoform

Functionactivator inactivator inhibitor inhibitorybinding interacting converting modulatingrepair sorting transporter transportingregulating regulator regulatorysilencer suppressor

Structurecomplexcomponent subcomponentchain subunit domain subdomain regionpromoter repeat

145


Locationmembrane golgi

Relationshipassociated coupled interactor interactiondependent regulatedreceptor receptorsligand substrate substratescarrieragonist antagonistmember family superfamilyhomolog relatedlike

Modifieractivated induced induciblecatalyticpendingputative variant partial

Specifier1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 20i ii iii iv vvi vii viii ix xxi xii xiii xiv xvxvi xvii xviii xix xxalpha beta gammaepsilon kappa deltaa b c d ef g h i jk l m n op q r s tu v wx y z1a 2d6

146


Common Wordand of at in byinclude includedsee alsosmall heavy stiff major

Not Genecocaine substancefibroblast hormonedisease syndromesynthesis transcription lymphoproliferativecontaining differential display generalized growthprogressive signal transducer

Table E.1: Classes of Words in Gene Names. Thistable lists the words that appear in gene names, or-ganized by type of word.

Clearly, this is not a comprehensive list of all words that can appear in a

gene name. Also, note that many words have morphologic variations, such as

fragment and fragments. A final system should use a morphologic stemmer to

recognize variants, even if they do not appear in this list. Also, this list includes

many sequences, such as numbers or Greek letters. The final algorithm should

recognize these without requiring a comprehensive enumeration.

147

Bibliography

Acronym Finder. Acronym finder. http://www.acronymfinder.com/ , 2002.

Acronyms and Initialisms. Acronyms and initialisms for health informationresources. http://www.geocities.com/˜mlshams/acronym/acr.htm ,2002.

DP Agarwal. Genetic polymorphisms of alcohol metabolizing enzymes.Pathologie–biologie, 49(9):703–709, Nov 2001.

SF Altschul, W Gish, W Miller, EW Myers, and DJ Lipman. Basic local align-ment search tool. Journal of Molecular Biology, 215(3):403–410, Oct 1990.

MA Andrade and P Bork. Automated extraction of information in molecularbiology. FEBS Letters, 476:12–7, 2000.

Jonathan H. Aseltine. Wave: An incremental algorithm for information extrac-tion. In Proceedings of the AAAI 1999 Workshop on Machine Learning forInformation Extraction, 1999.

M Ashburner, CA Ball, JA Blake, D Botstein, H Butler, JM Cherry, AP Davis,K Dolinski, SS Dwight, JT Eppig, MA Harris, DP Hill, L Issel-Tarver,A Kasarskis, S Lewis, JC Matese, JE Richardson, M Ringwald, GM Rubin,and G Sherlock. Gene ontology: tool for the unification of biology. the geneontology consortium. Nature Genetics, 25:25–9, 2000.

A Bairoch and B Boeckmann. The swiss–prot protein sequence data bank.Nucleic Acids Research, 19 Suppl:2247–2249, Apr 1991.

PG Baker, A Brass, S Bechhofer, C Goble, N Paton, and R Stevens. Tambis—-transparent access to multiple bioinformatics information sources. Proceed-ings of the International Conference on Intelligent Systems for Molecular Bi-ology, 6:25–34, 1998.

JD Baxter. Mechanisms of glucocorticoid inhibition of growth. Kidney Interna-tional, 14(4):330–333, Oct 1978.

Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra.A maximum entropy approach to natural language processing. Compu-tational Linguistics, 22(1):39–71, 1996. URL citeseer.nj.nec.com/berger96maximum.html .

148

http://www.acronymfinder.com/

http://www.geocities.com/~mlshams/acronym/acr.htm

citeseer.nj.nec.com/berger96maximum.html

citeseer.nj.nec.com/berger96maximum.html

BIBLIOGRAPHY

Biopython. Biopython. http://www.biopython.org/ , 1998.

MV Blagosklonny and AB Pardee. Conceptual biology: Unearthing the gems.Nature, 416(6879):373, Mar 2002.

C Blaschke, MA Andrade, C Ouzounis, and A Valencia. Automatic extractionof biological information from scientific text: protein-protein interactions. InProceedings of the International Conference on Intelligent Systems for Molec-ular Biology, pages 60–7, 1999.

Fintan Bolton. Pure CORBA. SAMS, 2001.

E. Brill. Some advance in transformationbased part of speech tagging. In Pro-ceedings of the Twelefth National Conference on Artificial Intelligence (AAAI-94), 1994.

PP Bringuier, M McCredie, G Sauter, M Bilous, J Stewart, MJ Mihatsch,P Kleihues, and H Ohgaki. Carcinomas of the renal pelvis associatedwith smoking and phenacetin abuse: p53 mutations and polymorphism ofcarcinogen–metabolising enzymes. International Journal of Cancer, 79(5):531–536, Oct 1998.

Christopher Burges. A tutorial on support vector machines for pattern recog-nition. Data Mining and Knowledge Discovery, 2:121–167, 1998.

M. Califf and R. Mooney. Relational learning of pattern-match rules for infor-mation extraction. In Proceedings of the Sixteenth National Conference onArtificial Intelligence (AAAI-99). AAAI Press / MIT Press, 1999.

N. Catala, N. Castell, and M. Martin. Essence: a portable methodology for ac-quiring information extraction patterns. In Proceedings of the 14th EuropeanConference on Artificial Intelligence, Berlin, 2000.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a libraryfor support vector machines, 2001. Software available athttp://www.csie.ntu.edu.tw/˜cjlin/libsvm .

Jeffrey T Chang, Hinrich Schutze, and Russ B Altman. Gapscore: Finding geneand protein names one word at a time. Bioinformatics, Accepted.

JT Chang and RB Altman. Promises of text processing: natural language pro-cessing meets ai. Drug Discovery Today, 7(19):992–993, Oct 2002.

149

http://www.biopython.org/

BIBLIOGRAPHY

JT Chang, S Raychaudhuri, and RB Altman. Including biological literatureimproves homology search. Pacific Symposium on Biocomputing, pages 374–383, 2001.

JT Chang, H Schutze, and RB Altman. Creating an online dictionary of abbre-viations from medline. Journal of the American Medical Informatics Associ-ation, 9(6):612–620, Nov–Dec 2002.

RO Chen, R Felciano, and RB Altman. Riboweb: linking structural computa-tions to a knowledge base of published experimental data. Proceedings ofthe International Conference on Intelligent Systems for Molecular Biology, 5:84–87, 1997.

China Medical Tribune. China medical tribune. http://www.cmt.com.cn/ ,2002.

LH Cohen, RE van Leeuwen, GC van Thiel, JF van Pelt, and SH Yap. Equallypotent inhibitors of cholesterol synthesis in human hepatocytes have distin-guishable effects on different cytochrome p450 enzymes. Biopharmaceuticsand Drug Disposition, 21(9):353–364, Dec 2000.

Thomas Cover and Joy Thomas. Elements of Information Theory. Wiley-Interscience, 1991.

M Craven and J Kumlien. Constructing biological knowledge bases by extract-ing information from text sources. In Proceedings of the International Con-ference on Intelligent Systems for Molecular Biology, pages 77–86, 1999.

Mark Craven. The genomics of a signaling pathway: A kdd cup challenge task.Technical report, University of Wisconsin, December 2002.

Hamish Cunningham. Information extraction – a user guide. Technical report,University of Sheffield, 1999.

Cytochrome P450 Homepage. Cytochrome p450 homepage. http://drnelson.utmem.edu/CytochromeP450.html , 2003.

JN Darroch and D Ratcliff. Generalized iterative scaling for log-linear models.The Annals of Mathematical Statistics, 43:1470–1480, 1972.

J Ding, D Berleant, D Nettleton, and E Wurtele. Mining medline: abstracts,sentences, or phrases? Pacific Symposium on Biocomputing, pages 326–337,2002.

150

http://www.cmt.com.cn/

http://drnelson.utmem.edu/CytochromeP450.html

http://drnelson.utmem.edu/CytochromeP450.html

BIBLIOGRAPHY

T. Dunning. Accurate methods for the statistics of surprise and coincidence.Computational Linguistics, 19(1), 1993.

Robert Englander. Developing Java Beans. O’Reilly & Associates, 1997.

WE Evans and MV Relling. Pharmacogenomics: translating functional ge-nomics into rational therapeutics. Science, 286(5439):487–491, Oct 1999.

J Firth. A Synopsis of Linguistic Theory 1930-1955. Philological Society, Ox-ford, 1957.

D Fisher, S Soderland, J McCarthy, F Feng, and W Lehnert. Description of theumass system as used for muc-6. In Proceedings of the Sixth Message Un-derstanding Conference (MUC-6), pages 127–140, San Francisco, CA, 1995.Morgan Kaufmann.

George Forman. Feature engineering for a gene regulation prediction task.Technical report, HP Labs, December 2002.

K Franzen, G Eriksson, F Olsson, L Asker, P Liden, and J Coster. Proteinnames and how to find them. International Journal of Medical Informatics,67(1–3):49–61, Dec 2002.

D. Freitag. Machine learning for information extraction from online docu-ments, 1996. URL citeseer.nj.nec.com/freitag96machine.html .

D. Freitag and A. McCallum. Information extraction with hmms and shrink-age. In Proceedings of the AAAI-99 Workshop on Machine Learning for Infor-mation Extraction, 1999.

Dayne Freitag. Machine Learning for Information Extraction in Informal Do-mains. PhD thesis, Carnegie Mellon University, November 1998.

Dayne Freitag and Andrew McCallum. Information extraction with hmmstructures learned by stochastic optimization. In Proceedings of the Eigh-teenth Conference on Artificial Intelligence (AAAI-98), 2000.

C Friedman. Towards a comprehensive medical language processing system:methods and issues. Proceedings of the AMIA Annual Symposium, pages595–599, 1997.

C Friedman. A broad-coverage natural language processing system. Proceed-ings of the AMIA Annual Symposium, pages 270–4, 2000.

151

citeseer.nj.nec.com/freitag96machine.html

BIBLIOGRAPHY

C Friedman, PO Alderson, JH Austin, JJ Cimino, and SB Johnson. A generalnatural-language text processor for clinical radiology. Journal of the Ameri-can Medical Informatics Association, 1:161–74, 1994.

C Friedman, L Hirschman, R McEntire, and C Wu. Linking biological lan-guage, information and knowledge. Pacific Symposium on Biocomputing, 8:388–390, 2003.

C Friedman, P Kra, H Yu, M Krauthammer, and A Rzhetsky. Genies: anatural–language processing system for the extraction of molecular path-ways from journal articles. Bioinformatics, 17 Suppl 1:S74–S82, Jun 2001.

K Fukuda, A Tamura, T Tsunoda, and T Takagi. Toward information extrac-tion: identifying protein names from biological papers. In Pacific Symposiumon Biocomputing, volume 3, pages 707–18, 1998.

Moustafa M. Ghanem, Yike Guo, Huma Lodhi, and Yong Zhang. Automatic sci-entific text classification using local patterns: Kdd cup 2002 (task 1). Tech-nical report, Imperial College of Science Technology & Medicine, December2002.

N Hamaoka, Y Oda, I Hase, and A Asada. Cytochrome p4502b6 and 2c9 donot metabolize midazolam: kinetic analysis and inhibition study with mono-clonal antibodies. British Journal of Anaesthesia, 86(4):540–544, Apr 2001.

BP Hamilton. Diabetes mellitus and hypertension. American Journal of Kid-ney Diseases, 16(4 Suppl 1):20–29, Oct 1990.

D Hanisch, J Fluck, HT Mevissen, and R Zimmer. Playing biology’s namegame: identifying protein names in scientific text. Pacific Symposium onBiocomputing, 8:403–414, 2003.

T Hastie, R Tibshirani, and J Friedman. The Elements of Statistical Learning.Springer-Verlag, New York, 2001.

V Hatzivassiloglou, PA Duboue, and A Rzhetsky. Disambiguating proteins,genes, and rna in text: a machine learning approach. Bioinformatics, 17Suppl 1:S97–S106, Jun 2001.

M Hewett, DE Oliver, DL Rubin, KL Easton, JM Stuart, RB Altman, andTE Klein. Pharmgkb: the pharmacogenetics knowledge base. Nucleic AcidsResearch, 30(1):163–165, Jan 2002.

152

BIBLIOGRAPHY

L Hirschman, AA Morgan, and AS Yeh. Rutabaga by any other name: extract-ing biological names. Journal of Biomedical Informatics, 35(4):247–259, Aug2002a.

L Hirschman, JC Park, J Tsujii, L Wong, and CH Wu. Accomplishments andchallenges in literature data mining for biology. Bioinformatics, 18(12):1553–1561, Dec 2002b.

T Hishiki, N Collier, C Nobata, T Okazaki-Ohta, N Ogata, T Sekimizu,R Steiner, HS Park, and J Tsujii. Developing nlp tools for genome informat-ics: An information extraction perspective. In Genome Informatics Series:Proceedings of the Workshop on Genome Informatics, volume 9, pages 81–90,1998.

J. Hobbs, R. Douglas, E. Appelt, J. Bear, D. Israel, M. Kameyama, M. Stickel,and M. Tyson. FASTUS: A Cascaded Finite-State Transducer for ExtractingInformation from Natural-Language Text. MIT Press, Cambridge, MA, 1996.

Wen-Juan Hou and Hsin-Hsi Chen. Enhancing performance of protein namerecognizers using collocation. In Proceedings of the ACL 2003 Workshop onNatural Language Processing in Biomedicine, pages 25–32, 2003.

S Huffmann. Learning information extraction patterns from examples, pages246–260. Springer-Verlag, Berlin, 1996.

Human Genome Acronym List. Human genome acronym list. http://www.ornl.gov/hgmis/acronym.html , 2002.

BL Humphreys, DA Lindberg, HM Schoolman, and GO Barnett. The unifiedmedical language system: an informatics research collaboration. Journal ofthe American Medical Informatics Association, 5(1):1–11, Jan–Feb 1998.

K Humphreys, G Demetriou, and R Gaizauskas. Two applications of informa-tion extraction to biological science journal articles: enzyme interactions andprotein structures. In Pacific Symposium on Biocomputing, volume 5, pages505–16, 2000.

David Hutchinson. Medline for Health Professionals: How to Search PubMedon the Internet. New Wind, Sacramento, 1998.

Stanley Jablonski, editor. Dictionary of Medical Acronyms & Abbreviations.Hanley & Belfus, 1998.

153

http://www.ornl.gov/hgmis/acronym.html

http://www.ornl.gov/hgmis/acronym.html

BIBLIOGRAPHY

WB Jakoby. The glutathione s–transferases: a group of multifunctional detox-ification proteins. Advances in Enzymology and Related Areas in MolecularBiology, 46:383–414, 1978.

DC Jamison. Open bioinformatics. Bioinformatics, 19(6):679–680, Apr 2003.

YN Jan. Pre–empting the arrival of a dark lord. Nature, 389(6652):665, Oct1997.

TK Jenssen, A Laegreid, J Komorowski, and E Hovig. A literature networkof human genes for high–throughput analysis of gene expression. NatureGenetics, 28(1):21–28, May 2001.

Thorsten Joachims. Text categorization with support vector machines: Learn-ing with many relevant features. Technical report, Universitat Dortmund,1997.

PD Karp, M Riley, SM Paley, and A Pelligrini-Toole. Ecocyc: an encyclopedia ofescherichia coli genes and metabolism. Nucleic Acids Research, 24(1):32–39,Jan 1996.

Jun’ichi Kazama, Takaki Makino, Yoshihiro Ohta, and Jun’ichi Tsujii. Tun-ing support vector machines for biomedical named entity recognition. InProceedings of the ACL 2002 Workshop on Natural Language Processing inBiomedicine, pages 1–8, 2002.

S. Sathiya Keerthi, Chong Jin Ong, Keng Boon Siah, David B.L. Lim, Wei Chu,Min Shi, David S. Edwin, Rakesh Menon, Lixiang Shen, Jonathan Y.K. Lim,and Han Tong Loh. A machine learning approach for the curation of biomed-ical literature – kdd cup 2002 (task 1). Technical report, National Universityof Singapore, December 2002.

BW Kernighan and DM Ritchie. The C Programming Language. Prentice Hall,Upper Saddle River, NJ, 1988.

Jun-Tae Kim and Dan I. Moldovan. Palka: A system for lexical knowledgeacquisition. In Proceedings of the International Conference on Informationand Knowledge Management, pages 124–131, 1993.

KT Kitchin. Laboratory methods for ten hepatic toxification/detoxification pa-rameters. Methods Find Exp Clin Pharmacol, 5(7):439–448, Sep 1983.

154

BIBLIOGRAPHY

TE Klein, JT Chang, MK Cho, KL Easton, R Fergerson, M Hewett, Z Lin,Y Liu, S Liu, DE Oliver, DL Rubin, F Shafa, JM Stuart, and RB Altman. In-tegrating genotype and phenotype information: an overview of the pharmgkbproject. pharmacogenetics research network and knowledge base. Pharma-cogenomics Journal, 1(3):167–170, 2001.

D Knuth. The Texbook. Addison-Wesley, Reading, Massachusetts, 1986.

Daphne Koller and Mehran Sahami. Toward optimal feature selection. InInternational Conference on Machine Learning, pages 284–292, 1996. URLciteseer.nj.nec.com/koller96toward.html .

Adam Kowalczyk and Bhavani Raskutti. One class svm for yeast regulationprediction. Technical report, Telstra Research Laboratories, December 2002.

M Krauthammer, A Rzhetsky, P Morozov, and C Friedman. Using blast foridentifying gene and protein names in journal articles. Gene, 259:245–252,2000.

Leah S. Larkey, Paul Ogilvie, M. Andrew Price, and Brenden Tamilio.Acrophile: an automated acronym extractor and server. In ACM DL, pages205–214, 2000. URL citeseer.nj.nec.com/larkey00acrophile.html .

Simon St. Laurent, Edd Dumbill, and Joe Johnston. Programming Web Ser-vices with XML-RPC. O’Reilly & Associates, 2001.

J Lazarou, BH Pomeranz, and PN Corey. Incidence of adverse drug reactionsin hospitalized patients: a meta–analysis of prospective studies. Journal ofthe American Medical Association, 279(15):1200–1205, Apr 1998.

Ki-Joong Lee, Young-Sook Hwang, and Hae-Chang Rim. Two-phase biomedicalne recognition based on svms. In Proceedings of the ACL 2003 Workshop onNatural Language Processing in Biomedicine, pages 33–40, 2003.

W Lehnert, C Cardie, D Fisher, J McCarthy, E Riloff, and S Soderland. Uni-versity of massachusetts: Description of the circus system as used in muc.In Proceedings of the Fourth Message Understanding Conference (MUC-4),pages 282–288, San Mateo, CA, 1992. Morgan Kaufmann.

H Liu, YA Lussier, and C Friedman. A study of abbreviations in the umls.Proceedings of the AMIA Annual Symposium, pages 393–397, 2001.

LocusLink. Locuslink. http://www.ncbi.nlm.nih.gov/LocusLink/GeneRIFhelp.html , 2003.

155

citeseer.nj.nec.com/koller96toward.html

citeseer.nj.nec.com/larkey00acrophile.html

http://www.ncbi.nlm.nih.gov/LocusLink/GeneRIFhelp.html

http://www.ncbi.nlm.nih.gov/LocusLink/GeneRIFhelp.html

BIBLIOGRAPHY

M Lutz, D Ascher, and F Willison. Learning Python. O’Reilly, Sebastopol, CA,1999.

WH Majoros, GM Subramanian, and MD Yandell. Identification of key conceptsin biomedical literature using a modified markov heuristic. Bioinformatics,19(3):402–407, Feb 2003.

R Malouf. A comparison of algorithms for maximum entropy parameter estima-tion. In Proceedings of the Sixth Conference on Natural Language Learning,pages 49–55, 2002.

Christopher D Manning and Hinrich Schutze. Foundations of Statistical Nat-ural Language Processing. MIT Press, 1999.

A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov modelsfor information extraction and segmentation. In Proceedings of ICML-2000,2000.

HL McLeod and WE Evans. Pharmacogenomics: unlocking the human genomefor better drug therapy. Annual Review of Pharmacology and Toxicology, 41:101–121, 2001.

UA Meyer. Pharmacogenetics and adverse drug reactions. Lancet, 356(9242):1667–1671, Nov 2000.

Alex Morgan, Lynette Hirschman, Alexander Yeh, and Marc Colosimo. Genename extraction using flybase resources. In Proceedings of the ACL 2003Workshop on Natural Language Processing in Biomedicine, pages 1–8, 2003.

Mouse Genome Database. Mouse genome database. http://www.informatics.jax.org/mgihome/nomen/short_gene.shtml , 2002.

Chuck Musciano and Bill Kennedy. HTML & XHTML. O’Reilly & Associates,2002.

MySQL. Mysql. http://www.mysql.com/ , 2003.

MA Namboodiri, JT Favilla, and DC Klein. Pineal n–acetyltransferase is inac-tivated by disulfide–containing peptides: insulin is the most potent. Science,213(4507):571–573, Jul 1981.

M Narayanaswamy, KE Ravikumar, and K Vijay-Shanker. A biological namedentity recognizer. Pacific Symposium on Biocomputing, 8:427–438, 2003.

156

http://www.informatics.jax.org/mgihome/nomen/short_gene.shtml

http://www.informatics.jax.org/mgihome/nomen/short_gene.shtml

http://www.mysql.com/

BIBLIOGRAPHY

SB Needleman and CD Wunsch. A general method applicable to the search forsimilarities in the amino acid sequence of two proteins. Journal of MolecularBiology, 48(3):443–453, Mar 1970.

G Nenadic, I Spasic, and S Ananiadou. Automatic acronym acquisition andmanagement with domain specific texts. In Proceedings of LREC-3, 3rd In-ternational Conference on Language, Resources and Evaluation, pages 2155–2162, 2002.

S. Ng and M. Wong. Toward routine automatic pathway discovery from on-linescientific text abstracts. Genome Informatics, 10:104–112, 1999.

K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text clas-sification. In IJCAI-99 Workshop on Machine Learning for Information Fil-tering, pages 61–67, 1999. URL citeseer.nj.nec.com/nigam99using.html .

Numerical Python. Numerical python. http://numpy.sourceforge.net/ ,2003.

Numerical Recipes Home Page. Numerical recipes home page. http://www.nr.com/ , 2003.

T Ohta, Y Tateisi, H Mima, and J Tsujii. Genia corpus: an annotated researchabstract corpus in molecular biology domain. In Proceedings of the HumanLanguage Technology Conference, 2002.

T Ono, H Hishigaki, A Tanigami, and T Takagi. Automated extraction of infor-mation on protein–protein interactions from the biological literature. Bioin-formatics, 17(2):155–161, Feb 2001.

Opaui. Opaui guide to lists of acronyms, abbreviations, and initialisms on theworld wide web. http://www.opaui.com/acro.html , 2002.

Y Oyanagui. Immunosuppressants enhance superoxide radical/nitric oxide–dependent dexamethasone suppression of ischemic paw edema in mice. Eu-ropean Journal of Pharmacology, 344(2–3):241–249, Mar 1998.

David D. Palmer and Marti A. Hearst. Adaptive sentence boundary disam-biguation. In Proceedings of the Fourth ACL Conference on Applied NaturalLanguage Processing, pages 78–83, Stuttgart, 1994. Morgan Kaufmann.

Jong C Park. Using combinatory categorial grammar to extract biomedicalinformation. IEEE Intelligent Systems, November/December:62–67, 2001.

157

citeseer.nj.nec.com/nigam99using.html

citeseer.nj.nec.com/nigam99using.html

http://numpy.sourceforge.net/

http://www.nr.com/

http://www.nr.com/

http://www.opaui.com/acro.html

BIBLIOGRAPHY

Jong C Park, Hyun Sook Kim, and Jung Jae Kim. Bidirectional incremen-tal parsing for automatic pathway identification with combinatory categorialgrammar. In Pacific Symposium on Biocomputing, volume 6, pages 396–407,2001.

PharmGKB. The pharmcagenomics knowledge base. http://www.pharmgkb.org/ , 2003.

KA Phillips, DL Veenstra, E Oren, JK Lee, and W Sadee. Potential role ofpharmacogenomics in reducing adverse drug reactions: a systematic review.JAMA, 286(18):2270–2279, Nov 2001.

WH Press, BP Flannery, SA Teukolsky, and WT Vetterling. Numerical Recipesin C. Cambridge University Press, New York, NY, 1993.

D Proux, F Rechenmann, and L Julliard. A pragmatic information extractionstrategy for gathering data on genetic interactions. In ISMB, volume 8, pages279–85, 2000.

D Proux, F Rechenmann, L Julliard, V Pillet, and B Jacq. Detecting gene sym-bols and names in biological texts: A first step toward pertinent informationextraction. In Genome Informatics Series: Proceedings of the Workshop onGenome Informatics, volume 9, pages 72–80, 1998.

KD Pruitt, KS Katz, H Sicotte, and DR Maglott. Introducing refseq and lo-cuslink: curated human genome resources at the ncbi. Trends in Genetics,16(1):44–47, Jan 2000.

J Pustejovsky, J Castanno, B Cochran, M Kotecki, and M Morrell. Automaticextraction of acronym–meaning pairs from medline databases. Medinfo, 10(Pt 1):371–375, 2001.

Adwait Ratnaparkhi. Maximum Entropy Models for Natural Language Ambi-guity Resolution. PhD thesis, University of Pennsylvania, 1998.

S Raychaudhuri, JT Chang, PD Sutphin, and RB Altman. Associating geneswith gene ontology codes using a maximum entropy analysis of biomedicalliterature. Genome Research, 12(1):203–214, Jan 2002.

Yizhar Regev, Michal Finkelstein-Landau, Ronen Feldman, Maya Gorodetsky,Xin Zheng, Samuel Levy, Rosane Charlab, Charles Lawrence, Ross A. Lip-pert, Qing Zhang, and Hagit Shatkay. Rule-based extraction of experimentalevidence in the biomedical domain – the kdd cup 2002 (task 1). Technicalreport, ClearForest and Celera, December 2002.

158

http://www.pharmgkb.org/

http://www.pharmgkb.org/

BIBLIOGRAPHY

J. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifyingsentence boundaries. In Proceedings of the Fifth Conference on Applied Nat-ural Language Processing, pages 16–19, Washington D.C., 1997.

Ellen Riloff. Automatically constructing a dictionary for information extrac-tion tasks. In Proceedings of the Eleventh National Conference on ArtificialIntelligence (AAAI-93), pages 811–816. AAAI Press / MIT Press, 1993.

Ellen Riloff. Automatically generating extraction patterns from untagged text.In Proceedings of the Thirteenth National Conference on Artificial Intelligence(AAAI-96), 1996.

TC Rindflesch, L Tanabe, JN Weinstein, and L Hunter. Edgar: extraction ofdrugs, genes and relations from the biomedical literature. In Pacific Sympo-sium on Biocomputing, volume 5, pages 517–28, 2000.

RJ Roberts, HE Varmus, M Ashburner, PO Brown, MB Eisen, C Khosla,M Kirschner, R Nusse, M Scott, and B Wold. Information access. buildinga ”genbank” of the published literature. Science, 291(5512):2318–2319, Mar2001.

AD Roses. Pharmacogenetics and the practice of medicine. Nature, 405(6788):857–865, Jun 2000.

AD Roses. Pharmacogenetics. Human Molecular Genetics, 10(20):2261–2267,Oct 2001.

A Rzhetsky, T Koike, S Kalachikov, SM Gomez, M Krauthammer, SH Kaplan,P Kra, JJ Russo, and C Friedman. A knowledge model for analysis andsimulation of regulatory networks. Bioinformatics, 16:1120–1128, 2000.

SAIC Information Extraction. Introduction to information extraction. http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ , 2003.

Gerard Salton. Automatic Information Organization and Retrieval. McGrawHill, 1968.

S Schulze-Kremer. Ontologies for molecular biology. Pacific Symposium onBiocomputing, pages 695–706, 1998.

AS Schwartz and MA Hearst. A simple algorithm for identifying abbreviationdefinitions in biomedical text. Pacific Symposium on Biocomputing, pages451–462, 2003.

159

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/

BIBLIOGRAPHY

T Sekimizu, HS Park, and J Tsujii. Identifying the interaction between genesand gene products based on frequently seen verbs in medline abstracts. InGenome Informatics Series: Proceedings of the Workshop on Genome Infor-matics, volume 9, pages 62–71, 1998.

Dan Shen, Jie Zhang, Guodong Zhou, Jian Su, and Chew-Lim Tan. Effec-tive adaptation of hidden markov model-based named entity recognizer forbiomedical domain. In Proceedings of the ACL 2003 Workshop on NaturalLanguage Processing in Biomedicine, pages 49–56, 2003.

NR Smalheiser and DR Swanson. Using arrowsmith: a computer–assisted ap-proach to formulating and assessing scientific hypotheses. Computer Meth-ods and Programs in Biomedicine, 57(3):149–153, Nov 1998.

Joseph Smarr and Christopher Manning. Classifying unknown proper nounphrases without context. Technical report, Stanford University, 2002.

EM Smigielski, K Sirotkin, M Ward, and ST Sherry. dbsnp: a database of singlenucleotide polymorphisms. Nucleic Acids Research, 28(1):352–355, Jan 2000.

James Snell, Doug Tidwell, and Pavel Kulchenko. Programming Web Serviceswith SOAP. O’Reilly & Associates, 2001.

S. Soderland. Learning information extraction rules for semi-structured andfree text. Machine Learning, 34:1–44, 1999.

S Soderland, D Fisher, J Aseltine, and W Lehnert. Crystal: Inducing a con-ceptual dictionary. In In Proceedings of the Fourteenth International JointConference on Artificial Intelligence, pages 1314–1319, 1995.

BJ Stapley and G Benoit. Biobibliometrics: information retrieval and visual-ization from co-occurrences of gene names in medline abstracts. In PacificSymposium on Biocomputing, pages 529–40, 2000.

BJ Stapley, LA Kelley, and MJ Sternberg. Predicting the sub–cellular locationof proteins from text using support vector machines. Pacific Symposium onBiocomputing, pages 374–385, 2002.

LD Stein. Integrating biological databases. Nature Reviews Genetics, 4(5):337–345, May 2003.

M Stephens, M Palakal, S Mukhopadhyay, and R Raje. Detecting gene relationsfrom medline abstracts. In Pacific Symposium on Biocomputing, 2001.

160

BIBLIOGRAPHY

Kazem Taghva and Jeff Gilbreth. Recognizing acronyms and their definitions.Technical report, ISRI (Information Science Research Institute) UNLV, June1995.

L Tanabe and W John Wilbur. Tagging gene and protein names in full textarticles. In Proceedings of the Workshop on Natural Language Processing inthe Biomedical Domain, pages 9–13, 2002a.

L Tanabe and WJ Wilbur. Tagging gene and protein names in biomedical text.Bioinformatics, 18(8):1124–1132, Aug 2002b.

T Tateishi, M Watanabe, H Nakura, M Tanaka, T Kumai, SF Sakata,N Tamaki, K Ogura, T Nishiyama, T Watabe, and S Kobayashi. Di-hydropyrimidine dehydrogenase activity and fluorouracil pharmacokineticswith liver damage induced by bile duct ligation in rats. Drug Metabolismand Disposition, 27(6):651–654, Jun 1999.

Julian Templeman and John Paul Mueller. COM Programming with Microsoft.NET. Microsoft Press, 2003.

J Thomas, D Milward, C Ouzounis, S Pulman, and M Carroll. Automatic ex-traction of protein interactions from scientific abstracts. In Pacific Sympo-sium on Biocomputing, volume 5, pages 541–52, 2000.

Three-Letter Abbreviations. The great three-letter abbreviation hunt. http://www.atomiser.demon.co.uk/abbrev/ , 2002.

Yoshimasa Tsuruoka and Jun’ichi Tsujii. Boosting precision and recall ofdictionary-based protein name recognition. In Proceedings of the ACL 2003Workshop on Natural Language Processing in Biomedicine, pages 41–48,2003.

Si Usuzaka, KL Sim, M Tanaka, H Matsuno, and S Miyano. A machinelearning approach to reducing the work of experts in article selection fromdatabase: A case study for regulatory relations of s. cerevisiae genes in med-line. Genome Informatics Series: Proceedings of the Workshop on GenomeInformatics, 9:91–101, 1998.

Ellen M. Voorhees. Overview of trec 2002. In The Eleventh Text RetrievalConference, 2002a.

Ellen M. Voorhees. Overview of trec 2002. In The Eleventh Text RetrievalConference (TREC 2002), 2002b.

161

http://www.atomiser.demon.co.uk/abbrev/

http://www.atomiser.demon.co.uk/abbrev/

BIBLIOGRAPHY

Edwin C Webb. Enzyme Nomenclature 1992. Academic Press, 1992.

Julia A White, Lois J Maltais, and Daniel W Nebert. An increasingly urgentneed for standardized gene nomenclature. Technical report, University ofLondon, 2002.

Limsoon Wong. Pies, a protein interaction extraction system. In Pacific Sym-posium on Biocomputing, volume 6, pages 520–531, 2001.

JD Wren and HR Garner. Heuristics for identification of acronym–definitionpatterns within text: towards an automated construction of comprehensiveacronym–definition dictionaries. Methods for Information in Medicine, 41(5):426–434, 2002.

Akane Yakushiji, Yuka Tateisi, Yusuke Miyao, and Jun ichi Tsujii. Event ex-traction from biomedical papers using a full parser. In Pacific Symposium onBiocomputing, volume 6, pages 408–419, 2001.

Yiming Yang. An evaluation of statistical approaches to text categorization.Information Retrieval, 1(1-2):69–90, 1999. URL citeseer.nj.nec.com/yang97evaluation.html .

Yiming Yang and Jan O. Pedersen. A comparative study on feature selection intext categorization. In International Conference on Machine Learning, pages412–420, 1997. URL citeseer.nj.nec.com/yang97comparative.html .

Roman Yangarber and Ralph Grishman. Machine learning of extraction pat-terns from unannotated corpora. In Proceedings of 14th European Conferenceon Artificial Intelligence (ECAI 2000) Workshop on Machine Learning for In-formation Extraction, Berlin, Germany, 2000.

Stuart Yeates. Automatic extraction of acronyms from text. In New ZealandComputer Science Research Students’ Conference, pages 117–124, 1999. URLciteseer.nj.nec.com/yeates99automatic.html .

Stuart Yeates, David Bainbridge, and Ian H. Witten. Using compression toidentify acronyms in text. In Data Compression Conference, page 582, 2000.URL citeseer.nj.nec.com/288921.html .

Alexander Yeh, Lynette Hirschman, and Alexander Morgan. Background andoverview for kdd cup 2002 task 1: Information extraction from biomedicalarticles. Technical report, The MITRE Corporation, December 2002.

162

citeseer.nj.nec.com/yang97evaluation.html

citeseer.nj.nec.com/yang97evaluation.html

citeseer.nj.nec.com/yang97comparative.html

citeseer.nj.nec.com/yeates99automatic.html

citeseer.nj.nec.com/288921.html

BIBLIOGRAPHY

AS Yeh, L Hirschman, and AA Morgan. Evaluation of text data mining fordatabase curation: lessons learned from the kdd challenge cup. Bioinformat-ics, 19 Suppl 1:I331–I339, Jul 2003.

M Yoshida, K Fukuda, and T Takagi. Pnad-css: a workbench for constructinga protein name abbreviation dictionary. Bioinformatics, 16:169–75, 2000.

H Yu and E Agichtein. Extracting synonymous gene and protein terms frombiological literature. Bioinformatics, 19 Suppl 1:I340–I349, Jul 2003.

H Yu, V Hatzivassiloglou, C Friedman, A Rzhetsky, and WJ Wilbur. Automaticextraction of gene and protein synonyms from medline and journal articles.Proceedings of the AMIA Annual Symposium, pages 919–923, 2002a.

H Yu, G Hripcsak, and C Friedman. Mapping abbreviations to full forms inbiomedical articles. Journal of the American Medical Informatics Associa-tion, 9:262–272, 2002b.

Mohammed J. Zaki, Jason T.L. Wang, and Hannu T.T. Toivonen. Biokdd 2002:Recent advances in data mining for bioinformatics. Technical report, Rens-selaer Polytechnic Institute, New Jersey Institute of Technology, and Univer-sity of Helsinki, December 2002.

163