using machine learning to extract drug and gene relationships from
TRANSCRIPT
USING MACHINE LEARNING TO EXTRACT DRUG
AND GENE RELATIONSHIPS FROM TEXT
A DISSERTATION
SUBMITTED TO THE PROGRAM IN BIOMEDICAL INFORMATICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Jeffrey T. Chang
September 2003
c© Copyright by Jeffrey T. Chang 2004All Rights Reserved
ii
I certify that I have read this dissertation and that,in my opinion, it is fully adequate in scope and qual-ity as a dissertation for the degree of Doctor of Phi-losophy.
Russ B. Altman, Principal Adviser
I certify that I have read this dissertation and that,in my opinion, it is fully adequate in scope and qual-ity as a dissertation for the degree of Doctor of Phi-losophy.
Douglas L. Brutlag
I certify that I have read this dissertation and that,in my opinion, it is fully adequate in scope and qual-ity as a dissertation for the degree of Doctor of Phi-losophy.
Serafim Batzoglou
I certify that I have read this dissertation and that,in my opinion, it is fully adequate in scope and qual-ity as a dissertation for the degree of Doctor of Phi-losophy.
Hinrich Schutze
Approved for the University Committee on GraduateStudies.
iii
iv
Abstract
Interpatient variability in responses to drugs leads to millions of hospitaliza-
tions every year. To help prevent these failures, the discipline of pharmacoge-
nomics intends to characterize the genomic profiles that may lead to unde-
sirable drug responses. Pharmacogenomic scientists must integrate research
findings across the genomic, molecular, cellular, tissue, organ, and organismic
levels. To address this challenge, I have developed methods to extract infor-
mation relevant to pharmacogenomics from the literature. These methods can
serve as the foundation for powerful tools that help scientists synthesize infor-
mation and generate new biological hypotheses.
Specifically, this thesis covers novel applications and extensions of super-
vised machine learning algorithms to extract relationships between genes and
drugs automatically. This task comprises several problems that must be solved
separately. Thus, I have also developed algorithms to identify and score gene
names and their abbreviations from text. I have framed these tasks as classi-
fication problems, where the computer must integrate diverse evidence to pro-
duce a decision. I identified features that captured information relevant to the
problem and then encoded them into representations suitable for classification.
To extract a comprehensive list of gene-drug relationships, an algorithm
must find gene and protein names from text. Using such an algorithm, the
v
computer could identify newly coined gene names. My approach to this prob-
lem achieved 83% recall at 82% precision. Since many of these names were ab-
breviations, e.g. TPMT for Thiopurine Methyltransferase, I developed an abbre-
viation identification algorithm that found these concurrences with 84% recall
at 81% precision. The final algorithm classified relationships between genes
and drugs into five categories with 74% accuracy.
Finally, I have made these algorithms and other results available on the
internet at http://bionlp.stanford.edu/ . The code is available both as
human-accessible web pages and computer-accessible web services.
vi
Acknowledgements
Paraphrasing someone else, it takes a village to produce a thesis. Thomas
Kuhn acknowledged that science is a social endeavour when he wrote that
“Scientific knowledge, like language, is intrinsically the common property of
a group or else nothing at all.” It is true that during my own education, I have
incurred many debts to those who have generously shared their knowledge and
wisdom, and also to those who have helped and encouraged me in other ways.
However, I will approach my acknowledgements differently. I hope to thank
people, rather than names, by illustrating the context in which they have im-
pacted my life and work.
My interests in science were undoubtedly inherited from my dad, who is
also a scientist. Every since I can remember, Dad has kept books on his night-
stand to be read before bed. He reads voraciously and consumes books covering
a broad span of subjects. When I began graduate school, his interests turned
toward the life sciences so that he could learn about the field I was pursuing.
Dad often called me to tell me about recent books he read or biological issues
he had been thinking about. In a sense, he has experienced my graduate edu-
cation with me.
Mom, on the other hand, taught me to do my best work ever since grade
school. Then, I often lost points on homework assignments when I would ne-
glect to copy the problems or circle the answers. To me, they were just details.
vii
To my mother, those were the easy points. The real challenge was in the actual
problem, so why lose the easy points? After years of gentle chiding, I have fi-
nally begun to learn those lessons. Plus, the problems have become strikingly
more difficult.
My sister Katherine is not pursuing science and has instead been making
a living in New York. It is nice to hear from her once in a while to learn how
things are going in the real world.
Ten years ago, almost to the day, my parents sent me off to college. As an
undergraduate at Stanford, although I had officially studied biology, I also had
an interest in computers. Thus, I took on a job as a section leader (like an un-
dergraduate TA) for computer science classes. While teaching, I met Mehran
Sahami. Although he was a graduate student then, Mehran had been an un-
dergraduate at Stanford as well. Naturally, undergraduates regarded him with
a sense of awe due to his long tenure here. I think he still teaches classes at
Stanford.
In his time here, Mehran accumulated knowledge about everything at Stan-
ford, which would change my life. At the time, I had been frustrated because
I was unsure how to combine my interests in biology and computers. Noth-
ing I had studied in my courses seemed quite like what I wanted to do. As I
shared my thoughts with Mehran (on the steps of the Stanford Bookstore on a
sunny day), he directed me to Russ Altman. It was just like that scene from
Star Wars — “You seek Yoda.” The effect on my career and life was just as
profound, except that Mehran wasn’t Yoda, and neither was Russ, and I had no
mitichlorians.
viii
After talking to Mehran, I went home that day and looked Russ up in the
phone book. Remember that the year was 1995. Students called professors on
the phone, and initiating contact by email was practically unheard of. Most
students did not regularly use it.1 I was in an unusual demographic because I
checked my email almost every day.
Once I found the phone number, I called Russ and explained what I wanted
to do. The first question he asked was whether I had seen his web page. Web
page? I didn’t know professors had web pages.2 Then, he asked me to check his
web page and email him to set up a meeting. I knew immediately that I would
enjoy working in his lab.
Soon after, I met with Russ to talk about possible research projects. We
met in his office, which back then was a tiny room in the MSOB. As soon as
he started talking, however, his energy, excitement, and exhuberance in his
research were palpable and permeated the entire room. Throughout the years,
I would enjoy meetings as ideas would burst forth from Russ like a stack dump.
While my formal classes made science rote, Russ made science fun.
During my first meeting with Russ, I chose a project analyzing protein
structures. At that time, Russ had a graduate student, Liping Wei, work-
ing in that area and introduced us. Liping was developing a creative and novel
approach to analyze protein sites using Bayesian statistics. She introduced me
to the power of statistical methods and machine learning that would form the
basis of this thesis.
1Back then, email addresses to us were just usernames and the @leland.stanford.edu wasjust understood. Very few students at Stanford had any other email account. Hotmail wouldnot introduce the idea of web-based email until it launched in July, 1996.
2Netscape 1.0 had been released less than a year earlier.
ix
Because of my positive undergrad and subsequent experiences, I wanted to
continue studying informatics and returned to Stanford. My plan was to pre-
dict protein function by building and characterizing structural “motifs.” When
I started the graduate program, I undertook a series of rotations through sev-
eral labs. I did my first rotation with an expert in the area, Doug Brutlag.
As a graduate student, Doug studied DNA replication with Arthur Korn-
berg. It was an enviable pedigree and a fantastic launching pad for a career in
“classical” biochemistry. However, at some point, he became interested in com-
putational methods and shifted the focus of his lab so that he could investigate
them. At that time, few people were working in the area, and it turned out to
be an incredibly prescient move. It was that audacity that gave me the drive
to pursue a research topic that few people were investigating.
Next, I rotated through the lab of Michael Levitt, which was a shocking
experience. Michael held weekly lab meetings where, during each meeting, ev-
eryone would talk in turn about what research they did in the previous week.
Finally, as the meeting was winding to a close, Michael picked up a mop and
bucket and scrubbed the room clean for the next group! No, that’s a lie, but
what actually happened was equally unbelievable. He talked about the re-
search he did in the last week. Michael ran his own lab, and his own depart-
ment, but he still had time to write and debug his own C code. Incredible.
Nevertheless, I eventually rejoined Russ’ lab. While investigating meth-
ods to find protein structural motifs, I realized that one of the tough problems
that few people were looking at was the problem of describing what the motif
does. At the time, and still mostly true today, protein function is primarily
x
documented in the unstructured literature. Thus, to describe the function of a
structural motif, the computer must be able to “read” text and pull out the in-
formation about the function. Heading down this path, I eventually developed
an algorithm to predict protein functions more accurately by having the com-
puter “read” journal papers (Chang et al., 2001). Thus, I became interested in
the problem of using information in literature to analyze biological data, which
is the topic of this dissertation.
Fortunately, Russ was very supportive of my new research direction. Al-
though the subject complemented some of his earlier work on knowledge mod-
elling, at that time, he had no expertise, per se, in text mining. The project
might have died, except by a stroke of luck, in 1999, an expert in statistical
text mining algorithms, Chris Manning, began his appointment in the De-
partment of Computer Science. He gave much of my initial introduction into
this new area. Chris ultimately chaired my defense.
It was through Chris that I met someone who would advise me on analyzing
biological literature. Hinrich Schutze came to the Helix Group (as Russ likes
to call his lab) to do research in this area. Hinrich was particularly enthusiastic
about advising students, and was always very generous with his time. Because
of his close involvement with my work, he came to be a co-author on many of
my papers. It was natural for Hinrich to be on my thesis committee.
For the final member of my thesis committee, I sought out Serafim Bat-
zoglou. He was an expert on string algorithms due to his research in genome
analysis and assembly. I met one day Serafim when I dropped by his office.
As I started to tell him about my work, he “got it” instantly and immediately
xi
started to give me new perspectives on my algorithms. Not only was he ex-
tremely sharp, Serafim was also very approachable. We often had informal
meetings when I occasionally bumped into him at the gym, where he would
bench press an impressive amount of weight. Adding Serafim to my committee
guaranteed that somebody’s name would be spelled wrong in the acknowledge-
ments slide of my defense.
Of course, all of this research occurred in the environment, or as Russ might
say gleefully, “milieu,” of the Biomedical Informatics (SBMI) training program3
It is impossible to talk about this program without mentioning Darlene Vian,
who has been the lifeblood of the program since its beginning. Darlene kept me
moving through the program smoothly, taking care of many important things
behind the scenes (like my stipend checks). As she reduces her involvement
in the program for a much deserved retirement, she will sorely be missed.
For now, she is continuing the tradition of hosting TGiTh parties at Tennyson
Manor, keeping the wine flowing.
Scientifically however, SBMI was rich because of the diversity of its stu-
dents and research scientists. Perhaps due to its interdisciplinary nature,
someone was an expert for nearly every problem. Instead of spending hours
trying to hunt down an answer to a problem, it was very easy to ask someone
who already knew it. For example, Daniel Rubin provided me expertise on
medicine and pharmacology, Micheal Hewett and Mike Liang knew every-
thing about computers, and Soumya Raychaudhuri was an expert in math.
3Technically, I joined the Medical Information Sciences training program. To reflect the cur-rent state of the field, the name was updated to the Biomedical Informatics Training program.Therefore, I will invent an acronym here and refer to it as SBMI.
xii
Every dissertation produced at SBMI is the product of many students’ collec-
tive efforts.
Outside of work though, there is some truth to the statement when Soumya
called SBMI a social club. I’m not sure whether he was making an observation
or a wish. Soumya himself had the habit of wandering around afternoons and
chatting with people who were trying to work. However, it is accurate that
SBMI students often socialize. We discovered that much of the research equip-
ment also had recreational uses. The projector for talks doubled as a big screen
movie theater, and the fast network supported games of Age of Empires after
hours.
Outside of lab, Mike Hewett noted that SBMI people tend to join the same
activities. It’s been true for me – at the time, three of us comprised the majority
of a sand volleyball class of about five students. As an avid runner, Mike also
formed the nucleus of a running group that regularly tortured me along with
George Scott and Elmer Bernstam. On off days, Elmer and I would often
play tennis. When Elmer finished the program, I started in a karate class
along with Serge Saxonov and Iwei Yeh. Even with these activities, Brian
Naughton and I still managed to keep up with way too many TV shows. I was
fortunate to have made so many friends in the program. They have also made
science fun.
Finally, while studying here, I met my girlfriend Zhen Lin. I have claimed
throughout these acknowledgements that science is a social endeavour. That’s
not totally true. Urually, pursuing science is isolating and lonely. I spend much
time by myself learning new concepts, pondering ideas, reading papers, and
xiii
writing code. At certain graduate school milestones, such as the orals, proposal
defense, and final thesis, the attention science demands had been a source of
tension in our relationship. Zhen has always kept me grounded and reminded
me what was most important. Stay tuned. . .
Enough stories. It is time to finish this thesis so that I can send it home,
where my father is eagerly waiting for it. I’m sure it will end up on his night-
stand.
xiv
Table of Contents
Abstract v
Acknowledgements vii
1 Introduction 11.1 Uses of Structured Data in Biology . . . . . . . . . . . . . . . . . . 31.2 Unstructured Knowledge in Literature . . . . . . . . . . . . . . . . 41.3 Discovering Knowledge from Text . . . . . . . . . . . . . . . . . . . 51.4 Extracting Pharmacogenomics Information . . . . . . . . . . . . . 71.5 Overview of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Statistical Text Analysis 112.1 Modelling a Document in Vector Space . . . . . . . . . . . . . . . . 112.2 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 k Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . 142.2.2 Naıve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.3 Maximum Entropy . . . . . . . . . . . . . . . . . . . . . . . 172.2.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . 192.2.5 Support Vector Machines . . . . . . . . . . . . . . . . . . . . 202.2.6 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Categorizing Words . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Finding Abbreviations 283.1 An Algorithm to Identify Abbreviations . . . . . . . . . . . . . . . 31
3.1.1 Finding Abbreviation Candidates . . . . . . . . . . . . . . . 313.1.2 Aligning Abbreviations with their Prefixes . . . . . . . . . 333.1.3 Computing Features from Alignments . . . . . . . . . . . . 333.1.4 Scoring Alignments with Logistic Regression . . . . . . . . 343.1.5 Implementating the Algorithm . . . . . . . . . . . . . . . . 35
3.2 Performance of the Abbreviation Identification Algorithm . . . . . 363.3 Clarifying and Reconciling Notions of Abbreviations . . . . . . . . 38
3.3.1 Reannotating the Medstract Gold Standard . . . . . . . . . 383.3.2 Comparing the Medstract and Expert Gold Standards . . . 393.3.3 Defining Abbreviations . . . . . . . . . . . . . . . . . . . . . 40
3.4 Compiling the Abbreviations in MEDLINE . . . . . . . . . . . . . 42
xv
TABLE OF CONTENTS
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 Identifying Gene Names 514.1 An Algorithm to Identify Gene and Protein Names . . . . . . . . . 57
4.1.1 Tokenizing the Sentences . . . . . . . . . . . . . . . . . . . 574.1.2 Filtering Recognized Words . . . . . . . . . . . . . . . . . . 594.1.3 Scoring Words . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1.4 Extending to Noun Phrase . . . . . . . . . . . . . . . . . . . 684.1.5 Matching Abbreviations . . . . . . . . . . . . . . . . . . . . 684.1.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Performance of GAPSCORE . . . . . . . . . . . . . . . . . . . . . . 694.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 Extracting Gene-Drug Relationships 795.1 Information Extraction Systems in the NLP Community . . . . . 825.2 Identifying Biological Relationships . . . . . . . . . . . . . . . . . 84
5.2.1 Co-occurrence . . . . . . . . . . . . . . . . . . . . . . . . . . 855.2.2 Keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875.2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 885.2.4 Natural Language Processing . . . . . . . . . . . . . . . . . 88
5.3 NLP Systems in Biomedicine . . . . . . . . . . . . . . . . . . . . . 895.4 Identifying Related Genes and Drugs . . . . . . . . . . . . . . . . 925.5 Classifying Gene-Drug Relationships . . . . . . . . . . . . . . . . . 975.6 Predicting Gene and Drug Relationships . . . . . . . . . . . . . . . 1025.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6 Distributing the Algorithms 1086.1 Clustering Abbreviations to Aid Browsing . . . . . . . . . . . . . . 1106.2 Making Servers Computer-Friendly . . . . . . . . . . . . . . . . . 113
7 Conclusions 1167.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 1187.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2.1 Disambiguating Gene Names . . . . . . . . . . . . . . . . . 1227.2.2 Formal Descriptions of Data . . . . . . . . . . . . . . . . . . 1237.2.3 Annotated Text Data . . . . . . . . . . . . . . . . . . . . . . 1257.2.4 Full Text Articles . . . . . . . . . . . . . . . . . . . . . . . . 125
7.3 Final Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
A Training Maximum Entropy 127
xvi
TABLE OF CONTENTS
B Sentence Boundary Disambiguation 131
C Gene Drug Relationships 133
D Classifying PharmGKB Relationships 138
E Gene Name Normalization 144
Bibliography 148
xvii
List of Tables
2.1 Overview of Machine Learning Algorithms . . . . . . . . . . . . . 222.2 Parameters for Machine Learning Algorithms . . . . . . . . . . . 222.3 Performance of Machine Learning Algorithms . . . . . . . . . . . 23
3.1 Features Used to Score Abbreviations . . . . . . . . . . . . . . . . 343.2 Types of Abbreviations Missed . . . . . . . . . . . . . . . . . . . . 37
4.1 Overview of Gene/Protein Name Algorithms . . . . . . . . . . . . 524.2 MeSH Terms That End with -in . . . . . . . . . . . . . . . . . . . . 544.3 Gene Name Appearance Features . . . . . . . . . . . . . . . . . . . 614.4 Morphologic Variations in Gene Names . . . . . . . . . . . . . . . 634.5 Gene Name Signal Words . . . . . . . . . . . . . . . . . . . . . . . 664.6 Parameters for Support Vector Machines . . . . . . . . . . . . . . 714.7 Comparing Algorithms to Classify Gene Names . . . . . . . . . . 724.8 Removing Modules Reduces GAPSCORE Performance . . . . . . 72
5.1 Relationships Between Ten Genes and Drugs . . . . . . . . . . . . 965.2 Pharmacogenomic Relationships in PharmGKB . . . . . . . . . . 985.3 Informative Features for Gene-Drug Relationships . . . . . . . . . 101
6.1 Clusters of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . 111
B.1 Sentence Boundary Ambiguities . . . . . . . . . . . . . . . . . . . 132
C.1 Relationships Between Genes and Drugs . . . . . . . . . . . . . . 137
D.1 Gene-Drug Classification Results . . . . . . . . . . . . . . . . . . . 143
E.1 Classes of Words in Gene Names . . . . . . . . . . . . . . . . . . . 147
xviii
List of Figures
1.1 Growth of MEDLINE Citations . . . . . . . . . . . . . . . . . . . . 51.2 Architecture for Finding Gene-Drug Relationships . . . . . . . . . 81.3 BioNLP Web Server Screenshot . . . . . . . . . . . . . . . . . . . . 9
2.1 Vector Space Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2 Zipf ’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 Abbreviation System Architecture . . . . . . . . . . . . . . . . . . 323.2 Abbreviations Predicted in Medstract Gold Standard . . . . . . . 373.3 Growth of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . 443.4 Scores of Abbreviations Found in China Medical Tribune . . . . . 453.5 Abbreviation Server Screenshot . . . . . . . . . . . . . . . . . . . . 49
4.1 Recognizing Gene Names . . . . . . . . . . . . . . . . . . . . . . . . 584.2 Performance of GAPSCORE . . . . . . . . . . . . . . . . . . . . . . 74
5.1 Sample Relationships between Drugs and Genes . . . . . . . . . . 855.2 Frequency of Gene-Drug Co-occurrences . . . . . . . . . . . . . . . 945.3 Scoring Gene-Drug Relationships . . . . . . . . . . . . . . . . . . . 1005.4 Distribution of Relationship Scores . . . . . . . . . . . . . . . . . . 1035.5 Errors in Gene-Drug Relationships . . . . . . . . . . . . . . . . . . 1045.6 Common Co-occurrences Classified More Accurately . . . . . . . . 105
B.1 Heuristics for Sentence Boundary Disambiguation . . . . . . . . . 132
xix
LIST OF FIGURES
xx
CHAPTER
1
Introduction
According to McLeod and Evans, “Inter-patient variability in response to drug
therapy is the rule, not the exception, for almost all patients” (McLeod and
Evans, 2001). Even correctly prescribed drugs can lead to unexpected effects
such as adverse drug reactions. In 1994, there were 2,216,000 serious adverse
drug reactions, and 106,000 resulted in death, after excluding cases of inappro-
priate administration and use (Lazarou et al., 1998).
Adverse drug reactions occur due to many reasons, such as poor patient
compliance and environmental factors. Since the 1950s, scientists have recog-
nized that genetic variations influence drug response, which suggests that ge-
netic tests may be able to predict and prevent adverse drug reactions (Meyer,
2000). Today, advances in sequencing technologies and the availability of un-
precedented quantities of genomic data have empowered pharmacogenomics
research.
Pharmacogenomics studies how variations in the genome, genetic polymor-
phisms, can cause people to respond differently to drugs. One well-studied ex-
ample is the thiopurine methyltransferase (TPMT) gene (McLeod and Evans,
1
CHAPTER 1. INTRODUCTION
2001). This enzyme inactivates the thiopurine drugs used to treat childhood
leukemia, rheumatoid arthritis, and dermatological disorders. However, 10%
of the population inherit a variant of TPMT that cannot metabolize those drugs
as efficiently. In those people, thiopurine accumulates to toxic levels and in-
creases the risk of secondary malignancies. Because of these types of problems,
scientists are investigating methods to correlate polymorphisms with drug re-
sponses, and to apply that knowledge into medical practice (Roses, 2000, 2001).
However, pharmacogenomics investigations are hindered by the vast quan-
tities of information available, produced by diverse disciplines over several
decades of research. Many genes influence the efficacy of drugs, and their
mechanism of action is often unknown (Evans and Relling, 1999). Most stud-
ies have concentrated on either finding variations in drug responses or finding
polymorphisms in genes; few have linked the two (Phillips et al., 2001). To un-
derstand the relationships between drugs and biological systems, researchers
must integrate knowledge covering many fields and draw inferences among
them (for example, to link the effects of drugs with possible molecular path-
ways). They must synthesize research findings across genomic, molecular,
cellular, tissue, organ, and organismic levels. Having tools to organize vast
amounts of diverse information will help scientists quickly mine the literature
and formulate new research hypotheses.
To stimulate the production of pharmacogenomics data, the National In-
stitutes of Health is funding the Pharmacogenetics Research Network (http:
//www.nigms.nih.gov/pharmacogenetics/ ) to collect information related
to genotypes and phenotypes (Klein et al., 2001). To manage the data, it is
2
CHAPTER 1. INTRODUCTION
also funding the Pharmacogenomics Knowledge Base (PharmGKB) at Stan-
ford. The PharmGKB models and stores pharmacogenomics data for the re-
search community. Since the data is also available in a computationally acces-
sible format, the data sets may also become the focus of intense bioinformatic
research (Hewett et al., 2002).
1.1 Uses of Structured Data in Biology
Knowledge-based systems such as PharmGKB organize data according to on-
tologies. An ontology is a detailed representation of information that unam-
biguously defines 1) the types of data stored, and 2) the relevant relation-
ships between them. For example, PharmGKB contains entities for Gene and
Polymorphism . The fact that a Gene can have zero or more Polymorphism s
is indicated by a HAS-A relationship.
The structure in knowledge bases reduces ambiguity, insures reliable trans-
fer of data to other representations, and facilitates computational analysis.
Presently, knowledge bases have successfully improved the storage and re-
trieval of information. However, in the long term, knowledge bases may be
used to generate novel research hypotheses. Therefore, scientists have begun
efforts to model biological knowledge in knowledge bases (Karp et al., 1996;
Chen et al., 1997; Baker et al., 1998; Humphreys et al., 1998; Schulze-Kremer,
1998; Ashburner et al., 2000).
Unfortunately, developing knowledge bases and adding data are labor in-
tensive. Currently, experts in a problem domain (e.g. pharmacogenomics) de-
velop knowledge bases manually. They must specify the domain and insert
3
CHAPTER 1. INTRODUCTION
knowledge according to the ontology. To simplify this task, many researchers
are developing methods to add biological knowledge to knowledge bases auto-
matically.
1.2 Unstructured Knowledge in Literature
One rich source of biological knowledge is the published literature. The MED-
LINE database catalogs nearly all journals related to biology and medicine
(Hutchinson, 1998). It is available over the web using the PubMed interface.
MEDLINE contains 12 million citations and grows by over 400,000 per year 1
(see Figure 1.1). 56% of those citations contain abstracts. In addition, journal
articles are increasingly becoming available online. Electronic publishers such
as High-Wire press and PubMed Central are starting to permit access to full
text articles (Roberts et al., 2001).
The knowledge in biomedical literature is undeniably valuable. However,
the vast amount and unstructured nature of the literature proffer challenges
for scientists. Many endeavor to develop computational tools to simplify ac-
cess, understanding, and application of that knowledge. However, the knowl-
edge residing in free text form is inaccessible for computation. Fortunately, the
field of natural language processing (NLP) has been investigating automated
techniques to understand and interpret unstructured literature (Manning and
Schutze, 1999).
A discipline within NLP, information extraction (IE), studies computational
representations and algorithms that can identify relevant information in text1http://www.nlm.nih.gov/pubs/factsheets/medline.html
4
CHAPTER 1. INTRODUCTION
MEDLINE Abstracts1975-2000
0
1
2
3
4
5
6
1975 1980 1985 1990 1995 2000
Mill
ions
of A
bstr
acts
Figure 1.1: Growth of MEDLINE Citations. MEDLINE contains 12 mil-lion citations and grows by 400,000 a year. Over half of the citations containabstracts.
and map it into predefined structured forms. Ultimately, IE or other NLP tech-
niques will be able to read the literature and populate instances in a knowledge
base.
1.3 Discovering Knowledge from Text
In addition to helping scientists, analyses of the literature may one day be
able to leverage the vast knowledge in MEDLINE to produce novel hypotheses.
5
CHAPTER 1. INTRODUCTION
Swanson noticed that because of increasing specialization amidst an unman-
ageable amount of literature, researchers are often unaware of relevant infor-
mation or solutions to their problem, even if the knowledge is widely known
in another field (Smalheiser and Swanson, 1998). He theorized that comput-
ers may be able to identify such disconnects and combine knowledge to solve
problems across disciplines.
Swanson found possible relationships between two concepts by searching
for them in MEDLINE and finding words and phrases in the intersection of
the hits. Using this method, he could detect relationships not explicitly stated
in any article. Focusing on treating diseases, Swanson discovered a new treat-
ment for Raynaud’s disease by noticing that it has symptoms that are known
to be alleviated by fish oil.
Similarly, some have argued that experimental biology also contains such
“undiscovered public knowledge” across subdisciplines. Blagosklonny and
Pardee have claimed that the information necessary to understand feedback
control of p53 function was available in MEDLINE in 1990, although the
mechanism was not elucidated for another 10 years (Blagosklonny and Pardee,
2002). Believing that similar insights are missed, they proposed building sys-
tems to review the contents in MEDLINE automatically to search for other
similarly hidden discoveries.
However, systems that automatically scan text for novel biological hypothe-
ses do not yet exist. Swanson’s method required experts to generate the queries
and interpret results. Nevertheless, algorithms for identifying concepts, rela-
tionships, and performing inferences on them are active research areas (Chang
6
CHAPTER 1. INTRODUCTION
and Altman, 2002). Thus, this thesis investigates methods to extract struc-
tured knowledge from unstructured literature.
1.4 Extracting Pharmacogenomics Information
This thesis describes novel methods that support efforts to create a database
of relationships between genes or proteins and drugs from literature in MED-
LINE. Such a database will be useful for researchers studying pharmacoge-
nomics, including the scientists in the Pharmacogenomics Research Network.
In the future, linking this data with other biological data, such as protein struc-
tures or molecular pathways, or for humans, single nucleotide polymorphisms
(SNPs) or clinical data, will lead to deeper insights into biological systems.
Finding relationships between genes and drugs from the literature requires
algorithms to recognize drug names, gene names, and the relationships be-
tween them. Drug names are relatively easy to recognize. There are a lim-
ited number of drugs, their development time is lengthy, and the nomenclature
is tightly controlled by a few drug developers. Conversely, gene names are
dynamic, new ones are created weekly, many names follow no standard, and
many scientists are creating new names. Because there is no standardization,
orthologs may have different names, and genes with the same name may not be
homologously related (Stein, 2003). Exacerbating the problem, many gene and
protein names are abbreviated, effectively increasing the number of names.
Identifying gene names from literature is an ongoing research problem.
Similarly, finding relationships between drugs and genes or proteins is dif-
ficult because of the many different types of relationships possible (e.g. protein
7
CHAPTER 1. INTRODUCTION
DocumentsDrug Name Identifier
Gene NameIdentifier
AbbreviationFinder
RelationshipScorer
Gene-DrugRelationships
Figure 1.2: Architecture for Finding Gene-Drug Relationships. Findingrelationships between drugs and genes requires components that can recognizethe drug and gene names from the literature, and a component that can scorepossible relationships between drugs and genes.
metabolizes drug, drug is cofactor of gene, genetic variation causes physiologi-
cal change in drug effect, etc.) and the various ways they are expressed in text.
Fortunately, the problems of finding abbreviations, gene names, and gene/drug
relationships share similarities. They can all be framed as classification prob-
lems, where the computer integrates multiple types of information to resolve
an ambiguous decision (e.g. whether a word is a gene name).
Thus, I have approached the problems of finding abbreviations, identifying
gene names, and collecting gene/drug relationships as machine learning tasks,
where judicious choices of informative features help the classifier produce con-
fidence scores. Because these problems are difficult and ambiguous, algorithms
that produce confidence scores help users (and computer programs) prioritize
results and choose the quality of information desired. My methods can identify
abbreviations in text with 84% recall and 81% precision, find gene and protein
names with 83% recall and 82% precision, and classify relationships between
8
CHAPTER 1. INTRODUCTION
Figure 1.3: BioNLP Web Server Screenshot. The BioNLP web server athttp://bionlp.stanford.edu/ provides biological NLP services to the com-munity.
genes and drugs into five categories with 74% accuracy. I have built a BioNLP
web server to provide these services to the community (Figure 1.3). It is avail-
able at http://bionlp.stanford.edu/ .
9
CHAPTER 1. INTRODUCTION
1.5 Overview of Thesis
Chapter 1 introduces the motivations and scientific framework leading to this
work.
Chapter 2 reviews common algorithms in machine learning and statistical
natural language processing that are relevant to this work.
Chapter 3 presents an algorithm for finding abbreviations in text.
Chapter 4 presents an algorithm for identifying the gene and protein names
in text.
Chapter 5 presents methods for finding genes and drugs with relationships,
and then identifying the type of relationship.
Chapter 6 describes the development of a server to present the results of the
algorithms, and to make them computationally accessible with web ser-
vices.
Chapter 7 concludes the thesis with a summary of the work, as well as a
discussion on possible future work.
10
CHAPTER
2
Statistical Text Analysis
Increased computer power has permitted the analysis of large text data sets
(called corpora, sing. corpus) with statistical methods. Using statistical meth-
ods, computers can find patterns and quantify latent trends in the data. This
chapter provides a background on these methods, introducing the data struc-
tures and algorithms relevant to the developments in this thesis.
2.1 Modelling a Document in Vector Space
Text documents are sequences of words, spaces, and punctuation and must
be reduced to a form amenable to computational analysis. One simple and
effect representation of text is called the vector space model (VSM). Salton first
used VSM in an application to retrieve documents from a database, similar to
functionality now provided by PubMed or Google (Salton, 1968). In VSM, each
document is modelled as a vector of word counts.
~Document = [w1w2 . . . wn]
11
CHAPTER 2. STATISTICAL TEXT ANALYSIS
Our analysis includes comparison ofamino acid environments with randomcontrol environments as well as witheach of the other amino acid environ-ments.
⇒acid 2amino 2analysis 1comparison 1control 1environments 3
. . .our 1protein 0
Figure 2.1: Vector Space Model. Documents can be represented as vectors ofword counts. Each dimension of the vector contains the number of times a wordappears in the document. If a word does not occur in the document, the valuefor the corresponding dimension of the vector is 0.
where wi is the number of times that word appears in the document and n is
the number of unique words in the corpus. Since a typical document contains
only small subset of all the possible words, these vectors are almost always
sparse. Most of the values in the vector are zero.
The VSM document representation is simple and discards all information
derived from the ordering of the words. No information about context, phrases,
modifiers, syntax or other structure is retained, leading some to call it a bag-of-
words model. Although the lack of structural information seems like an exigent
deficiency, in practice, VSM performs acceptably well for many applications.
2.2 Supervised Machine Learning
One advantage of the vector space model is that the vector representation is
requisite for many statistical algorithms. One type is called supervised ma-
chine learning. These algorithms discover patterns in data vectors that can
12
CHAPTER 2. STATISTICAL TEXT ANALYSIS
help distinguish among subsets of the data. For example, supervised machine
learning algorithms have been applied to detect spam from other email.
Generally, supervised machine learning consists of two steps. The first step,
training, the algorithm constructs a model of the data using a training set of
data vectors and assignments of the vectors to classes. In the spam classifi-
cation example, the classes would be either spam or not-spam. In the second
step, classifying, the algorithm assigns new data vectors to classes based on
the model created. There are many supervised machine learning algorithms;
they differ based on their models and assumptions of the distributions of the
underlying data.
More rigorously, a supervised machine learning algorithm contains:
a set of Kclasses C1...k (2.1)
a set of Mtraining data X1...m (2.2)
Nclass assignments Y1...n (2.3)
where Yi ∈ C (2.4)
where ~Xi is a vector of dimension D. With a new data point ~x, the classifier pro-
duces the most likely class C for the data. C enumerates the possible classes.
The training data is a set of vectors, where each dimension describes a feature
of the data. For text classification, C would be the different categories of text
to classify, Xi is a (VSM) vector of word counts from a document, and Yi is the
category for document Xi. Therefore, D is the size of the vocabulary for the
13
CHAPTER 2. STATISTICAL TEXT ANALYSIS
training set.
For biology, one application of supervised machine learning is
to find articles in MEDLINE that contain information about reg-
ulatory networks in S. cerevisiae (Usuzaka et al., 1998). Here,
the authors created a training set of 758 articles that they manu-
ally assigned into two classes, WITH-REGULATORY-INFORMATIONand
WITHOUT-REGULATORY-INFORMATION. Then, they used the vector space
model and classified 35,000 yeast-related MEDLINE abstracts to find the ones
containing regulatory information. Their method achieved 90% recall.
2.2.1 k Nearest Neighbors
One simple machine learning algorithm is called k Nearest Neighbors (kNN)
(Manning and Schutze, 1999). This straightforward algorithm contains few
assumptions about the distribution of the data and often performs among the
top text categorization algorithms (Yang, 1999). kNN classifies a new data
vector based on its distance to the k most similar data vectors (the nearest
neighbors) in the training set. It assigns a class to the vector based on the
classes of the nearest neighbors.
class = Yi where argmini∈1...m
dist(x, Xi) (2.5)
One popular distance metric, out of many reasonable ones, is Euclidean
14
CHAPTER 2. STATISTICAL TEXT ANALYSIS
distance. The Euclidean distance between two vectors ~x and ~y is:
dist(~x, ~y) =
√√√√ D∑i
(xi − yi)2 (2.6)
Another metric, the cosine distance is also commonly used for comparing
documents. This is:
dist(~x, ~y) =
∑Di xiyi√∑D
i xi
√∑Di yi
(2.7)
If the vectors x and y were both normalized to lengths of 1, both the Eu-
clidean and cosine distance result in the same relative ordering of data vectors
(Manning and Schutze, 1999).
kNN is robust because it does not impose a generalized model (e.g. a nor-
mality constraint) on the data. However, relative to other methods, the clas-
sification decision is slow. A straightforward implementation evaluates every
vector in the training data and requires O(M) time.
2.2.2 Naıve Bayes
Compared to kNN, Naıve Bayes classifies faster, but imposes a stricter model
on the data (Manning and Schutze, 1999). However, it is also simple, effective,
and easy to implement. Naıve Bayes computes a probabilistic model for each
dimension of the data based on the conditional probability of observing the data
given a specific class. It assigns a class by choosing the one with the highest
probability. To calculate that:
15
CHAPTER 2. STATISTICAL TEXT ANALYSIS
P (C = c|~x) =P (~x|C = c)P (C = c)
P (~x)(2.8)
=
∏Di P (xi|C = c)P (C = c)
P (~x)(2.9)
∝D∏i
P (xi|C = c)P (C = c) (2.10)
∝D∑i
log P (xi|C = c)P (C = c) (2.11)
Since P (~x) in Equation 2.9 does not affect the classification decision, it is
often dropped. Also, computations are typically performed in log space to avoid
numerical underflow problems.
P (xi|C = c) is estimated from the training data. The maximum likelihood
estimate (MLE) is computed by dividing the number of times a feature occurs
in a class by the total number of times it occurs in all classes. Note that if
a feature never occurs for a class in the training set, the MLE probability in
that model is 0. If that feature were observed in a subsequent data vector,
that vector would automatically be disqualified from that class, regardless of
the values in any other vector. This can cause problems for text classification;
the vectors are sparse, and minor variations in word choice can lead to a word
(feature) not occurring in a dataset by chance. To avoid this situation, one
can add pseudo-counts to guarantee that some probability is assigned to every
feature. A simple strategy, Laplace’s law, is to add 1 to each count.
16
CHAPTER 2. STATISTICAL TEXT ANALYSIS
Also note that naıve Bayes assumes independence among features. It calcu-
lates their probabilities separately and multiplies them together during clas-
sification. This assumption is violated severely in text vectors, where many
words and phrases are correlated. For example, amino acid often occurs as
a phrase, and observing either word increases the likelihood of observing the
other. In a naıve Bayes model, such correlated words are over-counted in the
classification decision. Nevertheless, naıve Bayes is often effective in practice.
Finally, naıve Bayes calculates probabilities for specific observed values.
Thus, continuous data such as word counts are either treated as boolean (seen
or not seen) or binned into discrete ranges.
2.2.3 Maximum Entropy
Similarly to naıve Bayes, Maximum Entropy uses a probabilistic framework.
However, it uses the information theoretic measure of entropy to quantify the
amount of information (the opposite of disorder) in probability distributions
(Cover and Thomas, 1991). It then chooses the model that contains the least
amount of information (highest entropy) not present in the training set (Man-
ning and Schutze, 1999). This ensures that the model does not contains biases
that adversely affect its ability to classify new data vectors.
Maximum entropy classifiers use a loglinear model:
P (~x, c) =1
Z
D∏i=1
αfi(~x,c)i (2.12)
Z normalizes the result so that it is always within a range of 0 to 1, ensuring a
probability. α is a D dimensional vector of weights; there is one weight for each
17
CHAPTER 2. STATISTICAL TEXT ANALYSIS
of the D features fi(~x, c).
To train a maximum entropy model, an α vector is found such that the ex-
pectation of each feature in the model matches the expectation in the training
data.
Epfrommodelfi = Epfromdatafi (2.13)
The most popular training algorithm is generalized iterative scaling (GIS),
which is an expectation maximization approach. It converges to an optimal
solution (Darroch and Ratcliff, 1972), but requires the features to be binary.
To handle continuous features, theoretically slower gradient descent optimiza-
tion methods have been applied (Malouf, 2002). For completeness, I provide a
description and derivation of this method in Appendix A.
Note that the features in maximum entropy differ from those in other ma-
chine learning algorithms. The features here are functions that take both an
input vector and a class and return a value. For a hypothetical example, to
classify whether a document concerns regulatory networks, one possible fea-
ture could return 1 if the word MAPK is in the document and the class is
REGULATORY-NETWORK. This flexibility allows authors to construct complex
feature functions to capture interesting nuances of their domain.
Thus, from Equation 2.13, this feature formulation implicitly includes the
probabilities of the features as well as the prior probabilities of the classes.
Because the statistics are calculated across all features, the learning algorithm
accounts for their dependencies, unlike naıve Bayes. To assign a document to
a class, the algorithm chooses the class that yields the highest score.
18
CHAPTER 2. STATISTICAL TEXT ANALYSIS
For classifying academic and financial web pages, maximum entropy per-
forms comparably to naıve Bayes on some data sets and more accurately on
others (Nigam et al., 1999). Modifying the algorithm to incorporate a Gaussian
prior for the features improved the results further. However, for classifying
biological text according to Gene Ontology codes, maximum entropy was over
10% more accurate than naıve Bayes and kNN (Raychaudhuri et al., 2002).
2.2.4 Logistic Regression
Another loglinear model with a simpler formulation than maximum entropy is
logistic regression (Hastie et al., 2001). Binary logistic regression distinguishes
between two classes and fits the feature vectors into a log odds (logit) function:
logp
1− p= β~x (2.14)
with some manipulation:
p =eβ~x
1 + eβ~x(2.15)
where ~x is the feature vector, p is the probability it belongs to a class, and β is
a vector of weights. Thus, training this model consists of finding the β vector
that maximizes the difference between the two classes.
This model is trained by finding a β vector that optimizes the likelihood of
the training set:
`(β) =n∑
i=1
(yi log pi + (1− yi) log (1− pi)) (2.16)
19
CHAPTER 2. STATISTICAL TEXT ANALYSIS
where yi is 1 if feature vector i is in one class and 0 otherwise.
This equation is optimized using Newton’s method. Although it is not guar-
anteed to converge, it usually does so in practice (Hastie et al., 2001). Because
the training algorithm does not scale to large dimension spaces, logistic regres-
sion is not commonly used for text classification.
2.2.5 Support Vector Machines
Finally, support vector machines (SVM) are binary classifiers grounded in
strong statistical theory (Hastie et al., 2001). In general, classifiers demar-
cate the feature space according to classes of the training vectors. An SVM
finds a hyperplane in the middle of the closest data points of the two classes.
Those data points are called the support vectors.
An SVM classifier constructs the following hyperplanes:
(~w ∗ ~xclass1) + b >= +1 (2.17)
(~w ∗ ~xclass2) + b <= −1 (2.18)
(~w ∗ ~xhyper) + b = 0 (2.19)
where data points in one class (the positive class) have values >= +1 and data
points in the other (the negative class) have values <= −1. Equation 2.19
represents the hyperplane in between, which is the decision boundary.
To classify a new data point, the SVM projects it onto the decision boundary
hyperplane. Points on one side of the boundary belong to one class, those on
20
CHAPTER 2. STATISTICAL TEXT ANALYSIS
the other side belong to the other class.
To find the optimally separating hyperplane, observe that the distance
between the two hyperplanes is 2‖w‖ , the difference between Equation 2.17
and Equation 2.18. Thus, to maximize the separation between the hyper-
planes, minimize ‖w‖. This is a constrained optimization problem solvable
with quadratic programming. Details are described in (Burges, 1998).
Although the decision boundary is linear, the formulation allows a function,
called a kernel function, to project the vectors into a higher dimension space.
Kernel functions then return the similarity between two vectors when calcu-
lated in higher dimensionality. By projecting vectors thusly, SVM can classify
data points that are not separable in lower dimension. Although higher dimen-
sions generally lead to overfitting (the models capture idiosyncracies specific to
the training set, rather than learning broad trends), SVMs are theoretically
robust because their decision boundaries do not characterize either class too
closely.
Behaving well in high dimension spaces, SVMs outperform naıve Bayes and
kNN in text classification (Joachims, 1997). They have been applied in the
biomedical domain to classify literature about proteins according to their sub-
cellular localization (Stapley et al., 2002).
2.2.6 Feature Selection
A typical corpus contains a large number of unique words. However, many of
those words are used in few documents. The frequency of a word is inversely
related to its rank, when sorted in decreasing frequency. This is called a Zipfian
21
CHAPTER 2. STATISTICAL TEXT ANALYSIS
Probabilistic Multiway Handles ContinuousScores Classification Dependence Data
KNN N Y Y YNB Y Y N NME Y Y Y YLR Y N Y YSVM Y N Y Y
Table 2.1: Overview of Machine Learning Algorithms. KNN is k nearestneighbors, NB is naıve Bayes, ME is maximum entropy, LR is logistic regression,and SVM is support vector machine. Probabilistic Scores indicates whetherthe method yields probabilistically interpretable scores. Multiway Classifica-tion indicates whether the method can classify more than two classes at a time.This can also be emulated by implementing N one-versus-rest classifiers. Han-dles Dependence indicates whether the method correctly handles correlationsamong features. Continuous Data indicates whether the method can handlecontinuous data, rather than binned or binary data.
ParametersKNN k, distance metricNB pseudocountsME NoneLR NoneSVM kernel, error penalty
Table 2.2: Parameters for Machine Learning Algorithms. For optimalperformance, most machine learning algorithms require parameter fitting.
distribution:
f ∝ 1
r(2.20)
f is the frequency of the word and r is its rank when sorted by decreasing
frequency (Manning and Schutze, 1999). The product of the rank of the word
and its frequency is a constant called Zipf ’s constant, Z = fr.
22
CHAPTER 2. STATISTICAL TEXT ANALYSIS
Training ClassifyingMethod Speed Algorithm Speed
KNN none none look through data slowNB counting data fast vector multiply fastME expectation maximization slow vector multiply fastLR Newton’s method medium vector multiply fastSVM quadratic programming slow vector multiple fast
Table 2.3: Performance of Machine Learning Algorithms. This containsa general description of the algorithms used for training and classification ofvarious machine learning algorithms. The speed columns indicate relative al-gorithmic complexity.
Zipf ’s Law was noticed empirically and can be observed across domains,
including in biological literature (Figure 2.2).
Zipf ’s Law implies, for statistical classifiers, that many words in the corpus
are uninformative. Approximately half of the words will only appear once.
Also, some words are uninformative because they appear too frequently. For
example, the word the is the most common word in the SWISS-PROT corpus
(and in most corpora) and comprises 7% of the words in the corpus. The 10
most common words in these abstracts (the, of, and, a, in, to, that, is, with,
gene) constitute 25% of the corpus. Such words are often called stop words.
Ignoring stop words in machine learning algorithms removes a source of
noise and simplifies the classification task by reducing the dimensionality of
the feature vectors. By using only the most informative features, the perfor-
mance of the algorithm can be improved.
Some common methods for feature selection are word count cutoffs, mutual
information, or χ2 tests. Using a word count cutoff is simplest. Here, the most
common words are considered the least informative and thus ignored.
23
CHAPTER 2. STATISTICAL TEXT ANALYSIS
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1 101 201 301 401 501 601 701 801 901 1001
Word Rank
Zip
f's
Co
nst
ant
Figure 2.2: Zipf’s Law. The SWISS-PROT database links protein sequences toMEDLINE citations (Bairoch and Boeckmann, 1991). I collected a corpus of theMEDLINE abstracts cited in SWISS-PROT version 37. This plot shows thatthe words in this corpus roughly follows a Zipfian distribution. Zipf’s constant,plotted on the vertical axis, is roughly constant starting from word 200.
The other two methods, mutual information and χ2, examine the frequency
of words in a two way contingency table O:
Class 1 Class 2
Has Word
Without Word
Mutual information is grounded in information theory. It measures the as-
sociation between the word and the classes. If the occurrence of a word is
independent of the class of the document, the mutual information will be 0.
24
CHAPTER 2. STATISTICAL TEXT ANALYSIS
Conversely, high dependence yields high scores. To calculate the mutual infor-
mation:
2∑i,j
P (Oij) logP (Oij)
P (Oi)P (Oj)(2.21)
The best features are the words with the highest mutual information scores.
However, mutual information uses probabilities and does not account for
the number of observations. Since 1/4 and 25/100 are both 25%, they are consid-
ered equal, even though the variances in these estimates are different. Thus,
mutual information is particularly sensitive to infrequent words, which, ac-
cording to Zipf ’s law, are very common.
In contrast, the χ2 test, grounded in statistical theory, does account for the
actual number of observations. It measures the probability that the word dis-
tribution could be observed by random chance by comparing against the ex-
pected distribution.1 It computes the expected word counts from the marginal
probabilities of the word and class using:
Eij = P (Oi) ∗ P (Oj) ∗N (2.22)
where N is the sum of the observed matrix O. Then, it calculates a χ2 score
using:
χ2 =∑i,j
(Oij − Eij)2
Eij
(2.23)
1 In other applications, some have argued that a log ratio test may produce more accuratestatistics for rare events (Dunning, 1993).
25
CHAPTER 2. STATISTICAL TEXT ANALYSIS
The words whose distribution is highly correlated with the classes have the
highest χ2 scores and are the best features.
An extensive study comparing methods for feature selection found that the
χ2 test performed best (Yang and Pedersen, 1997). Surprisingly, using a word
count cutoff performed nearly as well. Because of its simplicity and low compu-
tational demands, it is often used for feature selection. The mutual information
method performed poorly due in part to its sensitivity to infrequent words.
This study also showed that text classification algorithms can be accurate
even with many features discarded. χ2 performs well with up to 98% of the
features removed, and word count cutoff with up to 90% removed.
All the methods presented thus far evaluate features independently. To
handle any correlations among features require an exhaustive search of all pos-
sible subsets. Unfortunately, this is computationally intractable for the large
dimension feature spaces in text. However, some work has been done on ap-
proximating good feature sets without exhaustive search (Koller and Sahami,
1996).
2.3 Categorizing Words
Supervised machine learning methods have been applied to identify the mean-
ing of words using the context of an unknown word. The idea that the context
is informative was immortalized by J.R. Firth who coined the sentence “You
shall know a word by the company it keeps” (Firth, 1957).
In a machine learning framework, the possible meanings of a word are the
classes, and the features of the training vectors are the neighboring words. The
26
CHAPTER 2. STATISTICAL TEXT ANALYSIS
closest neighbors contain the most information, but distant words sometimes
also include information. Thus, the number of neighbors to examine varies and
is usually discovered empirically.
Word sense disambiguation has been applied in bioinformatics to distin-
guish whether a word is a gene, protein, or mRNA. Since the name of a gene
can be the same as the mRNA or protein product, the algorithm must discover
its meaning based on context (Hatzivassiloglou et al., 2001). This study found
considerable ambiguity, but still achieved up to 85% accuracy.
27
CHAPTER
3
Finding Abbreviations
An algorithm that automatically finds and defines abbreviations can simplify
gene name identification, which is a necessary component of systems that can
identify gene-drug relationships. Many long protein and gene names are ab-
breviated, such as thiopurine methyltransferase (TPMT). Algorithms that rec-
ognize abbreviations can use both the long form and the abbreviation to help
identify a gene. In this example, methyltransferase is easily identified from its
appearance. From the abbreviation, it is clear that 1) TPMT is also a protein,
and 2) thiopurine is part of the protein name.
I define abbreviation broadly as a shortened form of a longer word or phrase
(the long form). An acronym is typically defined as a type of abbreviation in
which the short form is a conjunction of the initial letters of words in the long
form; some authors also require them to be pronounceable.
Using such a strict definition excludes many types of abbreviations that
appear in biomedical literature. Writers create abbreviations in many different
ways as summarized here:
Portions of this chapter have appeared in Chang et al. (2002).
28
CHAPTER 3. FINDING ABBREVIATIONS
Abb. Definition DescriptionVDR⇒vitamin D receptor The letters align to the begin-
nings of the words.PTU⇒propylthiouracil The letters align to a subset of
syllable boundaries.JNK⇒c-Jun N-terminal kinase The letters align to punctuation
boundaries.IFN⇒interferon The letters align to some other
place.SULT⇒sulfotransferase The abbreviation contains con-
tiguous characters from a word.ATL⇒adult T-cell leukemia The long form contains words
not in the abbreviation.CREB-1⇒CRE binding protein The abbreviation contains let-
ters not in the long form.beta-EP⇒beta-endorphin The abbreviation contains com-
plete words.
Nevertheless, the numerous lists of abbreviations covering many domains
attest to broad interest in identifying them. Opaui, a web portal for abbrevia-
tions, contains links to 152 lists alone (Opaui). Some are compiled by individu-
als or groups (Acronyms and Initialisms; Human Genome Acronym List). Oth-
ers accept submissions from users over the internet (Acronym Finder; Three-
Letter Abbreviations). For the medical domain, a manually-collected published
dictionary contains over 10,000 entries (Jablonski, 1998).
Because of the size and growth of the biomedical literature, manual compi-
lations of abbreviations suffer from problems of completeness and timeliness.
Automated methods for finding abbreviations are therefore of great potential
value. In general, these methods scan text for candidate abbreviations and
then apply an algorithm to match them with the surrounding text. Most ab-
breviation finders fall into one of three types.
The simplest type of algorithm matches the letters of an abbreviation to
29
CHAPTER 3. FINDING ABBREVIATIONS
the initial letters of the words around it. The algorithm for recognizing this is
relatively straightforward, although it must perform some special processing
to ignore common words. Taghva gives an example Office of Nuclear Waste
Isolation (ONWR) where the O can be matched with the initial letter of either
Office or of (Taghva and Gilbreth, 1995).
More complex methods relax the first letter requirement and allow matches
to other characters. These typically use heuristics to favor matches on the
first letter or syllable boundaries, upper case letters, length of acronym, etc.
(Yoshida et al., 2000) However, Yeates notes the challenge in finding optimal
weights for each heuristic and further posits that machine learning approaches
may help (Yeates, 1999).
Another approach recognizes that the alignment between an abbreviation
and its long form often follows a set of patterns (Larkey et al., 2000; Puste-
jovsky et al., 2001; Nenadic et al., 2002; Wren and Garner, 2002; Yu et al.,
2002b). Thus, a set of carefully and manually crafted rules governing allowed
patterns can recognize abbreviations. Furthermore, one can control the per-
formance of the system by adjusting the set of rules, trading off between the
leniency in which a rule allows matches and the number of errors that it intro-
duces. Also, good results have been reported from a system that simply looks
for letter matches close to the abbreviation (Schwartz and Hearst, 2003).
In their rule-based system, Pustejovsky et al. introduced an interesting in-
novation by including lexical information (Pustejovsky et al., 2001). Their in-
sight is that abbreviations are often composed from noun phrases, and that
30
CHAPTER 3. FINDING ABBREVIATIONS
constraining the search to definitions in the noun phrases closest to the abbre-
viation will improve precision. With the search constrained, they found that
they could further tune their rules to also improve recall.
Finally, there is one completely different approach to abbreviation search
based on compression (Yeates et al., 2000). The idea here is that a correct ab-
breviation gives better clues to the best compression model for the surrounding
text than an incorrect one. Thus, a normalized compression ratio built from the
abbreviation gives a score capable of distinguishing abbreviations.
3.1 An Algorithm to Identify Abbreviations
I decompose the abbreviation-finding problem into four components: 1) scan-
ning text for occurrences of possible abbreviations, 2) aligning the candidates
to the preceding text, 3) converting the abbreviations and alignments into a
feature vector, and 4) scoring the feature vector using a statistical machine
learning algorithm (Figure 3.1).
3.1.1 Finding Abbreviation Candidates
I searched for possible abbreviations inside parentheses, assuming that they
followed the pattern:
long form ( abbreviation )
For every pair of parentheses, I retrieved the words up to a comma or semi-
colon. I rejected candidates longer than two words, candidates without any
letters, and candidates that exactly matched the words in the preceding text.
31
CHAPTER 3. FINDING ABBREVIATIONS
Figure 3.1: Abbreviation System Architecture. I use a machine learningapproach to find and score abbreviations. First, I scan text to find possibleabbreviations, align them with their prefix strings, and then collect a featurevector based on 8 characteristics of the abbreviation and alignment. Finally, Iapply binary logistic regression to generate a score from the feature vector.
For each abbreviation candidate, I saved the words before the open paren-
thesis (the prefix) so that I could search them for the long form of the abbre-
viation. Although I could have included every word from the beginning of the
sentence, as a computational optimization, I only used 3N words, where N was
the number of letters in the abbreviation. I chose this limit conservatively
based on an informal observation that I always found long forms well within
3N words.
32
CHAPTER 3. FINDING ABBREVIATIONS
3.1.2 Aligning Abbreviations with their Prefixes
For each pair of abbreviation candidate and prefix, I found the alignment of
the letters in the abbreviation with those in the prefix. This is a case of the
Longest Common Substring (LCS) problem studied in computer science and
adapted for biological sequence alignment in bioinformatics (Needleman and
Wunsch, 1970).
I found the optimal alignment between two strings X and Y using dynamic
programming in O(NM) time, where N and M were the lengths of the strings.
This algorithm is expressed as a recurrence relation:
M [i, j] =
0 : i = 0 or j = 0
M [i− 1, j − 1] + 1 : i, j > 0 and Xi = Yj
max(M [i, j − 1], M [i− 1, j]) : i, j > 0 and Xi 6= Yj
(3.1)
M is a score matrix, and M [i, j] contains the total number of characters
aligned between the substrings X1...i and Y1...j. To recover the aligned char-
acters, I created a traceback parallel to the score matrix. This matrix stored
pointers to the indexes preceding M [i, j]. After generating these two matrices,
I recovered the alignment by following the pointers in the traceback matrix.
3.1.3 Computing Features from Alignments
Next, I calculated feature vectors that quantitatively described each candidate
abbreviation and the alignment to its prefix. For the abbreviation recognition
task, I used 9 features described in Table 3.1. Each feature constituted one
33
CHAPTER 3. FINDING ABBREVIATIONS
Feature Description βDescribes the abbreviation
LowerAbbrev Percent of letters in abbreviation in lowercase.
-1.21
Describes where the letters are alignedWordBegin Percent of letters aligned at the beginning of
a word.5.54
WordEnd Percent of letters aligned at the end of aword.
-1.40
SyllableBoundary Percent of letters aligned on a syllableboundary.
2.08
HasNeighbor Percent of letters aligned immediately afteranother letter.
1.50
Describes the alignmentAligned Percent of letters in the abbreviation that
are aligned.3.67
UnusedWords Number of words in the prefix not aligned tothe abbreviation.
-5.82
AlignsPerWord Average number of aligned characters perword.
0.70
MiscellaneousCONSTANT Normalization constant for logistic regres-
sion.-9.70
Table 3.1: Features Used to Score Abbreviations. These features are usedto calculate the score of an alignment using Equation 2.15. I identified syllableboundaries using the algorithm used in TEX(Knuth, 1986). The β column indi-cates the weight given to each feature. The sign of the weight indicates whetherthat feature is favorably associated with real abbreviations.
dimension of a 9-dimension feature vector.
3.1.4 Scoring Alignments with Logistic Regression
Finally, I used a supervised machine learning algorithm to recognize abbrevia-
tions. To train this algorithm, I created a training set of 1000 randomly-chosen
candidates identified from a set of MEDLINE abstracts pertaining to human
34
CHAPTER 3. FINDING ABBREVIATIONS
genes, which I had compiled for another purpose. For the 93 real abbreviations,
I hand-annotated the alignment between the abbreviation and prefix.
Next, I generated all possible alignments between the abbreviations and
prefixes in my set of 1000. This yielded my complete training set, which con-
sisted of 1) alignments of incorrect abbreviations, 2) correct alignments of cor-
rect abbreviations, and 3) incorrect alignments of correct abbreviations. I con-
verted these alignments into feature vectors.
Using these feature vectors, I trained a binary logistic regression classifier
(Hastie et al., 2001). I chose this classifier based on its lack of assumptions
on the data model, ability to handle continuous data, speed in classification,
and probabilistically interpretable scores. To alleviate singularity problems, I
removed all the duplicate vectors from the training set.
Finally, the score of an alignment is the probability calculated from Equa-
tion 2.15 using the optimal β vector. The score of an abbreviation is the maxi-
mum score of all the alignments to its prefix.
3.1.5 Implementating the Algorithm
I implemented the code in Python 2.2 (Lutz et al., 1999) and C with the Biopy-
thon 1.00a4 and mxTextTools 2.0.3 libraries. The website was built with Red-
Hat Linux 7.2, MySQL 3.23.46, and Zope 2.5.0 on a Dell workstation with a
1.5GHz Pentium IV and 512Mb of RAM.
35
CHAPTER 3. FINDING ABBREVIATIONS
3.2 Performance of the Abbreviation Identifica-
tion Algorithm
I evaluated the performance of the algorithm on the Medstract acronym
gold standard (Pustejovsky et al., 2001). It contains MEDLINE abstracts
with expert-annotated abbreviations and forms the basis of the evaluation
of Acromed. The gold standard is publically available as an XML file at
http://www.medstract.org/gold-standards.html .
I ran my algorithm against the Medstract gold standard (after correcting
6 typographical errors in the XML file) and generated a list of the predicted
abbreviations, long forms, and their scores. With these predictions, I calcu-
lated the recall and precision at every possible score cutoff generating a re-
call/precision curve.
I counted an abbreviation/long form pair correct if it matched the gold stan-
dard exactly, considering only the highest scoring pair for each abbreviation. To
be consistent with Acromed’s evaluation on Medstract, I allowed mismatches
in 10 cases where the long form contained words not indicated in the abbrevia-
tion. For example, I accepted protein kinase A for PKA and did not require the
full cAMP-dependent protein kinase A indicated in the gold standard.
I ran my algorithm against the Medstract gold standard and calculated the
recall and precision at various score cutoffs (Figure 3.2). Identifying 140 out of
168 correctly, it obtained a maximum recall of 83% at 80% precision (Table 3.2).
The recall/precision curve plateaued at two levels of precision, 97% at 22%
recall (score=0.88) and 95% at 75% recall (score=0.14).
36
CHAPTER 3. FINDING ABBREVIATIONS
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Inte
rpo
late
d P
reci
sio
n Acromed: regular expression
Acromed:syntactic information
0.88
0.140.03
Figure 3.2: Abbreviations Predicted in Medstract Gold Standard. Icalculated the recall and precision at every score cutoff and plotted the resultingcurve. I marked the scores at various points on the curve. The performance ofthe Acromed system is shown for comparison.
# Description Example12 Abbreviation and long form are synonyms. apoptosis⇒programmed cell death7 Abbreviation is outside parentheses.3 Best alignment score yields incorrect long
form.FasL⇒Fas and Fas ligand
3 Letters in abbreviation are out of order. ATN⇒anterior thalamus25 TOTAL
Table 3.2: Types of Abbreviations Missed. My algorithm failed to find 25total abbreviations in the Medstract gold standard. This table categorizes typesof abbreviations and the number of each type missed.
37
CHAPTER 3. FINDING ABBREVIATIONS
At a score cutoff of 0.14, the algorithm made 8 errors. 7 of those errors
were abbreviations missing from the gold standard: primary ethylene response
element (PERE), basic helix-loop-helix (bHLH), intermediate neuroblasts de-
fective (ind), Ca2+-sensing receptor (CaSR), GABA(B) receptor (GABA(B)R1),
polymerase II (Pol II), and GABAB receptor (GABA(B)R2). The final error oc-
curred where an unfortunate sequence of words in the prefix yielded a higher
scoring alignment than the long form: Fas and Fas ligand (FasL).
3.3 Clarifying and Reconciling Notions of Ab-
breviations
Although my algorithm could find abbreviations from MEDLINE accurately,
the evaluation against Medstract remains unsatisfying. The evaluation uncov-
ered some subtleties and ambiguities in the abbreviation identification prob-
lem. To gain a better understanding of the problem, I had an expert reanno-
tate the Medstract gold standard to reveal differences among expert notions of
abbreviations.
3.3.1 Reannotating the Medstract Gold Standard
The goals for reannotating Medstract are twofold. First, the comparison of the
new annotations to the original ones should reveal differences in definitions of
abbreviations; therefore, the notion of abbreviation should not be biased by the
one in Medstract. Second, to generate a more complete standard, the new data
set should not omit any correct annotations already found in the original.
38
CHAPTER 3. FINDING ABBREVIATIONS
To address these two goals, I used a two-pass approach using an Expert
(Daniel Rubin) not involved in the development of abbreviation identification
algorithms, but familiar with the problem. The Expert was a board certified
physician with postdoctoral training in Biomedical Informatics. In the first
pass, both the Expert and I identified the abbreviations in Medstract. I asked
the Expert to mark the abbreviations in the gold standard and did not give
further definition of abbreviation.
The second pass resolved differences due to inconsistent annotation and hu-
man error. Here, the Expert alone had the authority to resolve the differences
between his annotations and those in Medstract and my list. I presented the
Expert each difference and asked whether he wanted to change his annota-
tions. There were four possible types of differences: 1) the Expert had an ab-
breviation not annotated elsewhere, 2) the Expert was missing an abbreviation
annotated elsewhere, 3) the long forms annotated had incongruent boundaries,
and 4) the long forms annotated were completely different. The resolved ab-
breviations yielded the Expert gold standard.
3.3.2 Comparing the Medstract and Expert Gold Stan-
dards
During the initial markup, the Expert identified 154 abbreviations in Med-
stract, which is fewer than the 168 marked in the gold standard. During the
adjudication step, the Expert added an additional 14 abbreviations, removed 3,
and changed the long form boundaries on 2. There were no instances where the
same abbreviation was annotated with different long forms.
39
CHAPTER 3. FINDING ABBREVIATIONS
Then, I compared the final Expert gold standard against Medstract. Dis-
regarding differences in the long form boundaries, the gold standards agreed
on 151 abbreviations. Expert had 10 abbreviations not in Medstract, and con-
versely, Medstract had 13 abbreviations not in Expert. Thus, the inter-observer
variability was 87% and was calculated:
# same abbreviations# same + # differences
The Expert disagreed with Medstract on 12 borders.
Finally, I compared the results of my abbreviation identification algorithm
against the Expert gold standard. The algorithm I used here has been modified
from the one reported above, so that it also recognizes abbreviations in the
form:
abbreviation ( long form )
as well as the other way implemented previously. Here, the algorithm attained
a precision and recall of 98.7% and 95.2%, in contrast to the 94.3% and 88.1%
on Medstract.
3.3.3 Defining Abbreviations
Comparing the results of the annotations revealed latent assumptions in the
definition of abbreviations. There are three main types of differences in the
markups: differing definitions of abbreviations, disagreement on the bound-
aries of long forms, and overlooked abbreviations.
First, there is considerable variation in the definition of an abbreviation.
40
CHAPTER 3. FINDING ABBREVIATIONS
In the broadest sense, an abbreviation is a shortened version of a longer word
or phrase. For example, IFN is constructed from the letters in interferon. An
acronym is a special case of an abbreviation where the letters are constructed
from the first letters of the words, such as NAT for N-Acetyl Transferase.
Although many abbreviations are synonyms for their long forms, some are
instead hypernyms or hyponyms. They can be more general or more specific
than the long form. One example is HOT-SPOT (HOT1). The abbreviation
contains more information and indicates a specific variant of the HOT-SPOT
gene. While this construction might technically be a parenthetical statement,
it also falls within my definition of abbreviation because the abbreviation is
constructed from the letters in the long form.
Similarly, the long forms can contain different amounts of information.
There is often ambiguity in the boundaries of the long form. The disagree-
ments stem from domain knowledge, where an expert may include more words
than indicated in the abbreviation, based on their expert knowledge. For ex-
ample, a strictly letter matching heuristic on RNA Polymerase I (Pol I) would
indicate a long form of Polymerase I. In this case, many experts would include
the word RNA because of their expert understanding of Pol I. Another example
is lateral arcuate nucleus (Arc). Experts knowledgeable in anatomy would in-
clude the word lateral. To address this, future algorithms will need to predict
boundaries based on the usage of phrases in the text.
41
CHAPTER 3. FINDING ABBREVIATIONS
Finally, Medstract also includes synonyms where neither entity is con-
structed from the letters of the other. These fall under the broad class of en-
tities called acronym-meaning pairs in Medstract. Some pairs found in Med-
stract are ommatidia and dorsal and ventral; and apoptosis and programmed
cell death. These are synonyms and have a short and long form, but are not
abbreviations because the short form is not constructed from the letters found
in the long form.
The inclusion of acronym-meaning (non-abbreviation) pairs in Medstract
led to many differences when compared against the Expert gold standard. Out
of those 168 acronym-meaning pairs in Medstract, 13 were not abbreviations.
These were the same 13 that Expert did not annotate.
The final source of discordance with the Medstract gold standard stemmed
from missing abbreviations. This is likely due to human error; the Expert
overlooked 14 in his first pass. This problem in particular complicates eval-
uations. Algorithms evaluated against Medstract can be unfairly penalized
for correct answers, leading to situations where better algorithms can receive
lower scores. When an algorithm identifies a correct abbreviation not anno-
tated in the gold standard, the precision drops. In addition to difficulty mea-
suring precision, the recall is inflated because the formula does not account for
the missing abbreviations.
3.4 Compiling the Abbreviations in MEDLINE
Nevertheless, I applied the algorithm and scanned for abbreviations in all
MEDLINE abstracts through the year 2001. I kept only the predictions that
42
CHAPTER 3. FINDING ABBREVIATIONS
scored at least 0.001. This computation required 70 hours using 5 processors
on a Sun Enterprise E3500 running Solaris 2.6. In all, I processed 6,426,981
MEDLINE abstracts (only about half the 11,447,996 citations had abstracts)
at an average rate of 25.5 abstracts/second.
From this scan, I identified a total of 1,948,246 abbreviations, and 20.7% of
them were defined in more than one abstract. 2.7% were found in 5 or more
abstracts. 2,748,848 (42.8%) of the abstracts contained at least 1 abbreviation
and 23.7% of them contained 2 or more.
Out of the nearly two million abbreviation/definition pairs, there were only
719,813 distinct abbreviations because many of them had different definitions
e.g. AR can be autosomal recessive, androgen receptor, amphiregulin, aortic
regurgitation, aldose reductase, etc. 156,202 (21.7%) abbreviations had more
than one definition.
The average number of definitions for abbreviations with 6 characters or
fewer was 4.61, higher than the 2.28 reported by (Liu et al., 2001). One possible
reason for this discrepancy is that Liu’s method correctly counts morphological
variants of the same definition. Both methods, however, overcount definitions
that have the same meaning, but different words. We found that 37.5% of
the abbreviations with 6 characters or fewer had multiple definitions, which
concurs with Liu’s 33.1%.
781,632 of the abbreviations had a score of at least 0.14. Of those,
328,874 (42.1%) were acronyms, i.e. they were composed of the first letters
of words.
43
CHAPTER 3. FINDING ABBREVIATIONS
0
50000
100000
150000
200000
250000
300000
350000
400000
1975 1980 1985 1990 1995 2000
New AbstractsNew Abbreviations
Figure 3.3: Growth of Abbreviations. The number of abstracts and abbre-viations added to MEDLINE steadily increases.
The growth rate of both abstracts in MEDLINE and new abbreviation def-
initions is increasing (Figure 3.3). 64,262 new abbreviations were introduced
last year, and there is an average of 1 new abbreviation in every 5-10 abstracts.
To evaluate the coverage of the database of predicted abbreviations from
MEDLINE, I used a list of abbreviations from the China Medical Tribune, a
weekly Chinese language newspaper covering medical news from Chinese jour-
nals (China Medical Tribune). The web site includes a dictionary of 452 com-
monly used English medical abbreviations with their long forms. I searched
the database for these abbreviations (after correcting 21 spelling errors) and
44
CHAPTER 3. FINDING ABBREVIATIONS
0
50
100
150
200
250
300
350
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
Score
Nu
mb
er o
f A
bb
revi
atio
ns
Figure 3.4: Scores of Abbreviations Found in China Medical Tribune.Using a score cutoff of 0.90 yields a recall of 68%, 0.14 87%, and 0.03 88%.
calculated the recall as:
# long forms identified# abbreviations (= 452)
(3.2)
I searched for abbreviations from the China Medical Tribune against my
database of all MEDLINE abbreviations. Allowing differences in capitalization
and punctuation, I matched 399 of the 452 abbreviations to their correct long
forms for a maximum recall of 88% (Figure 3.4). Using a score cutoff of 0.14
yields a recall of 395452
= 87%.
Out of the 53 abbreviations missed, 11 of them were in the database as a
45
CHAPTER 3. FINDING ABBREVIATIONS
close variation, such as Elective Repeat Caesarean-Section instead of Elective
Repeat C-Section. Also, the algorithm could identify all but 8 of the 53 with a
score cutoff of 0.14.
During validation, I found that the server contained 88% of the abbrevia-
tions from the dictionary in the China Medical Tribune.
Since the abbreviation list from the China Medical Tribune was created
independently of MEDLINE, the results suggest that the database contains
nearly all biomedical abbreviations. To improve the recall even further, Yu
has shown that linking to external dictionaries of abbreviations can augment
the ability of automated methods to assign definitions that are not indicated
in the text (Yu et al., 2002b). Nevertheless, this shows that my abbreviation
identification algorithm can successfully identify abbreviations, and also that
MEDLINE abstracts are a rich source of biomedical abbreviations.
3.5 Conclusions
Due to the enormous number of abbreviations currently in MEDLINE and the
rate at which prolific authors define new ones, maintaining a current dictio-
nary of abbreviation definitions clearly requires automated methods. Since
nearly half of MEDLINE abstracts contains abbreviations, computer programs
analyzing this text will frequently encounter them and can benefit from their
identification. Since fewer than half of all abbreviations are formed from the
initial letters of words, automated methods must handle more sophisticated
and non-standard constructs.
46
CHAPTER 3. FINDING ABBREVIATIONS
Thus, I used machine learning to create a method robust to varied abbrevi-
ating patterns. I evaluated it against the Medstract gold standard because it
was easily available, it eliminated the need to develop an alternate standard,
and it provided a reference point to compare methods.
The majority of the errors on this data set (see Table 3.2) occurred because
the gold standard included synonyms, words and phrases with identical mean-
ings, in addition to abbreviations. In these cases, the algorithm could not find
the correspondences between letters, indicating a fundamental limitation of
letter matching techniques.
My precision in this evaluation was hurt by abbreviations missing from the
gold standard. My algorithm identified 8 of these, and 7 had scores higher
than 0.14. Disregarding these cases yields a precision of 99% at 75% recall,
which is comparable to Acromed at 98% and 72%.
It is important for a gold standard to differentiate accurately the perfor-
mances of algorithms. To insure this, gold standards themselves should be
carefully reviewed. I therefore reviewed the MEDSTRACT gold standard by
reannotating it and analyzing the differences. This re-analysis has revealed
ambiguities and latent assumptions in the definition of an abbreviation. Many
of these are not handled explicitly in the first generation of abbreviation iden-
tification algorithms. Gaining a deeper understanding of abbreviations has led
to an improved gold standard and created a blueprint for the development of
second generation systems.
Finally, I applied the algorithm to search for abbreviations in all abstracts
in MEDLINE. Although this run completed in a reasonable amount of time,
47
CHAPTER 3. FINDING ABBREVIATIONS
under 3 days, the algorithm could be optimized by reducing the number of
alignments between abbreviations and prefixes that must be scored. One way
to do this is to encode the features into the alignment step to discard poor
alignments. The current algorithm uses dynamic programming to align abbre-
viations to possible long forms, giving equal weight to all matches. However,
assigning more weight to important positions, such as the initial letter of the
word, can help differentiate high quality alignments from others. However, a
new method must be developed to discover suitable weightings for different po-
sitions. Although many of the current features can be encoded this way, those
that depend on other aligned characters (e.g. HasNeighbor) violate assump-
tions in dynamic programming and must be handled separately.
I stored the predicted abbreviations into a relational database and built an
abbreviation server, a web server that, given queries by abbreviation or word,
returns abbreviations and their definitions. The server can also search for
abbreviations in text provided by the user (Figure 3.5).
I note that using the abbreviation server to look up definitions must be done
carefully. Since about a fifth of all abbreviations were degenerate, the correct
one must be disambiguated using the abbreviation’s context. Pustejovsky has
shown the suitability of the vector-space model for this task (Pustejovsky et al.,
2001).
I am making the abbreviation server available at http://abbreviation.
stanford.edu/ . This server contains all the abbreviations in MEDLINE and
also includes an interface that will identify abbreviations from user-specified
48
CHAPTER 3. FINDING ABBREVIATIONS
Figure 3.5: Abbreviation Server Screenshot. My abbreviation server sup-ports queries by abbreviation or keyword.
49
CHAPTER 3. FINDING ABBREVIATIONS
text. I hope that this server will also be useful for the general biomedical com-
munity. I describe the creation of this server in Chapter 6.
50
CHAPTER
4
Identifying Gene Names
Building biological databases, such as a pharmacogenomics database, requires
methods to identify the names of entities, such as genes and proteins, accu-
rately. Errors in this step can lead to problems in downstream algorithms;
overlooked gene names accounted for 85% of the missed interactions in the
protein-protein interaction database PubGene (Jenssen et al., 2001).
Computationally finding gene and protein names in natural language text
is difficult. The lack of uniform nomenclature standards has resulted in dis-
cordant naming practices (Jan, 1997; White et al., 2002). To handle the result-
ing diversity of the names, gene and protein name identification algorithms
use combinations of approaches including: Dictionary, searching from a list
of known gene names; Appearance, deducing word type based on its makeup
of characters; Syntax, filtering words based on parts of speech; Context, using
nearby words to infer gene and protein names; and Abbreviation, using abbre-
viations in text to help identify names (Table 4.1).
Perhaps the simplest approach is to create a dictionary of all known gene
and protein names. Krauthammer invented such a method by adapting BLAST
Portions of this chapter have appeared in Chang et al. (Accepted).
51
CHAPTER 4. IDENTIFYING GENE NAMES
Dictionary Appear. Syntax Context Abbr.Fukuda 1
√ √keywords
Proux 2√ √
Rindflesch 3 UMLS keywordsKrauthammer 4 GenBank
Kazama 5√ √ √
Tanabe 6√ √ √
Franzen 7√ √
Hanisch 8 HUGO,√
SP/TREMBLNarayanaswamy 9
√ √ √
Hou 10√ √ √
Lee 11√ √ √ √
Morgan 12√ √
Shen 13√ √ √ √
Tsuruoka 14√ √
Table 4.1: Overview of Gene/Protein Name Algorithms. Each row de-scribes a previous gene and protein name identification algorithm. The columnsshow types of data used to identify those names. (1Fukuda et al. (1998); 2Prouxet al. (1998); 3Rindflesch et al. (2000); 4Krauthammer et al. (2000); 5Kazamaet al. (2002); 6Tanabe and Wilbur (2002b); 7Franzen et al. (2002); 8Hanischet al. (2003); 9Narayanaswamy et al. (2003); 10Hou and Chen (2003); 11Leeet al. (2003); 12Morgan et al. (2003); 13Shen et al. (2003); 14Tsuruoka and Tsu-jii (2003); )
to search a database of gene names, rather than DNA sequences (Krautham-
mer et al., 2000; Altschul et al., 1990). Because BLAST allowed approximate
matches, this method could also detect small variations of the gene names in
the dictionary. Although such dictionary-based methods are easy to under-
stand and relatively simple to implement, maintaining dictionaries is difficult
given the rapid rate of genome research. The Mouse Genome Database alone
logged 166 name additions and withdrawals in a single week (Mouse Genome
Database; Friedman et al., 2003).
One insight that decreased the reliance on dictionaries is that, despite their
52
CHAPTER 4. IDENTIFYING GENE NAMES
diversity, many gene names look like other gene names. The appearance of a
word, its suffixes, prefixes, capitalization, or numbers, can help identify it as
a gene or protein (Fukuda et al., 1998). One particularly strong clue that a
word may be a protein is the suffix -ase, which the Nomenclature Committee of
the International Union of Biochemistry and Molecular Biology (NC IUBMB)
has standardized for naming enzymes (Webb, 1992). Another commonly used
heuristic is the -in suffix. However, although many protein names end with -in,
that suffix is also common among technical words, such as penicillin, heparin,
or serotonin (Table 4.2). Appearance clues can mislead when scientific naming
conventions, such as those for cell lines or viruses, are similar to those of genes
(Tanabe and Wilbur, 2002a).
Fortunately, leveraging the syntax of a sentence can alleviate some errors.
Since all names are nouns, a part of speech tagger can restrict the domain
of words and eliminate the possibility of erroneously identifying words that
have other parts of speech. Unfortunately, there are no taggers optimized for
biological literature. Using taggers developed for other corpora can result in
errors. One gene name identification algorithm compensates for tagging errors
by using a dictionary and appearance rules to recover lost names (Proux et al.,
1998).
Another use of syntax structure is to define the local context of a putative
gene or protein name. A noun phrase with a gene or protein name often con-
tains related words, such as those that describe molecular function or inter-
actions. One system EDGAR contains a contextual identification module that
53
CHAPTER 4. IDENTIFYING GENE NAMES
MH Name # Proteins # OtherD01 Inorganic Chemicals 3D02 Organic Chemicals 134D03 Heterocyclic Compounds 99D04 Polycyclic Hydrocarbons 27D06 Hormones, Hormone Substitutes, and Hormone Antagonists 19 23D08 Enzymes, Coenzymes, and Enzyme Inhibitors 37 2D09 Carbohydrates and Hypoglycemic Agents 46D10 Lipids and Antilipemic Agents 10D11 Growth Substances, Pigments, and Vitamins 32D12 Amino Acids, Peptides, and Proteins 202 1D13 Nucleic Acids, Nucleotides, and Nucleosides 8D20 Anti-Infective Agents 4D24 Immunologic and Biological Factors 39
22TOTAL 258 450OTHER
Table 4.2: MeSH Terms That End with -in. This table shows the distributionof words that end with -in across MeSH. The first column is the Mesh HeadingID. Nearly all the terms in MeSH that end with -in occur under D. Chemicalsand Drugs. The final two columns show the number of -in words that are pro-teins and non-proteins, respectively. Although protein names constitute a ma-jority of words that end with -in, many other technical terms, such as organicchemicals, also share the suffix.
uses the signal words directly before gene names such as activated, expres-
sion, mutated, or gene (Rindflesch et al., 2000). Other systems consider more
distant words (Fukuda et al., 1998; Narayanaswamy et al., 2003). However,
such heuristics miss the many occurrences of gene names without context clues
(Tanabe and Wilbur, 2002b).
One final characteristic of gene names that has not yet been fully exploited
is morphology, the derivation and formation of words. Biologists sometimes
indicate relationships among genes and proteins by varying their prefixes or
suffixes. For example, the cdk4 and cdk7 genes both share the stem cdk and
54
CHAPTER 4. IDENTIFYING GENE NAMES
are both involved in cell cycle regulation. Since biologists name many genes
similarly, examining the variants of a word stem can help classify it as a gene
or protein name. Morphology is analogous to appearance because they both
scrutinize the patterns of characters in a word. However, while the appearance
of a word can be examined by itself, my notion of morphology compares the
appearance of a word to the other words in the lexicon.
To handle the diversity of gene and protein names, I have implemented
a method called GAPSCORE that combines syntax, appearance, context, and
morphology. My notion of context, however, differs from that of previous ap-
proaches. To identify a single word gene name that occurs without context
clues, I use all information about the word in MEDLINE. I combine these char-
acteristics using supervised machine learning.
In my supervised machine learning framework, a classifier learns a model
by fitting parameters based on information from a training set of labelled
genes and non-genes. I quantify the appearance, morphology, and context
of each gene or non-gene as a numerical feature vector. Then, the classifier
can identify new words by scoring it based on similarities to the previously
observed training set. There are many well-studied machine learning clas-
sifiers that learn different models. Since no classifier performs best over all
types of data, I tested a simple classifier, Naıve Bayes, against two more com-
plex classifiers known for high accuracy, Maximum Entropy and Support Vec-
tor Machines (Manning and Schutze, 1999; Ratnaparkhi, 1998; Burges, 1998;
Joachims, 1997).
After developing my system, I evaluated its performance on the publicly
55
CHAPTER 4. IDENTIFYING GENE NAMES
available Yapex text collection (Franzen et al., 2002). The Yapex collection
consists of a training set of 99 abstracts from MEDLINE related to protein
binding, and a test set of 101 abstracts, of which 48 are relevant to protein
binding, and the rest were chosen randomly from the GENIA corpus (Ohta
et al., 2002).
Evaluating gene and protein name identification algorithms, however, is dif-
ficult. Problems stem from equivocal distinctions between genes (both genomic
and transcribed mRNA) and proteins and disagreements in the definition of
protein. When reading the same text, experts agree on whether a name refers
to a gene, protein, or mRNA only 77% of the time (Hatzivassiloglou et al., 2001).
Furthermore, experts only agree on whether a word is even a gene or protein
69% of the time (Krauthammer et al., 2000). The Yapex text collection ad-
dresses this ambiguity by specifically excluding peptides and protein families
(Franzen et al., 2002).
Because these ambiguities have not been explicitly resolved, algorithms of-
ten contain differing notions of protein names, which hinders direct compari-
son. In addition, implicit assumptions about the text also impede attempts to
compare. Algorithms often perform worse when applied to a different corpus.
Proux found that the precision of his method dropped from 91% to 70% when
transferred from a corpus of sentences from FlyBase to a more general set of
MEDLINE articles (Proux et al., 1998). Tanabe addressed this problem by ap-
plying a Bayesian statistical method to filter articles that were not likely to
contain a gene name (Tanabe and Wilbur, 2002b).
Therefore, to obtain an accurate measure of performance, I developed the
56
CHAPTER 4. IDENTIFYING GENE NAMES
features used by my machine learning classifier on a corpus we created inde-
pendent from Yapex. I also fit the parameters of the classifier on this data set.
To reconcile differences in definitions of protein names, I used the Yapex train-
ing set to create a list of stop words that did not match the stricter definition of
protein name in Yapex. Finally, I evaluated my algorithm on the Yapex test.
4.1 An Algorithm to Identify Gene and Protein
Names
GAPSCORE scores gene and protein names in written natural language text.
Since it does not distinguish between genes and proteins, we use gene generi-
cally to mean both. The algorithm consists of five steps: (1) TOKENIZE: I split
the document into sentences and words. (2) FILTER: I remove from consid-
eration any word that is clearly not a gene name. (3) SCORE: I score words
using a machine learning classifier. (4) EXTEND: I extend each word to the
full gene name. (5) MATCH ABBREVIATION: Finally, we score abbreviations
of the gene names identified (Figure 4.1).
4.1.1 Tokenizing the Sentences
The TOKENIZE step identifies the sentences and words in a document. I iden-
tify the sentence boundaries using a simple set of heuristics. I start by assum-
ing that any period, question mark, or exclamation point followed by a space
and then a capitalized letter is a sentence boundary. Periods that occur as part
of e.g. are exceptions to this rule.
57
CHAPTER 4. IDENTIFYING GENE NAMES
coactivate human keratin 4 (K4) promoter and inter
coactivate human keratin 4 VERB ADJ NOUN #
human keratin 4 (K4)
Feature Vectorkeratin
Extend
Score
Match Abbreviations
DNA
Filter
Non-G
enes
fragment
peptide
0.05
complex
PossibleGenes
Gene Found: human keratin, K4 Score: 0.97
Tokenize
Figure 4.1: Recognizing Gene Names. I scan through text one word at atime, filtering words that we immediately recognize to not be gene names. Then,I score the remaining words with a machine learning classifier, extend multi-word gene names, and score their abbreviations.
Within each sentence, I define a word as a string of alphanumeric char-
acters. Any space and most punctuation are word boundaries. We handle
dashes separately since many gene names contain them (e.g. c-jun, IL-2, IGF-
I). Dashes are not boundaries when the previous token is a single letter, or the
next token is a number or Roman numeral.
58
CHAPTER 4. IDENTIFYING GENE NAMES
4.1.2 Filtering Recognized Words
The FILTER step removes known non-gene words to increase the overall ac-
curacy and performance of the algorithm. I discard words that are not gene
names, but may be part of a gene name. For example, the 1 and alpha in
Interleukin-1 alpha are discarded, but recovered later in the EXTEND step.
First, I apply Brill’s tagger and remove from consideration words that are
not nouns, adjectives, participles, proper nouns, or foreign words (Brill, 1994).
I use the default settings for the tagger and did not customize it for my corpus.
I discard numbers, Roman numerals (I-X), Greek letters, amino acids, 7 virus
names, and 13 chemical compounds. Because virus names and chemical com-
pounds resemble gene symbols and may indicate genes in certain contexts, we
conservatively discard only the ones prevalent in my training set. Finally, I
discard names of organisms found in the SWISS-PROT database (Bairoch and
Boeckmann, 1991).
Finally, I discard words from a manually created list of 49 regular expres-
sion patterns. I compiled this list by running the algorithm on the Yapex train-
ing set and scanning the results for high-scoring technical terms. These pat-
terns include 7 that match words that indicate genes and proteins (e.g. protein,
DNA, gene); 17 subunits, parts, or complexes of genes and proteins (e.g. pep-
tide, chain, motif, complex); 5 related molecules (e.g. ATP, cAMP); and 20 types
or descriptions of genes (e.g. receptor, expressed).
59
CHAPTER 4. IDENTIFYING GENE NAMES
4.1.3 Scoring Words
My algorithm scores most remaining unfiltered words using a machine learn-
ing classifier. I score separately two classes of proteins that are common and
easy to recognize unambiguously: enzyme names and Cytochrome P-450 pro-
teins. Gene names in these two classes automatically receive the highest pos-
sible score.
To distinguish gene names that end with -ase, I compiled a dictionary of all
known non-gene words that also end with -ase (e.g. kilobase, disease). I selected
the 327 words that end with -ase or -ases from Webster’s Second International
dictionary. Then, I manually removed gene names from the list and added
one word gases. This resulted in a list of 196 words that are not gene names.
There were no ambiguous words that had both an enzyme and a non-enzyme
definition.
I also score separately the Cytochrome P450 proteins because they follow
a regular nomenclature (Cytochrome P450 Homepage). I use four regular ex-
pression patterns to recognize names with the form: cytochrome P450 2D6,
p450 IID6, CYP2d6, or just CYPs. Once a regular expression matches a Cy-
tochrome P450 protein name, the algorithm also identifies in the document
other short forms of the same family, e.g. 2D6, as proteins.
Most words, however, do not match these two special cases. For these, I
encode their appearance, morphology, and context as a feature vector for a
machine learning classifier.
60
CHAPTER 4. IDENTIFYING GENE NAMES
LengthWord is 1 Letter LongWord is 2 Letters LongWord is 3-5 Letters LongWord is 6+ Letters Long
Presence of NumbersWord Starts with DigitsWord Ends with DigitsWord Ends with a Roman Numeral
Looks at cases of wordsWord is CapitalizedWord Ends With an Upper Case LetterWord Has Mixed Upper and Lower CaseWord Ends with Upper Case Letter and Number
OtherWord Has Greek LetterWord Has Dash
Table 4.3: Gene Name Appearance Features. These features encode a 13 di-mension vector that describes the appearance of a word. For a specific word, thevalue for each feature is 1 if it describes the word and 0 otherwise.
Appearance
I model the appearance of two types of genes, gene symbols (such as TPMT or
NAT1) and gene names that end with -in (e.g. insulin). For gene symbols, I
compute a feature vector from the features described in Table 4.3. The value of
each feature is 1 if the symbol has the characteristic and 0 otherwise.
I recognize gene names that end with -in based on the hypothesis that those
names have characteristic patterns of letters that distinguish them from other
words. To find those patterns, I use a generic statistical model that learns
variable length N-grams to classify phrases, described in (Smarr and Manning,
2002)
To train the N-gram model, I created a training set of words that
end with -in from Medical Subject Headings (MeSH) (Humphreys et al.,
61
CHAPTER 4. IDENTIFYING GENE NAMES
1998). MeSH is a hierarchy of 21,973 phrases used to index MEDLINE
citations. I normalized each phrase by removing numbers, single let-
ters, and Roman numerals. Then, I discarded multi-word phrases, leav-
ing only single words. We stemmed each word by removing the termi-
nal -s if the resulting stem occurs as a word in MEDLINE. Finally, I re-
moved any word that did not end with -in, leaving a list of 708 unique
words. A word was a protein if it belonged to one of 15 MeSH classes:
Cytochromes , DNA Restriction-Modification Enzymes , Holoenzymes ,
Isoenzymes , Isomerases , Ligases , Lyases , Neuropeptides , Oxido-
reductases , Peptide Hormones , Peptides , Proteins , Receptors,
Immunologic , and Transferases . The remaining words were not proteins.
I trained the N-gram classifier on the MeSH training set. For words that
do not end with -in, I automatically assign them the lowest possible score of 0.
For words that do end with -in, I use the score from the classifier; those scores
constitute the final dimension of the appearance feature vector.
Morphology
I model morphology by quantifying the tendency for a word to vary in ways
similar to gene names. The morphology feature vector consists of scores for
8 types of variations (Table 4.4). I calculate the score based on the number
of times a stem and its variants appear in MEDLINE. The stem is the word
without the prefixes and suffixes, and the variant includes them. For example,
one type of variation counts word stems with numbers appended. Many genes
vary this way, such as ced with its variants ced1, ced3, and ced9.
62
CHAPTER 4. IDENTIFYING GENE NAMES
Prefix Suffixstem + Greek Letter
Greek Letter + stemstem + Roman Numeral
"apo" or "holo" + stemstem + Upper Case Letterstem + Number
STEM + Upper Case Letter + Numberstem + lower case letter
Table 4.4: Morphologic Variations in Gene Names. This table show varia-tions of gene and protein names that I score in a feature vector. Each variant iseither a prefix or suffix of the word stem.
The value of each morphology feature is:
log max(1
1000,
# Vars# Stems
)
where # Stems is the number of times a stem appears by itself in MEDLINE,
and # Vars is the total number of times the stem appears with a variation.
Empirically, the ratio of these counts, when plotted for all words in MEDLINE,
follows an exponential distribution. Therefore, to improve discrimination in
machine learning, I take the log of that ratio.
The final piece of the equation handles typographical and spelling errors
that become significant over all of MEDLINE. For example, with1 occurs
9 times. To alleviate the effects of such errors, we set a minimum cutoff to
ignore variants that appear less than 1 time in 1000. I found this cutoff empir-
ically based on cross-validation on my training set. I did not set a maximum
ratio cutoff for variants that outnumber stems. Finally, I did not score a variant
if the stem never appears in MEDLINE.
63
CHAPTER 4. IDENTIFYING GENE NAMES
As an example, the ced word contains many variations that match the pat-
tern “stem+Number.” The stem ced occurs 182 times in MEDLINE, and the
variants, ced1, ced3, and ced9, occur 1, 3, and 5 times. The score for ced and its
3 variants is thus log( 9182
) = −1.31. On the other hand, with appears 5,193,871
times and all its variants (including with1) occur 82 times. Since the ratio is
less than my minimum cutoff, the score for those words is −3.
I precompute the morphology score for every word in MEDLINE. For each
type of variation, I first calculate the score assuming that the word is a stem. I
look for all other words in MEDLINE that match that variation. For example,
for the variation requiring Greek letter suffixes, I look for words in MEDLINE
that are composed of the stem word followed by any Greek letter. For the Ro-
man numeral variation, I only consider the first ten Roman numerals since
they are the most common. If a word can be either a variation or a stem, I use
the higher of the two scores.
Context
Finally, I model context features based on the observation that gene names
often occur next to strong positive and negative signal words. For example,
the word directly before gene is frequently a gene name, but the word directly
after within is rarely one. Thus, my approach is similar to earlier systems that
only consider immediate neighbors. It differs by using negative signal words
as well as positive. I searched for both types of signal words by calculating the
correlation between each word in MEDLINE with the presence of gene names
directly before or after the word. Then, for an unknown word, the distribution
64
CHAPTER 4. IDENTIFYING GENE NAMES
of its occurrences around signal words comprises its context feature vector.
Gene names should appear most often next to positive signals and least next
to negative ones.
To find the signal words, I created a training set of 1025 words, which in-
cluded 574 gene names. I randomly chose 500 nouns that appeared in year
2001 MEDLINE abstracts containing the word gene or protein. To increase the
prevalence of gene names, I added 525 more words that appeared before gene,
protein, or mrna. I chose these words randomly to insure that there would be
no bias toward making these three words signal words.
Then, for each word in MEDLINE, I tallied the number of times it occurred
next to my labelled words in a 2x2 contingency table:
expression NOT expression
GENE (A) 253 (B) 321
NON-GENE (C) 111 (D) 340
In this example, cell (A) contains the number of genes from my training
set found before expression anywhere in MEDLINE, cell (B) is the number of
genes never found before expression, cell (C) is the number of non-genes found
before expression, and cell (D) is the number of non-genes never found before
expression. I similarly counted the occurrences of expression after gene names.
If expression is a strong signal that the previous word is a gene name, then
the ratio of genes to non-genes would be higher in the first column, the expres-
sion column, than the second. We calculated the significance of the difference
in the ratio using a χ2 test. Out of the 287,680 words from MEDLINE that
appeared next to a word from my training set, 2567 were significant with a
65
CHAPTER 4. IDENTIFYING GENE NAMES
Previous Word Next Word pgene name gene 0.0E+00gene name mrna 1.2E-20gene name protein 4.8E-13gene name promoter 1.3E-13
gene gene name 1.7E-10gene name genes 1.5E-10gene name expression 4.5E-09gene name transcripts 3.8E-08gene name mrnas 3.4E-07
or non-gene name 3.3E-27by non-gene name 3.5E-21
non-gene name or 1.8E-16with non-gene name 2.3E-12
to non-gene name 1.5E-11in non-gene name 1.5E-10
non-gene name were 9.0E-09non-gene name to 8.2E-09
for non-gene name 2.0E-08
…
Table 4.5: Gene Name Signal Words. This table shows a list of words that oc-cur next to gene names either more frequently or more rarely than expected. Theword pairs in the top half of the table occur more often than expected, implyingthat those bold-faced words are strong indicators of gene names. The pairs inthe bottom half of the table occur more than expected for words that are not genenames. The p column shows the statistical significance of the association.
p <= 1x10−7, which is roughly a p-value cutoff of 0.05 with a Bonferroni correc-
tion for multiple tests.
Although the 2567 words I found were all statistically significant, useful
signal words should be also ubiquitous. Obscure words could only discriminate
a small number of names. Therefore, I further narrowed my list by selecting
the most common signal words. Since only 9 of the 2,567 words were positive
signal words, I kept them all. Then, for balance, I chose the 9 negative signal
words that appeared with the greatest number of words in my training set.
This resulted in the 18 signal words that are listed in Table 4.5.
66
CHAPTER 4. IDENTIFYING GENE NAMES
Then, I used these signal words to encode the context of a word into feature
vectors. Each feature is the number of times that a word occurs with each
signal word across all of MEDLINE. I calculated the distribution across signal
words by normalizing the feature vector to 1.0.
Classifier
Finally, I concatenated the appearance, morphology, and context vectors to cre-
ate the final combined vector.
To train the machine learning classifiers, I created a training set of words
from 634 MEDLINE abstracts found from searches on regulatory elements
and 101 MEDLINE abstracts cited by a review article on pharmacogenomics
(Evans and Relling, 1999). I manually categorized each word from these ab-
stracts as either a gene name or non-gene. For a multiple word gene name,
I labelled as genes only the core gene-meaning words. I labelled ambiguous
words, those that have a dominant non-gene meaning, non-genes. In addi-
tion, I included 8,617 words from MeSH that I identified using the criteria
described above in the Morphology section. This resulted in a training set of
19,952 unique labelled words.
I used these words and trained 3 types of classifiers: Naıve Bayes, Maxi-
mum Entropy, and Support Vector Machines. Since Naıve Bayes required cat-
egorical features, I binned the features. For each dimension in the feature
vector, I assigned the values into 5 bins evenly spaced between the lowest and
highest values. For Maximum Entropy, I estimated the parameters using a con-
jugate gradient descent method that has been found to converge quickly and
67
CHAPTER 4. IDENTIFYING GENE NAMES
accurately (Malouf, 2002). I trained Support Vector Machines using the linear,
polynomial, and radial basis function kernels. I varied the C error penalty pa-
rameter and chose the parameters that performed best on the Yapex training
set.
4.1.4 Extending to Noun Phrase
After a word is scored, to identify multi-word gene names, I extend the name
using heuristics similar to the ones described in (Fukuda et al., 1998). Using
the parts of speech from Brill’s tagger, I include the nouns, adjectives, and
participles preceding the putative gene name. Then, I lengthen the name to
include the following words that are single letters, Greek letters, and Roman
numerals. Finally, I remove extraneous punctuation at the beginning or end of
the name, except for open or close parenthesis characters required to complete
a pair.
4.1.5 Matching Abbreviations
After the algorithm establishes the full gene names, it searches for abbrevia-
tions in the document using the algorithm described in Chapter 3. If the long
form of an abbreviation has a higher score, it transfers that score to the ab-
breviation. The algorithm likewise transfers higher scores from abbreviations
back to the long forms.
68
CHAPTER 4. IDENTIFYING GENE NAMES
4.1.6 Implementation
I implemented GAPSCORE in the Python 2.2 and C languages using the Biopy-
thon 1.10 library (Lutz et al., 1999; Kernighan and Ritchie, 1988; Biopython).
The Naıve Bayes code is available as part of Biopython. We implemented the
Maximum Entropy code using the conjugate gradient descent code from Nu-
merical Recipes in C, including the fixes from the website (Press et al., 1993;
Numerical Recipes Home Page). I used the libsvm implementation of Support
Vector Machines (Chang and Lin, 2001). To improve performance, I cache word
scores as well as various intermediate computations into a MySQL database
(MySQL).
4.2 Performance of GAPSCORE
I evaluated the performance of my algorithm against the Yapex test gold stan-
dard. To obtain an accurate result, I did not look at or run my algorithm against
this data set until after I had finalized it and fitted all parameters. To compare,
I ran the Yapex algorithm, available from their web site, on that data set on
4/6/2003. When the Yapex algorithm predicted overlapping gene names, I used
only the longest one.
I quantified the performance of the algorithms using recall, precision, and
F-score. Recall# correctly predicted gene names
# gene names
measures how thoroughly a method can identify gene names.
69
CHAPTER 4. IDENTIFYING GENE NAMES
Precision# correctly predicted gene names
# predictions
indicates the rate an algorithm produces errors.
F-score2 ∗ recall ∗ precision
recall + precision
combines recall and precision into a single number.
I assessed the performance of the algorithms on exact and sloppy gene name
matches using the definitions described in (Franzen et al., 2002). Using exact
match, the predicted gene name must be equivalent to the corresponding name
in the gold standard. Using sloppy match, a predicted gene name only needs to
overlap the name in the gold standard. However, if two predicted genes overlap
the same multi-word gene name, only one is considered correct and the other
is incorrect.
Since my algorithm could produce scores, I calculated the recall and preci-
sion at every score cutoff. The resultant curve illustrates the tradeoff between
recall and precision. The user can choose a strict cutoff for applications that
require high precision, or a more lenient one for applications that require high
recall.
I trained Naıve Bayes (NB), Maximum Entropy (ME), and Support Vector
Machine (SVM) classifiers on my own training set. Since this training set con-
sisted of words and not phrases, there was no distinction between sloppy or
exact match. Thus, a gene name prediction was correct if the word was la-
belled a gene in the data set. Since SVMs required user-specified parameters,
70
CHAPTER 4. IDENTIFYING GENE NAMES
Linear Poly1 Poly2 Poly3 RBF1 78.11% 78.12% 76.48% 75.67% 77.82%5 78.14% 78.02% 76.98% 76.31% 76.92%
10 78.14% 78.08% 76.36% 76.62% 76.11%50 78.14% 78.11% 74.24% 76.37% 76.00%
100 78.05% 78.14% 74.35% 76.12% 75.68%
Table 4.6: Parameters for Support Vector Machines. This table showsthe F-scores achieved by SVMs with different parameters on the Yapex trainingset. The columns contain kernels (linear, first degree polynomial, second degreepolynomial, third degree polynomial, and radial basis function) and the rowscontain different values for C, an error penalty parameter. Higher values of Cresult in models with stronger support vectors. For each set of parameters, Ichose the score cutoff that resulted in the highest F-score.
I tested various combinations of kernels and C error penalty parameters (Ta-
ble 4.6). (See (Burges, 1998) for a description of the parameters.) The F-scores
of these parameters varied from 74% to 78%, attained by a linear kernel with
C = 5.
Using optimal parameters, I compared the performance of NB, ME, and
SVM on the Yapex training set, scoring with sloppy matches (Table 4.7). Al-
though all classifiers performed comparably, the F-score for SVM was slightly
higher than the rest. At its maximum F-score, SVM scored 3.5% higher preci-
sion with 2.1% loss in recall compared to ME, the next best performing classi-
fier. However, at the same recall as ME, 81.4%, SVM only improved precision
by a marginal 0.4%.
Then, with the best classifier parameters, I tested the algorithm with vari-
ous modules disabled (Table 4.8). Leaving out the negative filter had the most
detrimental impact on performance, resulting in a 11.8% decrease in F-score,
71
CHAPTER 4. IDENTIFYING GENE NAMES
Recall Precision F-ScoreNaïve Bayes 81.1% 72.5% 76.6%Maximum Entropy 81.4% 73.5% 77.3%SVM 79.3% 77.0% 78.1%
Table 4.7: Comparing Algorithms to Classify Gene Names. This tablecompares the recall, precision, and F-score of three different classifiers whensearching for gene names.
Recall Precision F-score F-score Prec/75No Filter 67.1% 65.6% 66.3% -11.8% 56.5%
No Appearance 68.1% 72.4% 70.2% -7.9% 61.4%No -in Feature 68.2% 75.4% 71.6% -6.5% 65.8%
No -ase Classifier 74.3% 76.0% 75.1% -3.0% 74.7%No Abbreviation 74.7% 76.4% 75.5% -2.6% 76.0%No Morphology 71.5% 82.4% 76.6% -1.6% 78.1%
No Context 85.0% 71.5% 77.7% -0.5% 76.9%ALL 79.3% 77.0% 78.1% N/A 79.2%
†
D
Table 4.8: Removing Modules Reduces GAPSCORE Performance. Thistable shows the performance of GAPSCORE (using sloppy match scoring on theYapex training set) with different modules disabled. The modules are sorted byincreasing F-score. The first three columns shows the greatest F-score that thereduced algorithm can achieve, as well as the recall and precision at that F-score. ∆ F-score is the decrease in F-score compared to the complete algorithm.Prec/75 is the precision at 75% recall.
while leaving out context features had the least effect, leading to a 0.5% de-
crease. I also tested the context feature with only the positive and only the
negative signal words. Leaving out the positive signal words decreased the
F-score 0.07%, and leaving out the negative ones led to a larger decrease of
0.14%.
Finally, I ran GAPSCORE on the Yapex test set and compared the per-
formance against the Yapex algorithm (Figure 4.2). On sloppy match, the
Yapex algorithm received an F-score of 75.4% at recall and precision of 70.3%
72
CHAPTER 4. IDENTIFYING GENE NAMES
and 81.4%. In comparison, GAPSCORE achieved 82.5% F-score at 83.3% and
81.8%. The performance on exact match was closer, with Yapex receiving 54.3%
F-score (50.1% recall, 59.3% precision) and GAPSCORE 57.6% (58.5% recall,
56.7% precision).
The performance of Yapex and GAPSCORE differed with respect to the
length of the gene name. The Yapex test set contains 1,967 gene names; 1,546
of those names consist of only a single word. On sloppy match, GAPSCORE
found 85% of the one word names but only 76% of the multi-word ones. Yapex
exhibited a smaller discrepancy in performance and found 71% of the one word
names and 67% of the others.
4.3 Conclusions
I have created a method GAPSCORE that identifies protein and gene names
from text. It uses a word-based approach and scores the confidence that a
word may be a gene based on appearance, morphology, and context criteria
that includes information from all of MEDLINE. To identify the boundaries
of multi-word gene names, GAPSCORE extends the name using heuristics on
part of speech tags.
GAPSCORE scores new words using a Support Vector Machine. With care-
ful parameter tuning, this algorithm outperformed Maximum Entropy, and
Naıve Bayes. The performance of the linear kernel exceeded that of the more
complex ones including radial basis function. However, the differences among
the various parameters and classifiers were largely insubstantial.
In contrast, leaving out different modules led to more dramatic impacts
73
CHAPTER 4. IDENTIFYING GENE NAMES
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Pre
cisi
on
Yapex
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Pre
cisi
on
Yapex
Figure 4.2: Performance of GAPSCORE. I compared the performance ofGAPSCORE and Yapex on the Yapex test set using sloppy and exact match scor-ing. For sloppy match, a gene name prediction is correct if it partially matchesthe actual gene name. For exact match, the predicted name must match thewhole gene name.
74
CHAPTER 4. IDENTIFYING GENE NAMES
on performance. Disabling the filter reduced F-score by 20%, because many
unspecific gene terms scored highly. The word protein occurred 199 times. For-
tunately, the manually constructed filter successfully discarded those terms in
the complete system. The next largest reduction in performance came from
removing appearance features, which confirms the approaches of some earlier
methods. However, even without appearance, the classifier could still achieve
an F-score of 70.2%, suggesting that the other features contain much informa-
tion about genes.
When comparing performance loss between the two types of signal words
in the context features, it is counterintuitive that removing the negative signal
words had a greater impact than removing the positive. After all, positive
signal words have been commonly used as gene name markers, while negative
signal words have not been used conversely. This performance can be explained
by the greater prevalence of the negative signal words. They are much more
common than the positive signal words, and thus affect a greater number of
words overall. Nevertheless, the performance difference is small.
Using all features, GAPSCORE outperformed Yapex by an F-score of 7.1%
on sloppy match and 3.3% on exact match. In addition, GAPSCORE found a
relatively larger portion of single word genes than Yapex. These differences
in performance indicate 1) more sophisticated analysis of single words can
help overall accuracy, and 2) deeper syntactic analysis can help find the cor-
rect boundaries for the names.
Nevertheless, methods that analyze single words will never be able to iden-
tify some phrases that indicate genes. For example, parathyroid hormone is
75
CHAPTER 4. IDENTIFYING GENE NAMES
a peptide hormone encoded by a gene. However, neither parathyroid nor hor-
mone would indicate a gene by itself. Identifying these phrases would require
scoring phrases or collocations rather than single words.
Fortunately, such phrases that signify genes were not a significant source
of error. More substantial errors for GAPSCORE arose from differences in no-
tions of gene. The 10 highest scoring false positives were Kunitz-type protease,
PTK, alpha2, beta-globin, branched-chain alpha-ketoacid dehydrogenase, con-
ditional tyrosine kinase, elevated tyrosine kinase, endogenous 5-lipoxygenase,
globin, and glycoprotein. These proteins and genes may be missing from Yapex
because its definition excludes protein families. GAPSCORE was more sensi-
tive and identified many names that did not indicate a single identifiable gene.
When evaluating the names found by decreasing score, the first name that was
not a gene was COS-1, the 550th name at recall and precision of 25% and 90%.
Thus, these results underscore the importance of developing clear definitions
of protein names.
Some of the names from the list of false positives, and the relative decrease
in performance on exact match, suggest that deeper analysis is required to
correctly identify name boundaries. For example conditional tyrosine kinase
should not include the word conditional. In the future, I will investigate more
sophisticated methods for finding boundaries than part of speech heuristics.
One possible approach is to use Markov models that quantify the tendencies of
certain words to appear together (Majoros et al., 2003).
I also have not yet directly addressed ambiguous names – those that mean
76
CHAPTER 4. IDENTIFYING GENE NAMES
genes in some contexts and non-genes in others. My current strategy of la-
belling those words in my training set as non-genes add noise to my data and
could have hurt my performance. These may need to be handled separately. In
addition, the context could be used to update the confidence of ambiguous gene
name predictions.
Finally, several difficult problems remain in gene name identification.
There is still considerable ambiguity in the definition of the task; related en-
tities must be differentiated, for example gene and gene products, gene struc-
ture, gene families, protein domains, protein complexes, and alleles. Also con-
founding the task is the inconsistent naming of many genes. These differences
may be small variations in tokenization or word order, or the names may be un-
related synonyms (Yu et al., 2002a; Hanisch et al., 2003). Therefore, methods
must be developed to normalize synonyms and other variants before these al-
gorithms are generally useful for unambiguous indexing and extraction tasks.
Nevertheless, I have begun using my algorithm to identify gene and pro-
tein names and their relationships to drugs. Preliminary timings indicate that
my current implementation, running on a (busy) single processor 1.5GHz In-
tel Xeon with 512Mb of RAM, requires an average of 15 seconds to tag each
MEDLINE abstract. Another technical limitation to such a data-intensive ap-
proach is that the method may need retraining as new words and word senses
are introduced into the literature. It is unclear how rapidly the literature is
changing, and how these changes may affect the performance of the algorithm.
In conclusion, I have developed a new method, GAPSCORE, for finding gene
and protein names by combining novel formulations of features in a machine
77
CHAPTER 4. IDENTIFYING GENE NAMES
learning framework. I found that Support Vector Machines slightly outper-
form other popular methods. When applied to the Yapex text collection, my
method achieves high performance due to its sophisticated analysis of single
words and the high prevalence of single word gene names. The algorithm
produces confidence scores that can be adjusted for either high recall or pre-
cision. GAPSCORE is available on the web at: http://bionlp.stanford.
edu/gapscore/ .
78
CHAPTER
5
Extracting Gene-Drug Relationships
Much biological insight originates from the identification and characterization
of relationships among macromolecules. These interactions drive higher level
processes. Therefore, scientists have devoted much effort to developing tech-
nologies to elucidate those interactions, such as yeast two-hybrid screens, di-
rected genetic crosses, and expression analysis to infer networks.
Many interesting interactions are reported in unstructured free text, and
thus, unfortunately, are unavailable for high-throughput analysis. Because of
the vast number of molecules and relationships, identifying them manually is
daunting. Therefore, to extract interactions, researchers are investigating the
suitability of text processing algorithms.
The natural language processing (NLP) community, in particular, has stud-
ied the problem of identifying relationships from text. Spurred by the DARPA-
sponsored series of Message Understanding Conferences (MUC), the commu-
nity developed a technology called information extraction (IE). IE addressed
the problem of identifying pre-designated relationships in text. Because it
narrows the problem by specifying the relationships of interest, IE was more
79
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
tractable than the grandiose ambition of natural language understanding. IE
also differed from information retrieval, because its goal was to identify facts
rather than documents.
The MUC conferences provided a single data set and uniform evaluation
metric, which made possible the comparison of the performances of algorithms.
The extraction tasks were typically complex, involving the relationships among
many different types of entities. For example, the task for MUC-6 pertained to
corporation management changes. Identifying all the information related to a
change required the identification of people in the organization, their job titles,
the location of the business, etc. However, not all extraction tasks must be this
complicated. Binary relationships, those that involve only two entities, are a
common special case. In the most recent MUC in 1997, the organizers recog-
nized this and created a separate category of binary extraction tasks (SAIC
Information Extraction). Binary relationships are particularly important for
the biological community. Much work has been focused on such relationships,
for example, the interactions between proteins.
Similar to the MUC conferences, there have also been open evaluations of
biological information extraction algorithms. The Knowledge Discovery and
Data Mining (KDD) Challenge Cup was an open evaluation to test algorithms
for data mining. Although the KDD tasks traditionally concerned neither text
nor biological applications, the 2002 context focused on mining biological text
(Zaki et al., 2002). The contest included two tasks. The first was to identify
papers that contained experimentally derived functions of Drosophila genes
(and identifying the relevant gene) (Yeh et al., 2002, 2003). The winner of this
80
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
task used a traditional NLP approach, creating a rule-based system that could
recognize experiments and gene products (Regev et al., 2002). The number of
rules developed was not reported. However, statistical approaches performed
competitively – the two runners up that published their methods both used
them (Keerthi et al., 2002; Ghanem et al., 2002).
The second KDD task was to predict whether a knockout gene would affect a
signalling pathway (Craven, 2002). This was presented outright as a statistical
classification problem, containing in addition to text data, information on gene
function, protein localization, and protein interactions. The winning strategy
used support vector machines (Kowalczyk and Raskutti, 2002). It is notable,
however, that papers in both tasks stressed the importance of a careful choice
of features (Forman, 2002; Keerthi et al., 2002).
Another community-based effort to compare algorithms on biological
datasets has recently begun at the Text REtrieval Conference (TREC)
(Voorhees, 2002a). In 2003, TREC included an information extraction task
in its Genomics track. The tasks was to automate the assignment of GeneRIF
(Gene References Into Function) annotations in LocusLink, a database of ge-
netic loci, including genes (Pruitt et al., 2000). A GeneRIF is “a concise phrase
describing a function or functions” for a gene (or technically, a genetic locus)
(LocusLink). Currently, manual annotators scour MEDLINE literature to as-
sign GeneRIFs for genes in LocusLink. However, because of the vast number
of genes and volume of literature, the annotations are incomplete. Thus, there
is considerable interest in developing information extraction methods that can
reproduce these automatically. At this moment, although the entries for the
81
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
TREC task have been submitted, the results have not yet been released or
published.
Finally, there is one more community-based evaluation called BioCreAtIve,
for the Critical Assessment of Information Extraction systems in Biology. This
ongoing competition consists of two tasks. The first one is to identify the genes
and proteins in text. Although this task is nominally similar to the one in
Chapter 4, the goal of the task is more comprehensive. Here, a successful
system must produce a list of unique genes, with each linked to all synonyms
used in the texts. Thus, in addition to identifying the names, the system must
also resolve synonyms.
The second problem in BioCreAtIve is to annotate proteins with GeneOn-
tology (GO) codes, identifiers for an ontology of gene function (Ashburner et al.,
2000). Successful systems here will identify the text in full text journal articles
that provide evidence of protein function (similar to the TREC task), and then
annotate the function with a GO code. The training data for the BioCreAtIve
tasks have been made available; the final test data will not be available until
November 2003.
5.1 Information Extraction Systems in the NLP
Community
Traditional IE systems identified relationships in text by looking for distinctive
patterns. These patterns could be either rule-based or statistical. Rule-based
patterns were either regular languages (Hobbs et al., 1996; Soderland, 1999) or
82
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
ad-hoc, (Fisher et al., 1995; Kim and Moldovan, 1993; Lehnert et al., 1992; Yan-
garber and Grishman, 2000) consisting of words, parts of speech, or semantic
classes from a domain ontology.
Rule based patterns were initially developed manually, with an expert cre-
ating and tuning a set of rules so that they would work accurately for a spe-
cific problem. This process was labor intensive and required domain expertise.
Also, the patterns were then specific to a particular problem and hard to adapt
to another. Since this was recognized as a problem, adaptability of IE systems
became an issue in MUC-7.
The need to alleviate the difficulty of developing patterns spurred research
into automated inference of patterns. One early approach employed a feedback
system where the algorithm would propose possible patterns for an expert to
accept or reject (Riloff, 1993). However, this was still labor intensive. Subse-
quent systems used training sets of annotate relationships, and would induce
rules that accurately identified them (Soderland et al., 1995; Huffmann, 1996;
Aseltine, 1999; Califf and Mooney, 1999; Freitag, 1998; Catala et al., 2000; Kim
and Moldovan, 1993). In these systems, the expert would create a training set
rather than tune the patterns. Finally, to further reduce manual work, one sys-
tem induced patterns from a training set where the documents were labelled,
rather than the specific relationships (Riloff, 1996). As research increased,
methods for rule induction became more accurate. In 1996, Freitag noted that
automatically generated systems were approaching the performance of hand-
crafted ones (Freitag, 1996).
More recently, however, the community has also investigated using hidden
83
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
Markov models (HMM) to find relationships (Freitag, 1996). These models
scanned a series of words to compute an optimal probability that each word
belonged to a relationship. Since these were hidden Markov models, the mean-
ings of the states were inferred from the data. However, the topology of the
states were predetermined, and there has been work to find topologies suit-
able for IE (Freitag and McCallum, 2000) and estimating accurate transition
probabilities from sparse data (Freitag and McCallum, 1999). Also, McCallum
proposed a variant of Markov models trained with maximum entropy (McCal-
lum et al., 2000).
5.2 Identifying Biological Relationships
The biological research community also has built or adapted information ex-
traction systems to identify relationships. However, nearly all these systems
have been optimized to recognize specific relationships between two entities,
such as proteins. Binary relationships are particularly important in biology.
Some groups have developed algorithms to analyze text and automatically con-
struct databases of protein-protein interactions (Blaschke et al., 1999; Ng and
Wong, 1999; Thomas et al., 2000; Jenssen et al., 2001; Ono et al., 2001; Park
et al., 2001; Wong, 2001; Yakushiji et al., 2001), protein cellular localization
(Craven and Kumlien, 1999), metabolic enzymes (Humphreys et al., 2000),
gene-drug interactions (Craven and Kumlien, 1999; Rindflesch et al., 2000),
gene and gene products (Sekimizu et al., 1998), diseases associated with pro-
teins or keywords (Andrade and Bork, 2000; Craven and Kumlien, 1999), and
other relationships (Hirschman et al., 2002b).
84
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
• Renal cells accumulate the osmolyte sorbitol through increased transcrip-tion of the aldose reductase gene.
• This increase was markedly inhibited by addition of sulfaphenazole, aselective inhibitor of CYP2C9.
• It is not known whether VDR genotype influences bone accretion or loss,or how it is related to calcium metabolism.
• The results are consistent with recent findings showing that CYP1A2,rather than CYP2D6, is the major enzyme responsible for the metabolismof clozapine.
Figure 5.1: Sample Relationships between Drugs and Genes. Identifyingrelationships from text is difficult. There are many types of relationships andthe vocabulary for describing them is diverse. In addition, language describinguncertainty or negation can confound analyses.
5.2.1 Co-occurrence
The simplest algorithms leveraged the idea that if two entities appeared in the
same sentence or abstract, they may be related. This idea can also be refined
in two ways. First, entities that appear closer may be more likely to be related.
Second, entities appearing together more frequently may also be more likely to
be related (Andrade and Bork, 2000; Stapley and Benoit, 2000; Jenssen et al.,
2001; Stephens et al., 2001).
The co-occurrence algorithm was the basis for PubGene, a database contain-
ing gene-gene interactions (Jenssen et al., 2001). In this study, Jenssen et al.
looked for MEDLINE abstracts that contained at least two gene names from
their dictionary of human genes. They scanned all of MEDLINE for genes
that co-occur. Then, they used human experts to evaluate whether the co-
occurrence was biologically meaningful. This resulted in 60% precision and
85
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
51% recall. Also, Jenssen discovered that the gene-gene pairs occurring multi-
ple times were more likely to interact. When considering only the relationships
appearing in five or more articles, the precision increased to 72%. However,
many pairs that occurred less than five times also interacted. During the eval-
uation, the scientists found that nearly all the gene-gene interactions missed
were due to failures in gene name identification.
Although Jenssen looked for pairs of genes that occurred in the same ab-
stract, people have also used methods that find co-occurrences in the same
phrase (Ono et al., 2001) or sentence (Rindflesch et al., 2000). Intuitively, the
larger the scope, the more relationships the algorithm would detect. However,
the relationships may also be more likely to be incorrect. One comprehensive
study quantified this and found the recall and precision for co-occurrences in
abstracts to be 100% and 57%, respectively; performance on sentences was
85% and 64%, and for phrases, 62% and 74% (Ding et al., 2002). One inter-
esting finding from this study was that long range co-occurrences identified
relationships more effectively if one of the terms was a general name that was
used more specifically in the rest of the document. For example, the study
showed that flavonoid was used in the first sentence, and more specific words
for flavonoids were used subsequently. Without the benefit of an ontology (or
deep semantic analysis), flavonoid relationships were best found by consider-
ing the whole abstract.
Although co-occurrence methods could successfully find relationships, they
provided no insight into the characteristics of the relationships. Co-occurrences
between genes could indicate direct physical relationships such as binding, or
86
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
more abstract relationships such as mutual involvement in a biological process
(Stapley and Benoit, 2000). Further processing was necessary to characterize
the relationship.
5.2.2 Keywords
To identify the types of relationships, algorithms must examine relevant infor-
mation, such as the neighboring words in the sentence or abstract. A simple
heuristic identifies key words or phrases that can discriminate particular types
of relationships. One way to make this method more specific is to use patterns
of words.
In pattern-based methods, researchers developed patterns of biological en-
tities and regular words that distinguished particular types of relationships.
These patterns were typically simple. They did not require part of speech
or complex semantics. These methods usually employed only a few gen-
eral patterns. For example, one system used only <protein A> <action>
<protein B> where <action> consisted of a list of 14 possible words and
their variants (Blaschke et al., 1999). Other systems used 5 (Ng and Wong,
1999) and 20 (Ono et al., 2001) patterns. One group also developed a method
that could score relationships. For each co-occurrence, the algorithm scored
the words in that sentence against a list of words typically found with different
types of relationships (Stephens et al., 2001).
87
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
5.2.3 Machine Learning
Instead of requiring patterns or keywords, IE has also been framed as a ma-
chine learning problem. Formulated thus, the sentences with co-occurrences
are represented in a vector space model, and then the classifier scores the like-
lihood that the sentence contains a relationship.
Craven and Kumlein implemented such a classification system to determine
subcellular localization of proteins (Craven and Kumlien, 1999). Using experts,
they created a training set of sentences describing proteins localized to subcel-
lular compartments. Then, they trained a naıve Bayes classifier and scored
new sentences containing localization information. They found that method
had higher precision than co-occurrence methods, but lower recall.
5.2.4 Natural Language Processing
In addition to identifying the type of relationship, identifying the subject and
object of a relationship was also useful. For example, in regulatory networks,
one gene influences the expression of the next. In these cases, it was useful to
distinguish the regulator gene from the gene being regulated. This required a
more sophisticated language model. NLP systems addressed this by including
a domain ontology (semantics) and a structured model of the sentence (syntax).
Researchers have developed or adapted some NLP systems to extract bio-
logical information (Hishiki et al., 1998; Humphreys et al., 2000; Thomas et al.,
2000). Although technologies to model knowledge and parse syntax was still
88
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
under development, they often performed well enough for information extrac-
tion (Sekimizu et al., 1998). Although these systems have been built on tech-
nologies developed in the NLP community, all needed to be adapted for the
biological domain.
The use of NLP techniques could be as simple as just using the part of
speech to score relationships (Thomas et al., 2000). For example, in a protein-
protein relationship, both proteins should be nouns in the sentence. Slightly
more complicated systems used shallow parsing to determine the subject and
object of known verbs (Sekimizu et al., 1998; Proux et al., 2000; Wong, 2001).
Finally, systems that performed full parsing could ascertain the relationships
among all components in a sentence (Yakushiji et al., 2001; Park et al., 2001;
Park, 2001). Although these full parsers suffered from parsing ambiguities
(Yakushiji et al., 2001), they nevertheless achieved up to 48% recall with 80%
precision (Park et al., 2001).
5.3 NLP Systems in Biomedicine
With the methodological advances in the NLP IE community, it is somewhat
surprising that those technologies have not been used more commonly in biol-
ogy. One possibility is that bioinformaticians are unaware of the work of the
NLP community. However, this is not likely, as evidenced by the systems that
have been adapted. Other possible reasons may be related to idiosyncrasies
in the biological domain, difficulties in adapting NLP systems, and differing
expectations of accuracy in biology and general text.
89
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
One important difference that distinguishes IE in biology from IE in gen-
eral text is the method for recognizing entities. The characteristics of entity
names, and thus the methods for identifying them, differ. In general text, en-
tities include names of people and companies from news reports. Biological
systems must recognize names such as genes, proteins, drugs, or subcellular
locations. Biological names include domain-specific idiosyncracies, such as un-
usual patterns in the prefixes and suffixes of words (Hishiki et al., 1998). Also,
tokenizers must handle the mixed punctuation and numbers that occur in gene
names (Thomas et al., 2000). Although algorithms can identify entities from
regular text with 93–95% accuracy, biological names are generally recognized
with 75–80% accuracy (Hirschman et al., 2002a).
Also, it is difficult to adapt existing NLP systems. As one author wrote
about the entity recognition problem,
The process is weakly domain dependent . . . changing from news to
scientific papers would involve quite large changes
(Cunningham, 1999). Since the overall performance of the system depends
on the quality of the entity recognizer, this portion of an IE system must be
accurate (Fukuda et al., 1998),
In addition to the named entity recognizer, the ontology must be adapted
to the biological domain. To adapt the EmpathIE system, Humphreys cre-
ated a new ontology specifically to support extraction of metabolic pathways
(Humphreys et al., 2000). This ontology consisted of a lexicon of 52 categories
and 25,000 terms used to describe pathways. Even at this size, they reported
90
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
that further refinements of the ontology could still lead to performance in-
creases.
Nevertheless, one interesting application of NLP technologies to the
biomedical domain is MedLEE, which is developed at Columbia University
(Friedman et al., 1994; Friedman, 2000). This system was originally cre-
ated to extract information from clinical documents to support the Columbia-
Presbyterian Medical Center. It uses traditional NLP technologies, employ-
ing a domain-specific lexicon and synactic and semantic grammars. Their ap-
proach is to develop the system for a single domain and then adapt it to differ-
ent ones (Friedman, 1997). In medicine, MedLEE has been applied successfully
to different types of text, including radiology reports, mammography, discharge
summaries, electrocardiography, echocardiography, and pathology.
In addition to clinical work, MedLEE is now being adapted to the biologi-
cal domain. This involves developing a new concept recognizer, creating new
ontologies, and developing new patterns. Their final system, GENIES, for ex-
tracting molecular pathways only uses the tokenizer and parser from MedLEE
(Friedman et al., 2001). GENIES includes new modules for gene name recog-
nition and gene name disambiguation (Krauthammer et al., 2000; Hatzivas-
siloglou et al., 2001). To retrieve signal transduction pathways, the MedLEE
researchers adapted an ontology by extending the UMLS to handle basic bio-
logical concepts relevant for regulatory networks (Rzhetsky et al., 2000). For
this task, they reported 53% recall and 100% precision on identifying binary
relationships.
91
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
Hatzivassiloglou observed differing expectations in accuracy between tra-
ditional knowledge-based NLP systems and the systems currently built for
biology. Traditional systems, such as MedLEE, are focused heavily on high
specificity (Hatzivassiloglou et al., 2001). These systems emphasizing propos-
ing relationships that are correct. This concurs with the observation Thomas
made, that Highlight was “tuned to produce high precision but lower recall”
(Thomas et al., 2000). However, many of the systems developed specifically for
biology seem to prefer high recall and tolerate lower precision. Thus, adapt-
ing MedLEE also required supplementing it with the statistical approaches to
meet the expectations of performance in the biological community (Hatzivas-
siloglou et al., 2001).
5.4 Identifying Related Genes and Drugs
To support pharmacogenomics research, I applied information extraction tech-
niques to search for evidence of relationships between genes and drugs in the
literature. In 1999, Evans and Relling published a review article describing ge-
netic polymorphisms in “drug-metabolizing enzymes, transporters, receptors,
and other drug targets” that led to variations in responses to drugs (Evans and
Relling, 1999). Their work documented 215 relationships among 62 genes and
127 drugs.
Using the relationships from Evans & Relling, I tested whether a co-
occurrence method could find evidence of the 215 relationships in MEDLINE
abstracts. For my analysis, I compiled a list of the genes and drugs found in
the article. Then, I searched MEDLINE for gene and drug names from the list.
92
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
I used a local version of MEDLINE that contained citations only until the end
of the year 2001.
I searched for genes and drugs in the abstract or title of a MEDLINE cita-
tion. Each gene and drug word (or phrase) must have appeared in the text and
be preceded and followed by either whitespace or punctuation. This ensured
that spurious partial word matches were not found. I allowed variation in cap-
italization. If the word was an abbreviation or long form defined in the text,
I also searched for the corresponding form in that citation only. I recognized
abbreviations using the algorithm described in Chapter 3 with a score cutoff of
at least 0.03.
Then, for each gene-drug pair, I counted the number of abstracts and sen-
tences where both the gene and drug occurred. I treated the title as the first
sentence of the abstract. To split text into sentences, I used a sentence bound-
ary detection heuristic described in Appendix B. I did not count co-occurrences
where the gene and drug names overlapped, e.g. if the vitamin D drug used
the same words as the gene vitamin D receptor .
Out of 7,874 possible gene-drug pairs (= 62 genes x 127 drugs), 1,462 (19%)
occurred in at least one abstract. 489 (6%) occurred in 5 or more. Out of the
215 pairs from Evans & Relling, 167 (78%) appeared in at least one abstract
and 113 (53%) in at least five (Figure 5.2). 22% of the known gene-drug re-
lationships did not appear in any abstract, and 32% did not appear in any
sentence together.
This method missed nearly a quarter of the relationships described in Evans
& Relling. Many of these omissions occurred due to the minimal handling of
93
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
0
10
20
30
40
50
60
70
80
0 1 2-5 6-10 11-50 51-100 101-1000 More
Cooccurrence Frequency
Nu
mb
er o
f G
ene/
Dru
g P
airs Abstracts
Sentences
Figure 5.2: Frequency of Gene-Drug Co-occurrences. This figure showsthe frequency in which a gene and drug, with a relationships identified in Evans& Relling, occur in the same abstract and sentence.
synonyms; I only considered abbreviations. Since many genes and proteins
have multiple names, not including a thesaurus of synonyms caused any syn-
onym of a gene to be missed (Yu et al., 2002a; Yu and Agichtein, 2003).
In nearly half the missing gene-drug pairs (22 out of 48), the gene was
a cytochrome P450 protein listed in Evans & Relling as CYP1A2, CYP2C9,
CYP2D6, CYP3A5, and CYP3A7. However, in the literature there was con-
siderable variation in cytochrome P450 nomenclature. These proteins can be
written in many ways, including CYP1A1/2, CYP2D family, or more gener-
ally cytochrome p450 protein. Identifying these phrases, and resolving them as
94
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
individual genes, would require more sophisticated strategies for tokenization
and name resolution.
Similarly, there were many ways to refer to drugs. Drugs were commonly
classified into categories. As noted in Section 5.2.1, authors often refer to
classes of drugs rather than specific ones. For example, the drugs in Evans
& Relling included classes such as steroids, antipsychotics, and calcium chan-
nel blockers. Knowledge about classes of drugs may have been inferred based
on data pertaining to individual drugs. Resolving such discrepancies computa-
tionally would require a hierarchical classification of drugs.
Finally, this algorithm missed the relationships from Evans & Relling that
were not indicated in the MEDLINE abstracts.
In addition to missing relationships, the co-occurrence-based algorithm also
identified 1295 gene-drug pairs that were not described in the Evans & Relling
article. To see whether these co-occurrences indicated real relationships be-
tween genes and drugs, I manually checked 100 pairs. I examined the fifty
pairs that occurred in the most number of citations, and I also chose at ran-
dom fifty pairs that occurred in only one citation each. I read the citations and
found that 70, the majority, of these gene-drug pairs shared some relationship.
Out of the remaining 30, for 8 pairs, the text specifically documented that they
had no relationship, and the rest were errors in the co-occurrence algorithm.
Ten of these pairs are shown in Table 5.1, and the full list can be found in Ap-
pendix C. This was consistent with previous studies on gene-gene relationships
reporting that co-occurrences between correctly identified gene names indicate
some type of biological relationship (Stapley and Benoit, 2000). The major
95
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
Gene Drug AbsAldehyde dehydrogenase Ethanol 291 1 Gene catalyzes drug
metabolism.Glutathione S-transferase Glutathione 277 2 Drug is substrate of gene.Angiotensin-converting enzyme Insulin 178 3 Gene affects sensitivity to
drug.Glucocorticoid receptor Steroids 165 4 Drug is substrate of gene.CYP1A1 Ethoxyresorufin 151 5 Gene metabolizes drug.Dihydropyrimidine dehydrogenase Bilirubin 1 6 Disabled gene leads to
increased level of drug.N-Acetyltransferase Insulin 1 7 Drug leads to activation of
gene.NAT1 Phenacetin 1 8 Drug exposure does not lead
to gene polymorphisms.Glucocorticoid receptor Tacrolimus 1 9 Drug is substrate of gene.CYP1A1 Lovastatin 1 10 Drug does not influence gene
activity.
Relationship
Table 5.1: Relationships Between Ten Genes and Drugs. This table de-scribes the relationship between 10 genes and drugs found to co-occur in theliterature, but were not identified in a review by Evans & Relling. The first fiveare the genes and drugs that appear in the greatest number of abstracts. The lastfive are randomly chosen genes and drugs that appear in one abstract only. TheAbs column describes the number of abstracts that contained both the gene anddrug. The Relationship column describes the relationship between the geneand drug. (1Agarwal (2001); 2Jakoby (1978); 3Hamilton (1990); 4Baxter (1978);5Kitchin (1983); 6Tateishi et al. (1999); 7Namboodiri et al. (1981); 8Bringuieret al. (1998); 9Oyanagui (1998); 10Cohen et al. (2000))
source of errors in co-occurrence algorithms arose from errors in identifying
the gene names (Jenssen et al., 2001).
There are many reasons that gene-drug relationships could have been omit-
ted from the Evans & Relling article. First, the review article concentrated
mainly on the genes with polymorphisms that could affect drug response. Sec-
ond, 30 of the 100 gene-drug pairs in Appendix C indicated no relationship.
Third, the authors may not have intended to catalog all known gene-drug rela-
tionships. Finally, the review article contained only the relationships known to
96
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
occur in humans. Since the MEDLINE search was not limited to a specific or-
ganism, many of the results described studies performed in model organisms,
notably rats.
5.5 Classifying Gene-Drug Relationships
The previous section showed that a co-occurrence approach could identify from
the literature many related genes and drugs. Once I had established that a
gene and drug were related, I began to analyze the literature to identify char-
acteristics of the relationship. Genes and drugs can interact in many ways,
from direct substrate-ligand binding relationships to more abstract ones, e.g.
“variations in the MDR1 gene reduce the plasma concentration of digoxin.” In-
teractions between genes and drugs were organized in the Pharmacogenomics
Knowledge Base (PharmGKB) according to five broad categories relevant for
pharmacogenomic researchers (Table 5.2).
PharmGKB collected information about related genes and drugs using a
community-based submission tool. Researchers on the internet could directly
submit, pending approval by a PharmGKB annotator, gene-drug relationships
classified according to the five categories. Often, multiple categories applied to
a specific gene and drug. In particular, the consequences of Genotype varia-
tions were often revealed in the other categories. For example, a mutation in
the TPMTgene could lead to blood toxicity of 6-thioguanine , indicating both
Genotype and Pharmacodynamic relationships. Currently, PharmGKB an-
notators had approved submissions on 325 drug-gene pairs exhibiting 515 re-
lationships. That is, if a gene and drug were related both by Genotype and
97
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
Clinical Outcome Genetic variations in the response to drugs can cause measurable differences in clinical endpoints such as rates of cure, morbidity, side effects, and death. Data in this category demonstrate that genetic variability in the context of a drug effect significantly changes medical outcomes. These data sets are different from pharmacodynamics data sets, which may show a difference that is not sufficiently significant to alter practice or policy.
Pharmacodynamics and Drug Response
Genetic variation in drug targets can cause measurable differences in the response of an organism to a drug. Data in this category document that the biological or physiological response to a drug varies, and that this variation can be associated with the variation of one or more genes. This variation is often measured at the whole-organism level. The measured variables may be surrogates for clinical responses, but do not constitute outcomes themselves.
Pharmacokinetics Genetic variation in processes involved in the absorption, distribution, metabolism, or elimination of a drug can result in changes in drug availability. Data in this category are primarily concerned with demonstrating that genetic polymorphisms lead to variations in the levels or concentrations of drugs or their metabolites at the site of action
Molecular and Cellular Functional Assays
Genetic variation can alter results of molecular and cellular functional assays, and this may correlate with variations in the organism's drug response. Data in this category demonstrate associations between genetic variation and laboratory assays of function at the molecular or cellular level. These assays may test the molecular properties of drug targets or drug metabolizing enzymes, or may test the cellular properties of cells involved in the response to a drug (such as whole-cell gene expression).
Genotype Genotype is the internally coded, heritable information carried by the organism. Variation in genotype is fundamental to pharmacogenetics and is measured as sequence variation in individual genes--the type and location of the variation, and the frequency of the variation in the populations of interest. This genetic variation is independent of individual drugs, but forms the basis for variation in response to drugs.
Table 5.2: Pharmacogenomic Relationships in PharmGKB. PharmGKBidentifies five types of relationships between genes and drugs that are of interestto pharmacogenomic researchers. The relationships and their definitions areduplicated here from their website (PharmGKB).
98
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
Pharmacodynamics, that gene-drug pair exhibited two relationships.
For each gene-drug pair, I retrieved the MEDLINE citations (until the end
of the year 2001) that contained both the gene and the drug. I matched names
to the text using the heuristics described in Section 5.4, looking for exact
word matches, allowing abbreviations, and ignoring overlapping gene and drug
names. With the citations found, I created a data set of the sentences that con-
tained a gene and drug. The title was also considered a sentence.
From each sentence, I removed occurrences of gene and drug names man-
ually. I discarded any gene or drug that appeared in either the PharmGKB or
Evans & Relling lists. Then, I examined the remaining words in the sentences
and removed gene and drug names manually.
Then, I converted each document into a vector representation suitable for a
machine learning classifier. Each document was a vector of words:
~Document = [w1w2 . . . wn] (5.1)
where wi is 1 if the word occurred in the document and 0 otherwise. n was the
total number of unique words that occurred in the corpus.
Then, to eliminate uninformative features, I performed feature selection on
the sentences (without gene or drug names) for each type of relationship. I
ranked the words using the χ2 test. This test produced a p-value that indicated
the strength of the association between each word and relationship. I kept
the 100 features that were most strongly related (or technically, least likely
to be unrelated) to the relationship. I also kept the 100 features most strongly
99
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
Text with gene and drug Words from Text(no gene or drug)
Clinical Outcome
Pharmaco-dynamics
Pharmaco-kinetics
FunctionalAssay
Genotype
Classifiers Scores
TPMT deficiency is associated with profound toxicity after thiopurinetherapy.
afterassociateddeficiencyisprofoundtherapytoxicitywith
0.50
0.05
0.26
0.07
0.05
Figure 5.3: Scoring Gene-Drug Relationships. I scored the pharmacoge-nomic relationships for sentences that contain a gene and drug. First, I removedthe genes and drugs from the sentence. Then, using the high-scoring featuresfrom the sentence, machine learning classifiers scored each type of relationship.
related to no relationship, resulting in a total of 200 features. I chose this num-
ber to balance the competing demands of maintaining small dimensionality for
computational tractability, but including enough features to capture the infor-
mation relevant to the classification decision. The most informative features
for each relationship are shown in Table 5.3.
Using the reduced-size vectors, I trained supervised machine learning clas-
sifiers to score the relationships described in a sentence (Figure 5.3). Because
the sentence could describe multiple types of relationships, I trained one clas-
sifier for each type. Using 5-way cross validation, I trained each classifier on
80% of the sentences and withheld the rest for testing. This way, I compiled
the scores for every sentence in the data set, without having used the same
sentence for training.
To classify, I used a Maximum Entropy classifier. Although Support Vector
100
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
Indicates Token Indicates Token
NON resistant REL boneNON cells REL geneREL patients REL mineralNON gene REL nuclearREL heart REL bindingREL failure REL bmdNON cell REL elementNON bone REL dihydroxyvitaminREL are REL retinoidREL drugs REL transcription
Indicates Token Indicates Token
REL metabolism NON geneREL resistant NON cellsREL demethylation NON resistantREL hydroxylation REL patientsNON heart NON cellNON failure REL heartREL liver NON expressionREL amplification REL failureNON receptor REL hypertensionREL probe NON bone
Indicates Token
REL productionREL hydroxylationREL methylationREL myelosuppressionREL tumourREL toxicityREL pyloriREL substratesREL hivREL nucleotides
Molecular & Cellular Functional Assays
Pharmacodynamics and Drug Response
Clinical Outcome
Pharmacokinetics
Genotype
Table 5.3: Informative Features for Gene-Drug Relationships. This tableshows the 10 most informative features for classifying each type of gene-drugrelationship. The Indicates column contains either REL if the presence of theToken indicates a relationship, or NON otherwise. These features were calcu-lated from sentences that contain a gene and drug.
101
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
Machines may ultimately outperform Maximum Entropy, Maximum Entropy
performs well even without manually-fitted parameters (see Chapter 4).
When applied to the data sets, the classifier achieved high performance in
categorizing each sentence according to the relationships described. That is,
sentences that described a relationship received a high score from that classi-
fier, and those that did not received a low score (Figure 5.4).
The sentences with the Pharmacodynamics and Drug Responses re-
lationship were easiest to identify, with 41% of the results receiving a score
of 0.95 or higher. Conversely, Clinical Outcome was the hardest to predict
with only 2% of the correct results receiving a score of at least 0.95. This cate-
gory may have been difficult to predict due to the great breadth of vocabulary
that described various clinical outcomes. 6% of the sentences contained no in-
dicative words identified by the feature selection and thus received scores of
exactly 0.5. Interestingly, Genotype was hard both to predict (3% get score of
0.95 or higher) and to rule out (9% of non-genotype get score of 0.95 or lower).
5.6 Predicting Gene and Drug Relationships
Finally, I applied the sentence-relationship scores calculated in the previous
section to identify the types of relationships for each gene-drug pair. For each
pair, I collected the scores for every relationship of every sentence containing
the gene and drug. Then, for each relationship, I averaged those scores to be
the relationship score for that gene-drug pair. Finally, I assigned a relationship
to a pair if the average score was at least 0.5.
I found 187 pairs of genes and drugs from PharmGKB that occurred
102
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
Genotype
0%
10%
20%
30%
40%
50%
Molecular & CellularFunctional Assays
0%
10%
20%
30%
40%
50%
Pharmacokinetics
0%
10%
20%
30%
40%
50%
Pharmacodynamics andDrug Responses
0%
10%
20%
30%
40%
50%
Clinical Outcome
0%
10%
20%
30%
40%
50%
Gene & Drug in Same
Sentence
Figure 5.4: Distribution of Relationship Scores. These histograms, onefor each relationship, show the distribution of scores for sentences derived fromgenes and drugs in PharmGKB. The dark bars show the scores for sentencesdescribing that relationship, and the hashed bars show the scores for other sen-tences. The scores on the horizontal axis are split into 20 evenly spaced binsfrom 0 to 1. To preserve space, the scale is not shown.
103
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
3 Errors11%
4 Errors3%
5 Errors0%
2 Errors27%
1 Error32%
No Errors27%
Figure 5.5: Errors in Gene-Drug Relationships. This chart shows the num-ber of errors made for each gene-drug pair. No Errors indicates that all 5possible types of relationships for a pair was classified correctly.
in the same sentences. This resulted in 935 classification decisions (=
187 pairs x 5 relationship types). Out of those, 690 (74%) were predicted cor-
rectly. Of the 187 gene-drug pairs, 50 (27%) were predicted exactly correctly;
the state of all 5 relationships were classified correctly. Conversely, 5 gene-drug
pairs had 4 or more errors (Figure 5.5). A complete list of the gene-drug pairs
and relationship predictions appear in Appendix D.
104
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
1 10 100 1000 10000
Number of Sentences
Ave
rag
e E
rro
rs
Figure 5.6: Common Co-occurrences Classified More Accurately. Thisplot shows the number of errors for gene-drug pairs observed in different num-ber of sentences. Each gene-drug pair was classified into the 5 PharmGKB phar-macogenomics categories. The number of errors is the number of categories as-signed incorrectly, i.e. between 0 to 5. Average Errors is the averaged numberof errors for all gene-drug pairs found in at most Number of Sentences. Thenumber of sentences is plotted on a log scale for clarity.
The number of errors for each of the gene-drug pairs seemed to be depen-
dent on the amount of data available (Figure 5.6). Gene-drug pairs that co-
occurred in only one sentence could be classified into the 5 types of relation-
ships with an average of 1.89 errors. However, the error rate decreased precip-
itously to 1.45 in pairs that co-occurred in up to 10 sentences, and 1.36 in those
co-occurring in up to 100.
The 5 gene-drug pairs that were classified with 4 or more errors were
105
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
CYP2D6-interferon alpha, CYP3A5-midazolam, CYP3A5-xenobiotics, CYP4B1-
xenobiotics, and TPMT-sulfasalazine. 4 of these were cytochrome P450 pro-
teins. As noted in Section 5.4, this protein family was difficult to recognize be-
cause of tokenization and semantic problems. A more sophisticated tokenizer
that handled P450 nomenclature may have been able to recognize analogous
references to the token from the gene name list. For example, the CYP3A5-
midazolam co-occurrence appeared in one abstract as “. . . CYP1A2, 2A6, 2B6,
2C9, 2C19, 2D6, 2E1, 3A4 and 3A5, three (CYP2B6, 3A4 and 3A5) showed mi-
dazolam 1’-hydroxylation activity . . . ” (Hamaoka et al., 2001). Tokenizers that
could recognize these and other forms of the P450 nomenclature could increase
the amount of data available for classifying particular gene-drug pairs.
5.7 Conclusions
Recognizing relationships between genes and drugs is a complex problem that
requires further investigation. In particular, as noted in previous studies,
recognition of the basic entities, the genes and drugs, should be improved.
Future algorithms should handle familial and categorical relationships (hy-
pernymy), and well as synonymy within gene names. In addition, the rela-
tionships may need to be annotated with other useful types of classes, such as
source organism.
One of the major barriers for such statistical approaches is the dearth of
training data. The PharmGKB data set is relatively small, including 442 sub-
missions covering 325 drug/gene pairs. (Some of the submissions involved the
same gene and drug.) Also, over 70% of the submissions were contributed by
106
CHAPTER 5. EXTRACTING GENE-DRUG RELATIONSHIPS
4 submittors, with the top 3 coming from the same institution. Such homo-
geneity in the data submissions could have led to individual (or institutional)
biases where the choice of articles or submissions clouded the data set.
One modification that will increase the amount of training data (but unfor-
tunately not the diversity) is to examine co-occurrences in abstracts instead of
sentences. Statistically, this approach will examine more words. This will in-
crease accuracy when the whole abstract contains language that is distinct for
a particular class, such as some abstracts describing outcomes from clinical tri-
als. In addition, this will detect long range co-occurrences, which can arise due
to hypernymy and other types of indirect references, e.g. that drug. However,
as documented above, previous studies have shown that although reviewing
more text will increase recall, it also decreases precision. It is unclear how
abstract-based co-occurrence will affect the extraction of gene-drug relations.
Nevertheless, this study has shown promising results, indicating that the
gene-drug relationship problem is tractable and may be solvable soon.
107
CHAPTER
6
Distributing the Algorithms
The potential value in computationally accessing knowledge from the biomed-
ical literature has driven the development of natural language algorithms
(Hirschman et al., 2002b). Concurrently, technologies for distributing those
algorithms are also in development (Stein, 2003). Distributing software is use-
ful so that other researcher may evaluate, use, and build upon the work. The
dominant methods for distributing software include compiled binaries, source
code, web sites, or web services.
Each option for distributing programs has advantages and disadvantages.
Compiled binaries force the end user to use a particular platform. Distribut-
ing source code affords more flexibility in the platform and also allows the end
user to modify the algorithm according to their needs. However, source code
may be difficult to install and also can disclose the intellectual property in the
algorithms or implementation. One compromise is to conceal the code behind
a web interface. This allows platform-independent access, but places the bur-
den of maintenance on the operator. The computation is run on the server,
which may be problematic if the application is well used. An extension of the
108
CHAPTER 6. DISTRIBUTING THE ALGORITHMS
web interface is a web service, which allows access from computer programs
(Jamison, 2003).
In particular, distributing the algorithms described in Chapters 3 and 4 is
challenging because of third-party dependencies, intellectual property issues,
and data requirements. Although my code is available as open source, it re-
quires programs that I cannot redistribute, such as Brill’s tagger and Numeri-
cal Recipes in C (Brill, 1994; Press et al., 1993). In addition, the code depends
on many libraries (e.g. Biopython, Numerical Python, libsvm) (Biopython; Nu-
merical Python; Chang and Lin, 2001). While these libraries are freely avail-
able, the requirement still complicates the installation process for the end user.
Finally, the algorithm depends on data stored in a relational database. Dis-
tributing the source code is still impractical because the difficult installation
process prevents casual use.
Therefore, I am distributing the algorithm as both open source code and as
a web page (and web service). This allows both casual use through the web
interfaces and more dedicated use and extension with local installation of the
source code.
In this chapter, I describe the development of the web page and web service
as a BioNLP text mining server. I report on an algorithm to simplify browsing
lists of abbreviations by aggregating similar long forms. I also report on a web
service interface that allows users to access algorithms to identify abbrevia-
tions and gene names from their programs over the internet.
109
CHAPTER 6. DISTRIBUTING THE ALGORITHMS
6.1 Clustering Abbreviations to Aid Browsing
In Chapter 3, I described an algorithm to identify abbreviations from text.
Given the string “Activation of Jun N-kinase (JNK) is a cellular response to
stress”, the algorithm identifies JNK as an abbreviation for Jun N-kinase, with
a score of 0.86.
I applied the algorithm on all abstracts from MEDLINE and found
1,948,246 abbreviation/long form pairs. Out of these pairs, many long forms
were similar. For example, JNK was an abbreviation for 155 different long
forms, including Jun N-kinase, Jun N-kinases, and Jun NH2 kinase. This led
to long and redundant results when users searched for an abbreviation on the
abbreviation web server.
To improve the presentation of abbreviations in a website, I developed a
heuristic to cluster similar abbreviations. The heuristic is based on the idea
that the long forms with small variations can be aggregated if their abbrevi-
ations are the same. For each abbreviation found in MEDLINE, I first sort
the long forms in alphabetical order. Then, I consider each abbreviation/long
form pair sequentially and aggregate it with a previous pair if they meet two
conditions:
• The abbreviations are the same, or they differ because one of them has an
-s (or -es) appended to the end.
• The edit distance between the alpha-numeric characters (ignoring spaces
and punctuation) in the long forms <= 1.
If, according to these condition, the abbreviation/long form pair is similar
110
CHAPTER 6. DISTRIBUTING THE ALGORITHMS
N # Clusters Edit DistanceX 609,162 142 461,822 833 510,955 815 555,813 477 576,524 33
10 592,912 2315 609,105 21
Table 6.1: Clusters of Abbreviations The abbreviations are clustered allow-ing N mismatches per alpha-numeric character in the long form. The first row,X mismatches is the clustering obtained when only 1 mismatch is allowed re-gardless of the length of the long form. The Edit Distance is the maximum editdistance between two long forms in the same cluster.
to two or more other pairs, I cluster them all together. Note that because of
this transitivity, the final clusters can contain long forms with edit distances
greater than 1.
An alternate, more lenient, second condition allows more mismatches de-
pending on the length of the string. I experimented with conditions that
allowed 1 mismatch per N characters (Table 6.1). As expected, more strin-
gent mismatch requirements (greater N) led to increased numbers of clusters.
Fewer abbreviation/long form pairs could be clustered together.
Although the number of clusters vary, the method appears robust and clus-
ters together similar long forms. For example, when using the most lenient
strategy, allowing a mismatch every other character, the two most distant long
forms were:
1 colony-forming units, erythroid burst-forming units, and|||||||||||||| ||||
111
CHAPTER 6. DISTRIBUTING THE ALGORITHMS
2 colony-forming unit-
1 granulocyte erythrocyte macrophage megakaryocyte||||||||||| |||||||||| |||||||||||||
2 granulocyte, macrophage, erythroid, megakaryocyte
1 colony-forming units
2
Requiring the abbreviations to be the same (except for a possible -s at the end)
provides a constraint on the long forms that may be clustered together. How-
ever, such a simple heuristic can cluster long forms with similar letters but
different meaning. One example is:
1 androgen receptor| || || ||||||||
2 a dr energic receptor
However, the frequency or significance of such errors for the user is unclear.
For the final clustering used on the web site, I used the computationally
cheapest strategy and allowed 1 mismatch between any two long forms. This
heuristic reduced 1,948,246 abbreviations into 609,162 clusters.
Finally, to display on web pages, I chose to represent each cluster the abbre-
viation/long form pair that had the highest score. If multiple pairs shared the
highest score, I chose the one that appeared most frequently in MEDLINE.
112
CHAPTER 6. DISTRIBUTING THE ALGORITHMS
6.2 Making Servers Computer-Friendly
Another important aspect of a web interface is to allow computational access
as well as human access. Developers create human-readable web pages us-
ing HTML (HyperText Markup Language), which encodes visual layout rather
than the semantics that computers require (Musciano and Kennedy, 2002).
Therefore, protocols called web services allow computers to access code on other
computers over a network.
For a web service protocol, my requirement was that it 1) allows clients
to access code over a network, 2) is supported across a wide range of oper-
ating systems and programming languages, 3) is relatively simple to install,
and 4) works with normal firewall configurations. Out of the many web ser-
vice protocols available, such as Common Object Request Broker Architec-
ture (CORBA), Common Object Model (COM), Enterprise Java Beans (EJB),
XML Remote Procedure Call (XML-RPC), and Simplified Object Access Pro-
tocol (SOAP), the one that most closely fit our needs was XML-RPC (Bolton,
2001; Templeman and Mueller, 2003; Englander, 1997; Laurent et al., 2001;
Snell et al., 2001). Libraries for XML-RPC are available for many languages
including C, C++, C#, Java, LISP, PERL, Python, Ruby, Scheme, Tcl, .NET, and
Visual Basic.
I developed an XML-RPC interface to the BioNLP server. The server exports
two functions: find abbreviations and find gene and protein names.
The documentation and usage for these functions are as follows:
113
CHAPTER 6. DISTRIBUTING THE ALGORITHMS
find abbreviationsINPUT: stringOUTPUT: array
This function will search for the abbreviations in a string and returnan array of the abbreviations found. Each element of the returnedarray is itself an array of:[string long form , string abbreviation , double score ]
find gene and protein namesINPUT: stringOUTPUT: array
This function will search for the gene and protein names in a stringand return an array of the names found. Each element of the re-turned array is itself an array of:[string name, int start , int end , double score ]
start and end are indexes into the input string that describe wherethe name was found. The indexes begin at 0, and the end index isexclusive.
These functions can be accessed natively from many languages. For exam-
ple, in perl, to access the BioNLP web service to find gene names in a string:
# RPC::XML::Client is the PERL module that handles XML-RPC client# requests. It is available from CPAN.use RPC::XML::Client;
# Create a new XML-RPC client for the BioNLP server.$client = new RPC::XML::Client "http://bionlp.stanford.edu/xmlrpc";
# Call the ’find_gene_and_protein_names’ function on the BioNLP# server and save the response.$resp = $client->send_request(
’find_gene_and_protein_names’,"We observed an increase in mitogen-activated \protein kinase (MAPK) activity.");
114
CHAPTER 6. DISTRIBUTING THE ALGORITHMS
# Save the return value of the function. @results is an array of# information for each gene or protein name found. Each name found is# itself an array of [name, start index, end index, score].@results = $resp->value;
# Print the name and score for each gene found.foreach $i (0..$#results) {
my @data = @{$results[$i]};print ‘‘NAME=’’, $data[0], ‘‘\n’’;print ‘‘SCORE=’’, $data[3], ‘‘\n’’;print ‘‘\n’’;
}
Only two lines of code is necessary to access the web service — one line to
create the connection, and one to call the function. Simple access to algorithms
is important to bioinformatics, and I hope to see more web service enabled
servers in the future.
115
CHAPTER
7
Conclusions
This thesis has described work to develop methods that automatically identify
relationships between drugs and genes from the literature. The final algorithm
can be used to create a database, which will be useful for pharmacogenomics re-
searchers and allow new biological insights. However, the initial investigations
into a gene-drug relationship algorithm uncovered several other areas of sci-
entific research that were not sufficiently advanced to support my endeavors.
These areas also required attention. Thus, for this thesis, I have addressed the
problems of: 1) identifying abbreviations in text (Chapter 3), 2) finding gene
names (Chapter 4), and 3) recognizing related genes and drugs (Chapter 5).
I have framed the main challenges in this thesis as machine learning tasks.
Each one is a problem of classification, where the computer must decide be-
tween two alternatives, e.g. abbreviation or not, gene name or not, or type of
relationship or not. The computer reaches its decision based on pieces of evi-
dence, the features.
Although the algorithms depend heavily on advances in machine learning,
116
CHAPTER 7. CONCLUSIONS
the work in the thesis is somewhat atypical. Instead of focusing on the machin-
ery supporting machine learning algorithms, I have instead found larger per-
formance gains by concentrating on developing informative features and data
representations. In two of the chapters in this thesis, the problems presented
were not amenable to classification with generic features, which for text, is typ-
ically words. Chapter 4 showed that the choice of features influenced the final
performance more than the machine learning algorithm. In this case, there
was little difference between the most and least accurate algorithms, and it
seems unlikely that further algorithm development will enhance performance.
However, the development of features is labor intensive. It requires domain
knowledge, and the features may not be immediately obvious. Good features
capture information relevant to the classification decision. Also, a representa-
tion of the feature must also be discovered that is suitable for machine learning
algorithms. Although much work has been done on developing algorithms, the
features are not as well developing.
As machine learning algorithms mature and are applied to different fields,
scientists must find ways to adapt the methologies to their domain. Perhaps
the science of machine learning will turn toward developing new theories on
automatic development of features, or finding rigorous formulations of good
features. Currently, feature selection methods can find sets of features to help
distinguish classes. However, there should also be formalisms to assess the
quality of a feature based on its distribution, shape, or prevalence. In my work,
I have addressed these issues manually.
117
CHAPTER 7. CONCLUSIONS
Thus, I have discovered features and developed methods necessary to iden-
tify relevant genes and their relationships with drugs. I have shown that the
relationship scoring algorithm performs well when applied to a static list of
genes that a user specifies. However, the literature is dynamic, with new genes
being named and altered constantly. The scoring algorithm would be robust
to these changes if it were applied to a current list of gene names found by
the gene name identification algorithm. Thus, the work in Chapter 5 may be
repeated by replacing the static list of gene names with a dynamic one. The
method would then be able to discover relationships of drugs involving genes
that the user may not have known about or previously considered.
However, as discussed in Section 4.3, developing a gene name identifica-
tion algorithm is only the first step to creating a list of gene names. Further
problems that need to be solved include handling variations in gene names,
synonymous gene names, and gene families. I will discuss these issues in Sec-
tion 7.2.
7.1 Summary of Contributions
My dissertation addressed three separate, but related, challenges in extrac-
tion information from biological text: identifying abbreviations, finding gene
names, and classifying the relationships between genes and drugs. In addi-
tion, I have made technical contributions in the delivery and accessibility of
my algorithms. Here are my contributions grouped according to these four ar-
eas.
118
CHAPTER 7. CONCLUSIONS
Identifying Abbreviations
• I have developed an abbreviation identification algorithm that obtains
higher recall and precision that previous methods. It uses a dynamic-
programming algorithm to find possible alignments and scores the like-
lihood using logistic regression. It is robust to missing characters and
locations of letter alignments.
• I have created a database of all predicted abbreviations from MEDLINE.
The database contains nearly all biomedical abbreviations.
• I have developed an algorithm for aggregating similar abbreviations to
simplify online navigation.
• I have characterized the inter-observer variability and error rate in man-
ually identifying abbreviations in text.
• I have distinguished types of abbreviations that should be handled in sec-
ond generation abbreviation identification algorithms.
Finding Gene Names
• I have developed an algorithm to recognize names of genes and proteins
from free text. The algorithm produces a score that can be adjusted to
trade recall for precision. However, a tradeoff can be chosen such that
both precision and recall exceed that of previous approaches.
• I have developed a novel feature, morphology, for recognizing gene names.
119
CHAPTER 7. CONCLUSIONS
• I have automatically discovered a list of signal and non-signal words asso-
ciated with gene names. The signals include, as high-scoring words, those
from previous hand-generated lists. The use of non-signal words is novel.
Non-signal words have not been used previously.
• I have discovered that deep analysis on single gene symbols can yield
better performance than a method that requires information from several
words.
• I have compared the performance of machine learning algorithms to this
task and found that support vector machines perform best.
Classifying Gene-Drug Relationships
• I have applied the co-occurrence algorithm to a new problem domain —
gene-drug relationships. Similar to previous reports on protein-protein
interactions, co-occurring entities appear to have some biological relation-
ship.
• I have classified sentences with gene-drug co-occurrences into classes de-
fined by the PharmGKB.
• I have predicted the relationships of the genes and drugs in PharmGKB
by combining the scores from the sentences in which they co-occur.
120
CHAPTER 7. CONCLUSIONS
Online Access to Algorithms
• I have created a web site for delivering information about abbreviations
in MEDLINE.
• I have developed XML-RPC web services for hosting algorithms over the
internet. I have demonstrated that web services can allow cross platform
and language independent access to code.
7.2 Future Work
An online resource of gene-drug relationships will be useful for pharmacoge-
nomics scientists. Although the data itself is beneficial, having it in a struc-
tured format allows scientists to generate and test more sophisticated biologi-
cal queries by linking the data to those in other databases. For example, un-
derstanding pharmacogenomic relationships, genetic variations that cause ad-
verse reactions, requires information about polymorphisms, such as the data
stored in dbSNP (Smigielski et al., 2000). Genes that have many polymor-
phisms and are also known to interact with many drugs may be possible can-
didates to include in a clinical test for drug sensitivity. Similarly, information
about the mechanism of drug relationships may exist in the structure or path-
way of the gene product.
Ultimately, such a database of gene-drug relationships may be compiled au-
tomatically using the tools described in this thesis. It is important to note here,
that many of the methods and ideas are not specific to pharmacogenomics. For
example, methodologies for identifying relationships can be applied to classify
121
CHAPTER 7. CONCLUSIONS
gene-gene interactions. Also, the general types of features may be useful for
information extraction algorithms in other domains.
Although this thesis presents methodological advances toward the goal of
automatic identification of gene-drug relationships, many areas of the problem
still warrant further investigation.
7.2.1 Disambiguating Gene Names
Considerable ambiguity still exists in the definition of a gene name. Previous
studies have found large ambiguities in whether experts agree on whether a
word is a gene (or protein) or not (Krauthammer et al., 2000). In common
use, there can be little or no distinction between genes and their products,
types of genes, gene families, complexes, peptides, motifs, domains, alleles, and
gene structure such as introns and exons. This causes considerable difficulty
in the development of gene name identification algorithms. Without a clear
delineation of the different entities, comparing the performance of algorithms
is difficult.
Another issue related to gene and protein names are synonyms. Many genes
have multiple names. For example, the BRCA1 gene is also PSCP and RNF53.
Algorithms that extract information about genes from literature must handle
synonyms to resolve references to the same gene.
However, even if a gene does not have synonyms, its name may still
vary in ways that confound simple string matching algorithms. For example,
hemoglobin beta may be written hemoglobin B or, more rarely in MEDLINE
abstracts, hemoglobin β. This is also the same gene as beta hemoglobin. Thus,
122
CHAPTER 7. CONCLUSIONS
matching gene names is more complicated than matching strings. One possible
way to recognize this is to normalize the gene name into a structured form so
that each of the names appears similarly as:
Core: hemoglobinSpecifier: beta
I present a possible approach to gene name normalize, with no evaluation,
in Appendix E.
7.2.2 Formal Descriptions of Data
One way to resolve the ambiguities around gene names is to organize those
and related terms in an ontology. Such a formalism complements work in in-
formation extraction. Ontologies could define the different types of biological
entities related to genes (such as proteins or gene families), and the relation-
ships between them. For example, a protein is a product of a gene, an exon is a
part of a gene, or, more specific information, hemoglobin is a type of globin. Us-
ing strict formal definitions provides informative distinctions for the user and
enables more specific characterization of the performance of algorithms. As de-
scribed in Section 3.3.3, even seemingly simple concepts such as abbreviations
may have subtle ambiguities that lead to disagreements and cause difficulties
in evaluating algorithms.
Using ontologies would also be important for inferring the family or type
of gene or drug. My current methods do not account for classes of genes or
drugs. In the linguistics literature, this relationship is called hypernymy. If
the text analyzed contain references to broad classes of drugs, the algorithms
123
CHAPTER 7. CONCLUSIONS
would not recognize that those may include specific names in the drug lexicon.
For example, they would not recognize that an abstract describing information
about anti-hypertensive drugs may also apply to Propranolol. To handle these
distinctions, algorithms should employ existing hierarchical lists of drugs, such
as the one produced by the commercial company Apelon. However, for gene
names, new methods will need to be developed to recognize hypernyms in text.
Another important use for ontologies is to guide the development of stan-
dardized data sets. If a data set is annotated according to an ontology, it will
contain the distinctions in the ontology and thus be useful for other users also
interested in those distinctions. However, without precise definitions, notions
of distinctions may be different enough so that the data set would have to be
reannotated. The Yapex test collection introduced in Chapter 4 contains a por-
tion of text from the GENIA data set (Franzen et al., 2002; Ohta et al., 2002).
However, the Yapex notion of protein was generally broader than the one in GE-
NIA. For example, Yapex includes as protein c-jun, although GENIA describes
it more specifically as a DNA domain or region . Unfortunately, not all DNA
domain or region names, such as promoter and proximal region were pro-
teins in Yapex. Presumably because of this type of inconsistency in definitions,
the researchers reannotated the text in Yapex that was derived from GENIA.
Even if the community could agree on standard definitions of terms relevant
to biological text mining, there will be ambiguity in the definitions. This, an
ontology should also include the expected inter-observer variability of different
distinctions in a manner similar to that reported in Section 3.3.3. This would
be a reasonable upper bound on the performance of algorithms.
124
CHAPTER 7. CONCLUSIONS
7.2.3 Annotated Text Data
Several projects, notably GENIA, have produced high-quality annotated data
sets. However, as the knowledge domain becomes more complex and the dis-
tinctions more subtle, the amount of data needed to develop algorithms, to eval-
uate them meaningfully, and to distinguish them with statistical significance
will grow. The lack of annotated data sets can hinder research into algorithm.
Developing many of my methods required significant labor to produce anno-
tated training sets. In addition, the lack of a proper gold standard significantly
hindered my ability to develop methods in Section 5.4 to score the correctness
of gene-drug relationships.
Although developing data sets require manual verification, leveraging auto-
mated methods may mitigate some of the labor costs. Current machine learn-
ing methods such as active learning or co-training can generate labelled data
sets from little training data. Thus, these methods may be able to simplify
the task by generating an initial low resolution “draft” of the annotated data
set. Some have argued that algorithms generated from noisy training data
may perform competitively with those trained on higher quality data (Morgan
et al., 2003). Assuming that verifying annotations is simpler than composing
them, these methods may be able to help efficiently develop larger quantities
of annotated data.
7.2.4 Full Text Articles
The final limitation of the work in this thesis is the reliance on information
in MEDLINE abstracts. Although abstracts should contain a summary of the
125
CHAPTER 7. CONCLUSIONS
important findings in the paper, clearly there is more detailed and useful infor-
mation in the full text. Analyzing the full document should increase the recall
performance of algorithms. Fortunately, full text documents may soon be per-
vasive due to efforts such as PubScience and communication technologies such
as the internet (Roberts et al., 2001).
However, full text documents may introduce challenges not present when
analyzing abstracts. Journal articles contain many sections with information
other than new research findings. For example, Methods sections contain in-
formation about the experimental protocol that may be irrelevant to the re-
search hypothesis. Furthermore, Discussion sections often contain specula-
tions that may not yet be proven. Therefore, when full text articles are readily
available, the legitimacy of the information extracted should be analyzed with
respect to its source section.
7.3 Final Conclusions
Having structured knowledge in computable form will accelerate research in
pharmacogenomics and other health disciplines. Using tools that can automat-
ically extract the knowledge from unstructured literature will ensure access to
the most current known information. Once these knowledge bases are avail-
able, researchers will be able to develop algorithms that can link diverse types
of biological data to propose novel, and perhaps clever, biological hypotheses.
Only at that point will the scientific community have met the challenge pre-
sented by Swanson, when he threw down the gauntlet and charged the com-
munity to find the “undiscovered public knowledge.”
126
APPENDIX
A
Training Maximum Entropy
The maximum entropy classifier has a statistical formulation that is provably
equivalent to the information theoretic one (Berger et al., 1996). Viewed thusly,
maximum entropy searches for feature weightings that best explain the train-
ing set. This turns into an optimization problem of maximizing the probability
of a model, and can be solved with generic algorithms such as conjugate gra-
dient descent (or inversely, ascent). Since I have not been able to locate the
derivation of this formulation, I provide it here for completeness.
The function to maximize is defined:
L(α) = p(x, y) log p(y|x)
This is reminiscent of the conditional entropy formula. x is an observation
vector, and y is a class. Thus, p(x, y) is the probability that the observation
belongs to the class. p(y|x) is the probability of the class given an observation,
calculated from the model.
The formula for the model is, according to the maximum entropy formula-
tion:
127
APPENDIX A. TRAINING MAXIMUM ENTROPY
pα(y|x) =1
Z(x)
∏i
eαifi(x,y)
Z(x) is a normalization factor to insure a probability.
Z(x) =∑
y
∏i
eαifi(x,y)
Then, taking a log of the weighted feature values (fi(x, y)) yields the likeli-
hood of the parameters α:
L(α) =∑x,y
p(x, y) log1
Z(x)
∏i
eαifi(x,y)
This likelihood is calculated by summing over all combinations of x and y
from the training set. p(x, y) is the probability of seeing a particular combi-
nation as provides a weight for the features based on their prevalence in the
training set.
Distributing the log:
L(α) =∑x,y
p(x, y)[∑
i
αifi(x, y)− log Z(x)]
L(α) =∑x,y
p(x, y)αi
∑i
fi(x, y)−∑x,y
p(x, y) log Z(x)
And the expanding the Z(x):
L(α) =∑x,y
p(x, y)αi
∑i
fi(x, y)−∑x,y
p(x, y) log(∑
y
∏i
eαifi(x,y))
128
APPENDIX A. TRAINING MAXIMUM ENTROPY
Summing through the y variable:
L(α) =∑x,y
p(x, y)∑
i
αifi(x, y)−∑
x
p(x) log(∑
y
∏i
eαifi(x,y))
yields the final version of the function to be maximized.
Maximizing this function using conjugate gradient descent requires the par-
tial derivatives with respect to each of the αi parameters. The partial deriva-
tive is:
dL(α)
dαi
=∑x,y
p(x, y)fi(x, y)−∑
x
p(x)1∑
y
∏i e
αifi(x,y)
∑y
[∏
i
(eαifi(x,y))fi(x, y)]
Moving∑
y out:
dL(α)
dαi
=∑x,y
p(x, y)fi(x, y)−∑x,y
p(x)1∑
y
∏i e
αifi(x,y)(∏
i
eαifi(x,y))fi(x, y)
dL(α)
dαi
=∑x,y
p(x, y)fi(x, y)−∑x,y
p(x)1
Z(x)(∏
i
eαifi(x,y))fi(x, y)
dL(α)
dαi
=∑x,y
p(x, y)fi(x, y)−∑x,y
p(x)p(y|x)fi(x, y)
The partial derivatives can be interpreted as the difference between the in-
formation from the training data and the model. Thus, the extreme values of
129
APPENDIX A. TRAINING MAXIMUM ENTROPY
the likelihood function is found when this function is 0, i.e. there is no differ-
ence between the training data and model.
130
APPENDIX
B
Sentence Boundary Disambiguation
Sentence boundary identification is useful for text processing. Many tagging
and parsing applications require separate sentences as input.
Unfortunately, finding sentence boundaries is not straightforward because
of ambiguities in sentence-ending punctuation. Periods appear in many con-
texts, such as 0.05, N.I.H., or G. Bush (Table B). Therefore, determining
whether punctuation indicates a sentence boundary requires special process-
ing. Researchers have approached this problem both with heuristics and sta-
tistical models such as neural networks and maximum entropy models (Palmer
and Hearst, 1994; Reynar and Ratnaparkhi, 1997).
Fortunately, MEDLINE abstracts are relatively well structured. In general,
the text is regular and the sentences are well-formed. Thus, I use a simple set
of heuristics to find sentence boundaries (Figure B).
131
APPENDIX B. SENTENCE BOUNDARY DISAMBIGUATION
• Only ’.’, ’?’, ’!’, and ’”’ can be sentence boundaries.
• There is always a sentence boundary at the end of the text.
• A sentence boundary cannot precede another sentence boundary.
• Sentence boundaries always precede whitespace.
• Question marks and exclamation points mark the end of the sentence, aslong as they are not followed by quotes.
• Quotation marks are sentence boundaries if they follow a sentence bound-ary character.
• A period followed by whitespace followed by a capital letter is a sentenceboundary.
Figure B.1: Heuristics for Sentence Boundary Disambiguation. Thisheuristic does a reasonable job at finding sentence boundaries in MEDLINEabstracts.
Handled? Description Examples√Period is inside a token. 14.2√Punctuation repeated foremphasis
!!!
√Abbreviation 300 ng i.p. , N.I.H.-approved√Period is inside a quote ."√Species name E. coli√E.C. numbers EC 1.6.2.4Next sentence starts witha number
"observed. 5?-Deletion"
Next sentence starts witha lower case letter
p-Nitrophenol , hCG
Numbering a list 1. 2. 3.Abbreviation in a name Mol. Brain Res.
Table B.1: Sentences Boundary Ambiguities. This table shows some caseswhere the end of sentences are ambiguous. The first column indicates whethereach case is correctly handled by my heuristic.
132
APPENDIX
C
Gene Drug Relationships
I manually identified the relationships between 100 pairs of genes and drugs.
See Section 5.4 for more information.
Gene Drug Abs Relationship PMIDAldehydedehydrogenase
Ethanol 291 Gene catalyzes drugmetabolism.
11762132
GlutathioneS-transferase
Glutathione 277 Drug is substrate ofgene.
345769
Angiotensin-convertingenzyme
Insulin 178 Gene affects sensitivityto drug.
2220797
Glucocorticoidreceptor
Steroids 165 Drug is substrate ofgene.
366226
CYP1A1 ethoxyresorufin 151 Gene metabolizes drug. 6422171CYP2D6 Quinidine 145 Drug inhibits gene. 12867484CYP2D6 Mephenytoin 115 Drug is substrate of
related gene.8861658
Angiotensin-convertingenzyme
Beta blockers 109 Drug treats conditionrelated to gene.
191300
Angiotensin-convertingenzyme
hydralazine 88 Drug inhibits gene. 2416221
CYP3A4 Mephenytoin 84 Drug is substrate ofrelated gene.
8861658
CYP1A2 Mephenytoin 82 Drug is substrate ofrelated gene.
8861658
CYP2C9 Mephenytoin 81 Drug is substrate ofrelated gene.
8861658
133
APPENDIX C. GENE DRUG RELATIONSHIPS
Angiotensin-convertingenzyme
propranolol 78 Drug treats conditionrelated to gene.
2531184
MRP Conjugates 76 Drug is non-specific.Angiotensin-IIreceptor
Enalapril 75 Drug inhibits gene inthe pathway of the gene.
10076917
Alcoholdehydrogenase
Glutathione 73 Drug is metabolite ofgene.
12631283
Angiotensin-IIreceptor
captopril 68 Drug inhibits gene inthe pathway of the gene.
10052650
Angiotensin-convertingenzyme
calcium channelblockers
65 Drug treats conditionrelated to gene.
2487803
CYP1A2 phenacetin 61 Gene metabolizes drug. 7678502CYP1A2 Quinidine 60 Drug inhibits related
gene.7895609
CYP2E1 Glutathione 60 Gene activity influencedby drug concentration.
9101035
CYP3A4 dextromethorphan 60 Gene metabolizes drug. 9352574Angiotensin-convertingenzyme
digoxin 58 Drug treats conditionrelated to gene.
7503006
CYP1A2 Benzo(a)pyrene 57 Drug is substrate ofgene.
8819302
CYP1A2 dextromethorphan 57 Related genemetabolizes drug.
9352574
CYP1A2 Tolbutamide 56 Related genemetabolizes drug.
9431831
CYP3A4 Tolbutamide 56 Related genemetabolizes drug.
9431831
CYP2D6 Tolbutamide 55 Related genemetabolizes drug.
9431831
CYP2E1 Mephenytoin 55 Related genemetabolizes drug.
11523064
DT-diaphorase Glutathione 55 Drug transferase andgene activity weresometimes similar.
2391358
CYP2C9 Quinidine 54 Related genemetabolizes drug.
7720520
CYP2D6 Fluvoxamine 54 Gene metabolism ofdrug is not clinicallysignificant.
8846617
CYP2E1 ethoxyresorufin 53 Related genemetabolizes drug.
10383922
CYP2C19 Tolbutamide 51 Gene metabolized drug. 10411572CYP2C19 dextromethorphan 50 Related gene
metabolizes drug.10859141
134
APPENDIX C. GENE DRUG RELATIONSHIPS
CYP2C9 dextromethorphan 50 Related genemetabolizes drug.
10859141
CYP2D6 caffeine 50 Related genemetabolizes drug.
9867310
Glucocorticoidreceptor
Insulin 50 Low drug levelsassociated with genepolymorphism.
12351458
MDR1 Glutathione 50 Levels of drugtransferase and geneare correlated.
1348425
CYP2D6 Coumarin 49 Related genemetabolizes drug.
10828259
CYP2E1 Coumarin 49 Related genemetabolizes drug.
10828259
CYP1A2 Coumarin 47 Related genemetabolizes drug.
10828259
CYP3A4 Coumarin 46 Related genemetabolizes drug.
10828259
CYP2C19 Quinidine 44 Related genemetabolizes drug.
7640150
ALDH2 Ethanol 43 Gene metabolizes drug. 3067025CYP3A4 ethoxyresorufin 42 Related gene
metabolizes drug.10725303
Catechol-O-methyltransferase
dopamine 42 Gene activity reducesamount of drugavailable.
8190296
CYP1A1 Glutathione 41 Drug reduces inhibitionof gene.
12519694
CYP1A2 Glutathione 41 Inactivation of gene notaffected by drug.
11714871
CYP2C19 Fluvoxamine 41 Drug inhibits gene. 10674711ALDH2 Glutathione 1 No relationship.Angiotensin-IIreceptor
androgens 1 No relationship.
CYP1A1 Fluorouracil 1 No relationship.CYP1A1 Lovastatin 1 Drug does not influence
gene activity.11523064
CYP1A1 lidocaine 1 No relationship.CYP1A2 Histamine 1 Competitor for drug had
no effect on geneactivity.
9485522
CYP1A2 Macrolides 1 Drug inhibits relatedgene.
8119047
CYP1B1 omeprazole 1 No relationship.CYP1B1 phenytoin 1 Drug is substrate of
gene family.9493761
CYP2A6 Oral contraceptives 1 Drug does not affectmetabolic rate of gene.
9653923
135
APPENDIX C. GENE DRUG RELATIONSHIPS
CYP2A6 glucuronide 1 Drug is substrate ofrelated gene.
11377097
CYP2A6 procainamide 1 Related genemetabolizes drug.
9352574
CYP2C19 Steroids 1 Gene family metabolizesdrug.
7704034
CYP2C19 halothane 1 No relationship.CYP2C9 ascorbic acid 1 No relationship.CYP2C9 codeine 1 Gene does not affect
rate of drug metabolism.9113345
CYP2C9 flecainide 1 Drug inhibits gene. 8801060CYP2D6 hydralazine 1 No relationship.CYP2E1 dopamine 1 Gene is expressed in
cells containing drug.9881865
CYP3A5 aflatoxin 1 Gene activates drug. 10224324CYP3A7 phenytoin 1 Drug is substrate for
family of gene.9493761
DT-diaphorase glucuronide 1 No relationship.Dihydropyrimidinedehydrogenase
Bilirubin 1 Disabled gene leads toincreased level of drug.
10348793
Dihydropyrimidinedehydrogenase
Glutathione 1 No relationship.
GSTM1 vinyl chloride 1 Drug may bemetabolized by gene.
9838066
GSTP1 Cyclosporin A 1 Drug activity leads toreduced gene function.
11108662
Glucocorticoidreceptor
Quinidine 1 No relationship.
Glucocorticoidreceptor
Tacrolimus 1 Drug is substrate ofgene.
9600660
Glucocorticoidreceptor
menadione 1 Drug binds to gene. 11311319
GlutathioneS-transferase
Histamine 1 Gene influences drugaction.
10372823
GlutathioneS-transferase
paclitaxel 1 Drug did not enhanceinfluence of NSAIDs ongene activity.
9849488
MRP Benzo(a)pyrene 1 No relationship.MRP ethoxyresorufin 1 No relationship.MRP naproxen 1 Increased toxicity
observed with activedrug and overexpressedgene.
9849488
MRP propranolol 1 No relationship.MRP tamoxifen 1 No relationship.N-Acetyltransferase
Conjugates 1 No relationship.
136
APPENDIX C. GENE DRUG RELATIONSHIPS
N-Acetyltransferase
Insulin 1 Drug leads to activationof gene.
7017937
N-Acetyltransferase
vinyl chloride 1 Gene does notcontribute to diseasecaused by gene.
1458774
NAT1 Phenacetin 1 Drug exposure does notlead to genepolymorphisms.
9761125
NAT2 Irinotecan 1 No relationship.Peroxisomeproliferator-activatedreceptor
Glutathione 1 No relationship.
Peroxisomeproliferator-activatedreceptor
naringenin 1 Drug had no effect ongene expression.
11245597
Prothrombin p-aminobenzoicacid
1 No relationship.
Stromelysin Dexamethasone 1 No relationship.Sulfotransferase Benzo(a)pyrene 1 No relationship.Sulfotransferase Oral contraceptives 1 Drug may lead to
changes in gene activity.6934248
UGT1A1 Ethanol 1 Drug influences activityof gene.
11091029
Vitamin D receptor tamoxifen 1 No relationship.uridinediphosphate-glucuronosyltransferase
glucuronide 1 Drug is substrate ofgene.
9054958
Table C.1: Relationships Between Genes and Drugs. Thistable describes the relationship between 100 genes and drugsfound to co-occur in the literature, but were not identified in areview by Evans & Relling. The first 50 are the genes and drugsthat appear in the greatest number of abstracts. The last 50 arerandomly chosen genes and drugs that appear in one abstractonly. The Abs column describes the number of abstracts thatcontained both the gene and drug. The Relationship columndescribes the relationship between the gene and drug.
137
APPENDIX
D
Classifying PharmGKB Relationships
Chapter 5 presented an algorithm for predicting the relationships between
genes and drugs. I applied this algorithm to a list of genes and drugs from
the Pharmacogenomics Knowledge Base. The following table contains a com-
plete list of the predictions.
Gene Drug Actual Rel. Predicted Rel.ABCC8 tolbutamide PD PDACE ace inhibitors Gn PD Gn PDACE atenolol PD Gn PDACE captopril PD PDACE enalapril PD Gn PDACE enalaprilat Gn PD PDACE fluvastatin Gn PDCO Gn PDACE fosinopril GnFA PDCO Gn PDACE gemfibrozil Gn PD Gn PDACE imidapril PD PDACE lisinopril Gn PD Gn PDALDH2 ethanol Gn PD Gn PDALDH2 vinyl chloride Gn PD Gn PKPDAPOA1 testosterone PD GnAPOE choline PD FAAPOE donepezil PD PDAPOE estrogens Gn PDAPOE fenofibrate PD FA PD
138
APPENDIX D. CLASSIFYING PHARMGKB RELATIONSHIPS
Gene Drug Actual Rel. Predicted Rel.APOE gemfibrozil Gn PD PDAPOE simvastatin GnFA PDCO PDAPOE tacrine Gn PDCO Gn PDBCHE succinylcholine CO Gn PDC3 gemfibrozil Gn PD Gn PDCETP pravastatin PDCO PDCFTR cpx GnFACOMT methyldopa PKPD PKPDCYP1A2 amiodarone PK PKCYP1A2 caffeine PK PKCYP1A2 clozapine PK PKPDCYP1A2 estradiol PK PDCYP1A2 fluvoxamine PK PKPDCYP1A2 haloperidol PK PKPDCYP1A2 imipramine PK PKCYP1A2 modafinil PK PKPDCYP1A2 naproxen PK PKCYP1A2 ondansetron PK PKCYP1A2 propranolol PK PKCYP1A2 riluzole PK PKCYP1A2 ropivacaine PK PKPDCYP1A2 tacrine PK PKCYP1A2 theophylline PK PKCYP1A2 ticlopidine PK PKPDCYP1A2 verapamil PK PKCYP1A2 zolmitriptan PK PKCYP1A2 zoxazolamine PD PKCYP1B1 estrogens CO Gn PKCYP2A6 fadrozole Gn Gn PDCOCYP2A6 fluorouracil PK PKCYP2A6 rifampin FACYP2A6 tegafur PK PKCYP2B6 aflatoxin b1 FA PD PKCYP2B6 bupropion PK PKCYP2B6 cyclophosphamide FAPKPD PKCYP2B6 ifosfamide PK PKCYP2B6 phenobarbital PK FACYP2B6 rifampin PK PKCYP2C19 amitriptyline PK PK
139
APPENDIX D. CLASSIFYING PHARMGKB RELATIONSHIPS
Gene Drug Actual Rel. Predicted Rel.CYP2C19 cyclophosphamide PK Gn PKPDCYP2C19 diazepam Gn PKPD Gn PKPDCYP2C19 fluoxetine PK PKPDCYP2C19 fluvoxamine PK PKPDCYP2C19 hexobarbital PK Gn PKPDCYP2C19 lansoprazole Gn PK CO Gn PKPDCYP2C19 mephenytoin FA PKCYP2C19 modafinil PK PKPDCYP2C19 nelfinavir PK PKPDCYP2C19 omeprazole Gn PKPDCO Gn PKPDCYP2C19 pantoprazole PK Gn PKCYP2C19 proguanil PK PKCYP2C19 ticlopidine FA PKPDCYP2C8 paclitaxel PKPDCO PKCYP2C8 rifampin FACYP2C9 acenocoumarol Gn PD PKCYP2C9 amiodarone PK PKPDCYP2C9 celecoxib PK PKPDCYP2C9 diclofenac PK PKPDCYP2C9 fluconazole PK Gn PDCYP2C9 fluoxetine PK PKPDCYP2C9 fluvastatin PK Gn PKPDCYP2C9 fluvoxamine PK PKCYP2C9 glimepiride PK PDCYP2C9 glyburide PK PKPDCYP2C9 ibuprofen CO PKCYP2C9 isoniazid PK PKPDCYP2C9 losartan PK Gn PKPDCYP2C9 phenylbutazone PK PDCOCYP2C9 phenytoin PKPDCO PKPDCYP2C9 rifampin FACYP2C9 tolbutamide PKPDCO PKPDCYP2C9 torsemide PK PKCYP2C9 warfarin Gn PKPDCO Gn PKPDCYP2D6 amitriptyline PK PKCYP2D6 cimetidine PK PKCYP2D6 clomipramine PK PKCYP2D6 clozapine FA PKCYP2D6 cocaine PK PK
140
APPENDIX D. CLASSIFYING PHARMGKB RELATIONSHIPS
Gene Drug Actual Rel. Predicted Rel.CYP2D6 codeine Gn PKPDCO PKPDCYP2D6 debrisoquine Gn PK PKCYP2D6 desipramine PK PKPDCYP2D6 dextromethorphan PK PKCYP2D6 diltiazem PK PKCYP2D6 flecainide PK Gn PKPDCYP2D6 fluoxetine Gn PK CO PKPDCYP2D6 fluvoxamine PK PKPDCYP2D6 haloperidol Gn PKPD PKPDCYP2D6 imipramine PK PKCYP2D6 interferon alpha PK Gn PDCOCYP2D6 metoprolol Gn PKPD PKCYP2D6 mexiletine Gn PKPD PKCYP2D6 morphine CO PKPDCYP2D6 paroxetine PK PKPDCYP2D6 perphenazine PK Gn PKPDCYP2D6 propafenone PKPD PKPDCYP2D6 risperidone Gn PKPD Gn PKPDCYP2D6 ritonavir PK PDCYP2D6 sparteine PD PKCYP2D6 thioridazine PK PKPDCYP2D6 timolol PK PDCYP2D6 tramadol PKPD PKCYP2D6 xenobiotics Gn PK Gn PKCYP2D6 zuclopenthixol PK Gn PKPDCYP2E alcohol PKPDCYP2E ethanol GnFA PD PDCYP2E xenobiotics GnFAPKPD FACYP3A4 alprazolam PK PKPDCYP3A4 epipodophyllotoxin Gn PD Gn PKPDCYP3A4 nifedipine Gn PK PKCYP3A4 omeprazole PKPD PKPDCYP3A4 testosterone Gn PD PKCYP3A4 xenobiotics GnFA Gn PKCYP3A5 midazolam GnFA PD PKCYP3A5 tacrolimus PK PK COCYP3A5 xenobiotics Gn PD FAPKCYP4B1 xenobiotics Gn FAPKPDDHFR methotrexate PK PK
141
APPENDIX D. CLASSIFYING PHARMGKB RELATIONSHIPS
Gene Drug Actual Rel. Predicted Rel.DRD3 clozapine Gn PD PDDRD3 neuroleptics Gn PDCO PDDRD4 antipsychotics Gn COF2 estrogens CO PDF2 oral contraceptives Gn CO Gn PDGSTA1 busulfan GnFAPK PKGSTA1 xenobiotics Gn FA PDGSTM1 tacrine PKPD Gn PDGSTM1 xenobiotics CO Gn PKGSTT1 xenobiotics CO PKHLA-DRB1 cyclosporine PD Gn PDHNMT histamine GnFA Gn PDHTR2A clozapine Gn PDCO Gn PDHTR2C clozapine PD GnHTR6 clozapine Gn PDINPP1 lithium Gn PD PDLPL fenofibrate PD PDMGMT carmustine PDNAT1 sulfamethoxazole FA PKPDNAT1 xenobiotics Gn GnFAPKNAT2 caffeine Gn PK PKNAT2 hydralazine PKPD Gn PKPDNAT2 isoniazid PK Gn PKPDNAT2 procainamide Gn PKPD PKPDNAT2 sulfamethoxazole PK PKPDNAT2 xenobiotics GnFA CO Gn PKNQO1 benzene PKPDCO PKNQO1 mitomycin c CO PKNQO1 xenobiotics GnFA FANR1I2 xenobiotics GnFA FAOPRM1 opiates FA GnFA PDRYR1 halothane PDSCN5A mexiletine PD Gn PDSULT xenobiotics FA GnSULT1A1 phenol GnFA PKSULT2A1 dehydroepiandrosterone GnFA PKTNF thalidomide COTPH fluvoxamine PD PDTPH paroxetine PD PD
142
APPENDIX D. CLASSIFYING PHARMGKB RELATIONSHIPS
Gene Drug Actual Rel. Predicted Rel.TPMT 6-mercaptopurine Gn PKPDCO Gn PKPDCOTPMT 6-thioguanine Gn PD GnFAPKPDCOTPMT azathioprine GnFAPKPDCO Gn PDTPMT cefazolin Gn PD Gn PKPDCOTPMT mercaptopurine GnFAPKPDCO GnFAPKPDCOTPMT sulfasalazine PK GnFAPKPDCOTPMT thioguanine GnFAPKPDCO GnFAPKPDCOTPMT thiopurines GnFA PD GnFAPKPDCOUGT1A1 irinotecan PKPDCO PKPDUGT1A9 flavopiridol FA PKPDUGT2B7 epirubicin FAPK PKVDR vitamin d FA FAXRCC1 alcohol PD Gn PKPD
Table D.1: Gene-Drug Classification Results. Ipredicted the type of relationship for gene-drug pairs.The Actual Relationships column lists the rela-tionships identified by a PharmGKB curator. ThePredicted Relationships column lists the relation-ships my algorithm predicted. See Section 5.6 for adescription of the algorithm.
143
APPENDIX
E
Gene Name Normalization
The names of many genes can exhibit diverse variations. For example, beta
hemoglobin and hemoglobin B are the same gene. Thus, one way to recognize
these as synonyms of the same gene is to normalize them to a common form.
In this form, they will be easier to compare. Matching these variations in
gene names is important for information retrieval engines. If the engine can
not recognize a variation of a name, it will not be able to recognize a relevant
document.
As part of a submission to the information retrieval task of the 2003 TREC
Genomics Track, I developed a gene name normalization heuristic using word
matching (Voorhees, 2002b). Assuming that a limited number of words appear
peripherally to the core words of the gene name, it is possible to enumerate
those words nearly comprehensively. The words are then classified into types
useful for normalization. For example, two gene names that differ by a function
word (e.g. inhibitor) likely indicate different genes. On the other hand, if they
differ only by a specifier (e.g. the number 5), then they may be in the same
family.
144
APPENDIX E. GENE NAME NORMALIZATION
I started with an initial set of word classes defined in ProMiner (Hanisch
et al., 2003). I applied the algorithm to a list of gene names derived in the
TREC training set. This results in a list of words assigned to type accord-
ing to the classification. Because the classification was not complete, many of
the words were unassigned. I assigned types to the most frequently occurring
words, creating new types when necessary. Then, I repeated these steps until
I was satisfied with the coverage of the classification.
In the tables below, I present the classification of the words in gene names
into types. Every word that does not appear in this classification is a putative
Core gene name.
Entitymolecule protein gene oncogenedna cdna rna trna mrnafragment fragments peptide polypeptide neuropeptideprecursor product transcriptclone factor factors antigen isoform
Functionactivator inactivator inhibitor inhibitorybinding interacting converting modulatingrepair sorting transporter transportingregulating regulator regulatorysilencer suppressor
Structurecomplexcomponent subcomponentchain subunit domain subdomain regionpromoter repeat
145
APPENDIX E. GENE NAME NORMALIZATION
Locationmembrane golgi
Relationshipassociated coupled interactor interactiondependent regulatedreceptor receptorsligand substrate substratescarrieragonist antagonistmember family superfamilyhomolog relatedlike
Modifieractivated induced induciblecatalyticpendingputative variant partial
Specifier1 2 3 4 56 7 8 9 1011 12 13 14 1516 17 18 19 20i ii iii iv vvi vii viii ix xxi xii xiii xiv xvxvi xvii xviii xix xxalpha beta gammaepsilon kappa deltaa b c d ef g h i jk l m n op q r s tu v wx y z1a 2d6
146
APPENDIX E. GENE NAME NORMALIZATION
Common Wordand of at in byinclude includedsee alsosmall heavy stiff major
Not Genecocaine substancefibroblast hormonedisease syndromesynthesis transcription lymphoproliferativecontaining differential display generalized growthprogressive signal transducer
Table E.1: Classes of Words in Gene Names. Thistable lists the words that appear in gene names, or-ganized by type of word.
Clearly, this is not a comprehensive list of all words that can appear in a
gene name. Also, note that many words have morphologic variations, such as
fragment and fragments. A final system should use a morphologic stemmer to
recognize variants, even if they do not appear in this list. Also, this list includes
many sequences, such as numbers or Greek letters. The final algorithm should
recognize these without requiring a comprehensive enumeration.
147
Bibliography
Acronym Finder. Acronym finder. http://www.acronymfinder.com/ , 2002.
Acronyms and Initialisms. Acronyms and initialisms for health informationresources. http://www.geocities.com/˜mlshams/acronym/acr.htm ,2002.
DP Agarwal. Genetic polymorphisms of alcohol metabolizing enzymes.Pathologie–biologie, 49(9):703–709, Nov 2001.
SF Altschul, W Gish, W Miller, EW Myers, and DJ Lipman. Basic local align-ment search tool. Journal of Molecular Biology, 215(3):403–410, Oct 1990.
MA Andrade and P Bork. Automated extraction of information in molecularbiology. FEBS Letters, 476:12–7, 2000.
Jonathan H. Aseltine. Wave: An incremental algorithm for information extrac-tion. In Proceedings of the AAAI 1999 Workshop on Machine Learning forInformation Extraction, 1999.
M Ashburner, CA Ball, JA Blake, D Botstein, H Butler, JM Cherry, AP Davis,K Dolinski, SS Dwight, JT Eppig, MA Harris, DP Hill, L Issel-Tarver,A Kasarskis, S Lewis, JC Matese, JE Richardson, M Ringwald, GM Rubin,and G Sherlock. Gene ontology: tool for the unification of biology. the geneontology consortium. Nature Genetics, 25:25–9, 2000.
A Bairoch and B Boeckmann. The swiss–prot protein sequence data bank.Nucleic Acids Research, 19 Suppl:2247–2249, Apr 1991.
PG Baker, A Brass, S Bechhofer, C Goble, N Paton, and R Stevens. Tambis—-transparent access to multiple bioinformatics information sources. Proceed-ings of the International Conference on Intelligent Systems for Molecular Bi-ology, 6:25–34, 1998.
JD Baxter. Mechanisms of glucocorticoid inhibition of growth. Kidney Interna-tional, 14(4):330–333, Oct 1978.
Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra.A maximum entropy approach to natural language processing. Compu-tational Linguistics, 22(1):39–71, 1996. URL citeseer.nj.nec.com/berger96maximum.html .
148
BIBLIOGRAPHY
Biopython. Biopython. http://www.biopython.org/ , 1998.
MV Blagosklonny and AB Pardee. Conceptual biology: Unearthing the gems.Nature, 416(6879):373, Mar 2002.
C Blaschke, MA Andrade, C Ouzounis, and A Valencia. Automatic extractionof biological information from scientific text: protein-protein interactions. InProceedings of the International Conference on Intelligent Systems for Molec-ular Biology, pages 60–7, 1999.
Fintan Bolton. Pure CORBA. SAMS, 2001.
E. Brill. Some advance in transformationbased part of speech tagging. In Pro-ceedings of the Twelefth National Conference on Artificial Intelligence (AAAI-94), 1994.
PP Bringuier, M McCredie, G Sauter, M Bilous, J Stewart, MJ Mihatsch,P Kleihues, and H Ohgaki. Carcinomas of the renal pelvis associatedwith smoking and phenacetin abuse: p53 mutations and polymorphism ofcarcinogen–metabolising enzymes. International Journal of Cancer, 79(5):531–536, Oct 1998.
Christopher Burges. A tutorial on support vector machines for pattern recog-nition. Data Mining and Knowledge Discovery, 2:121–167, 1998.
M. Califf and R. Mooney. Relational learning of pattern-match rules for infor-mation extraction. In Proceedings of the Sixteenth National Conference onArtificial Intelligence (AAAI-99). AAAI Press / MIT Press, 1999.
N. Catala, N. Castell, and M. Martin. Essence: a portable methodology for ac-quiring information extraction patterns. In Proceedings of the 14th EuropeanConference on Artificial Intelligence, Berlin, 2000.
Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a libraryfor support vector machines, 2001. Software available athttp://www.csie.ntu.edu.tw/˜cjlin/libsvm .
Jeffrey T Chang, Hinrich Schutze, and Russ B Altman. Gapscore: Finding geneand protein names one word at a time. Bioinformatics, Accepted.
JT Chang and RB Altman. Promises of text processing: natural language pro-cessing meets ai. Drug Discovery Today, 7(19):992–993, Oct 2002.
149
BIBLIOGRAPHY
JT Chang, S Raychaudhuri, and RB Altman. Including biological literatureimproves homology search. Pacific Symposium on Biocomputing, pages 374–383, 2001.
JT Chang, H Schutze, and RB Altman. Creating an online dictionary of abbre-viations from medline. Journal of the American Medical Informatics Associ-ation, 9(6):612–620, Nov–Dec 2002.
RO Chen, R Felciano, and RB Altman. Riboweb: linking structural computa-tions to a knowledge base of published experimental data. Proceedings ofthe International Conference on Intelligent Systems for Molecular Biology, 5:84–87, 1997.
China Medical Tribune. China medical tribune. http://www.cmt.com.cn/ ,2002.
LH Cohen, RE van Leeuwen, GC van Thiel, JF van Pelt, and SH Yap. Equallypotent inhibitors of cholesterol synthesis in human hepatocytes have distin-guishable effects on different cytochrome p450 enzymes. Biopharmaceuticsand Drug Disposition, 21(9):353–364, Dec 2000.
Thomas Cover and Joy Thomas. Elements of Information Theory. Wiley-Interscience, 1991.
M Craven and J Kumlien. Constructing biological knowledge bases by extract-ing information from text sources. In Proceedings of the International Con-ference on Intelligent Systems for Molecular Biology, pages 77–86, 1999.
Mark Craven. The genomics of a signaling pathway: A kdd cup challenge task.Technical report, University of Wisconsin, December 2002.
Hamish Cunningham. Information extraction – a user guide. Technical report,University of Sheffield, 1999.
Cytochrome P450 Homepage. Cytochrome p450 homepage. http://drnelson.utmem.edu/CytochromeP450.html , 2003.
JN Darroch and D Ratcliff. Generalized iterative scaling for log-linear models.The Annals of Mathematical Statistics, 43:1470–1480, 1972.
J Ding, D Berleant, D Nettleton, and E Wurtele. Mining medline: abstracts,sentences, or phrases? Pacific Symposium on Biocomputing, pages 326–337,2002.
150
BIBLIOGRAPHY
T. Dunning. Accurate methods for the statistics of surprise and coincidence.Computational Linguistics, 19(1), 1993.
Robert Englander. Developing Java Beans. O’Reilly & Associates, 1997.
WE Evans and MV Relling. Pharmacogenomics: translating functional ge-nomics into rational therapeutics. Science, 286(5439):487–491, Oct 1999.
J Firth. A Synopsis of Linguistic Theory 1930-1955. Philological Society, Ox-ford, 1957.
D Fisher, S Soderland, J McCarthy, F Feng, and W Lehnert. Description of theumass system as used for muc-6. In Proceedings of the Sixth Message Un-derstanding Conference (MUC-6), pages 127–140, San Francisco, CA, 1995.Morgan Kaufmann.
George Forman. Feature engineering for a gene regulation prediction task.Technical report, HP Labs, December 2002.
K Franzen, G Eriksson, F Olsson, L Asker, P Liden, and J Coster. Proteinnames and how to find them. International Journal of Medical Informatics,67(1–3):49–61, Dec 2002.
D. Freitag. Machine learning for information extraction from online docu-ments, 1996. URL citeseer.nj.nec.com/freitag96machine.html .
D. Freitag and A. McCallum. Information extraction with hmms and shrink-age. In Proceedings of the AAAI-99 Workshop on Machine Learning for Infor-mation Extraction, 1999.
Dayne Freitag. Machine Learning for Information Extraction in Informal Do-mains. PhD thesis, Carnegie Mellon University, November 1998.
Dayne Freitag and Andrew McCallum. Information extraction with hmmstructures learned by stochastic optimization. In Proceedings of the Eigh-teenth Conference on Artificial Intelligence (AAAI-98), 2000.
C Friedman. Towards a comprehensive medical language processing system:methods and issues. Proceedings of the AMIA Annual Symposium, pages595–599, 1997.
C Friedman. A broad-coverage natural language processing system. Proceed-ings of the AMIA Annual Symposium, pages 270–4, 2000.
151
BIBLIOGRAPHY
C Friedman, PO Alderson, JH Austin, JJ Cimino, and SB Johnson. A generalnatural-language text processor for clinical radiology. Journal of the Ameri-can Medical Informatics Association, 1:161–74, 1994.
C Friedman, L Hirschman, R McEntire, and C Wu. Linking biological lan-guage, information and knowledge. Pacific Symposium on Biocomputing, 8:388–390, 2003.
C Friedman, P Kra, H Yu, M Krauthammer, and A Rzhetsky. Genies: anatural–language processing system for the extraction of molecular path-ways from journal articles. Bioinformatics, 17 Suppl 1:S74–S82, Jun 2001.
K Fukuda, A Tamura, T Tsunoda, and T Takagi. Toward information extrac-tion: identifying protein names from biological papers. In Pacific Symposiumon Biocomputing, volume 3, pages 707–18, 1998.
Moustafa M. Ghanem, Yike Guo, Huma Lodhi, and Yong Zhang. Automatic sci-entific text classification using local patterns: Kdd cup 2002 (task 1). Tech-nical report, Imperial College of Science Technology & Medicine, December2002.
N Hamaoka, Y Oda, I Hase, and A Asada. Cytochrome p4502b6 and 2c9 donot metabolize midazolam: kinetic analysis and inhibition study with mono-clonal antibodies. British Journal of Anaesthesia, 86(4):540–544, Apr 2001.
BP Hamilton. Diabetes mellitus and hypertension. American Journal of Kid-ney Diseases, 16(4 Suppl 1):20–29, Oct 1990.
D Hanisch, J Fluck, HT Mevissen, and R Zimmer. Playing biology’s namegame: identifying protein names in scientific text. Pacific Symposium onBiocomputing, 8:403–414, 2003.
T Hastie, R Tibshirani, and J Friedman. The Elements of Statistical Learning.Springer-Verlag, New York, 2001.
V Hatzivassiloglou, PA Duboue, and A Rzhetsky. Disambiguating proteins,genes, and rna in text: a machine learning approach. Bioinformatics, 17Suppl 1:S97–S106, Jun 2001.
M Hewett, DE Oliver, DL Rubin, KL Easton, JM Stuart, RB Altman, andTE Klein. Pharmgkb: the pharmacogenetics knowledge base. Nucleic AcidsResearch, 30(1):163–165, Jan 2002.
152
BIBLIOGRAPHY
L Hirschman, AA Morgan, and AS Yeh. Rutabaga by any other name: extract-ing biological names. Journal of Biomedical Informatics, 35(4):247–259, Aug2002a.
L Hirschman, JC Park, J Tsujii, L Wong, and CH Wu. Accomplishments andchallenges in literature data mining for biology. Bioinformatics, 18(12):1553–1561, Dec 2002b.
T Hishiki, N Collier, C Nobata, T Okazaki-Ohta, N Ogata, T Sekimizu,R Steiner, HS Park, and J Tsujii. Developing nlp tools for genome informat-ics: An information extraction perspective. In Genome Informatics Series:Proceedings of the Workshop on Genome Informatics, volume 9, pages 81–90,1998.
J. Hobbs, R. Douglas, E. Appelt, J. Bear, D. Israel, M. Kameyama, M. Stickel,and M. Tyson. FASTUS: A Cascaded Finite-State Transducer for ExtractingInformation from Natural-Language Text. MIT Press, Cambridge, MA, 1996.
Wen-Juan Hou and Hsin-Hsi Chen. Enhancing performance of protein namerecognizers using collocation. In Proceedings of the ACL 2003 Workshop onNatural Language Processing in Biomedicine, pages 25–32, 2003.
S Huffmann. Learning information extraction patterns from examples, pages246–260. Springer-Verlag, Berlin, 1996.
Human Genome Acronym List. Human genome acronym list. http://www.ornl.gov/hgmis/acronym.html , 2002.
BL Humphreys, DA Lindberg, HM Schoolman, and GO Barnett. The unifiedmedical language system: an informatics research collaboration. Journal ofthe American Medical Informatics Association, 5(1):1–11, Jan–Feb 1998.
K Humphreys, G Demetriou, and R Gaizauskas. Two applications of informa-tion extraction to biological science journal articles: enzyme interactions andprotein structures. In Pacific Symposium on Biocomputing, volume 5, pages505–16, 2000.
David Hutchinson. Medline for Health Professionals: How to Search PubMedon the Internet. New Wind, Sacramento, 1998.
Stanley Jablonski, editor. Dictionary of Medical Acronyms & Abbreviations.Hanley & Belfus, 1998.
153
BIBLIOGRAPHY
WB Jakoby. The glutathione s–transferases: a group of multifunctional detox-ification proteins. Advances in Enzymology and Related Areas in MolecularBiology, 46:383–414, 1978.
DC Jamison. Open bioinformatics. Bioinformatics, 19(6):679–680, Apr 2003.
YN Jan. Pre–empting the arrival of a dark lord. Nature, 389(6652):665, Oct1997.
TK Jenssen, A Laegreid, J Komorowski, and E Hovig. A literature networkof human genes for high–throughput analysis of gene expression. NatureGenetics, 28(1):21–28, May 2001.
Thorsten Joachims. Text categorization with support vector machines: Learn-ing with many relevant features. Technical report, Universitat Dortmund,1997.
PD Karp, M Riley, SM Paley, and A Pelligrini-Toole. Ecocyc: an encyclopedia ofescherichia coli genes and metabolism. Nucleic Acids Research, 24(1):32–39,Jan 1996.
Jun’ichi Kazama, Takaki Makino, Yoshihiro Ohta, and Jun’ichi Tsujii. Tun-ing support vector machines for biomedical named entity recognition. InProceedings of the ACL 2002 Workshop on Natural Language Processing inBiomedicine, pages 1–8, 2002.
S. Sathiya Keerthi, Chong Jin Ong, Keng Boon Siah, David B.L. Lim, Wei Chu,Min Shi, David S. Edwin, Rakesh Menon, Lixiang Shen, Jonathan Y.K. Lim,and Han Tong Loh. A machine learning approach for the curation of biomed-ical literature – kdd cup 2002 (task 1). Technical report, National Universityof Singapore, December 2002.
BW Kernighan and DM Ritchie. The C Programming Language. Prentice Hall,Upper Saddle River, NJ, 1988.
Jun-Tae Kim and Dan I. Moldovan. Palka: A system for lexical knowledgeacquisition. In Proceedings of the International Conference on Informationand Knowledge Management, pages 124–131, 1993.
KT Kitchin. Laboratory methods for ten hepatic toxification/detoxification pa-rameters. Methods Find Exp Clin Pharmacol, 5(7):439–448, Sep 1983.
154
BIBLIOGRAPHY
TE Klein, JT Chang, MK Cho, KL Easton, R Fergerson, M Hewett, Z Lin,Y Liu, S Liu, DE Oliver, DL Rubin, F Shafa, JM Stuart, and RB Altman. In-tegrating genotype and phenotype information: an overview of the pharmgkbproject. pharmacogenetics research network and knowledge base. Pharma-cogenomics Journal, 1(3):167–170, 2001.
D Knuth. The Texbook. Addison-Wesley, Reading, Massachusetts, 1986.
Daphne Koller and Mehran Sahami. Toward optimal feature selection. InInternational Conference on Machine Learning, pages 284–292, 1996. URLciteseer.nj.nec.com/koller96toward.html .
Adam Kowalczyk and Bhavani Raskutti. One class svm for yeast regulationprediction. Technical report, Telstra Research Laboratories, December 2002.
M Krauthammer, A Rzhetsky, P Morozov, and C Friedman. Using blast foridentifying gene and protein names in journal articles. Gene, 259:245–252,2000.
Leah S. Larkey, Paul Ogilvie, M. Andrew Price, and Brenden Tamilio.Acrophile: an automated acronym extractor and server. In ACM DL, pages205–214, 2000. URL citeseer.nj.nec.com/larkey00acrophile.html .
Simon St. Laurent, Edd Dumbill, and Joe Johnston. Programming Web Ser-vices with XML-RPC. O’Reilly & Associates, 2001.
J Lazarou, BH Pomeranz, and PN Corey. Incidence of adverse drug reactionsin hospitalized patients: a meta–analysis of prospective studies. Journal ofthe American Medical Association, 279(15):1200–1205, Apr 1998.
Ki-Joong Lee, Young-Sook Hwang, and Hae-Chang Rim. Two-phase biomedicalne recognition based on svms. In Proceedings of the ACL 2003 Workshop onNatural Language Processing in Biomedicine, pages 33–40, 2003.
W Lehnert, C Cardie, D Fisher, J McCarthy, E Riloff, and S Soderland. Uni-versity of massachusetts: Description of the circus system as used in muc.In Proceedings of the Fourth Message Understanding Conference (MUC-4),pages 282–288, San Mateo, CA, 1992. Morgan Kaufmann.
H Liu, YA Lussier, and C Friedman. A study of abbreviations in the umls.Proceedings of the AMIA Annual Symposium, pages 393–397, 2001.
LocusLink. Locuslink. http://www.ncbi.nlm.nih.gov/LocusLink/GeneRIFhelp.html , 2003.
155
BIBLIOGRAPHY
M Lutz, D Ascher, and F Willison. Learning Python. O’Reilly, Sebastopol, CA,1999.
WH Majoros, GM Subramanian, and MD Yandell. Identification of key conceptsin biomedical literature using a modified markov heuristic. Bioinformatics,19(3):402–407, Feb 2003.
R Malouf. A comparison of algorithms for maximum entropy parameter estima-tion. In Proceedings of the Sixth Conference on Natural Language Learning,pages 49–55, 2002.
Christopher D Manning and Hinrich Schutze. Foundations of Statistical Nat-ural Language Processing. MIT Press, 1999.
A. McCallum, D. Freitag, and F. Pereira. Maximum entropy markov modelsfor information extraction and segmentation. In Proceedings of ICML-2000,2000.
HL McLeod and WE Evans. Pharmacogenomics: unlocking the human genomefor better drug therapy. Annual Review of Pharmacology and Toxicology, 41:101–121, 2001.
UA Meyer. Pharmacogenetics and adverse drug reactions. Lancet, 356(9242):1667–1671, Nov 2000.
Alex Morgan, Lynette Hirschman, Alexander Yeh, and Marc Colosimo. Genename extraction using flybase resources. In Proceedings of the ACL 2003Workshop on Natural Language Processing in Biomedicine, pages 1–8, 2003.
Mouse Genome Database. Mouse genome database. http://www.informatics.jax.org/mgihome/nomen/short_gene.shtml , 2002.
Chuck Musciano and Bill Kennedy. HTML & XHTML. O’Reilly & Associates,2002.
MySQL. Mysql. http://www.mysql.com/ , 2003.
MA Namboodiri, JT Favilla, and DC Klein. Pineal n–acetyltransferase is inac-tivated by disulfide–containing peptides: insulin is the most potent. Science,213(4507):571–573, Jul 1981.
M Narayanaswamy, KE Ravikumar, and K Vijay-Shanker. A biological namedentity recognizer. Pacific Symposium on Biocomputing, 8:427–438, 2003.
156
BIBLIOGRAPHY
SB Needleman and CD Wunsch. A general method applicable to the search forsimilarities in the amino acid sequence of two proteins. Journal of MolecularBiology, 48(3):443–453, Mar 1970.
G Nenadic, I Spasic, and S Ananiadou. Automatic acronym acquisition andmanagement with domain specific texts. In Proceedings of LREC-3, 3rd In-ternational Conference on Language, Resources and Evaluation, pages 2155–2162, 2002.
S. Ng and M. Wong. Toward routine automatic pathway discovery from on-linescientific text abstracts. Genome Informatics, 10:104–112, 1999.
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text clas-sification. In IJCAI-99 Workshop on Machine Learning for Information Fil-tering, pages 61–67, 1999. URL citeseer.nj.nec.com/nigam99using.html .
Numerical Python. Numerical python. http://numpy.sourceforge.net/ ,2003.
Numerical Recipes Home Page. Numerical recipes home page. http://www.nr.com/ , 2003.
T Ohta, Y Tateisi, H Mima, and J Tsujii. Genia corpus: an annotated researchabstract corpus in molecular biology domain. In Proceedings of the HumanLanguage Technology Conference, 2002.
T Ono, H Hishigaki, A Tanigami, and T Takagi. Automated extraction of infor-mation on protein–protein interactions from the biological literature. Bioin-formatics, 17(2):155–161, Feb 2001.
Opaui. Opaui guide to lists of acronyms, abbreviations, and initialisms on theworld wide web. http://www.opaui.com/acro.html , 2002.
Y Oyanagui. Immunosuppressants enhance superoxide radical/nitric oxide–dependent dexamethasone suppression of ischemic paw edema in mice. Eu-ropean Journal of Pharmacology, 344(2–3):241–249, Mar 1998.
David D. Palmer and Marti A. Hearst. Adaptive sentence boundary disam-biguation. In Proceedings of the Fourth ACL Conference on Applied NaturalLanguage Processing, pages 78–83, Stuttgart, 1994. Morgan Kaufmann.
Jong C Park. Using combinatory categorial grammar to extract biomedicalinformation. IEEE Intelligent Systems, November/December:62–67, 2001.
157
BIBLIOGRAPHY
Jong C Park, Hyun Sook Kim, and Jung Jae Kim. Bidirectional incremen-tal parsing for automatic pathway identification with combinatory categorialgrammar. In Pacific Symposium on Biocomputing, volume 6, pages 396–407,2001.
PharmGKB. The pharmcagenomics knowledge base. http://www.pharmgkb.org/ , 2003.
KA Phillips, DL Veenstra, E Oren, JK Lee, and W Sadee. Potential role ofpharmacogenomics in reducing adverse drug reactions: a systematic review.JAMA, 286(18):2270–2279, Nov 2001.
WH Press, BP Flannery, SA Teukolsky, and WT Vetterling. Numerical Recipesin C. Cambridge University Press, New York, NY, 1993.
D Proux, F Rechenmann, and L Julliard. A pragmatic information extractionstrategy for gathering data on genetic interactions. In ISMB, volume 8, pages279–85, 2000.
D Proux, F Rechenmann, L Julliard, V Pillet, and B Jacq. Detecting gene sym-bols and names in biological texts: A first step toward pertinent informationextraction. In Genome Informatics Series: Proceedings of the Workshop onGenome Informatics, volume 9, pages 72–80, 1998.
KD Pruitt, KS Katz, H Sicotte, and DR Maglott. Introducing refseq and lo-cuslink: curated human genome resources at the ncbi. Trends in Genetics,16(1):44–47, Jan 2000.
J Pustejovsky, J Castanno, B Cochran, M Kotecki, and M Morrell. Automaticextraction of acronym–meaning pairs from medline databases. Medinfo, 10(Pt 1):371–375, 2001.
Adwait Ratnaparkhi. Maximum Entropy Models for Natural Language Ambi-guity Resolution. PhD thesis, University of Pennsylvania, 1998.
S Raychaudhuri, JT Chang, PD Sutphin, and RB Altman. Associating geneswith gene ontology codes using a maximum entropy analysis of biomedicalliterature. Genome Research, 12(1):203–214, Jan 2002.
Yizhar Regev, Michal Finkelstein-Landau, Ronen Feldman, Maya Gorodetsky,Xin Zheng, Samuel Levy, Rosane Charlab, Charles Lawrence, Ross A. Lip-pert, Qing Zhang, and Hagit Shatkay. Rule-based extraction of experimentalevidence in the biomedical domain – the kdd cup 2002 (task 1). Technicalreport, ClearForest and Celera, December 2002.
158
BIBLIOGRAPHY
J. Reynar and A. Ratnaparkhi. A maximum entropy approach to identifyingsentence boundaries. In Proceedings of the Fifth Conference on Applied Nat-ural Language Processing, pages 16–19, Washington D.C., 1997.
Ellen Riloff. Automatically constructing a dictionary for information extrac-tion tasks. In Proceedings of the Eleventh National Conference on ArtificialIntelligence (AAAI-93), pages 811–816. AAAI Press / MIT Press, 1993.
Ellen Riloff. Automatically generating extraction patterns from untagged text.In Proceedings of the Thirteenth National Conference on Artificial Intelligence(AAAI-96), 1996.
TC Rindflesch, L Tanabe, JN Weinstein, and L Hunter. Edgar: extraction ofdrugs, genes and relations from the biomedical literature. In Pacific Sympo-sium on Biocomputing, volume 5, pages 517–28, 2000.
RJ Roberts, HE Varmus, M Ashburner, PO Brown, MB Eisen, C Khosla,M Kirschner, R Nusse, M Scott, and B Wold. Information access. buildinga ”genbank” of the published literature. Science, 291(5512):2318–2319, Mar2001.
AD Roses. Pharmacogenetics and the practice of medicine. Nature, 405(6788):857–865, Jun 2000.
AD Roses. Pharmacogenetics. Human Molecular Genetics, 10(20):2261–2267,Oct 2001.
A Rzhetsky, T Koike, S Kalachikov, SM Gomez, M Krauthammer, SH Kaplan,P Kra, JJ Russo, and C Friedman. A knowledge model for analysis andsimulation of regulatory networks. Bioinformatics, 16:1120–1128, 2000.
SAIC Information Extraction. Introduction to information extraction. http://www.itl.nist.gov/iaui/894.02/related_projects/muc/ , 2003.
Gerard Salton. Automatic Information Organization and Retrieval. McGrawHill, 1968.
S Schulze-Kremer. Ontologies for molecular biology. Pacific Symposium onBiocomputing, pages 695–706, 1998.
AS Schwartz and MA Hearst. A simple algorithm for identifying abbreviationdefinitions in biomedical text. Pacific Symposium on Biocomputing, pages451–462, 2003.
159
BIBLIOGRAPHY
T Sekimizu, HS Park, and J Tsujii. Identifying the interaction between genesand gene products based on frequently seen verbs in medline abstracts. InGenome Informatics Series: Proceedings of the Workshop on Genome Infor-matics, volume 9, pages 62–71, 1998.
Dan Shen, Jie Zhang, Guodong Zhou, Jian Su, and Chew-Lim Tan. Effec-tive adaptation of hidden markov model-based named entity recognizer forbiomedical domain. In Proceedings of the ACL 2003 Workshop on NaturalLanguage Processing in Biomedicine, pages 49–56, 2003.
NR Smalheiser and DR Swanson. Using arrowsmith: a computer–assisted ap-proach to formulating and assessing scientific hypotheses. Computer Meth-ods and Programs in Biomedicine, 57(3):149–153, Nov 1998.
Joseph Smarr and Christopher Manning. Classifying unknown proper nounphrases without context. Technical report, Stanford University, 2002.
EM Smigielski, K Sirotkin, M Ward, and ST Sherry. dbsnp: a database of singlenucleotide polymorphisms. Nucleic Acids Research, 28(1):352–355, Jan 2000.
James Snell, Doug Tidwell, and Pavel Kulchenko. Programming Web Serviceswith SOAP. O’Reilly & Associates, 2001.
S. Soderland. Learning information extraction rules for semi-structured andfree text. Machine Learning, 34:1–44, 1999.
S Soderland, D Fisher, J Aseltine, and W Lehnert. Crystal: Inducing a con-ceptual dictionary. In In Proceedings of the Fourteenth International JointConference on Artificial Intelligence, pages 1314–1319, 1995.
BJ Stapley and G Benoit. Biobibliometrics: information retrieval and visual-ization from co-occurrences of gene names in medline abstracts. In PacificSymposium on Biocomputing, pages 529–40, 2000.
BJ Stapley, LA Kelley, and MJ Sternberg. Predicting the sub–cellular locationof proteins from text using support vector machines. Pacific Symposium onBiocomputing, pages 374–385, 2002.
LD Stein. Integrating biological databases. Nature Reviews Genetics, 4(5):337–345, May 2003.
M Stephens, M Palakal, S Mukhopadhyay, and R Raje. Detecting gene relationsfrom medline abstracts. In Pacific Symposium on Biocomputing, 2001.
160
BIBLIOGRAPHY
Kazem Taghva and Jeff Gilbreth. Recognizing acronyms and their definitions.Technical report, ISRI (Information Science Research Institute) UNLV, June1995.
L Tanabe and W John Wilbur. Tagging gene and protein names in full textarticles. In Proceedings of the Workshop on Natural Language Processing inthe Biomedical Domain, pages 9–13, 2002a.
L Tanabe and WJ Wilbur. Tagging gene and protein names in biomedical text.Bioinformatics, 18(8):1124–1132, Aug 2002b.
T Tateishi, M Watanabe, H Nakura, M Tanaka, T Kumai, SF Sakata,N Tamaki, K Ogura, T Nishiyama, T Watabe, and S Kobayashi. Di-hydropyrimidine dehydrogenase activity and fluorouracil pharmacokineticswith liver damage induced by bile duct ligation in rats. Drug Metabolismand Disposition, 27(6):651–654, Jun 1999.
Julian Templeman and John Paul Mueller. COM Programming with Microsoft.NET. Microsoft Press, 2003.
J Thomas, D Milward, C Ouzounis, S Pulman, and M Carroll. Automatic ex-traction of protein interactions from scientific abstracts. In Pacific Sympo-sium on Biocomputing, volume 5, pages 541–52, 2000.
Three-Letter Abbreviations. The great three-letter abbreviation hunt. http://www.atomiser.demon.co.uk/abbrev/ , 2002.
Yoshimasa Tsuruoka and Jun’ichi Tsujii. Boosting precision and recall ofdictionary-based protein name recognition. In Proceedings of the ACL 2003Workshop on Natural Language Processing in Biomedicine, pages 41–48,2003.
Si Usuzaka, KL Sim, M Tanaka, H Matsuno, and S Miyano. A machinelearning approach to reducing the work of experts in article selection fromdatabase: A case study for regulatory relations of s. cerevisiae genes in med-line. Genome Informatics Series: Proceedings of the Workshop on GenomeInformatics, 9:91–101, 1998.
Ellen M. Voorhees. Overview of trec 2002. In The Eleventh Text RetrievalConference, 2002a.
Ellen M. Voorhees. Overview of trec 2002. In The Eleventh Text RetrievalConference (TREC 2002), 2002b.
161
BIBLIOGRAPHY
Edwin C Webb. Enzyme Nomenclature 1992. Academic Press, 1992.
Julia A White, Lois J Maltais, and Daniel W Nebert. An increasingly urgentneed for standardized gene nomenclature. Technical report, University ofLondon, 2002.
Limsoon Wong. Pies, a protein interaction extraction system. In Pacific Sym-posium on Biocomputing, volume 6, pages 520–531, 2001.
JD Wren and HR Garner. Heuristics for identification of acronym–definitionpatterns within text: towards an automated construction of comprehensiveacronym–definition dictionaries. Methods for Information in Medicine, 41(5):426–434, 2002.
Akane Yakushiji, Yuka Tateisi, Yusuke Miyao, and Jun ichi Tsujii. Event ex-traction from biomedical papers using a full parser. In Pacific Symposium onBiocomputing, volume 6, pages 408–419, 2001.
Yiming Yang. An evaluation of statistical approaches to text categorization.Information Retrieval, 1(1-2):69–90, 1999. URL citeseer.nj.nec.com/yang97evaluation.html .
Yiming Yang and Jan O. Pedersen. A comparative study on feature selection intext categorization. In International Conference on Machine Learning, pages412–420, 1997. URL citeseer.nj.nec.com/yang97comparative.html .
Roman Yangarber and Ralph Grishman. Machine learning of extraction pat-terns from unannotated corpora. In Proceedings of 14th European Conferenceon Artificial Intelligence (ECAI 2000) Workshop on Machine Learning for In-formation Extraction, Berlin, Germany, 2000.
Stuart Yeates. Automatic extraction of acronyms from text. In New ZealandComputer Science Research Students’ Conference, pages 117–124, 1999. URLciteseer.nj.nec.com/yeates99automatic.html .
Stuart Yeates, David Bainbridge, and Ian H. Witten. Using compression toidentify acronyms in text. In Data Compression Conference, page 582, 2000.URL citeseer.nj.nec.com/288921.html .
Alexander Yeh, Lynette Hirschman, and Alexander Morgan. Background andoverview for kdd cup 2002 task 1: Information extraction from biomedicalarticles. Technical report, The MITRE Corporation, December 2002.
162
BIBLIOGRAPHY
AS Yeh, L Hirschman, and AA Morgan. Evaluation of text data mining fordatabase curation: lessons learned from the kdd challenge cup. Bioinformat-ics, 19 Suppl 1:I331–I339, Jul 2003.
M Yoshida, K Fukuda, and T Takagi. Pnad-css: a workbench for constructinga protein name abbreviation dictionary. Bioinformatics, 16:169–75, 2000.
H Yu and E Agichtein. Extracting synonymous gene and protein terms frombiological literature. Bioinformatics, 19 Suppl 1:I340–I349, Jul 2003.
H Yu, V Hatzivassiloglou, C Friedman, A Rzhetsky, and WJ Wilbur. Automaticextraction of gene and protein synonyms from medline and journal articles.Proceedings of the AMIA Annual Symposium, pages 919–923, 2002a.
H Yu, G Hripcsak, and C Friedman. Mapping abbreviations to full forms inbiomedical articles. Journal of the American Medical Informatics Associa-tion, 9:262–272, 2002b.
Mohammed J. Zaki, Jason T.L. Wang, and Hannu T.T. Toivonen. Biokdd 2002:Recent advances in data mining for bioinformatics. Technical report, Rens-selaer Polytechnic Institute, New Jersey Institute of Technology, and Univer-sity of Helsinki, December 2002.
163