a dissertation submitted to the university of manchester...
TRANSCRIPT
Information Extraction
A dissertation submitted to The University of Manchester for the degree of Master of
Science in the Faculty of of Engineering and Physical Sciences
2013
Wafa Al Showaib
School of Computer Science
Table of Contents
....................................................................................................................Abstract 5
................................................................................................................Declaration 6
.......................................................................................COPYRIGHT STATEMENT 7
.........................................................................................ACKNOWLEDGEMENTS 8
1. .........................................................................................................Introduction 9
1.1. .......................................................................................................Aim of the Project 9
1.2. ......................................................................................................Project Objectives 9
1.3. .......................................................................................................Report Structure 10
2. .......................................................................................................Background 12
2.1. ..............................................................................................Information Extraction 12
2.2. ................................................................................Defining Information Extraction 12
2.3. .......................................................................................IE and Other Technologies 14
2.4. ..............................................................................................................Brief History 15
2.5. ........................................................................................................Challenges of IE 16
2.6. ....................................................................................Information Extraction Tasks 17
2.7. ..................................................................................IE System Evaluation Forums 18
2.8. .........................................................................................Evaluation of IE Systems 21
2.9. ...................................................................................................IE Overall Process 24
2.10. ...................................................Information Extraction System Design Approach 28
2.11. ...........................................................................................Examples of IE system 32
2.12. .........................................................................................IE System Performance 35
2.13. ................................................................................................................Summary 36
3. .........................................................................................CAFETIERE System 37
3.1. ..............................................................................................................CAFETIERE 37
3.2. ................................................................................................System Components 37
3.3. ................................................................................................Notation of the Rules 39
3.4. .......................................................................................Exploiting the rule notation 40
3.5. .................................................................................................................Gazetteer 42
3.6. ..................................................................................................................Summary 43
4. ......................................................................................Requirement Analysis 44
2
4.1. .....................................................................................................Domain Selection 44
4.2. ...............................................................................Motivation for Domain Selection 44
4.3. ..................................................................................................Structure of the text 47
4.4. ........................................................Entities, relationships and events identification 48
4.5. ................................................................................................Project Methodology 50
4.6. .......................................................................................Development Methodology 50
4.7. ..................................................................................................................Summary 51
5. ..............................................................Design, Implementation and Testing 52
5.1. ..............................................................................................General design issues 52
5.2. ........................................................................................................Entity extraction 53
5.3. ...................................................................Rules for extracting the publishing date 53
5.4. ............................................................Rules for extracting the announcement date 54
5.5. ............................................................................Rules for extracting country name 56
5.6. ..........................................................Rules for extracting the name of an outbreak 57
5.7. ....................................................Rules for extracting affected cities and provinces 60
5.8. .............................................................................................Relationship extraction 63
5.9. ...............................Rules for extracting the name of the reporting health authority 63
5.10. ....................................................................................................Events extraction 65
5.11. ...................................................................Rules for extracting an outbreak event 65
5.12. ....................................Rules for extracting the total number of cases and deaths 70
5.13. ..............................................................................................................Discussion 71
6. ............................................................................................System Evaluation 74
6.1. .......................................................................................System evaluation metrics 74
6.2. ......................................................................................System evaluation process 75
6.3. .......................................................................................................Results analysis 78
7. ........................................................................................................Conclusion 84
..............................................................................................................References 86
................................................................................Appendix A: Extraction rules 89
...........................................................................Appendix B: Gazetteer entries 114
Word Count: 22,867
3
List of Figures Page
Figure 2.1: IE position within information retrieval and text understanding 15
Figure 2.2: Confusion matrix 23
Figure 2.3: Information extraction overall process 25
Figure 3.1: CAFETIERE overall analysis and query model 37
Figure 4.1: The development life-cycle model for information extraction 51
Figure 5.1: Announcement date extraction 55
Figure 5.2: Extracting groups of outbreak locations 61
Figure 5.3: Reporting health authority extraction 65
Figure 6.1: Manual annotation 76
Figure 6.2: System annotation 77
List of Tables Page
Table 2.1: Summary of MUC topics from 1987 to 1997 19
Table 2.2: Decision factors for IE approaches 32
Table 2.3: Top scoring in MUC-7 36
Table 3.1: Examples of the values that can be assigned to syn feature 40
Table 4.1: Named entities in outbreak reports for IE 49
Table 6.1: Breakdown of the counting results of the training corpus 78
Table 6.2 : Breakdown of the evaluation metrics of the training corpus 79
Table 6.3: Breakdown of the counting results of the test corpus 81
Table 6.4: Breakdown of the evaluation metrics of the test corpus 81
Table 6.5: Number of occurrences of each entity type 82
4
AbstractInformation extraction (IE) is a technology that facilitates the movement of data from their
initial manifestation in natural texts into structured representation, usually in the form of
databases, to facilitate their use in further analysis. IE systems serve as the front end and
core stage in different natural language programming tasks. Although IE is a relatively new
area, the field has witnessed rapid development; this report gives a brief review of IE
history and the IE-focused conferences that have influenced its growth over the last two
decades. The overall system structure is also detailed. Two approaches will be discussed
and compared: the knowledge engineering approach, and the automatic training
approach.
The practical work in this project was based on the knowledge engineering approach, also
known as the rule-based approach. The extraction system used was CAFETIERE, which was designed by The National Centre for Text Mining. As IE has proved its efficiency in
domain-specific tasks, this project focused on one domain: disease outbreak reports. Several reports from the World Health Organization were carefully examined to formulate
the extraction tasks: named-entities, such as disease name, date and location; the location
of the reporting authority; and the outbreak incident. Extraction rules were then designed, based on a study of the textual expressions and elements found in the text that appeared
before and after the target text.
The experiment resulted in very high performance scores for all the tasks in general. The
training corpora and the testing corpora were tested separately. The system performed with higher accuracy with entities and events extraction than with relationship extraction.
It can be concluded that the rule-based approach has been proven capable of delivering
reliable IE, with extremely high accuracy and coverage results. However, this approach
requires an extensive, time-consuming, manual study of word classes and phrases.
5
DeclarationNo portion of the work referred to in the dissertation has been submitted in support of an
application for another degree or qualification of this or any other university or other
institute of learning.
6
COPYRIGHT STATEMENTi. The author of this dissertation (including any appendices and/or schedules to this
dissertation) owns certain copyright or related rights in it (the “Copyright”) and s/he has
given The University of Manchester certain rights to use such Copyright, including for
administrative purposes.
ii. Copies of this dissertation, either in full or in extracts and whether in hard or electronic
copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance
with licensing agreements which the University has entered into. This page must form part of any such copies made.
iii. The ownership of certain Copyright, patents, designs, trade marks and other intellectual property (the “Intellectual Property”) and any reproductions of copyright works in the
dissertation, for example graphs Guidance for the Presentation of Taught Masters’ Dissertations and tables (“Reproductions”), which may be described in this dissertation,
may not be owned by the author and may be owned by third parties. Such Intellectual
Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or
Reproductions.
iv. Further information on the conditions under which disclosure, publication and
commercialisation of this dissertation, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the University IP Policy (see
http://documents.manchester.ac.uk/display.aspx?DocID=487), in any relevant Dissertation restriction declarations deposited in the University Library, The University Library’s
regulations (see http://www.manchester.ac.uk/library/aboutus/regulations) and in The
University’s Guidance for the Presentation of Dissertations.
7
ACKNOWLEDGEMENTSI would like to express my sincerest appreciation to my supervisor, Mr. Jock McNaught, for
his invaluable support and contribution. I would like to extend special thanks to KACST for
this great scholarship opportunity, which allowed me to be here to pursue this great and
exciting challenge.
I dedicate this work to my parents for their continued support during my studies.
8
1. IntroductionOver the last two decades, the World-Wide-Web (Web) has played a key role in the rapid
proliferation of information that is available to humans. Digital information is available in a
myriad of forms in different locations on the internet and intranet. A significant amount of
data is available in the form of news, blogs, annual reports and social media. This has
resulted in a growing need for effective techniques to analyse and manipulate natural text
data to discover uncovered relevant and valuable knowledge. Probably, one of the first
solutions is to convert free text into tabular form (databases) in order to facilitate using
them in computerized systems. This need has resulted in the emergence of information
extraction (IE) technologies.
IE is one of the natural language processing (NLP) techniques, and is the process of
extracting structured information from semi-structured or unstructured documents. It is an
emerging technology that is used to tackle the problem of information that is growing very
quickly, while the development of automated NLP techniques is relatively slower. It is
commonly the first process in text mining, in which a collection of documents representing
the area we are interested in are skimmed to extract a sub-sequence of information using
a pre-defined set of rules applied to specific texts.
1.1.Aim of the Project
The main aim of this project is to gain a deeper understanding of current IE
techniques and principles. More specifically, how a rule-based IE system is
generated to retrieve specific data from unstructured (natural language) texts to
form a structured representation.
1.2.Project Objectives
To achieve the aim of this project:
• It is necessary to study the ‘state-of-the-art’ in the IE field, and particularly to
identify the research efforts that have been made thus far. This phase of reviewing
the history of IE will provide a clear understanding of IE, which is necessary so as
not to confuse it with other NLP techniques, such as information retrieval.
• Choose the extraction domain.
9
• This study will explore and learn the extraction system formalism, i.e., the
development of extraction rules for extracting relevant information from a given text.
• Design, implement and test the extraction rules.
• Evaluate the overall performance of the extraction rules by calculating recall,
precision and F-measure.
Limitations of this project include:
1.The researcher will play two simultaneous roles, as both rule developer
(which includes requirements analysis and design) and tester.
2.The text source to be examined is drawn from natural texts (unstructured),
written in the English language.
3.The time dedicated to this project is seven months in total; for the first four
months the project will be pursued part-time, and for the remaining three, full-
time.
4. Because we are focusing on extracting information from natural texts, only
specific information, i.e. entities and relationships, will be extracted from a
text.
There are two main deliverables of this project: constructing a domain-specific
gazetteer and developing a set of rules in addition to the dissertation project report.
1.3.Report Structure
The progress report for the project is organised as follows:
Chapter 2, Background: The chapter starts with providing a clear definition for IE,
then distinguishing it from various NLP tasks, and identifying where it is positioned
among them. A brief history of IE will follow, which will include the influence of
Message Understanding Conferences (MUCs) and automatic extraction forums. An
important part of this chapter is dedicated to a discussion the general architecture of
an IE system and the two approaches that have been followed in almost all of
10
these. This chapter will conclude with a discussion of the evaluation methods and
metrics, and finally, a brief discussion of how the performance of an IE system could
be measured.
Chapter 3, CAFETIERE System: This chapter discusses the software components
of the IE system adapted for this project (CAFETIERE system).
Chapter 4, Requirement Analysis: The aim of this chapter is to describe the
domain focus and how the corpus will be gathered for analysis. A preliminary
analysis of some texts is presented. A list of entities and relationships that we are
looking for is listed at the end of the chapter. Finally, an overview of the
methodology adopted for this project.
Chapter 5, Design implementation and testing: This chapter gives an overview
of the process that was followed in designing the extraction rules for all entities,
relationships and events. The primary textual patterns that influenced the final
extraction decisions are discussed, in addition to a number of examples that
demonstrate the final analysis findings.
Chapter 6, Evaluation: The evaluation metrics that were used to assess the
system performance are discussed, in addition to some basic definitions that were
considered while validating the extraction outputs. A demonstration of how each
report was evaluated is also shown. Finally, precision, recall and F-measure were
calculated for both training and testing sets to conclude the main findings of this
project.
Chapter 7, Conclusion: This chapter concludes the project. It also provides
several suggestions for future research proposals.
11
2. BackgroundThis chapter gives an overview of the basics of IE beginning with a definition of IE and
stating the position of IE within various NLP technologies. A brief history and a discussion
of previous IE research will lead to the development of two main approaches of building an
IE system. An evaluation of an IE system and measuring the performance of such a
system is also discussed in this chapter.
2.1.Information Extraction With the tremendous amount of data that accumulate on the web every second, the
urge for automatic technologies that read, analyse, classify and populate data has
evolved. Humans cannot read and memorise a megabyte of data on a daily basis.
This has resulted in opportunities for historical, archival information to be lost or
discarded. Information that may currently seem to contain no value may hold
valuable information for future needs. Information also runs the risk of being
overlooked or missed because it was not presented in a specific manner or was
contained with additional misleading data.
Lost opportunities and limited human abilities have spurred researchers to explore
and create strategies to manage this text ‘wilderness’. In the last decades,
researchers have mainly worked in natural language techniques. Since human
language is difficult and follows different writing styles, the Natural Language
Processing (NLP) technologies cannot be classified under one domain only.
Different stages of processing comprise the NLP field, and each stage is a unique
science and field of research. IE systems serve as the front-end and core stage in
different NLP techniques.
2.2.Defining Information ExtractionIn the literature, different researchers give different descriptions for the term
‘Information Extraction’ (IE). One of the oldest definitions was proposed by Cowie
and Lehnert (1996), who defines it as any process that extracts relevant information
from a given text then pieces together the extracted information in a coherent
structure. De Sitter (2004) sees that IE can take a different definition according to
the purpose of the system:
12
One best per approach: the information system is a system for filling a
template structure;
All occurrences approach: the IE Information system is to find every
occurrence of a certain item in a document.
However, De Sitter’s definition lacks the part about recognizing relationships and
facts. Moens (2006) suggests a very comprehensive definition:
“Information extraction is the identification, and consequent or concurrent classification and structuring into semantic classes, of specific information found in unstructured data sources, such as natural language text, providing additional aids to access and interpret the unstructured data by information systems.” (Moens, 2006)
It seems that in recent IE manuscripts, researchers partially agree with similar
descriptions. Saggoin et al. (2010) described an IE system as a technology of
extracting snippets of data from natural language. A similar definition is provided by
Ling (2010), who stated that IE is the problem of distilling relational data from
unstructured texts. Acharya and Parija (2010) suggested another definition, which is
to reduce the size of text to a tabular form by identifying only subsets of instances of
a specific class of relationships or events from a natural language document, and
the extraction of the arguments related to the event or relationship.
Before continuing with discussion in this report, it seems essential to view the
definition that has been adopted for this project. We agree with Moens’ (2006)
definition that additional aids are needed to find primary data, and also with De
Sitter from the aspect that IE can handle more than one definition depending on the
aim of the system.
The definition that seems most comprehensive for this project is that IE is the
process of extracting predefined entities, identifying the relationships between those
entities from natural texts into accessible formats that can be used later in further
applications and, with the help of evidence, can be deduced from particular words
from the text or from the context.
13
2.3.IE and Other TechnologiesIn most cases, IE is not a final goal, but rather assists in going forward with other
natural texts processing tasks such as information retrieval, text understanding and
data mining (Moens, 2006). The aim of this section is to define the differences
between IE and information retrieval (IR), and between IE and text understanding.
2.3.1.Information RetrievalFor the purpose of this project, it seems suitable to distinguish between IR
and IE as they may interfere with one another. Information retrieval focuses
on collecting documents with relevant text from a set of articles available in
newspapers, journals and the Web. IE starts with texts collected in advance,
it then digests them into more readily analysable form. It isolates irrelevant
fragments and keeps relevant-only information from the text, and then
gathers the targeted information in a comprehensible framework.
When compared with IR, IE has some advantages and disadvantages. IE
tasks are generally more difficult and the systems themselves are more
knowledge-intensive, such as built-in dictionaries. From a performance point
of view, IE consumes more computation power than IR. However, in the case
of a large volume of text, IE is potentially more efficient than IR because of
its ability to summarise the text in a dramatically short time when compared
to the time spent by people to read the same information. Also, where results
need to be translated instantly into other languages, IE conversions are
simpler and more straightforward compared to IR, where whole retrieved
documents must be provided with full translation tasks (Cunningham, 2006).
2.3.2.Text UnderstandingIE is a very important milestone towards real text understanding (Moens,
2006). A similar view is given by Riloff (1999), who sees that extracted data
can be very useful to represent more complex semantic classes. IE is usually
the first and cheapest task in overall text understanding tasks. Appelt et al.
(1993) distinguish between IE and text understanding. According to these
researchers, in text understanding:
• The main goal of the system is to make sense of the entire text.
• The target text must reflect the full complexity of a language.
14
One should be able to recognise the nuance of meaning and the
author’s aim.
While on IE:
• Only part of the text is relevant and accommodates the final goal; for
example, in MUC-4, in analysing terrorist reports, only 10 per cent of
the text was relevant.
• The final representation of the data is predefined, in most cases a
database.
Figure 2.1: IE position within information retrieval and text understanding
Finally, in the spectrum of the processing of natural language, it can be said that IE
is situated in the middle between information retrieval and text understanding
(Appelt, 1999).
2.4.Brief HistoryThe idea of IE can be traced back to 1964 as some papers were found that
discussed the attempts of filling templates by searching pieces of text, but these
ideas depended purely on human work, and were not combined with any
computational powers (Wilks, 1997). The earliest practical work in the field is that of
Sager in the 1970s, conducted at the New York University in the medical field.
Basically, Sager extracted patients’ information to fill in forms that could be used as
inputs of the traditional Conference on Data Systems Languages (CODASYL)
database (Cowie and Lehnert, 1996). Although the work was based on handcrafted
structures and a limited number of techniques, it was highly effective (Wilks, 1997).
15
Subsequently, in the late 1970s, De Jong at Yale University developed the Fast
Reading, Understanding and Memory Program (FRUMP), which was one of the
earliest artificial intelligence systems (Jong, 1977). FRUMP was designed to work
with unstructured texts by skimming newspaper stories using a computer to fill
predefined slots in structures (Cowie and Wilks, 2000). The goal was to find the
important points of the news stories without reading them in detail (Wilks, 1997).
This work can be considered the basis of many IE systems that emerged in the
1980s, such as TRANS (Wilks, 1997).
In the mid-1980s, the Carnegie Group developed one of the earliest commercial IE
systems called JASPER. As in the case of the early systems, it relied on a high
degree of manual intervention, mainly using very complicated templates generated
by analysts and with complicated extraction tasks. As with earlier attempts,
JASPER had limited access to lexicons and dictionary resources and with no
learning algorithms. JASPER was designed for Reuters, and unlike FRUMP,
JASPER was benchmarked and seriously evaluated for its performance (Wilks,
1997). However, the movement in IE was motivated by a growing trend towards
more practical approaches of using computational powers to manipulate large
volumes of text and to depend less on handcrafted linguistic templates.
2.5.Challenges of IE The challenges of IE can be summarised by the following three areas:
1 - IE is a domain-specific task. The target types of objects and facts relate to
a specific domain; for example, the tasks of extracting information about
products, companies and release dates is different from those for another
domain, for example, natural disasters.
2 - Accuracy: The primary challenge facing this research area is the design
of an extraction model that would achieve a high level of accuracy in the
execution of extraction tasks.
3 - Running time: It is necessary to look at the expense of processing steps
that the selected text must scan.
16
Apart from these challenges, IE is an interesting field of research due to the
following reasons (Cowie and Lehnert,1996):
• Tasks of IE are clear and well-defined.
• IE applies to real-world texts.
• IE poses challenging NLP problems.
• The performance of IE can be compared to the human benchmark for the same
task.
2.6.Information Extraction TasksThe prime goal of IE has been divided into several tasks. The tasks are of
increasing difficulty, starting from identifying names in natural texts then moving into
finding relationships and events.
2.6.1.Named Entity Recognition
The term named entity recognition (NER) was first introduced in MUC-6
(Grishman and Sundheim, 1996). A key element of any extraction system is
to identify the occurrence of specific entities to be extracted. It is the simplest
and most reliable IE subtask (Cunningham, 2006). Entities typically are noun
elements that can be found within text, and they usually consist of one to a
few words. In early work in the field, more specifically at the beginning of the
MUC and Automatic Content Extraction (ACE) competitions, the most
common entities were named entities, such as names of persons, locations,
companies and organizations, numeric expressions, e.g. $1 million, and
absolute temporal terms, e.g. September 2001. Now, named entities have
been expanded to include other generic names, such as names of diseases,
proteins, article titles and journals. More than 100 entity types have been
introduced in the ACE competition for named entity and relationship
extraction from natural language documents (Sarawagi, 2007).
The NER task not only focuses on detecting names, but it can also include
descriptive properties from the text about the extracted entities. For instance,
in the case of person names, it can extract the title, age, nationality, gender,
position and any other related attributes (Esparcia et al., 2010).
17
There is now a wide range of systems designed for NER, such as the
Stanford Named Entity Recognizer 1. Regarding the performance of these
subsystems, the accuracy reached 95 per cent. However, this accuracy only
applies for domain-dependent systems. To use the system for extracting
entities from other types, changes must be applied (Cunningham, 2006).
2.6.2.Relationship Extraction
Another task of the IE system is to identify the connecting properties of
entities. This can be done by annotating relationships that are usually
defined between two or more entities. An example of this is ‘is an employee
of’, which describe the relationship between an employee and a company; ‘is
caused by’ is a relationship between an illness and a virus (Sarawagi, 2007).
Although the number of relations between entities that may be of interest can
generally be unlimited, in IE, they are fixed and previously defined, and this
is considered part of achieving a well-specified task (Piskorski and
Yangarber, 2012). The extraction of relations differs completely from entity
extraction. This is because entities are found in the text as sequences of
annotated words, whereas associations are expressed between two
separate snippets of data representing the target entities (Sarawagi, 2007).
2.6.3.Event Extraction
Extracting events in unstructured texts refers to identifying detailed
information about entities. These tasks require the extraction of several
named entities and the relationships between them. Mainly, events can be
detected by knowing who did what, when, for whom and where.
2.7.IE System Evaluation ForumsInformation extraction specialized conferences played a key role in the rapid
development of IE systems and their underlying models and technologies.
Following is a discussion of three main conferences, the Message understanding
18
1 Stanford Named Entity Recognizer website: http://nlp.stanford.edu/software/CRF-NER.shtml [Last Accessed: 28 April 2013]
conferences, the Automatic content extraction and finally Knowledge Base
Population.
2.7.1.Message Understanding Conference
The MUC series was initiated and funded by the Defense Advanced
Research Projects Agency (DARPA) (Cardie, 1997). The main goal of the
MUCs was to foster research in automating the extraction of information from
texts. However, one of the main outputs was defining the evaluation
standards for IE systems. For instance, defining quantitative metrics, such as
precision and recall. In total there were seven conferences, with the first,
MUC-1 taking place in 1987 and the last, MUC-7 in 1997.
Although they are called conferences, they are also widely known as
competitions, because concurrent research groups who wanted to attend the
conferences were asked to submit the evolutions of their systems in order to
be accepted for participation. For each conference, participants were given
sample messages and a set of carefully defined instructions on the type of
the information to be extracted. On a later date, before the start of the
conference, participants received a set of 100 previously unseen tests
(Appelt, 1999) to run on their developed systems without making any
changes to them. The processed output was tested against a model answer
manually prepared by experts (Grishman and Sundheim, 1996). The
comparison was done on the basis of a scoring system that rated the output
summary of each system according to metrics of recall and precision
(Cardie, 1997). Table 2.1 gives a summary of seven events of MUC
(Grishman and Sundheim, 1997; Grishman et al., 2002, Chinchor, 2001).
Table: 2.1 Summary of MUC topics from 1987 to 1997 Table: 2.1 Summary of MUC topics from 1987 to 1997 Table: 2.1 Summary of MUC topics from 1987 to 1997 Table: 2.1 Summary of MUC topics from 1987 to 1997
MUC Year Text source Evaluation taskEvaluation taskMUC-1 1987 Naval reports Scoring system undefined Scoring system undefined MUC-2 1989 Naval reports Undefined, large training corpusUndefined, large training corpusMUC-3 1991 Newswire on tterrorism in Latin American Semi-automated scoring program.
Recall and precession has been introduced.
Semi-automated scoring program. Recall and precession has been
introduced.MUC-4 1992 Newswire on tterrorism in Latin American Semi-automated scoring program,
further increase in template complexity in comparisons to those
in MUC-3
Semi-automated scoring program, further increase in template
complexity in comparisons to those in MUC-3
MUC-5 1993 Joint ventures and Microelectronic fabrication Multilingual (English and Japanese)Multilingual (English and Japanese)
19
MUC-6 1995 Management Succession Named-entity recognition, co-reference, template element filling
Named-entity recognition, co-reference, template element filling
MUC-7 1997 Airplane crashes Named-entity recognition, co-reference, template element filling,
template relation filling
Named-entity recognition, co-reference, template element filling,
template relation filling
Table 2.1 shows that from the first MUC with the last one, the number of
tasks and their nature completely differed. In MUC-7 the number of tasks to
be evaluated increased in number and complexity. There were five tasks to
be evaluated, including named-entity recognition, co-reference resolution,
template element filling, template relation filling and scenario template filling.
Before MUCs, templates were completely prepared by experts and were of
limited size for only limited texts. According to Cowie and Lehnert (1996), if
those templates had been examined with the measurements followed in later
MUC competitions, they would only earn scores between 60 to 80 per cent
for overall accuracy of the system, which is far less than expected (Cowie
and Lehnert, 1996).
The MUC proceedings were considered very influential in fostering the
development of the field of IE systems. They were a significant reference
resource for understanding how to evaluate IE systems. In addition, these
conferences helped in understanding the current state-of-the-art (Appelt,
1999).
2.7.2.Automatic Content Extraction (ACE)
The next series of conferences were the ACE, which aimed to cover texts
from broader domains, such as general news stories, including politics and
air accidents. ACE was an annual conference that started in 2003 and it
continued evaluations of different systems until 2008. Later ACE events
included testing of multilingual systems for texts of languages other than
English, such as Chinese and Arabic. The evaluation of extraction systems
was done separately by testing their ability to correctly extract the required
information from each text (Grishman, 2012).
20
2.7.1.Knowledge Base Population
The Text Analysis Conference (TAC), funded by the National Institute of
Standards and Technology (NITS), was a series of workshops initiated in
2008 and held annually thereafter. The aim was to provide an evaluation
infrastructure for NLP technologies 1. Part of TAC is the Knowledge Base
Population (KBP) workshops. The aim of KBP was to motivate research in
the field of named entity extraction from large test collections of unstructured
texts. The organisers provided competitors with common procedures and
each system was tested to find given names, and articles from newswires
and blog posts. This workshop raised questions about the context in which
the word occurs; for instance, ‘Apple’ is the name of a company and also the
name of a fruit. Other questions involve finding redundant words and
connecting properties (Grishman, 2012). Most of these questions were not
addressed in past events (Grishman, 2012).
2.8.Evaluation of IE SystemsThe findings of the MUCs have demonstrated that the evaluation of IE systems are
rather challenging, even for trained professionals, when compared with other
systems that manages natural language.
In IE, the element of difficulty resides in the fact that there are no clear guidelines
on the correctness of the items extracted by an extraction algorithm. To be able to
give a precise declaration of the desired output class is a difficult task; for example,
for a straightforward entity class such as ‘Country’, results such as ‘Palestine’ and
‘Europe’ are considered countries in some contexts, and in others they are not
(Moens, 2006).
The annotation tasks for IE are very complex, and they require the development of
a formal guideline document that clearly describes the desired output, alongside
several examples to guide the annotation process. For example, when comparing
the work of two people for the same tasks (human annotation), the Linguistic Data
Consortium (LDC) reported a rate of 92.6 per cent for entities and 70.2 per cent for
21
1 Text Analysis Conference webpage. http://www.nist.gov/tac/. Last Accessed: 12 April 2013
much more complicated tasks, such as relationships (Ramshaw and Weischedel,
2005). These difficulties can be clearly be seen in the human inter-annotator
agreements, where the expected rate is between 70 to 85 per cent, meaning that
there were discussions about correctness. (De Sitter et al., 2004).
De Sitter et al. (2004) pointed out that the lack of standardisation is actually a result
of the following three main issues:
• There is ambiguity with regard to the definition of IE; is the goal to fill a
template, or about finding all the occurrences of an instance in a document?
• How is it decided whether the extracted item is correct or not? Since the
inter-annotator agreement states that only 70 to 85 per cent is enough on a
standard IE task, obviously, this means there is a discussion about
correctness.
• What statistical metrics should be used to measure the effectiveness of an
IE system? Most researchers use F-measures, while others use recall and
precision measures.
Although, literature shows that there is no definite framework to evaluate IE
systems, there are some metrics that have been used widely. For counting of
results, a confusion matrix is a typical measure for evaluating an IE system. Figure
2.2 presents a confusion matrix. The evaluation will be done first on the entity level.
If it is correctly classified, it is a true positive; otherwise, it is a false positive. The
number of entities that the system should have extracted but failed to is labelled as
false negative; the true negative is rarely used in IE evaluations (Moens, 2006). The
confusion matrix is a very useful tool to determine how successful the system is in
classifying data. Hence, measures such as recall and precision can be easily
computed using confusion matrix.
22
Predicted classPredicted class
Yes No
Actual class Yes True Positive False NegativeActual class
No False Positive True Negative
Figure 2.2: Confusion matrix
Also, there are two well-known metrics that cannot be neglected. They are widely
accepted an adopted in the research community (De Sitter et al., 2004). These
measures are precision and recall.
Definition 1:
Recall = True Positive
True Positive + False Negative
Definition 2:
Precision = True Positive
True Positive + False Positive
The recall measure indicates the percentage of items that the system should have
detected. If high recall is achieved, then almost all of the information that had to be
extracted was indeed extracted. Precision indicates the percentage of the correct
items that the system has produced. If high precision is achieved, then almost all of
the extracted information is correct and there are few or no errors (Moens, 2006).
There is a clear trade-off between recall and precision and, in most cases there is
an inverse relationship. For the intended end-user, it is important to identify what is
the aim of the application; in other words, which is more important — obtaining a
high recall (avoid false negatives) or high precision (avoid false positives) (Appelt
and Israel, 1999).
According to Sarawagi (2007), in most systems, achieving high precision is a much
easier task than high recall. This is mainly because mistakes can be easily detected
from the extraction output, so they can be corrected manually, and the model can
be tuned until there are no errors. However, achieving high recall is more
challenging, because it requires an extensive annotation of data in the document; if
23
this is not achieved, then the identification of missed data from large unstructured
corpus is relatively unfeasible (Sarawagi, 2007).
Because of this unavoidable trade-off between recall and precision, an average
measure is usually reported, namely, the F-measure, which is a combination of both
recall and precision.
Definition 3:
F-measure= 2 x Recall x Precision Recall + Precision
Definition 4:
FḂ-measure= (1+Ḃ2) x Recall x Precision
Ḃ2 x Recall + Precision
Where Ḃ is a non-negative real number that indicates the relative importance of
precision and recall, and when Ḃ = 1, both are of equal importance (Moens, 2006).
Thus, the F-Ḃ is a weighted measure of both recall and precision (Han et al., 2012).
Although, the F-measure is widely used to make a comparison between IE systems
by using only one single measure, in most cases it is probably essential to know the
individual score values of recall and precision to be able to fully compare one
system against another.
For example, if one system has achieved a recall of 80 per cent and precision of 20
per cent, the obtained F-measure score will be the same as a system that scored
20 per cent recall and 80 per cent precision, even though the two systems are very
different. Therefore, in such cases, it is impossible to tell whether one system is
better than the other (De Sitter et al., 2004).
2.9.IE Overall ProcessIn general, IE systems that have been developed to accomplish different tasks for
different domains that differ from each other in a myriad of ways, but there are basic
components that are found in nearly every extraction system that deals with natural
texts. Obviously, most IE system designs were influenced by the design that was
24
defined by Hobbs in MUC-5 (Hobbs, 1993). Hobbs’ general architecture is based on
the idea of ‘cascading’ independent modules or engines separating the overall
processes into several smaller stages. The earlier stages deal with small linguistic
objects and work in a predefined domain-independent manner. At each step,
structure is added to the document and irrelevant information is filtered out. Some
authors combine several steps in a bigger stage, while others tend to divide one
step into smaller ones ( Cowie and Wilks 2000; Appelt and Israel 1999; Turmo et al.
2006; Archaya and Parija 2010; Piskorski and Yangarber 2012).
Reviewing the historical framework, most current systems, to a greater or lesser
degree, follow six major system functionalities:
• Document pre-processing:
• Morphological and lexical processing
• Syntactic parsing
• Semantic interpreter
• Co-reference resolver
• Template generator
Figure 2.3: Information extraction overall process
25
2.9.1.Document pre-processing
The first task is to prepare the input corpora for processing. This can be
achieved by the use of three primary modules:
Text zoner: Also known as text splitters, the text zoner splits the
document into sets of appropriate zones or segments, usually
sentences.
Filters: These take a segment of sentences and filter the relevant
sentences and discard the irrelevant ones. The prime consideration of
this module is to speed up the processing time by blocking unwanted
text.
Tokenization: This process mainly identifies lexical units. For
languages, such as English, French and Arabic, tokenization is a trivial
problem, and words can be identified by whitespace characters. In
newspaper stories and articles of any type, punctuation mostly indicates
a sentence boundary. However, in some languages, such as Japanese
and Chinese, where orthography cannot indicate word boundaries,
additional, mostly complex, processing modules are added to identify
word segments.
2.9.2.Morphological and lexical processing
This stage manages the task of building small-scale determinable structures
from sequences of lexical units. At this stage, system developers take
advantage of available resources to do a sufficient part of the work. Here
part-of-speech taggers, stemmers, dictionaries and lexicon resources are all
used. There are mainly four modules that fall in this stage:
Proper name extraction: Some authors believe that this step must
be done in the pre-processing stage (Cowie and Wilks, 2000). One of
the most important tasks in IE systems is to extract enumerate units
(dates, spelled out numbers) and named entities. As mentioned
earlier, named entity recognition (NER) is the process of identifying
and classifying the domain-dependent names in cases where the
system is designed to extract data from specific domains, otherwise,
common names or domain-independent. Although extracting names is
26
a critical task, it may represent some difficulty if the entity is of a large
class. For example, there are many cities in the world in different
locations that have the same name.
Part-of-speech taggers: IE systems do different tagging, one of
which is part-of-speech tagging. It works by assigning a part-of-
speech symbol (proper noun, verb) to a word from the corpus. They
work on the basis of statistical methods resulting from training pre-
tagged texts. These systems can be built independently of the system
(Wilks, 1997). Generally, constructing part-of-speech taggers vary,
especially when extracting information from highly specific domains,
where the system will need more effort and time to train (Appelt and
Israel 1999).
Word sense tagging: Also known as lexical disambiguator, this
assigns lexical units to one and only one lexical tag (Cowie and Wilks,
2000). It is important for revealing noun ambiguity on the predicted
fragments.
2.9.3.Syntactic parsing
Full document parsing raises issues related to performance and time
consumption. At the beginning of the MUC competitions, IE systems were
designed to implement full parsing of the input texts. However, in MUC-3, the
system that achieved the best score was built on the basis of partial parsing
by using finite-state grammars. This system was developed by Lehnert
(Turmo et al., 2006). This system was soon followed by the Hobbs’ group
(MUC-4, 1993) which, like Lehnert, followed an approach of using a
simplified parser (Turmo et al., 2006). The idea of using simpler language
models i.e. finite-state grammars was advocated by Church in 1980. Church
contended that finite-state grammars are adequate to achieve good
performance in human linguistic systems (Appelt et al.,1993)
Hence, in the most current IE systems, a shallow parser based on finite-state
approach is sufficient to distinguish the syntactic structure of a sentence and
its main components. However, for some domains, a full parser may be
desirable (Appelt and Israel 1999).
27
2.9.4.Semantic interpreter
Generating a semantic structure or an event from a syntactic structure is the
goal of this module (Archaya and Parija 2010). Simple approaches, such as
verb sub-categorisation and identifying appositive types are usually used
(Cowie and Wilks 2000): for example, ‘John Smith’ (person name),
‘CEO’ (occupation) of ‘ABC Corp’ (company name). This task is usually
limited to finding predefined argument structures.
2.9.5.Co-reference resolver
In MUC-6, this became an important part of the system evaluation process.
Its importance was derived from the need to combine structures to produce
fewer ones by generating larger, and hence, complete templates. In natural
texts, entities are referred to by different names and in different ways. For
example, ‘International Business Machines’ and ‘IBM’ are both references of
the same entity; one author may mention the full company name at the
beginning of the article, but only use the acronym in later positions. IE
systems must be able to identify co-references for successful processing
(Cowie and Wilks 2000).
2.9.6.Template generator
The final stage is to produce a semantic structure that can be evaluated. It is
crucial to ensure that the output template is (i) in the proposed format and
contains lexical units extracted from the original text (Cowie and Wilks 2000),
(ii) a representational device, and (iii) designed to be an input of another
processing by humans or programs or both (Appelt and Israel, 1999).
2.10.Information Extraction System Design Approach
Two major approaches exist for the design of IE, namely, the knowledge
engineering approach and the automatic engineering approach.
2.10.1.The Knowledge Engineering Approach
This approach can be identified by its fundamental characteristics, and
requires the involvement of the human factor to write the extraction rules.
Because it relies on handcrafted rules, the knowledge engineering approach
is also known as the rule-based approach (Sarawagi, 2007). The main
component of the IE system is a domain-specific grammar, which is usually
28
written or supervised by a domain expert. Appelt and Israel (1999) labelled
the person who works on the IE system under this approach a ‘knowledge
engineer’. Mainly because this approach requires a person who must be
familiar with the IE system mechanism and he must be able to familiarise
himself with the target domain, whom then, by himself or with a help of an
expert of the application domain, writes the extraction rules from natural
texts. In this methodology, the knowledge engineer plays a fundamental role
in the level at which the system performs. In addition to the need for skill and
high familiarity with the system and the domain, this approach is very labour-
intensive. In this case, it is obvious that achieving a high performing system
is an iterative process, where the written rules will be continually modified
until satisfactory accuracy scores are reached.
2.10.2.The Automatic Training Approach
In MUC-5, researchers were tempted with the idea of using a statistical-
based system to learn extraction rules instead of writing them manually
(Turmo et al, 2006). The principle motivation behind this model was to
reduce the workload by shifting away from knowledge-based systems
towards an automatic system designed upon machine learning algorithms
(Piskorski and Yangarber, 2012).
By following this model, there is no need for a knowledge engineer to write
the extraction rules. Instead, the rules are derived automatically by following
an intensive training of the system over the input corpus of the texts. The
only human intervention required to accomplish the task relates to annotating
the input texts for information to be extracted. Once the training corpus is
analysed and the required information marked, the training algorithm is run,
this then generates statistical information that will be used in the processing
of novel texts. Users are allowed to examine the results to check whether the
system hypotheses are correct or not. Thus, in the case of negative findings,
the system can modify its rules to respond to the new information (Appelt
and Israel 1999). Some well-known machine learning statistical models were
applied in this area, such as the hidden Markov model (HMM) and
conditional random fields (CRFs) (Piskorski and Yangarber, 2012).
29
Several types of learning have been used within the automatic training
approach, including supervised, semi-supervised and unsupervised learning
(Appelt and Israel 1999; Turmo et al. 2006; Nadeau and Sekine 2007;
Grishman 2005; Piskorski and Yangarber 2012).
Supervised learning approach
The term ‘supervised learning’ is applied in a situation where the entire
input into the training system is annotated manually. In preparing the input,
the annotation process will be in sequence, typically going through the
corpus document by document. Supervised learning makes sense only if
the large training set is already annotated; otherwise, the whole procedure
is extremely expensive.
Semi-supervised learning approach
An alternative is the ‘semi supervised’ approach, also known as the
‘weakly supervised’ learning approach (Nadeau and Sekine, 2007). The
semi-supervised approach uses a combination of a limited amount of
labelled data with large amount of unlabelled data (Grishman, 2005). It
requires limited human supervision, mainly in the initial stage to provide
the input, and in the last stage to check the output.
Unsupervised learning approach
The key idea of unsupervised learning is the use of clustering. For
instance, named entities can be grouped based on the similarity of the
underlying context. This technique relies on using lexical resources for
training the system, such as WordNet, and on lexical statistics generated
from large unlabelled texts (Nadeau and Sekine, 2007).
According to Appelt and Israel (1999) a project that adopted a hybrid
approach by using a combination of these learning strategies (supervised,
semi-supervised and unsupervised) conducted by Ralph Weischedel, and
TIPSTER in MUC-7, resulted in high performance results (Appelt and
Israel, 1999).
30
In general, although these two approaches were discovered a long time ago and
were used in IE systems in the 1980s and 1990s, both are still being used in
parallel depending on the nature and the purpose of the system’s extraction tasks
(Sarawagi, 2007).
Appelt and Israel (1999) highlighted some principle considerations that can drive
the decision of which approach to choose :
1 - The availability of training texts: If there is an adequate quantity or if
obtaining the training data is cheap, then the question on which approach to
choose, to some extent, returns to the difficulty of the domain. If the named
entity were obvious and easy, the required skill of the annotator is relatively
modest. However, where the situation is different for more difficult domains,
the inter-annotator agreement for accuracy level is lower and the process is
slower and much more complex and may require a more experience level. In
most cases, the annotation task is much cheaper to produce than rules, but
with a complex domain, this cost is arguable.
2 - The availability of lexical resources: If lexical resources, such as
dictionaries and lexicons, are available for the target language, then the
knowledge engineering approach is more suitable. Otherwise, it is necessary
to use the trainable approach and annotated corpus.
3 - The availability of a knowledge engineer: This factor is a prerequisite for
following the knowledge engineering approach.
4 - The stability of the system specification: It is often easier to do a minor
change in rules from re-annotating a text than to respond to different
requirements.
5 - The required performance degree: Results from MUC-6 and MUC-7 show
that rules produced by humans achieved an error rate approximately 30 per
cent lower than the automatic trainable systems.
31
Therefore, there is no winning model, and each system has its appropriate use. For
example, for closed domains, the knowledge engineering systems are more
appropriate. For unspecific domains, the statistical methods are preferable. Table
2.2 gives a summary of these factors.
Table 2.2 Decision factors for IE approaches
The Knowledge Engineering Approach The Automatic Training Approach
Lexical resources are available Lexical resources are unavailable
Rule-writers are available No knowledge-engineer is available
Training resources are rare or difficult to obtain Training resources are available and cheap
Extraction tasks are not stable and likely to change
Extraction tasks are stable
The required performance degree is high Good performance scores are sufficient to the task
2.11.Examples of IE system
The subsequent achievements in the area of rule-based system inclined
researchers to move toward general-purpose IE systems that could facilitate the
incorporation with new domains and languages (Piskorski and Yangarber 2012).
One of the earliest IE systems with a comprehensive design was FASTUS: A Finite-
state Processor for Information Extraction from Real-world Text. This system was
developed at SRI international’s Artificial Center, AIC, in 1993 and it has been
tested in MUC-4 (Appelt et. al, 1993). The main aim of the system was to address
the need for a system that extract predefined information from natural texts with
high speed and accuracy (Appelt et. al, 1993). FASTUS was able to process texts
from English and Japanese and it achieved one of the top scores in the MUC
competitions. It was designed as a set of cascaded finite-state transducers, where
the output of each stage serves as the input of the next stage. Each stage of the
processing model is actually a finite-state device (Piskorski and Yangarber, 2012).
The architecture of FASTUS consisted of four stages: (i) triggering of words; (ii)
recognising phrases; (iii) recognising patterns; (iv) merging incidents (Appelt et. al,
1993). The system has achieved a recall of 44% and a precision of 55% when
testing it against a never seen set of 100 texts. At the time of testing in MUC-4, the
system has been considered very fast.
32
Most of the systems that were designed at the same time of FASTUS, consisted of
one building block, where most of the processing modules, including stemmers,
parsers, POS, and tokenizes, were usually a one person or one group effort (Hahn
et al, 2008). Most of them were stand-alone systems, making the interoperability of
these components very hard to achieve. At that time, the idea of reusability of
system components was not considered, but rather the modules were designed to
work only for specific tasks, resulting in a low level of abstraction of specifications.
So, to develop an IE system, developers had to start from scratch because of this
lack of reusability (Hahn et al., 2008).
Researchers wanted to combine their work with the work of other researchers in an
effort to deliver innovative and coherent solutions, and this was also the idea that
motivated researchers at IBM to develop the Unstructured Information Management
Architecture (UIMA) system. UIMA is both an architecture and implementation
known as a framework (Ferrucci and Lally, 2005).
The high level architecture of UIMA can be described as text analysis engine
(where available in both C++ and Java programming language), and it consists of
two fundamental phases: analysis and delivery. The analysis phase consists of the
text analysis engine (TAE), which includes different processing modules such as
tokenization, stemming, syntactic parsing, and dictionary look-ups. However, in the
delivery stage, the analysis results can be presented to the user in different ways.
One way is a search from a query interface (a semantic search engine) for
documents that contain a particular token, entities or relationships (Ferrucci and
Lally, 2005).
The document analysis of UIMA works on the basis of a data structure: the original
document and the associated meta-data. This data structure is referred to as the
Common Analysis System (CAS). TAE takes the data structure CAS, performs the
required analysis, then produce the updated documents with additional meta-data
that contains the detected data such as name of persons, locations, organizations.
(Ferrucci and Lally, 2005).
33
UIMA has been used in various IE systems such as the Mayo Clinical Text Analysis
and Knowledge Extraction System (cTAKES)1, AVATAR extraction system2 . A
system similar to UIMA, is GATE, where the core language processing algorithm is
isolated from system services, such as data storage, communication between
components, and results visualisation panel.
GATE is the acronym of the “General Architecture for Text Engineering”. It was
developed at the university of Sheffield in 1996. The system was assessed in two
MUCs and has achieved high scores for named entity tasks (Cowie and Wilks,
2000).
UIMA and GATE frameworks as well as their tools and resources represent the-
sate-of-art of the available tools for text natural language processing techniques
(Dietl et al, 2008). Cowie and Wilks (2000) highlighted the three main objectives of
GATE framework:
•To allow the data to pass through different system modules at the highest
common level.
•To support the integration of system modules developed in any
programming language, and thereby, to be available on any applicable
platform.
•To provide a common easy-to-use interface to facilitate the the evaluation
and refinement of system modules, and to manage input text and linguistic
resources.
GATE is open source and completely written in Java programming language and it
supports PostgreSQL and Oracle databases (Dietl et al., 2008). Considering IE
technology in particular, many developers of IE systems opted for GATE because of
its high reliability and robustness, and because it incorporates the shallow
processing approach, which has proved it’s suitability for the IE domain (Cowie and
Wilks, 2000).
34
1 Open Health Natural Language Processing (OHNLP) Consortium. http://www.ohnlp.org [Last Accessed: 20 April 2013]
2 Avatar Information Extraction Systemhttp://libra.msra.cn/Publication/2175452/avatar-information-extraction-system [Last Accessed: 20 April 2013]
GATE comprises three main components: (i) Language Resources LRs (lexicons,
textual documents, ontologies); (ii) Processing Resources PRs (parser algorithms,
generators, modelers); and (iii) the Visual Resources VRs (graphical user interface
components) (Cowie and Wilks 2000, Dietl et al. 2008). GATE Processing
Resources (PR) are analogous to the TAEs of the UIMA architecture.
Therefore, the existence of this development pipeline, under the guidance of
software engineering principles and practices, has led to incremental development
in the field and it created opportunities for the development of NLP complicated
systems. Accordingly, an increasing number of institutes concerned with NLP
technologies, has adopted the software development that comply with UIMA
specifications to meet the emerging standards (Hahn et al., 2008).
2.12.IE System Performance
Measuring the overall performance of IE systems is an aggregated process and it
depends on multiple factors. The most important factors are (i) the level of the
logical structure to be detected, such as named entities, relationships, events and
the co-references,
(ii) the type of system input, i.e. newspapers articles, corporate reports, database
tuples, short text messages from social media or sent by mobile phones, (iii) the
focus of the domain, i.e. political, medical, financial, natural disasters, (iv) the
language of the input texts, i.e., English, or a more morphologically sophisticated
language, such as Arabic (Piskorski and Yangarber, 2012).
The relative complexity of assessing the performance of an IE system can be
managed by noting the scores obtained by the system in MUC competitions. In
MUC-7, in which the domain focus was aircraft accidents from English newspaper
articles, the system, which achieved the highest overall score, obtained a different
score for each subsystem. Scores for both recall and precision are presented in
table 2.3 (Piskorski and Yangarber, 2012). These figures provide a glimpse of
what to expect from an IE system in which the best performance is achieved in NER
and the lowest scores indicate the most difficult task of event extraction.
35
Table 2.3 Top scoring in MUC-7
Task Recall score Precision score
NER 95 95
Relationships 70 85
Events 50 70
Co-reference 80 60
2.13.Summary
IE is a milestone for many NLP techniques; therefore, it has progressed rapidly over
the last two decades. During and after the MUCs, researchers applied IE
techniques in a myriad of domains, resulting in massive evolution in the field. The
technologies and techniques underwent many developments. IE started out relying
heavily on handcrafted rules writing; the advent of machine learning techniques
advanced the progress of IE systems. However, there are still many challenges; the
current state-of-the-art systems show that IE system performance has not achieved
the level of accuracy of human extraction tasks. In addition to this, the IE system is
an intricate system consisting of several components that vary in complexity, and
their combined performance has a massive effect on the consequences. For
example, in order to extract a relationship, the NER task must first work with
relatively accurate results. The examples of IE systems, have demonstrated how
the field of NLP in general and in IE in particular has been influenced in the last
decade by the evolvement of software engineering principles and design
frameworks. The development of IE systems has been moved from standalone
systems to much more interoperable components.
36
3. CAFETIERE SystemThe aim of this chapter is to give an overview of the extraction engine adopted in this
project. The system framework and the rule grammar are explained. Some examples are
added for more elaboration.
3.1.CAFETIERE
The CAFETIERE system is a rule-based system for the detection and extraction of
basic semantic elements. CAFETIERE is an abbreviated term for Conceptual
Annotation for Facts, Events, Terms, Individual Entities and RElations. It is an
information extraction system developed by the National Center for Text Mining at
the University of Manchester. The engine incorporates a knowledge engineering
approach for extracting tasks. In this project, CAFETIERE is the extraction engine
to be used.
Figure 3.1: CAFETIERE overall analysis and query model
3.2.System Components
Before applying the extraction rules, the input will go through different
preprocessing tasks. The main framework in the preprocessing comprises the
UIMA analysis pipeline:
3.2.1.Document capture and zoning
37
In this task a structural annotation is implemented, the text will be partitioned
into structural segments using the Common Annotation Scheme known as
CAS; an XML-based annotation scheme. This step helps in splitting the text
into a title and main body components, the body will be then split into
paragraphs.
3.2.2.Tokenization
Text is seen by the extraction engine as a consequence of elements. Every
basic element within a paragraph will be marked: words, numbers,
punctuation marks and special symbols. These elements are called tokens.
Tokens are saved into the system as objects with associative attributes that
represent the token position in the text and its orthography features. These
features are defined at an earlier stage in number of codes that can be used
later in writing the rules. For example, the orthography code “capitalized”
means that only the first character in the word is capitalized, while the rest
are written in lowercase (Black et al., 2005).
3.2.3.Tagging
The task of lexical annotation of the text, where every token is identified by
its part of speech (POS) using the tagger. The tagger is the supporting
engine that has previously been trained to identify POS using Brill algorithm
and it uses the tag-sets1 of Penn Treebank as the tags in use. The accuracy
of an IE system output depends highly on the performance of the tagger.
Tokens will be then identified by their semantic properties: noun, proper
nouns, verbs, adjectives etc (Black et al., 2005).
3.2.4.Gazetteer lookup
This process is mainly there to perform semantic annotation. In CAFETIERE,
users can upload a lexical resource that consists of collection of all words
and phrases related to the domain in focus. This collection and its
representation is called Gazetteer. It is held in Access or MySQL database.
The role of the gazetteer is to identify proper names such as people names,
38
1Automatic Mapping Among Lexico-Grammatical Annotation Models http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html[Last accessed 23 April 2013]
locations, titles, months and any other domain specific words (Black et al.,
2005).
3.3.Notation of the Rules
After the preprocessing phase finishes and the system recognized every token, the
document will be ready to undergo the extraction phase by applying the rules. Rules
in cafeteria system are written in high-level programming language (Black et al.
2005).
The general rule formalism in CAFETIERE is (Black et al. 2005):
Y => A \ X / B
Where Y represents the phrase to be extracted and X are the elements that are part
of the phrase. A represent the part of the context that may appear immediately
before X in the text, and B represents the part of context that may appear
immediately after X, both A and B are optional (which means may be null).
Many rules lack the presence of A or B or both, therefore rules may have one of the
following form (Black et al. 2005):
Y => \ X / B
Y => A \ X /
Y => \ X /
Rules are context sensitive, that means the constituents present in the text before
and after the phrase should be reflect on the right hand-side (A) and on the left
hand-side (B) of the rule. The rule must have at least one constituent (X) and can
be more if required.
These rules define phrases (Y) and their constituents (A, X, B) as pairs of features
and their corresponding values, for example:
Y A X B
[syn=np, sem=date] => \ [syn=CD], [sem=temporal/interval/month], [syn=CD]
39
Where this represents a context free rule where both A and B are null, the phrase
and the constituent part are written in as a sequence of features and values
enclosed by in square brackets [Feature Operator Value]. If there is more than one
feature, then they are separated by commas, as can be seen in the feature bundle
for Y . Following is brief description for each of them:
Feature: Denote the attribute of the phrase to be extracted. The most commonly
used features are syn, sem, and orth, where syn is syntactic, sem is semantic and
orth is orthography. Features are written as sequence of atomic symbols.
For example, some of the values (and their meaning) may be assigned to the
feature syn listed in table 3.1 (Black et al., 2005)
Table 3.1: Examples of the values that can be assigned to syn feature
Tag Category Example
CD Cardinal number 4 , four
NNP Proper noun London
NN Common noun girl, boy
JJ Adjective happy, sad
Although these features are built into the system, there is no restriction on the name
of the features on the left-hand side of the rule.
Operator: Denote the function applied to the attribute and the predicated value,
operators that can be used in the system are ( >, >=, <, <=, , =, !=, ~) all have the
same usual meaning, the tilde operator matches a text unit with a pattern.
Value: expresses a literal value that may be strings or numbers or combination of
both, they may be quoted or unquoted.
3.4.Exploiting the rule notation
In general, rule grammar of the CAFETIERE system are diverse, and their usages
are different. Following are some of the main rule grammar that might be helpful for
assembling the appropriate rule (Black et al., 2005).
40
3.4.1.Regular expressions
Such as ?, *, and + can enhance the rule performance. They can be added
at the end of the right-hand side of the rule to show the following:
? means optional
* matching from zero to infinity
+ matching from 1 to infinity
Numbers or characters in square brackets e.g. [A-Z] can show precisely the
start and the end of a specific range (Black et al., 2005).
3.4.2.Grouping of subsequent words
Constituents can be grouped in round brackets. For example, list of words
written in lowercase characters separated by a comma, can be identified by
the following features enclosed by the brackets (Black et al., 2005).
[syn=NN, sem=list] =>\ [orth=lowercase]+, ([token=","], [orth=lowercase]+)+ / ;
3.4.3.The use of variables
Let’s assume that the semantic category of the target element is unknown, in
this case the identification will be problematic. One solution is to make the
‘sem’ feature to range across all semantic values by assigning a variable to
it, e.g. in the left hand side of the rule sem = _sem, while in the right hand-
side of the rule the value assigned to the lookup1 feature is _sem. This
means whenever a match is found according to the sequence of features in
the right hand side, save the value of lookup to be the value of sem.
Variables are written as unquoted strings with an initial under score e.g. _var
(Black et al., 2005).
3.4.4.Disjunction of values
From a perspective of efficiency, to avoid writing similar rules that differ in
feature value only, one value expression can be used to denote alternative
411 This feature is only assigned to words identified by the gazetteer or the lexical resource.
values. This can be done by using the pipe symbol “ | “ as in between
operator.(Black et al. 2005).
3.4.5.Partial matching
Partial matching is very useful for the rule-writer if he is looking for a string
that begins with certain characters. For example, many surnames start with
Mac or Mc but the rest of the name may differ. The use of wildcards “*” is
essential and very useful. The following example is taken from Black et Al.
(2005).
[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name3] =>
\ [token= "*virus", token=__o], [sem= disease_type]/;
3.4.6.Rules Order
Rules will be invoked in sequence by the system processor, therefore, their
analysis is deterministic (Black et al, 2012). It is the author’s responsibility to
organize the rules in order, because once words and phrases are identified
by a certain rule they will be no longer available to the subsequent rules. To
illustrate this further, assuming that there are two rules:
Rule 1: extract title, first name, initials, surname.
Rule 2: extract title and the surname
Target text: Ph.D John N. Scott
If rule 2 is executed before rule 1, then rule 2 will match Ph.D. and John,
yielding to N. Scott remaining. Therefore, when rule 1 is fired, it will not
detect N. Scott. because the other parts have already been used.
To avoid such a situation, when writing rules to match tokens that vary in
their numbers, the longest match rule must be executed first.
3.5.Gazetteer
In addition to the system built-in gazetteer, users can upload their own gazetteers to
look up words and expressions from the domain that they are focusing on. The
42
system recognizes a plain text files with the extension .gaz as a gazetteer file, then
it will added to the existed file (which is a relational database table) (Black et al.,
2012). The look up mechanism of the gazetteer works by identifying all the strings
that happen to be in the gazetteer by loading the relevant data from the lexical base
of the system before applying the rules. This process may consume time because
all tokens are looked up to see if there is any information about them in the
gazetteer(Black et al., 2012).
3.6.SummaryCAFETIERE system has been explained in detail, starting from the analysis pipeline, which determines the different preprocessing phases that the text
encounter. Moving to the rules grammar and the specific features such as identifying variables and regular expressions usage.
43
4. Requirement AnalysisThe aim of this chapter is to describe the domain focus and how the corpus will be
gathered. A preliminary analysis of some texts will give the guidelines for rules design and
implementation.
4.1.Domain Selection
The capability of processing natural texts, extracting specific entities, and identifying
relationships are the principle aim of most IE systems. Many systems have been
developed specifically to extract information from news topics under the influence of
MUCs. In the last two decades, IE techniques have been applied in different areas,
such as terrorist attack reports, business news and financial reports, drug
discoveries and biomedical domains. For this project, disease outbreak reports
have been chosen as the extraction domain.
Due to the differences in context for each domain, and the language patterns, this
project will be carried out using complete analysis.
4.2.Motivation for Domain Selection
Once an infectious disease begins to spread in a certain location in the world, the
general public instantly becomes very concerned about the severity of the disease,
its origins, and how many people have been infected and how many have died. This
drives the need for a system that is able to analyse data timely from natural texts,
such as news articles.
Furthermore, it is necessary to complement the clinical-based reporting systems by
enriching their databases with information extracted from disease outbreak reports.
It is essential in many emergency situations to go back and study the history of a
certain disease. Information related to diseases outbreaks is often written as free
text, and, therefore, difficult to use in computerised systems. Confining such
information in this rigid format makes the process of accessing it rapidly very
difficult.
According to the World Health Organization (WHO), analysing the information from
disease epidemic reports can be used for:
44
• The identification of disease clusters and patterns;
• Facilitating the tracking and following up with the spread of a disease outbreak;
• Estimating the potential increase in the number of infected people in a further
spread;
• Providing an early warning in the case of an increase in the number of incidents;
• Help in strategic decision-making as to whether control measures are working
effectively.
In addition to these factors, it has been found throughout history that the number of
information systems that were designed to extract information from disease
outbreak reports is very few and limited, the only system that was designed for
disease outbreaks is Proteus-BIO in 2002 (Grishman et al., 2002). All these are the
motivating factors for choosing to study this domain.
The intention of the work proposed in this project is to be able to extract information
about disease outbreaks from different natural texts. There are a number of news
websites that produce reports about disease outbreaks. Some of these report are
annual reports, where one report presents a summary of all disease epidemics that
have been recorded in one year; for example, reports found in the Centre for
Disease Control and Prevention website1 are all historical reports related to food-
borne diseases. However, for the sake of the project, the aim is to analyse texts
from different formats, where one particular disease is discussed in many reports.
Therefore, a decision has been taken to mainly analyse news reports from WHO.
The WHO website2 represents an ideal source that contains archives of news
classified in different categories, either by country, year or by disease.
In almost every case, the authors of these news stories are reporting the same
information; however, from report to report, the style of writing is slightly different.
This provides an opportunity of being exposed to a variety of writing styles in
reporting disease data.
The following are examples of news taken from the WHO:
45
1 http://www.cdc.gov/outbreaknet/surveillance_data.html
2 http://www.who.int/csr/don/en/
•Example 1: A sample of reporting incidents in Brazil diagnosed with dengue
haemorrhagic fever:“10 April 2008 - As of 28 March, 2008, the Brazilian health authorities have reported a national total of 120 570 cases of dengue including 647 dengue haemorrhagic fever (DHF) cases, with 48 deaths.
On 2 April 2008, the State of Rio de Janeiro reported 57 010 cases of dengue fever (DF) including 67 confirmed deaths and 58 deaths currently under investigation. Rio de Janeiro, where DEN-3 has been the predominant circulating serotype for the past 5 years since the major DEN-3 epidemic in 2002, is now experiencing the renewed circulation of DEN-2. This has led to an increase in severe dengue cases in children and about 50% of the deaths, so far, have been children of 0-13 years of age.
The Ministry of Health (MoH) is working closely with the Rio de Janeiro branch of the Centro de Informações Estratégicas em Vigilância em Saúde (CIEVS) to implement the required control measures and identify priority areas for intervention. The MoH has already mobilized health professionals to the federal hospitals of Rio de Janeiro to support patient management activities, including clinical case management and laboratory diagnosis.
Additionally public health and emergency services professionals have been recruited to assist community-based interventions. Vector control activities were implemented throughout the State and especially in the Municipality of Rio. The Fire Department, military, and health inspectors of Funasa (Fundacao Nacional de Saude, MoH) are assisting in these activities.”1
•Example 2: A sample of reporting incidents in Turkey diagnosed with Avian
influenza:“30 January 2006 - A WHO collaborating laboratory in the United Kingdom has now confirmed 12 of the 21 cases of H5N1 avian influenza previously announced by the Turkish Ministry of Health. All four fatalities are among the 12 confirmed cases. Samples from the remaining 9 patients, confirmed as H5 positive in the Ankara laboratory, are undergoing further joint investigation by the Ankara and UK laboratories. Testing for H5N1 infection is technically challenging, particularly under the conditions of an outbreak where large numbers of samples are submitted for testing and rapid results are needed to guide clinical decisions. Additional testing in a WHO collaborating laboratory may produce inconclusive or only weakly positive results. In such cases, clinical data about the patient are used to make a final assessment.”2
•Example 3: A sample of reporting incidents in Sierra Leone diagnosed with
cholera outbreak:“18 September 2012 - As of 16 September 2012, a cumulative total of 18,508 cases including 271 deaths (with a case fatality ratio of 1.5%) has been reported in the ongoing cholera outbreak in Sierra Leone since the beginning of the year.
The highest numbers of cases are reported from the Western area of the country where the capital city of Freetown is located.
46
1 http://www.who.int/csr/don/2008_04_10/en/index.html
2 http://www.who.int/csr/don/2006_01_30/en/index.html
The Ministry of Health and Sanitation (MOHS) is closely working with partners at national and international levels to step up response to the cholera outbreak. The ongoing activities at the field level include case management; communication and social mobilization; water, sanitation and hygiene promotion; surveillance and data management.” 1
4.3.Structure of the text
After reviewing 20 outbreak reports from the WHO website, it can be said that most
of them follow a general scheme. Reports chosen for this project will range from
100 to 300 words in length, because those of over 300 words usually contain
additional information, such as recommendations or medical treatments, which are
out of the scope of this study.
All the documents on the WHO website start with a date string indicating the data of
publication. Some elements are common to all texts, such as information about the
number of people affected by an outbreak, the name of the disease, and the
location where it is spreading. The structure usually consists of the following points:
•The first sentence ,after the title, contains a date string featuring the date of
publication on the website, always presented in the same format, e.g. 2 April
2011.
•Some of the reports contain another date, which is the announcement date;
this is always given after the first date in the text, in the second sentence in
most cases.
•In most instances the disease is reported by a health agency in a country,
e.g. “The Brazilian health autorities “. This can be very useful piece of
information, since it has been noticed that in some reports the name of
country is not mentioned, and the name of the national health agency is
enough to indicate the location.
•Disease names are not capitalized but they are sometimes accompanied
with indicating words such as fever, outbreak, influenza etc. Some of the
disease names are combination of characters and numbers like H5N1-
influenza.
•The report identifies the number of suspected and confirmed disease cases.
471 http://www.who.int/csr/don/2012_09_18b/en/index.html
•The total number of people affected by the disease since it was initially
discovered to the date of announcement is sometimes reported.
•Infected cases are reported individually for one state, from the text: the
State of Rio de Janeiro reported 57 010 cases of dengue fever (DF) including
67 confirmed deaths and 58 deaths currently under investigation.
•A pattern of “health authority of a Country reported victims” can be found in
the text: “the Brazilian health authorities have reported a national total of 120
570”. In some cases the word reported is replaced by synonyms such as
“announced”. Also, this pattern can be found in other documents in other
order “victims reported by health agency of country”.
4.4.Entities, relationships and events identification
The heart of this project is to find outbreak events within outbreak reports.
Extracting these events requires finding outbreak patterns in the text. We found that
sometimes the reporting of the outbreak event follows a very simple pattern — i.e.
“H1N1 killed 10 victims” — whereas in some reports, the event can contain complex
phrases, i.e. “the total of 18,508 cases including 271 deaths has been reported in
the cholera outbreak”.
Rather than trying to build a very complex rule, a set of smaller tasks must be
accomplished in order to compute the final event.
Information about a particular incident must be drawn from several constituent
elements within the text: the publication date, announcement date, disease name,
country of the outbreak, specific location (cities, states), number of infected people,
status of victims (sick, dead). The organization of this information often differs from
one report to another.
To make event extraction easier, we need to distinguish between relationships and
events. Relationship extraction will rely on identifying a single piece of information,
such as the nationality of the reporting authority, while the event will be the outbreak
incident itself (number of people infected by the disease in the country).
48
In sum, the extraction task will involve detecting the following information elements:
Entities
Publication date
Announcement date
Disease name
Disease code
Country
Locations of the outbreak(cities, villages,...)
Relationships
Nationality of the reporting authority
Events
Number of cases and deaths of an outbreak
Total number of affected cases
Table 4.1 Named entities in outbreak reports for IETable 4.1 Named entities in outbreak reports for IE
Entity Position in the text
Report date Actual string from document
Disease name Clue words: fever, outbreak, syndrome, influenza, fever
Health agency name Actual string from document, clues: ministry, agency
Country Actual string from document, or computed from the agency name
Location States, cities, town
Number of victims numeric value of cases mentioned in the text, or computed by counting number of cases in different
locations
To comply with the rule order, rules to extract entities must be presented before
rules for relationships and events. Based on the initial analysis, the following
assumptions has been made, extracting the named entities will be relatively
straightforward task since they are usually accompanied by clues occurring
immediately before or after their appearance in a text (see table 4.1). At the same
49
time, extracting location information such as a country name is much more difficult,
as these may not appear at all in the text, and thus should be computed from the
name of the agency. Finally, extracting outbreak events may be the most
challenging part of this project.
4.5.Project Methodology
This project is placed under the learning model proposed by the knowledge
engineering approach. As discussed earlier in Chapter 2, the rule-based approach
proved its efficiency in extracting information from various domains. This learning
model will use a set of features that will facilitate the extraction process, including
proper name identification, special characters and punctuation.
4.6.Development Methodology
Delivering an information system that has fulfilled the proposed requirements
within a specific timeframe can be achieved by following different strategies.
Before deciding which approach to choose, it seems appropriate to give a
brief description of two major approaches used in most system development
processes.
4.6.1.The Waterfall Approach
The most traditional approach for system development was proposed in
1970 by Royce (Massey and Satao, 2012). This approach follows a linear
framework consisting of different development stages that should be followed
in sequential order, including: requirement gathering, system design,
implementation and development, testing and finally maintenance. In this
model, before moving to the next stage, the preceding task must be
completed. Each stage has a set of goals and deliverables that must be met
and they have been stated previously. This approach is usually followed if
the system requirements are clear and stable before the start of the project
(Avison and Fitzgerald, 2003).
4.6.2.The Prototyping Approach
One of the relatively new approaches is the prototype approach, where
modification of the information system model is much simpler. The initial
prototype is developed quickly with minimal expense to give the client a
50
glimpse of how it will look and work. The system can critically analysed to
enhance it before the actual deployment starts. Thus, an iterative process is
followed until the client/ end user is satisfied with the generated deliverables
(Moscove, 2001).
Due to the nature of this project, a combination of these two approaches will be
employed. By incorporating the strengths of the waterfall and the prototyping, their
potential difficulties can be resolved. The waterfall will provide the overall structure
needed to accomplish the project, while the prototyping model will be followed in
certain stages. Mainly, there will be an iteration between the implementation of the
extraction rules and testing them, until finally target results are reached. According
to Sarawagi (2007) the most successful rule-based systems have followed a hybrid
model, where after extracting the most common patterns, rules are modified and
repeatedly tuned to obtain optimal results.
Figure 4.1: The development life-cycle model for information extraction
4.7.Summary
The domain chosen and the motivation behind the selection have been explained.
Samples of the texts upon which the project will focus were provided. The entity
types, relations, and events to be extracted have been clearly identified. Finally, the
project methodology has been described.
51
5. Design, Implementation and TestingThis chapter gives an overview of the process that was followed in designing the extraction
rules for all entities, relationships and events. The primary textual patterns that influenced
the final extraction decisions are discussed, in addition to a number of examples that
demonstrate the final analysis findings.
5.1.General design issues
The process of designing the extraction rules is based on studying the textual
expressions and elements found in the text, so for every entity, relationship and
event, a similar approach has been followed. The following are the factors that
influenced the design of the majority of the rules:
• Every textual element is recognized and captured using linguistic features
(e.g. syntactic, semantic, orthography). For example, to extract a token of
type number such as ‘45’, the rule should contain the syntactic feature
‘syn=CD’. (CD refers to Cardinal Number.)
• For each extraction task, the span of text that appears before and after the
target text is collected and studied to find common patterns that may help in
identifying the correct element. This task of studying the context surrounding
the element is the heart of this work as it is the only way to avoid false
matches.
• Patterns can be very simple, such as ‘prepositional phrase + noun’, or very
complex, such as the patterns used to look for outbreak events when the
pattern is a whole sentence: ‘verbs + prepositional phrases + nouns +
punctuations’. Hence, not all the constituents mentioned in the pattern will be
extracted - only the required ones.
• Rule order is very important. If there are two elements to be extracted and
the first element depends on the existence of the second element in the
sentence, then the first element should be extracted before the second one.
This is because when an element is recognized by a rule, it will be hidden
from the rest of the rules; therefore, each element is only extracted once.
52
Although the project uses rule-based systems, for some extraction tasks there was
an essential need for additional entries to the system gazetteer. Therefore, one of
the initial steps was to collect domain-specific vocabularies and then add them
under the appropriate semantic class or create new semantic classes if needed. Not
only have the domain terminologies been added to the gazetteer, but some
commonly used verbs and nouns have also been collected, added and categorized.
5.2.Entity extraction
5.2.1.Rules for extracting the publishing date
Extracting the publishing date of a report on the WHO website is essential.
When an outbreak hits a country, the WHO will release a periodic report
about the status of the outbreak; thus, reports are ordered by their publishing
date on the WHO website.
The date in the training corpora is expressed in two formats: ‘D Month
YYYY’ (‘6 June 2010’) or ‘D Month, YYYY’ (‘6 June, 2010’). Month names
already exist in the default gazetteer. The date rule is given below:
[syn=np, sem=date, type=entity,key=__t, month=_mno, day=_day, year=_year, rulid=publish_date1] => \ [syn=CD, orth=number, token=__t, token=_day, sent<=1], [sem="temporal/interval/month", monthno=_mno, key=__t, sent<=1],
[token=","]?, [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_year, sent<=1] /;
More than one date can be found for any randomly chosen outbreak report,
each of which may refer to something different (such as the date of the first
suspected ill person). To resolve this issue, we found that the publishing
dates are usually mentioned in the first or second sentence; therefore, the
feature ‘sent’ in the CAFETIERE system that indicates the sentence number
has been used. This has only been used to tag the date pattern in sentence
0 or 1; the annotation system starts counting sentences from 0.
53
Though this rule has successfully captured almost all the dates in the training
corpora, some of the reports in which the month has been written in capital
letters, such as ‘JULY’, were not identified by the gazetteer. For this reason,
the list of months written in all capital letters has been added to the gazetteer
and has been assigned the same semantic class as those in the default
gazetteer:
JANUARY:instance=january,class=temporal/interval/month,monthno=01FEBRUARY:instance=february,class=temporal/interval/month,monthno=02MARCH:instance=march,class=temporal/interval/month,monthno=03...DECEMBER:instance=december,class=temporal/interval/month,monthno=12
By adding these entries to the gazetteer, we have eliminated the need to
design another rule for each month written in capital letters.
5.2.1. Rules for extracting the announcement date
While this is similar to the publishing date (but has different meaning), the
announcement date is the date the national health authority reported an
outbreak to the WHO. Unlike the publishing date, this comes in three
formats: 1) a day with a month (‘DD Month’), a month with a year (‘Month
YYYY’) or a full date (‘DD Month YYYY’). The rules for the first and second
patterns are presented below; the rule for the last pattern is similar to the
publishing date rule:
[syn=np, sem=date, type=entity,key=__t, month=_mno, year=_year, rulid= announcement_date2, sentence=_s] =>
[token="As"|"as"]?, [token="of"|"on"|"On"|"in"|"since"] \ [sem="temporal/interval/month", monthno=_mno, key=__t, sent=2], [token=","]?, [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_year, sent=2] /;
[syn=np, sem=date, type=entity,key=__t, month=_mno, rulid= announcement_date3] =>
[token="As"|"as"]?, [token="of"|"on"|"On"|"in"|"since"]
54
\ [syn=CD, orth=number, token=__t, token=_day, sent=2], [sem="temporal/interval/month", monthno=_mno, key=__t, sent=2] /;
Some constraints have been added to the rules to avoid extracting dates with
different meanings; these constraints are the result of closely studying all the
training texts. The first constraint is that the announcement dates are usually
mentioned in sent=2 and in some cases in sent=3. As a result, the rules were
rewritten to handle announcement dates mentioned in the fourth sentence
(sent=3).
Expressions such as ‘As of 6 July 2002, the ministry of health has
reported . . .’ and ‘On 4 March, the Gabonese Ministry of Public Health
reported . . .’ are used to report the news; therefore, the left constituent of the
rule was designed to look for ‘As of’ and ‘On’ before capturing the reporting
date.
Moreover, another pattern has been found in the form of ‘During 1-26
January 2003’; this expression is used to identify the period that the outbreak
report is covering. Figure 5.1 shows the extraction task for such a pattern:
! ! Figure 5.1: Announcement date extraction
55
5.2.2.Rules for extracting country name
All outbreak reports contain information about the country affected by the
disease. In addition to the country of the outbreak, other countries may be
found in the text; therefore, the process of finding the country of the outbreak
had to be very explicit and precise. There is a separate gazetteer for
GeoName places, which occurs by default with the CAFETIERE system.
When a country or city occurrence is found in the input text and matches an
occurrence in the GeoName database, a phrasal annotation is made.
Therefore, to extract the correct country of the outbreak, rules must be
designed to classify the tagged locations.
First of all, we found that the country of the outbreak is usually mentioned in
the first three sentences, but this is not enough of a constraint to extract the
country since other countries can be mentioned as well. A very common
pattern involves indicating the country of the collaborating laboratory used to
examine a virus. For example: ‘Tests conducted at a WHO collaborating
laboratory in the United Kingdom . . .’; therefore, phrases such as
‘collaborating laboratory’ and ‘laboratory centre’ were collected and added
under the semantic class ‘collaboration’. Now, whenever a country is
mentioned after these words, it will not be extracted as the country of the
outbreak.
The most confirmed country entity to be extracted as the country of the
outbreak is when it is mentioned in the title of the report (sent=0); this can be
extracted by the following rule:
[syn=NNP, sem=country_of_the_outbreak, type=entity, key=_c, rulid=country_name2] => \ [sem>="geoname/COUNTRY", token=_c, sent=0] /;
Country names are categorized in the GeoName database under the
semantic class ‘geoname/COUNTRY’.
The second pattern is when the country name is preceded by the phrase
‘Situation in’, which is very common in reports on the WHO website. Since no
56
similar phrases are found in other texts, there was no need to add it to the
gazetteer. The following rule has been created to capture this pattern:
[syn=NNP, sem=country_of_the_outbreak, type=entity, country=_c, rulid=country_name] => [token="Situation"|"situation", sem!="collaboration"],[token="in"] \ [syn =DT]?, [sem>="geoname/COUNTRY", token=_c] /;If the country of the outbreak cannot be found using the previous rules, the
third pattern may be employed. Any country name preceded with the
preposition ‘in’ or ‘,’ that has occurred in the first three sentences and is not
preceded with any phrase that indicates a collaboration will be captured by
this rule:[syn=NNP, sem=country_of_the_outbreak, type=entity, key=_c, rulid=country_name1] => [sem!="collaboration"], [token="in"|","] \ [sem>="geoname/COUNTRY", token=_c, sent=<2] /;The rule order, especially in finding the country of the outbreak, is very
important; this is because many countries that are not affected by a disease
but are mentioned regarding other information can be found in outbreak
reports.
5.2.3.Rules for extracting the name of an outbreak
In order to extract the name of an outbreak, it is essential to know as many
disease names as possible. One way to do this is to incorporate a list of
disease names, types and symptoms into our extraction system. The medical
domain in particular is enriched with specific and generic dictionaries and
terminology lists, including names and categories of diseases such as the
ICD-101 disease lists. However, the emphasis of this project is to train the
system to automatically identify disease names of various patterns and to be
able to extract recently-discovered disease names and codes not found on
the pre-fixed lists.
571 ICD-10: The International Classification of Diseases standard.
One of the main problems making the identification of disease names more
complex than classic entities (such as person names and countries) is that
they are in most cases written in lowercase. Therefore, there was a high
need to recognize other features in the text. Another problem is when
diseases are mentioned in the text but are not the outbreak itself, e.g. when
the symptoms of an outbreak are similar to another disease. To avoid this
situation, words indicating the occurrence of such a case are gathered and
added in the gazetteer under the semantic class ‘symptoms’.
The first pattern recognized is when the disease name is mentioned in the
title of the form:
orth(capitalized) sem(preposition, punctuation) sem(GeoName)
E.g.: ‘Cholera in Chile’.
This pattern is only suitable for diseases mentioned in the title (sent=0). The
following rule was subsequently developed:[syn=np, sem=outbreak, type=entity, key=_o, rulid=outbreak_name1] => [token!="in"] \ [sem=disease, orth="capitalized", token=_o, syn!=DT, sent=0], [sem="disease_type"]? / [token="in"|","], [sem="geoname/COUNTRY"];
Nouns that indicate disease types (usually mentioned after the disease
name), such as virus, infection and syndrome, were all collected and added
into the gazetteer under the semantic class ‘disease_types’.
Many of the diseases are in the form of compound nouns where multiple
words are used to describe one disease entity. A typical disease name may
consist of the following:
sem(disease condition) sem(disease) sem(disease type)
‘Disease conditions’ is a new category created to cover all health conditions,
such as ‘acute’, ‘paralysis’ and ‘wild’ that are used as disease descriptors. An
example of this pattern is as follows:
‘acute poliomyelitis outbreak’
58
The following rule was developed to extract this pattern when mentioned in
the text body:
[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name6] => [sem!="symptoms", sent>0] \ [sem="disease_condition", token=__o], [sem=disease, token=__o], [sem=disease_type]? /;
Another disease pattern is when the word ‘virus’ is attached to the end of the
name, such as ‘Coronavirus’. Subsequently, the following rule has been
designed:[syn=np, sem=outbreak, type=entity, key=_o, rulid=outbreak_name3] => \ [token= "*virus", token=_o],[sem= disease_type] /;
Diseases that are always mentioned in the text without clues, such as
disease types and conditions, were collected from the WHO website and
added to the gazetteer under the semantic class ‘disease’. Examples of this
class are Malaria, plague, E.coli and Japanese encephalitis. They are
extracted by this rule:
[syn=np, sem=outbreak, type=entity, key=_o, rulid=outbreak_name2] => [sem!="symptoms", sent>0] \ [sem=disease, token=_o], [sem="disease_type"]? /;
In addition to this, extraction rules have been designed to identify disease
codes such as H1N1. The CAFETIERE system recognizes the word ‘H1N1’
as one token and assigns the value ‘other’ to the orthography feature
because it contains characters and numbers; thus, the rules were designed
based on this finding:
[syn=np, sem=Disease_code, key=__s, rulid=disease_code2] => \ [orth="other", token=__s], [sem="disease_type", token=__s] /;
59
The number of patterns for disease extraction we have designed and
implemented total eighteen. Other patterns not presented in this chapter can
be found in appendix A.
5.2.4.Rules for extracting affected cities and provinces
When an outbreak hits a country, one or more cities will be affected. The goal
of this task is to extract the affected locations by studying the context of the
texts. With the existence of the default GeoName database, it will be easy to
capture any geographical location mentioned in the text as a location of the
outbreak; however, in some reports, the cities mentioned are not affected by a
disease but instead are mentioned as part of other details. As a result, the
context that precedes the city names was carefully studied.
After examining the texts, we found that not all of the cities, areas, provinces
and states had been added into the GeoName database, resulting in them not
being identified. This is either due to a transliteration problem or because they
are not very well-known places. The problem was partially solved by studying
the expression used to represent the locations in the text.
The corpora was studied to collect all the words that may indicate a location,
such as ‘city’, ‘province’, ‘district’, ‘village’ or ‘region’, and then these terms
were added to the gazetteer under the semantic class ‘areas’. All possible
forms that can be used were also added (singular, plural and capitalized).
In some texts, the names of the affected areas occur in expressions like ‘54
cases have been reported in the provinces of Velasco’ and ‘Cases also
reported from Niari’. The first rule was designed to look for the following
pattern:
sem(reporting_verbs) sem(preposition) sem(GeoName)
60
To identify any locations excluding country names, two categories from the
GeoName database - geoname/PPL1 and geoname/ADM22 - were used.
Thus, the following rule was constructed for both the passive and active
expression:
[syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations] =>[sem="haveverb"]?,[token="today"|"also"]?,[tokem="been"]?,[sem="reporting_verbs"],[token="in"|"from"|","|"In"], [token~"^[a-z]+$"]{0,5}\[sem="geoname/PPL"|"geoname/ADM2", token=_c][sem="areas"]?/;
We designed a rule to capture groups of entities for situations where the
outbreak hits more than one location. The maximum number of entities it
captures is eight. Figure 5.2 shows the text after extracting the outbreak
locations.
Figure 5.2: Extracting groups of outbreak locations
61
1 PPL: “A city, town, village, or other agglomeration of buildings where people live and work”, Source: The GeoNames geographical database: Available from:http://www.geonames.org/export/codes.html Last accessed [25 July 2013]
2 ADM2: “A subdivision of a first-order administrative division”, Source: The GeoNames geographical database: Available from: http://www.geonames.org/export/codes.html Last accessed [25 July 2013]
For locations not identified by the gazetteer, a number of rules have been
designed to extract locations mentioned within explicit expressions. These
expressions are usually used to express location and are followed by an
indicating word after the location, such as ‘city’, ‘province’ and ‘state’:
sem(reporting_verbs) sem(preposition) Orth(capitalized) sem(areas)
This pattern conforms with expressions such as:
‘ . . . reported from the Oromiya region’.
Additionally, we have designed a rule to capture locations mentioned after
reporting the incidence of a single case:
‘ . . . female from Kampong Speu Province’.
‘ . . . male from Kong Pisey district’.
Words indicating a single person, such as ‘male’, ‘female’, ‘child’, ‘man’
and ‘woman’ were added under the semantic class ‘people’. The
following rule has been designed:
[syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations10] => [sem="people"], [token="from"|"for"]?, [token~"^[a-z]+$"]{0,2} \ [sem="geoname/PPL"|"geoname/ADM2", token=_c] / [sem="areas"]?;
Identifying the name of the location seems to be the most challenging task,
especially when the name of place is not identified using the gazetteer.
Location phrases can take various forms and can occur anywhere in the text.
The use of simple rules (finding location prepositions such as ‘in and ‘from’)
may extract all the locations in the text, but this also may increase the number
of false matches.
62
5.3.Relationship extraction
5.3.1.Rules for extracting the name of the reporting health authority
In the domain being studied in this project, the most interesting relationship
we have found is the name of the health authority reporting the outbreak to
the WHO. This relationship is a binary relationship of type “located in” (E.g.
Ministry of Health, Afghanistan). This task is especially important because in
some reports, the name of the country is not mentioned in the beginning but
instead is implied in the name of the authority.
When we began our analysis, the name of the health authority was
considered part of the outbreak event because it was usually mentioned at
the beginning or at the end of the outbreak event clause. The outbreak
event, as can be seen later in this report, is a very complex task; extracting
the name of the reporting authority is not straightforward task, in that it can
take various forms and can be mentioned in different locations in a sentence,
depending on the tense used (passive, active). For this reason, we decided
to extract it using separate extraction rules.
According to the texts under study, the reporting authority always takes the
form of the relevant country’s health authority, where the name of the health
authority is adjacent to the country name. The most common form is:
sem(health authority) sem(preposition) sem(GeoName)
All the names that might refer to a health authority were collected and added
to the gazetteer under the semantic class ‘health_agency’. The most
common authority reporting outbreaks was a country’s ‘ministry of health’;
however, other names such as ‘The Ministry of Health and Population’ and
‘The National Health and Family Planning Commission’ were also found.
Pattern 1:
sem(health_agency) sem(preosition) sem(GeoName)
This will capture
‘Ministry of Health (MoH) of Egypt’
63
with the following rule:
[syn=clause, sem=reporting_authority, type=relationship, key=__d, country=_c, health_agency=_h, rulid=reporting_authority2] => \ [sem="health_agency", token=__d, token=_h], [token="("],[syn="NN"]?,[token=")"], [token="of"|"in", token=__d], [token="the", token=__d]?,[sem>="geoname", token=__d, token=_c] / [sem="haveverb"]?, [sem="reporting_verbs"];
Pattern 2:
The nationality of the country is mentioned instead of the country name:
orth(DT) NNP(nationality) NP(health authority)
where DT refers to determiners such as ‘the’. This pattern conforms with the
following example:
‘The Afghan Ministry of Public Health’.
Pattern 3:
Punctuation is used to refer to the authority name:
sem(health authority) sem(punctuation) sem(GeoName)
to capture:
‘The National Health and Family Planning Commission, China’.
For cases when a health authority is of the form: ‘The Ministry of Health and Care Services of Norway’ ‘The Ministry of Health and Population, Angola’
Where the first part of the health authority name (The Ministry of Health) is
already added to the gazetteer, but the second part of the name “and Family
Welfare” or “and Population” is new and can be of various names, so to avoid
the incident of the whole relationship not to be identified, this pattern was
added to the other relationship patterns:
sem(health authority) token(and) orth(capitalized){1,4} sem(GeoName)
64
Figure 5.3: Reporting health authority extraction
In the text shown in figure 5.3, the “The Ministry of Health and Family
Welfare, Bangladish” has been recognized as the name of the reporting
agency.
However, requiring exact matches for the relationship extraction may result in
reporting no information, for instance in some cases, the health authority of
the country is the reporting authority, but the name of the country is not
adjacent, or the country name is used to report an outbreak for example,
“China has reported to WHO 5 deaths..”. So to avoid not extracting them at
all, it has been decided to extract the health authority or the country name as
reporting authority if the reporting pattern is mentioned in the text, even
though, it does not align with complete relation extraction pattern.
5.4.Events extraction
As in the ACE model, only events with interesting arguments should be extracted,
such as life events (born, die) and transactions (money transfers)(Ahn, 2006).
Consequently, events found in the outbreak reports are life events because a
certain number of people are affected by a disease in a specific country.
5.4.1.Rules for extracting an outbreak event
After examining 25 reports, we found that the patterns used to report an
outbreak event are in the form of the number of victims of an outbreak
reported by an authority. To avoid increasing the complexity of the events
65
rules, the authority name is extracted in advance. Typically, the simplest
event will be in the following form:
sem(GeoName) sem(reporting verbs) orth(CD) token (“cases”)
This will capture a sentence in the following form:
‘China reported 34 cases’.
However, as we analyse more texts, more interesting arguments can be
found, such as case classification and fatal cases. Therefore, before
designing the events rules, we have taken key considerations into mind:
5.4.1.1. Number of cases
The number of cases and deaths are usually in digit form, such as
‘134 cases’. Alternately, they can be in written form: ‘five cases’. We
have also found that the form of ‘twenty-five cases’, where a dash is
inserted between two numbers, is also used in some reports. Another
issue related to extracting the numbers arises when the number
consists of four or more units and a space is used after every three
number units (e.g. ‘45 100’). To overcome the last problem, the
number can simply be read as a whole string, ‘45 100’, but if we want
it to be saved as a proper integer value the following arithmetic
calculation would solve the problem : total = “(+ (* _a 1000)_b))”,
CAFETIERE will interpret this calculation by multiplying the first
number by 1000 then adding the second number.
For example, ‘45’ is the first number token ‘a’ and ‘100’ is the second
token ‘b’, thus,
45 * 1000 = 45000
45000 + 100 = 45100.
5.4.1.2.Case classification
Cases of infection from a disease are usually classified as either
suspect, probable or confirmed cases to identify the degree of
66
certainty of an outbreak. Those terms are known as ‘case
classification’ and are often used in outbreak reports. Therefore, case
classification has been added as a feature to the event extraction
rules. All of the terms that fall under the case classification have been
added to the gazetteer under the class ‘case-classification’. Terms
such as ‘laboratory confirmed’ and ‘epidemiologically linked’ are types
of confirmed cases that have been added.
5.4.1.3.Fatal cases
In addition to the typical classification of reported cases mentioned
above, reports usually contain information the about number of fatal
cases and deaths. To distinguish these cases from the others, their
semantic class is ‘fatal cases’, and to distinguish the fatal but not dead
from the deaths, the feature ‘dead’ will be used and will hold to values
of either ‘yes’ or ‘no’ according to the terms used in the texts that
describe the situation.
The simple event pattern ‘China reported 34 cases’ is very common;
however, other similar patterns can be found:
‘China reported 34 new suspected cases and 4 new deaths’. ‘China reported 34 new suspected SARS cases and 4 new deaths’.
So to borden the coverage of similar patterns, verbs that indicate the
reporting such as ‘reported’, ‘identified’ and ‘confirmed’ were added
along with their different tenses to the gazetteer. Their semantic class
is ‘reporting verbs’.
[syn=clause, sem=outbreak_event, type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak14] => [sem="haveverb"], [token="today"]?, [sem="reporting_verbs"] \
67
[syn=CD, token=_a], [syn=CD,token=_b], [token="new"]?, [sem="case_classification",classification=_c ]?, [token="cases"], [token="of"|"Of"]?, [token~"^[A-Z]+$"|"^[a-z]+$"]{0,3}, [sem="disease_type"]?, [token=","]?, [token="and"|"including"], [syn=CD, token=_d], [token="new"]?, [sem="fatal_cases",dead=_answer] /;
In addition to the active voice, passive patterns like
syn(CD) token (“cases”) sem(haveverb +beverbs) sem(reporting_verbs)
are also used widely in outbreak reports; therefore, the verb groups
such as ‘have been’ and ‘has been’ were added to the rules to capture
the following type of pattern:
‘2 249 cases have been reported ...’
Another example of an outbreak event pattern is:
orth(CD) sem(case_classification) token(“cases”) sem(preposition) syn(NN)
This will pick up sentences such as:
‘130 laboratory-confirmed cases of avian influenza’.
In addition to this, temporal and locative information may appear in
different positions in the sentence or clause:
‘Since 2005, 20 cases reported, 18 of which have been fatal’
‘20 cases reported since 2005- 18 have been fatal’
68
‘Of the 20 cases reported, 18 have been fatal since 2005’
More complex patterns can be found when both the temporal and
locative information are mentioned in the same sentence:
“20 cases reported in Cambodia since 2005, 18 have been fatal.”
Which can be captured by the following rule:
[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak11] =>
\ [syn=CD, token=_n], [token="cases"], [sem="haveverb"]?, [sem="beverb"]?, [sem="reporting_verbs"], [token="in"], [sem>="geoname"], [token="since"]?, [token~"19??"|"20??"]?, [token=","|"."]?, [token="of"]?, [token="which"|"these"]?, [syn=CD, token=_d], [sem="haveverb"]?, [sem="beverb"]?, [sem="fatal_cases",dead=_answer] /;
Similarly, the phrase ‘has reported’ can occur anywhere in the reporting
clause. The adjunct clause will be used to extract fatal cases such as
number of deaths. We have found the following sub-pattern in the WHO
reports:
tokens(“including“| “and” | “of which” | “of these” ) orth(CD) sem(haveverb)?
sem(NN)
will match
‘ . . . including 31 deaths’.
69
Another pattern occurs when the passive voice is used:
tokens(“including” | “and “|“of which” |“of these”) orth(CD)
sem(haveverb +beverbs)? sem(NN)
This will capture clauses such as:
‘ . . .and 18 have been fatal’.
5.4.2. Rules for extracting the total number of cases and deaths
In many of the texts chosen for this study, we found that in addition to
reporting the number of cases and deaths, the total or the cumulative total of
cases on a particular date was usually mentioned in the same report. All the
patterns mentioned in the previous outbreak reporting event apply to this
task; however, the only difference for these events is when texts use phrases
like ‘the total number is’ and ‘the cumulative total is’.
For example, to capture the total number mentioned in this fragment:
‘ . . . total number of children affected to be 59. Of these, 52 have died’, we
have designed the following rule:
[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak15] => \ [token="since"|"In"|"in"]?, [syn=CD,orth=number, token~"19??"|"20??", token=_year]?, [orth=punct]?, [token="the"], [token="total"], [token="number"]?, [token="of"], [token="children"|"people"|"human"]?, [token="affected"]?, [token="to"]?, [sem="beverb"]?, [syn=CD, token=_n], [token="cases"]?, [token=","|"."]?, [token="of"|"Of"]?, [token="which"|"these"]?, [token=","]?, [syn=CD, token=_d], [sem="haveverb"]?, [sem="beverb"]?,
70
[sem="fatal_cases",dead=_answer] / ;
5.5. Discussion
This chapter provides a comprehensive overview of how the training corpus has
been analyzed to capture disease outbreak details, determining the expressions we
found that helped in building linguistic patterns. We have discussed all of the
entities, relationships and events that were considered as the extraction tasks in
this project and have given examples of the rules and the output results. Some of
the patterns used to capture certain entities and events are not mentioned - the
complete file containing all the rules for all the patterns is attached in appendix A.
Even the texts chosen in this study belong to one domain, challenges caused by
linguistic variation do exist. By linguistic variation we mean that different
expressions may be used to deliver the same idea. Extracting information from texts
can be achieved either by writing a few general patterns (which may lead to
information being tagged under incorrect semantic classes) or by writing as many
specific rules as possible (which will lead to an extensive workload by trying to write
a rule for each pattern, even for those patterns that are rarely found in natural
texts). Due to time constraints, both generic and specific rules were written to cover
as many patterns for entities, relationships and events as possible.
Writing generic rules for extracting entities seemed like an efficient way to capture
as many patterns as possible. This means that many of the constituents that
comprise the main part of the rule are optional values; however, this can be a
problem if the mandatory fields are few and not very specific, which may end up
only capturing the information of very common patterns. Therefore, we have found
that designing many rules with a limited number of optional values has reduced the
amount of information that is captured incorrectly.
Regarding the entities extraction, in the beginning we assumed that extracting the
entities would be the the most straightforward part of the project. This assumption
has proven to be true for extracting the dates as they are always mentioned in the
same way. This is also true for extracting the country of the outbreak. The only
71
problem is with countries that are mentioned in the text but have no further reporting
of disease outbreak.
E.g.: ‘Argentina and Peru have been notified of the cases that occurred
earlier this month in Chile’.
The countries ‘Argentina’ and ‘Peru’ are not disease outbreak locations - ‘Chile’ is
the outbreak location. So for this task, the work has been focused on the sentence
level; countries mentioned in the first sentences are only captured if they conform to
specific patterns, as it has been found that in the disease outbreak reports, the
important information related to the actual outbreak event is always presented first
and the secondary information is presented later.
Extracting the outbreak name was a relatively challenging task, as these do not
conform to common patterns - even the orthography features are not obvious. Many
diseases are named after the person who discovered them or after the location
where they first appeared; this can cause confusion when extracting them. For
example, the word ‘Avian’ was tagged by the gazetteer as the name of a location
and not as the name of a disease. Some reports discuss the symptoms of an
outbreak, which can be problematic if their sentence matches a pattern designed to
capture an outbreak disease. All of these reasons have complicated the extraction
process. Extracting the locations of an outbreak was very challenging task,
especially for locations that were not tagged in the GeoName database; therefore, it
was essential to discover as many expressions as possible.
Conversely, extracting the ‘located in’ relationship was relatively straightforward.
This is because the reporting authorities have a limited number of patterns.
We initially assumed that events extraction would be difficult because the outbreak
events are usually very long and consist of other information that may be extracted
in advance; however, after closely examining and testing the patterns, we decided
to treat each clause or sentence as a number of constituents indicating certain
features. The longest event clause that can occur is when all the features are
mentioned in the same sentence. By features, we mean that case classification,
locative and temporal details and the number of cases are reported in the same
72
sentence or clause. Other information that may appear in the event clause, such as
the disease name and the country of the outbreak, is read only as a linguistic
pattern that helps in constructing patterns but that is not extracted. This is because
they are always extracted beforehand using separate rules. For example:
‘130 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus including 31 deaths’.
The information extracted is:
Number of cases = 130 Case classification = laboratory-confirmed Number of fatal cases = 31 dead_cases = yes
The name of the disease will not be extracted and is only used to formalize one of
the outbreak event patterns. In doing this, the extraction process will be facilitated
as there is no reason to extract the same information again.
Most of the difficulties we encountered when designing the rules were due to rules
order. The file containing the rules is ordered by featuring the rules for extracting
entities at the beginning, followed by relationships and finally events. If a rule for
entity extraction captures information from the clause containing the event
information, the whole event will not be recognized. This problem was partially
solved by defining the ‘before’ and ‘after’ constituents - the more conditions added,
the more potential similarities between patterns will be avoided.
73
6. System EvaluationAt the beginning of this chapter, the evaluation metrics that were used to assess the system performance are discussed, in addition to some basic definitions that were considered while validating the extraction outputs. A demonstration of how each report was evaluated is also shown. Finally, precision, recall and F-measure were calculated for both training and testing sets to conclude the main findings of this project
6.1.System evaluation metrics
To aid in assessing the level of performance achieved by the designed system, a
complete system evaluation was performed. As discussed in chapter 2, the main
findings of the MUCs include the measures of precision and recall, as well as the F-
measure which is the average of precision and recall. Those metrics were adopted
in this project. Precision indicates how many of the elements extracted by the
system are correct (accuracy), while recall indicates how many of the elements that
should have been extracted were actually extracted (coverage). Although these
measures have occasionally been changed slightly, they are the most commonly
used by researchers to compare the results of their systems with those of other IE
systems (Maynard et al., 2006).
Precision = True Positive
True Positive + False Positive
Recall = True Positive True Positive + False Negative
F-measure= 2 x Precision x Recall Precision + Recall
In preparation for the evaluation, the sets of texts run through the system were also
manually annotated. The evaluation process was based on a comparison of the
manual extractions with the system’s output. Elements extracted by the system
were identified as:
• Correct (true positive): Elements extracted by the system align with the value
and type of those extracted manually.
74
• Spurious (false positive or match): Elements extracted by the system do not
match any of those extracted manually.
• Missing (false negative): The system did not extract elements that were
extracted manually.
• Partial: The extracted elements are correct, but the system did not capture
the entire range. For example, from the sentence “China today reported 39
new SARS cases and four new deaths”, the system should extract the
number of cases and deaths, but in this instance, it extracted only the
number of cases. This case is a partial extraction and would be allocated a
half weight, resulting in the coefficient 0.5. Another coefficient could be used
to obtain more accurate results. For example, if the majority of an element is
extracted, then a coefficient of 0.75 or higher can be used, but if only a small
part of the element is extracted, a coefficient of 0.40 or less can be used. All
the MUCs assigned partial scores for incomplete but correct elements
(Grishman, 2012).
Therefore, the measures of precision and recall can be calculated as follows:
Precision = Correct + 0.5 Partial = Correct + 0.5 Partial
Correct + Spurious + Partial N
Recall = Correct + 0.5 Partial = Correct +0.5 Partial
Correct + Missing + Partial M
Where:
N = Total number of elements extracted by the system
M = Total number of manually extracted elements
6.2. System evaluation process
The manually annotated entities were employed as the answer keys used to
validate the system output. Validation against the training corpus utilised in
designing the rules should yield high scores for both recall and precision. Therefore,
75
10 texts new to the system were selected from the WHO website. In the training
phase, a set of 25 texts was chosen randomly. Summary reports of disease
outbreaks in different countries were excluded from both the training and the test
set because they typically are constructed differently and contain significantly
different textual patterns.
The system tagged elements either because they were captured by the extraction
rules or they matched a gazetteer entry. The main goal of this project was to test
the extraction rules’ ability to identify elements of the desired value and type.
Elements tagged by the gazetteer consistently possessed the correct value and
type and, if assigned a score, would receive the full score of 1. Therefore, it was
decided to count only the elements captured by the extraction rules.
For example, a text was annotated manually to identify the elements that the
system should extract (see Figure 6.1.).
Figure 6.1: Manual annotation
Entities: Outbreak name: Meningococcal disease Country: Burkina Faso Publish date: 4 February 2003 Report date start: 1 January 2003 Report date end: 26 January 2003 Outbreak locations: Batie, Kossodo, Manga and Tenkodogo
Meningococcal disease in Burkina Faso.4 February 2003.Disease Outbreak Reported.During 1-26 January 2003, the Ministry of Health of Burkina Faso has reported 980 cases and 196 deaths (case-fatality rate, 20%) in the country. On 26 January 2003, 4 districts, Batie, Kossodo, Manga and Tenkodogo, were in the alert phase, although none had crossed the epidemic threshold.
For more details about the epidemic threshold principle, see the article, "Detecting meningococcal meningitis epidemics in highly-endemic African countries" in the Weekly Epidemiological Record.Of a total of 28 specimens collected in 3 districts (Nanoro, Paul VI, Pissy), the National Public Health Laboratory has confirmed Neisseria meningitidis serogroup W135 in 10 samples, Streptococcus pneumoniae in 8 and Haemophilus influenzae type b in 4. The Ministry of Health is implementing control measures to contain the outbreak, including the pre-positioning of laboratory materials and oily chloramphenicol at district level, enhanced epidemiological surveillance, training of health personnel and social mobilization in communities.
76
Relationship: Reporting authority: Ministry of Health of Burkina Faso
Event: Outbreak event: 980 cases and 196 deaths
The same text was run through the system (see Figure 6.2.).
Figure 6.2: System annotation
As can be seen, the elements were correctly extracted and assigned to the
appropriate type (class). The system extracted more elements than the manual
process. Among the additional elements tagged by the system gazetteer in this
particular example, the Ministry of Health is mentioned twice in the text. In the first
instance, the phrase is followed by a country name. This pattern conforms to the
reporting authority rules and therefore was tagged by the system. The second
mention, however, did not follow a pattern recognised by any of the rules; therefore,
it was tagged only by the gazetteer.
As well, additional tags which were not tagged by the gazetteer can be found and,
in this case, are considered spurious elements. The same process of analysis was
undertaken for both the training and the test sets.
77
6.3.Results analysis
Table 6.1 shows the extraction results from the training corpus of 25 texts. The table
displays the number of entities, relationships and events correctly identified, those
partially identified by the system and spurious elements. The actual entities, actual
relations and actual events columns present the total number of elements extracted
manually. C, P, S and T refer to the correct, partial, spurious and total elements,
respectively.
Table 6.1: Breakdown of the counting results of the training corpus
Text Entities extractedEntities extractedEntities extractedEntities extracted Actual entities
Relations extractedRelations extractedRelations extractedRelations extracted
Actual relations
Events extractedEvents
extractedEvents
extractedEvents
extractedActual events
Text
C P S T
Actual entities
C P S T
Actual relations
C P S T
Actual events
1 4 4 4 1 1 1
2 5 1 (50%)
6 7 1 1 1 1 1 1
3 4 4 5 0 1 1 1 1
4 4 4 5 1 1 1 1 1 1
5 3 1 4 4 0 1 1 1 1
6 4 4 4 1 1 1 1 (80%) 1 1
7 4 4 4 1 1 1 1 1 1
8 10 10 12 1 1 1 1 1 1
9 4 4 4 1 2 (50%) 3 3
10 7 2 9 8 1 1 1 1 (50%) 1 1
11 3 1 4 8 1 1 2 1
12 4 4 4 1 1 1 1 1 1
13 4 4 4 1 1 1 1 1 1
14 5 1 6 6 1 1 1 1 1 1
15 5 5 5 1 1 1
16 4 4 7 1 1 1 2 2 2
17 4 4 5 3 1 4 3
18 4 4 7 1 1 1 1 (80%) 1 1
19 3 3 3 1 1 1 1 (50%) 1 1
20 8 8 10 1 1 1 1 1 1
21 4 4 5 0 1 1 1 1
22 6 1 7 8 0 1 1 1 1
23 4 4 4 1 1 1 1 1 1
24 4 4 5 1 1 1
25 5 1 6 5 1 1 1 2 2 3
78
The precision, recall and f-measures were calculated based on the results
presented in Table 6.1. The calculation results are shown in Table 6.2.
Table 6.2: Breakdown of the evaluation metrics of the training corpus
Text EntitiesEntitiesEntities RelationsRelationsRelations Events Events Events Text
Precision Recall F-measure Precision Recall F-measure Precision Recall F-measure
1 1.00 1.00 1.00 1.00 1.00 1.00
2 0.91 0.78 0.85 1.00 1.00 1.00 1.00 1.00 1.00
3 1.00 0.80 0.90 0.00 0.00 1.00 1.00 1.00
4 1.00 0.80 0.90 1.00 1.00 1.00 1.00 1.00 1.00
5 0.75 0.75 0.75 0.00 0.00 1.00 1.00 1.00
6 1.00 1.00 1.00 1.00 1.00 1.00 0.8 0.8 0.80
7 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
8 1.00 0.83 0.92 1.00 1.00 1.00 1.00 1.00 1.00
9 1.00 1.00 1.00 0.70 0.70 0.70
10 0.80 0.90 0.85 1.00 1.00 1.00 0.5 0.5 0.50
11 0.80 0.40 0.60 0.50 1.00 0.75
12 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
13 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
14 0.83 0.83 0.83 1.00 1.00 1.00 1.00 1.00 1.00
15 1.00 1.00 1.00 1.00 1.00 1.00
16 1.00 0.60 0.80 1.00 1.00 1.00 1.00 1.00 1.00
17 1.00 0.8 0.90 0.80 1.00 0.90
18 1.00 0.60 0.80 1.00 1.00 1.00 0.80 0.80 0.80
19 1.00 1.00 1.00 1.00 1.00 1.00 0.5 0.5 0.50
20 1.00 0.8 0.90 1.00 1.00 1.00 1.00 1.00 1.00
21 1.00 0.8 0.90 0.00 0.00 1.00 1.00 1.00
22 0.93 0.80 0.87 0.00 0.00 1.00 1.00 1.00
23 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
24 1.00 0.8 0.90 1.00 1.00 1.00
25 0.83 1.00 0.92 1.00 1.00 1.00 1.00 0.70 0.85
Average 0.95 0.85 0.90 0.80 0.80 0.80 0.90 0.92 0.91
79
An initial observation that emerges from both tables is that the system delivers a high level
of performance when extracting outbreak events. Extremely high precision and recall were
achieved in all events; in 17 of 25 cases, precision and recall equalled 1. These high
values were produced not only because the texts used were the actual training set, but
also because the patterns utilised to extract the event were studied extensively in order to
design additional rules for never-before-seen patterns. These additional patterns were
predicted based on the knowledge that many tokens can separate adjacent pieces of
information (the number of cases and deaths). The word ‘tokens’ here describes elements
that can refer to temporal and location information, or to disease names.
The results of entities extraction also demonstrate extremely high performance. Recall is
slightly lower than precision but is still considered high, with an average of 0.85.
The relationships were either correctly extracted or not extracted at all. Although the task
of designing the relationships rules was relatively straightforward, it produced the lowest
precision and recall, primarily because all the constituents of the relationship rule were
made mandatory fields during the design phase, which prevented partial extraction.
Enabling optional fields could lead to many false alerts because the names of health
authorities often occur frequently in a single outbreak reports. This restriction resulted in a
score of 1 for both recall and precision in 80% of the relations mentioned in all the reports,
indicating that the rules succeeded in accurately extracting all the relationships in a report.
The test corpus results are shown in Tables 6.3 and 6.4.
80
Table 6.3: Breakdown of the counting results of the test corpus
Text Entities extractedEntities extractedEntities extractedEntities extracted Actual entities
Relations extractedRelations extractedRelations extractedRelations extracted
Actual relations
Events extractedEvents
extractedEvents
extractedEvents
extractedActual events
Text
C P S T
Actual entities
C P S T
Actual relations
C P S T
Actual events
1 5 5 6 1 1 2
2 8 8 10 0 1 1 1 1
3 4 4 5 1 1 1 1 2 1
4 3 3 4 1 1
5 8 8 8 1 1 1 1 1
6 4 4 5 0 1 1 1 2
7 3 3 9
8 4 4 5 1 1 1 1 1 1
9 2 2 5 1 1 2
10 6 6 6 1 1 1 1 1 1
Table 6.4: Breakdown of the evaluation metrics of the test corpus
Text EntitiesEntitiesEntities RelationsRelationsRelations Events Events Events Text
Precision Recall F-measure
Precision Recall F-measure
Precision Recall F-measure
1 1.00 0.83 0.92 1.00 0.5 0.75
2 1.00 0.8 0.90 0 0 1.00 1.00 1.00
3 1.00 0.8 0.90 1.00 1.00 1.00 0.5 1.00 0.75
4 1.00 0.75 0.88 1.00 1.00 1.00
5 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
6 1.00 0.8 0.90 0 0 1.00 0.5 0.75
7 1.00 0.33 0.67
8 1.00 0.8 0.90 1.00 1.00 1.00 1.00 1.00 1.00
9 1.00 0.4 0.70 1.00 0.5 0.75
10 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Average 1.00 0.75 0.88 0.70 0.70 0.70 0.90 0.80 0.85
The most noticeable result is that all entities extracted from the test corpus were correct.
The low recall indicated that extraction was performed with few errors, achieving high
recall is generally more challenging than precision (Sarawagi, 2007).
81
As in the training corpus, the relationship extraction had the lowest performance level. In
addition to the possible cause discussed earlier, the rules did not take into account the
new reporting patterns.
Event extraction for the test set also achieved a high level of performance even for new
patterns. The results show an average precision of 90% and recall of 80%, reasonably
high considering the complexity of the task.
It is necessary to determine the number of occurrences of each entity type in order to
assess their level of difficulty. Not all the entities have the same frequency, necessitating
accurate measurement of the performance of the rules.
Table 6.5: Number of occurrences of each entity type
Entity Training setTraining setTraining setTraining set Testing setTesting setTesting setTesting setEntity
C P S M C P S M
Published date
24 10
Report date 14 1 6 5 1
Country 22 1 10
Disease 23 2 1 6 4
Locations 28 2 3 17 15 11
Disease code
5 1
Table 6.5 shows that, in both the training and the testing sets, the location entity has the
greatest number of missing elements. This result is not surprising because the extraction
of locations was the most challenging task during the design phase. Locations can be
mentioned anywhere in a report and do not conform to obvious patterns. In addition, unlike
the other entities, locations can be mentioned within a group of other locations as, for
example, in the statement “Cases have also been reported in Larnaca, Famagusta,
Nicosia and Paphos”. Problematically, in such patterns, the number of locations that can
be mentioned within a clause may remain undetermined. In addition, the preceding and
following sentences can present various patterns. Those factors make location one of the
most difficult entities to handle within outbreak reports.
82
Another observation can be made about disease names. Although the design of the
extraction rules for diseases was extremely challenging and required both a deep analysis
of various disease names and the linguistic analysis of the context in order to prove that
an entity actually was an outbreak, the results show highly accurate identification and few
errors. These results are acceptable considering the difficulty of the task.
The extraction results, though, can be improved. In particular, the results for location and
disease name entities could be bettered significantly by using up-to-date official datasets
for location and disease names. Doing so would allow most effort to be focused on
analysing linguistic patterns rather than positing potential name structures and
combinations.
Taking into account the limited time allocated for the project and the time-consuming
nature of building extraction rules that yield instantaneous adjustments and testing, the
performance results show an extremely high degree of accuracy.
83
7. ConclusionThe main aim of this project was to acquire basic knowledge of IE technology and
methodologies. The number of objectives was defined in accordance with this general aim.
The background study was conducted to satisfy the first objective, which was to assess
the current state-of the-art of IE. We reviewed the basic definitions of IE and its position in
the NLP spectrum. The history of IE was briefly described, including the most influential
events such as MUC and ACE, in order to reveal the basics of these systems. The two IE
approaches the knowledge engineering approach and the automatic training approach
were examined, and the reasons for preferring one explained. The evaluation metrics—
recall and precision—were reviewed. Finally, several examples of IE systems were given
and the influence of modern software engineering approaches on system designs
considered.
A rule-based approach—the knowledge engineering approach—was selected as the
methodology for this project. To determine the success of the designed rules, a hybrid
model involving both the prototype and the waterfall approach was adopted as the overall
development strategy; therefore, rules were designed, implemented and tested
simultaneously.
Along with the theoretical work, the majority of the practical work consisted of extracting a
predefined set of information from the online WHO disease outbreak reports that contained
similar information. Extraction rules thus were designed to capture the repeated
information. The information extraction system CAFETIERE was used to implement the
rules, while the rules writing was based on extensive study of the linguistic patterns
preceding and following the target element. Essential to the rule-based approach is
studying the classes of words and phrases used in the arguments of such information.
The evaluation of the extraction rules yielded high precision and recall scores, close to
those of state-of-the-art IE. The experiments were conducted independently with two
subset corpora (the training and testing sets). The sets delivered similar system
performance, although the training corpus had higher accuracy, particularly for relationship
extraction. Event extraction, surprisingly, yielded to very high scores, the approach that
helped in achieving such scores returns to the idea of looking what information digests that
may form the event clause itself, so instead of only capturing the number of cases and
84
deaths caused by the outbreak, other information was also included in the task such as
case classification, fatality status, year and total numbers. Those constituents have helped
in building many linguistic patterns that comprise the outbreak events.
It can be concluded that the rule-based approach has been proven capable of delivering
reliable information extraction with extremely high accuracy and coverage results. This
approach, though, requires an extensive, time-consuming, manual study of word classes
and phrases.
In the future, this research could be expanded in various directions. For instance,
information about individual cases affected by an outbreak could be extracted, such as the
gender, age, province, village and initial symptoms of a particular case. It would be useful
to investigate how to use co-references in multiple sentences. In addition, the identification
of location entities could be improved by combining the different levels of a location into a
single relation; for example, Halifax, Nova Scotia, could be extracted as a location
relationship. Finally, study should be directed toward reports on outbreaks affecting plants
and animals.
85
ReferencesACHARYA, S. and PARIJA, S. (2010) The Process of Information extraction through natural language processing. International Journal of Logic and Computation, 1(1), pp. 40-51.
AHN, D. (2006) The stages of event extraction. ARTE '06 Proceedings of the Workshop on Annotating and Reasoning about Time and Events, Sydney, Australia, pp.1-8.
APPELT, D. et al. (1993) FASTUS: A finite-state processor for information extraction from real-world text. Proceedings of the 13th International Joint Conference on Artificial Intelligence, (IJCAI-93), pp. 1172-1178.
APPELT, D. and ISRAEL, D. (1999) Introduction to information extraction. Artificial Intelligence Communications, 12 (3), pp. 161–172.
AVISON, D. and FITZGERALD, G. (2003) Where now for development methodologies? Communications of the ACM, 46 (1), pp. 78–82.
BANKO, M. and ETZIONI, O. (2008) The tradeoffs between open and traditional relation extraction. Proceedings of ACL-08: HLT,Columbus, Ohio, Association for Computational Linguistics, June 2008, pp. 28-36.
BLACK, W.J. et al. (2005) Parmenides Technical Report. Available from: http://www.nactem.ac.uk/files/phatfile/cafetiere-report.pdf [Accessed 29/04/13].
BLACK, W.J. et al. (2012) A data and analysis resource for an experiment in text mining collection of micro-blogs on a political topic. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 2083-2088Available from: http://www.lrec-conf.org/proceedings/lrec2012/pdf/1056_Paper.pdf [Accessed 29/04/2013].
CARDIE, C. (1997) Empirical methods in information extraction. AI Magazine, 18 (4), pp. 65-79.
CHINCHOR, N. (2001) Overview of MUC-7. Science Applications International Corporation. Available from: http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/overview.html [Accessed 28/04/13].
COWIE, J. and LEHNERT, W. (1996) Information Extraction. Communications of the ACM, 39 (1), pp. 80–91.
COWIE, J. and WILKS, Y. (2000) Information extraction. In: DALE, R., MOISL, H. and SOMERS, H., eds. Handbook of natural language processing. New York: Marcel Dekker, pp. 241-269.
CUNNINGHAM, H. (2006) Information extraction, automatic. In: Encyclopedia of language and linguistics, 2nd ed. Amsterdam: Elsevier Science,5, pp. 665-677.
NADEAU, D. and SEKINE, S. (2007) A survey of named entity recognition and classification. Linguisticae Investigationes, 30, pp. 3–26.
De SITTER, A. et al. (2004) A formal framework for evaluation of information extraction. Technical report no. 2004-4. Antwerp: University of Antwerp Dept. of Mathematics and Computer Science, 2004.
DIETL, R. et al. (2008) Project deliverable report: Deliverable D2.1 services approach & overview general tools and resources. Available from: http://dspace.ou.nl/bitstream/1820/1707/1/D2.1%20final%20EC.pdf [Accessed: 28/04/13] .
86
ESPARCIA, S. et al. (2010) Integrating information extraction agents into a tourism recommender system. Proceedings of EAIS2010, Springer LNAI 6077, pp.193 – 200.
FERRUCCI, D. and LALLY A. (2005) Building an example application with the unstructured information management architecture. IBM Systems Journal, 43, pp. 455-475.
FRIEDMAN, C. et al. (1995) Natural language processing in an operational clinical information system. Natural Language Engineering, 1, pp. 1-28.
GRISHMAN, R. et al. (2002) Information extraction for enhanced access to diseaseoutbreak reports. Journal of Biomedical Informatics, 35 (4), pp. 236–246.
GRISHMAN, R. (2005) NLP: An information extraction perspective. In ANGELOVA, G. et al., eds. Recent Advances in Natural Language Processing (RANLP). Borovets, Bulgaria: INCOMA, pp. 1-4.
GRISHMAN, R. (2012) Information extraction: Capabilities and challenges.Notes prepared for the 2012 International Winter School in Language and Speech Technologies, Rovira i Virgili University, Tarragona, Spain.
GRISHMAN, R. and SUNDHEIM, B. (1996) Message understanding conference - 6: A brief history. Proceedings of the 16th International Conference on Computational Linguistics (COLING 96), Copenhagen, August 1996.
HAHN, U. et al. (2008) An overview of JCoRe, the JULIE Lab UIMA component repository.Proceedings of the LREC 2008 Workshop on UIMA for NLP, pp. 1–7.
HAUG, P.J. et al. (1997) A natural language parsing system for encoding admitting diagnoses. Proceedings of the AMIA Annual Fall Symposium, pp. 814-8.
JONG, G. (1977) FRUMP. ACM SIGART Bulletin, 61, pp. 54-55.
MASSEY, V. and SATAO, K. (2012) Comparing various SDLC models and the new proposed model on the basis of available methodology. International Journal of AdvancedResearch in Computer Science and Software Engineering (IJARCSSE), 2, pp.170-177.
MAYNARD, D. et al. (2006) Metrics for Evaluation of Ontology-based Information Extraction. Proceedings WWW 2006 Workshop on Evaluation of Ontologies for the Web”(EON), Edinburgh, May 2006.
MOENS, M. (2006) Information extraction: Algorithms and prospects in a retrieval context. The Information Retrieval Series 21. NewYork: Springer.
MOSCOVE, S. (2001) Prototyping: An alternative approach to systems development work. Review Of Business Information Systems, 5 (3), pp. 65-72.
PISKORSKI, J. and YANGARBER, R. (2012) Information extraction: Past, present and future. In POIBEAU, T. et al. Multi-source, multilingual information extraction and summarization. Theory and Applications of Natural Language Processing series. Berlin and Heidelberg: Springer-Verlag, pp. 23-49.
RAMSHAW, L. and WEISCHEDEL, R. (2005) Information extraction. Proceedings of IEEE ICASSP, Philadelphia, PA, 2005. IEEE digital library, 5, pp. 969–972.
SAGER, N., et al. (1987) Medical language processing: Computer management of narrative data. Reading, MA: Addison-Wesley.
87
SARAWAGI, S. (2007) Information extraction. Foundations and Trends Databases, 1 (3), pp. 261-377.
TURMO, J. et al. (2006) Adaptive information extraction. ACM Computing Surveys 38 (2), pp. 1–47.
WILKS, Y. (1997) Information extraction as a core language technology. In: PAZIENZ, M. Information Extraction A Multidisciplinary Approach to an Emerging Information Technology. Berlin and Heidelberg: Springer-Verlag, 1299, pp. 1-9.
88
Appendix A: Extraction rules# Ex: Cholera in Chile[syn=np, sem=outbreak, type=entity, key=_o, rulid=outbreak_name1] =>! [token!="in"]! \! [sem=disease, orth="capitalized", token=_o, syn!=DT, sent=0], [sem="disease_type"]?!! /! [token="in"|","], [sem="geoname/COUNTRY"];
# Ex: Malaria[syn=np, sem=outbreak, type=entity, key=_o, rulid=outbreak_name2] =>! [sem!="symptoms", sent>0]! \! [sem=disease, token=_o], [sem="disease_type"]?!! /;
# Ex: Coronavirus[syn=np, sem=outbreak, type=entity, key=_o, rulid=outbreak_name3] =>! \! [token= "*virus", token=_o],[sem= disease_type]! /;
# Ex: Herpes type 1[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name4] =>! \! [sem=disease, token=__o], [token="type", token=__o]?,[token="-", token=__o]?, [orth=number, token=__o]?! /;
# Ex: acute poliomyelitis outbreak [syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name5] =>! \! [sem="disease_condition", token=__o], [sem=disease, token=__o], [sem=disease_type]?! /;
# Ex: acute poliomyelitis outbreak [syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name6] =>! [sem!="symptoms", sent>0]! \! [sem="disease_condition", token=__o], [sem=disease, token=__o], [sem=disease_type]?! /;
# Ex: Severe acute respiratory syndrome[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name7] =>! [sem!="symptoms", sent>0]! \! [sem="disease_condition", token=__o], [token=__o]{1,2}!! /! [sem=disease_type];
89
# Ex: outbreak of Shigellosis[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name8] =>! [token="outbreak"|"outbreaks"],[token="of"]! \! [token=__o]{1,3},[sem=disease_type, token=__o]! /;
# Ex: avian influenza A(H7N9) virus[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name9] =>! [token="infection"], [token="with"]! \! [token=__o, syn!=DT],! [sem=disease_type, token=__o],[token=__o]{1,2},[sem=disease_type]!! /! [token!="Outbreak"|"of"|"including"|"and"|","], [token!="Reported"];
# Ex: acute poliomyelitis outbreak [syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name10] =>! \! [sem="disease_condition", token=__o],[token=__o]{1,2},[sem=disease_type]!! /! [token!="Outbreak"|"of"|"including"|"and"|","], [token!="Reported"];
# Ex: Acute haemorrhagic fever[syn=clause, sem=outbreak, type=entity, key=__o, rulid=outbreak_name11] =>! \! [token=__o, sem="disease_condition", orth=capitalized],[token=__o, sem="disease_condition"]*,[sem="disease_type", token=__o]{1,2}!! /! [token!="Outbreak"|"of"|"including"|"and"|","|"outbreak"], [token!="Reported"];
# Ex: Acute haemorrhagic fever[syn=clause, sem=outbreak, type=entity, key=__o, rulid=outbreak_name12] =>! [sem!="symptoms", sent>0]! \! [token=__o, sem="disease_condition", orth=capitalized],[token=__o, sem="disease_condition"]*,[sem="disease_type", token=__o]{1,2}! /! [token!="Outbreak"|"of"|"including"|"and"|","|"outbreak"], [token!="Reported"];
# Ex: SARS virus [syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name13] =>! \! [token~"^[A-Z]+$", token=__o],! [sem= disease_type, token=__o]! /;
# Ex: avian influenza[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name14] =>
90
[token="infection"], [token="with"]! \! [token=__o, syn!=DT],[sem=disease_type, token=__o]! /! [token!="Outbreak"|"of"|"including"|"and"|","], [token!="Reported"];!
# Ex: Dengue/dengue haemorrhagic fever[syn=clause, sem=outbreak, type=entity, key=__o, rulid=outbreak_name15] =>! [sem!="symptoms"]! \! [token=__o, orth=capitalized, syn!=DT], [token=__o]{1,2},! [token=__o, sem="disease_condition"],[sem="disease_type", token=__o]{1,2}!! /! [token!="Outbreak"|"of"|"including"|"and"|","|"outbreak"], [token!="Reported"];
# Ex: Dengue/dengue haemorrhagic fever from title[syn=clause, sem=outbreak, type=entity, key=__o, rulid=outbreak_name16] =>! \! [token=__o, orth=capitalized, syn!=DT, sent=0], [token=__o]{1,2},! [token=__o, sem="disease_condition"],[sem="disease_type", token=__o]{1,2}! /! [token!="Outbreak"|"of"|"including"|"and"|","|"outbreak"], [token!="Reported"];
# Ex: Yellow fever from title[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name17] =>! \! [orth=capitalized, token=__o, sent=0],[token=__o, sent=0]?,! [sem= disease_type, token=__o]! /;
# Ex: Ebola haemorrhagic fever from title[syn=clause, sem=outbreak, type=entity, key=__o, rulid=outbreak_name18] =>! \! [token=__o, orth=capitalized, syn!=DT, sent=0], [token=__o]{0,2},! [token=__o, sem="disease_condition"],[sem="disease_type", token=__o]{1,2}!! /! [token!="Outbreak"|"of"|"including"|"and"|","|"outbreak"], [token!="Reported"];
# Ex: China[syn=NNP, sem=country_of_the_outbreak, type=entity, country=_c, rulid=country_name] =>! [token="Situation"|"situation", sem!="collaboration"],[token="in"]! \! [syn =DT]?,! [sem>="geoname/COUNTRY", token=_c]! /;
# Ex: China
91
[syn=NNP, sem=country_of_the_outbreak, type=entity, key=_c, rulid=country_name1] =>! [sem!="collaboration"],! [token="in"|","]! \! [sem>="geoname/COUNTRY", token=_c, sent=<2]! /;
# Ex: China from title[syn=NNP, sem=country_of_the_outbreak, type=entity, key=_c, rulid=country_name2] =>! \! [sem>="geoname/COUNTRY", token=_c, sent=0]! /;
# Ex: reported from Bengo, Malange, and Luanda[syn=NNP, sem=outbreak_locations, type=entity, loc1=_l1, loc2=_l2, loc3=_l3, loc4=_l4, loc5=_l5, loc6=_l6, loc7=_l7, loc8=_l8, rulid=locations1] =>! [sem="reporting_verbs"],! [token="in"|"from"|","|"In"], ! [token~"^[a-z]+$"]{0,5} ! \! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l1],! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l2]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l3]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l4]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l5]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l6]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l7]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l8]?,[sem="areas"]?! /;
[syn=NNP, sem=outbreak_locations, type=entity, loc1=_l1, loc2=_l2, loc3=_l3, loc4=_l4, loc5=_l5, loc6=_l6, loc7=_l7, loc8=_l8, rulid=locations2] =>
! [sem="disease_type"],[token~"^[a-z]+$"]{0,2},[token="in"|","|"In"|"from"]! \! [syn =DT]?,[sem="geoname/PPL"|"geoname/ADM2", token=_l1],! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l2]?, [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l3]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l4]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l5]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l6]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l7]?,
92
! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l8]?,! [sem="areas"]?! /;
[syn=NNP, sem=outbreak_locations, type=entity, loc1=_l1, loc2=_l2, loc3=_l3, loc4=_l4, loc5=_l5, loc6=_l6, loc7=_l7, loc8=_l8, rulid=locations3] =>
! [token="outbreak"|"Outbreak"],! [token~"^[a-z]+$"]{0,2},! [token="in"|","|"In"|"from"] ! \! [syn =DT]?,[sem="geoname/PPL"|"geoname/ADM2", token=_l1],! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l2]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l3]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l4]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l5]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l6]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l7]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l8]?,! [sem="areas"]?! /;
# Ex: Cases have also been reported for the first time in the provinces of Velasco [syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations4] =>! [sem="haveverb"]?,! [token="today"|"also"]?,! [tokem="been"]?,! [sem="reporting_verbs"],! [token="in"|"from"|","|"In"], ! [token~"^[a-z]+$"]{0,5}! \! [sem="geoname/PPL"|"geoname/ADM2", token=_c]! [sem="areas"]?! /;
# Ex: male from Phnom [syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations5] =>
! [sem="people"],! [token="from"|"for"]?,! [token~"^[a-z]+$"]{0,2}! \! [sem="geoname/PPL"|"geoname/ADM2", token=_c]! /! [sem="areas"]?;
# Ex: Beijing has reported
93
[syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations6] =>! \! [sem="geoname/PPL"|"geoname/ADM2", token=_c]! /! [sem="haveverb"]?,[token="today"]?,! [sem="reporting_verbs"];
# Ex: 4 cases from Beijing[syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations7] =>
! [token="cases"|"deaths"|"case"],! [token~"^[a-z]+$"]{0,2},! [token="in"|"from"|","]! \! [sem="geoname/PPL"|"geoname/ADM2", token=_c]! /! [sem="areas"]?;
# Ex: Beijing[syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations8] =>
! [sem="outbreak_locations"],! [token="of"]! \! [sem="geoname/PPL"|"geoname/ADM2", token=_c]! /! [sem="areas"]?;
# Ex: 4 cities out of 10, ()[syn=NP, sem=outbreak_locations, type=entity, loc1=__l1, loc2=__l2, loc3=__l3, loc4=__l4, loc5=__l5, loc6=__l6, loc7=__l7, loc8=__l8, rulid=locations9] =>
! [syn=CD]?,! [sem="areas"],! [token="out"]?,! [token="of"]?,! [syn=CD]?,! [orth=punct]?! \! [token="("]?,[orth=capitalized, token=__l1]{1,2},! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l2]{1,2},! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l3]*,! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l4]*,! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l5]*,! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l6]*,! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l7]*,! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l8]*,! [token=")"]?! /;
94
# Ex: Ogooue-Ivindo[syn=NNP, sem=outbreak_locations, type=entity, key=__c, rulid=locations10] =>! [token="cases"|"deaths"|"case"],! [token="in"|","|"from"]! \!! [syn =DT]?,[orth=capitalized, token=__c],[token="-"],[token=__c]!! /! [sem="areas"];
# Ex: Ogooue Ivindo[syn=NNP, sem=outbreak_locations, type=entity, key=__c, rulid=locations11] =>! [token="cases"|"deaths"|"case"],! [token="in"|","|"In"|"from"] ! \! [syn =DT]?,! [orth=capitalized, token=__c],! [orth=capitalized, token=__c]?! /! [sem="areas"];
# Ex: Ogooue Ivindo[syn=NNP, sem=outbreak_locations, type=entity, key=__c, rulid=locations12] =>! [sem="reporting_verbs"],! [token="in"|","|"In"|"from"]! \! [syn =DT]?,! [orth=capitalized, token=__c],! [orth=capitalized, token=__c]?!! /! [sem="areas"];
# Ex: Ogooue Ivindo[syn=NNP, sem=outbreak_locations, type=entity, key=__c, rulid=locations13] =>
! [sem="disease_type"],! [token~"^[a-z]+$"]{0,2},! [token="in"|","|"In"|"from"]! \! [syn =DT]?,[orth=capitalized, token=__c],[orth=capitalized, token=__c]?! /! [sem="areas"];
# Ex: old male from Kong Pisey district[syn=NNP, sem=outbreak_locations, type=entity, key=__c, rulid=locations14] =>! [sem="people"],! [token="from"|"for"]?,! [token~"^[a-z]+$"]{0,2}! \! [orth=capitalized, token=__c],[token="-" ,token=__c]?,[orth=capitalized, token=__c]?! /! [sem="areas"]?;
# ex: 2 November 2003
95
[syn=np, sem=date, type=entity,key=__t, month=_mno, day=_day, year=_year, rulid=publish_date1] =>! \! [syn=CD, orth=number, token=__t, token=_day, sent<=1],! [sem="temporal/interval/month", monthno=_mno, key=__t, sent<=1],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_year, sent<=1]! /;
# ex: 2 November 2003 [syn=np, sem=date, type=entity,key=__t, month=_mno, day=_day, year=_year, rulid=report_date1] =>
! [token="As"|"as"]?,! [token="of"|"on"|"On"|"in"|"since"]! \ ! [syn=CD, orth=number, token=__t, token=_day, sent=2], ! [sem="temporal/interval/month", monthno=_mno, key=__t, sent=2], ! [token=","]?, ! [syn=CD, orth=number, token~"19??"|"20??", token=__t, token=_year, sent=2]! /;
# ex: November 2003 [syn=np, sem=date, type=entity,key=__t, month=_mno, year=_year, rulid=report_date2] =>
! [token="As"|"as"]?,! [token="of"|"on"|"On"|"in"|"since"]! \! [sem="temporal/interval/month", monthno=_mno, key=__t],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_year, sent=2]! /;
# ex: 21 November [syn=np, sem=date, type=entity,key=__t, month=_mno, rulid=report_date3] =>
! [token="As"|"as"]?,! [token="of"|"on"|"On"|"in"|"since"]! \! [syn=CD, orth=number, token=__t, token=_day, sent=2],! [sem="temporal/interval/month", monthno=_mno, key=__t, sent=2]! /;
# ex: 2 November 2003 [syn=np, sem=date, type=entity,key=__t, month=_mno, day=_day, year=_year, rulid=report_date4] =>
! [token="As"|"as"]?,! [token="of"|"on"|"On"|"in"|"since"]! \! [syn=CD, orth=number, token=__t, token=_day, sent=3],! [sem="temporal/interval/month", monthno=_mno, key=__t, sent=3],! [token=","]?,! [syn=CD, orth=number, token~"19??"|"20??", token=__t, token=_year, sent=3]! /;
96
# ex: November 2003 [syn=np, sem=date, type=entity,key=__t, month=_mno, year=_year, rulid=report_date5] =>
! [token="As"|"as"]?,! [token="of"|"on"|"On"|"in"|"since"]! \! [sem="temporal/interval/month", monthno=_mno, key=__t, sent=3],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_year, sent=3]! /;
# ex: November 2003 [syn=np, sem=date, type=entity,key=__t, month=_mno, rulid=report_date6] =>
! [token="As"|"as"]?,! [token="of"|"on"|"On"|"in"|"since"]! \! [syn=CD, orth=number, token=__t, token=_day, sent=3],! [sem="temporal/interval/month", monthno=_mno, key=__t, sent=3]! /;
# ex: During 1-26 January 2003 in sentence 2[syn=np, sem=date, type=entity,from_date=__tt, to_date=__t, day1=_dd, day2=_d ,month=_mno, year=_year, rulid=report_date7] =>
! [token="during"|"During"|"From"|"from"]! \! [syn=CD, orth=number, token=__tt, token=_dd, sent=2],! [token="-"|"/"|"to"],! [syn=CD, orth=number, token=__t, token=_d, sent=2],! [sem="temporal/interval/month", monthno=_mno, token=__tt, token=__t, sent=2],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__tt, token=__t, token=_year, sent=2]! /;
# ex: During 1-26 January 2003 in sentence 3[syn=np, sem=date, type=entity,from_date=__tt, to_date=__t, day1=_dd, day2=_d ,month=_mno, year=_year, rulid=report_date8] =>
! [token="during"|"During"|"From"|"from"]! \! [syn=CD, orth=number, token=__tt, token=_dd, sent=3],! [token="-"|"/"|"to"],! [syn=CD, orth=number, token=__t, token=_d, sent=3],! [sem="temporal/interval/month", monthno=_mno, token=__tt, token=__t, sent=3],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__tt, token=__t, token=_year, sent=3]! /;
97
# ex: From 1 January 2003 to 27 January 2003 in sentence 2[syn=np, sem=date, type=entity,from_date=__tt, to_date=__t, day1=_dd, day2=_d ,from_month=_mm, to_month=_m, from_year=_yy, to_year=_y, rulid=report_date9] =>
! [token="From"|"from"]! \! [syn=CD, orth=number, token=__tt, token=_dd, sent=2],! [sem="temporal/interval/month", token=_mm, token=__tt, sent=2],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__tt, token=_yy, sent=2]?,! [token="-"|"to"],! [syn=CD, orth=number, token=__t, token=_d, sent=2],! [sem="temporal/interval/month", token=_m, token=__t, sent=2],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_y, sent=2]! /;
# ex: From 1 January 2003 to 27 January 2003 in sentence 3[syn=np, sem=date, type=entity,from_date=__tt, to_date=__t, day1=_dd, day2=_d ,from_month=_mm, to_month=_m, from_year=_yy, to_year=_y, rulid=report_date10] =>
! [token="From"|"from"]! \! [syn=CD, orth=number, token=__tt, token=_dd, sent=3],! [sem="temporal/interval/month", token=_mm, token=__tt, sent=3],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__tt, token=_yy, sent=3]?,! [token="-"|"to"],! [syn=CD, orth=number, token=__t, token=_d, sent=3],! [sem="temporal/interval/month", token=_m, token=__t, sent=3],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_y, sent=3]! /;
#Ex: The Department of Santa Cruz [syn=clause, sem=reporting_authority, type=relationship, country=_c, health_agency=__h, rulid=reporting_authority1] =>! \! [token="the"|"The",token=__h], [token="department"|"Department", token=__h],[token="of", token=__h],[sem>="geoname", token=__h, token=_c]! /! [sem="haveverb"]?,! [sem="reporting_verbs"];
#Ex: Ministry of Health (MoH) of the Kingdom of Cambodia[syn=clause, sem=reporting_authority, type=relationship, key=__d, country=_c, health_agency=_h, rulid=reporting_authority2] =>! \! [sem="health_agency", token=__d, token=_h],! [token="("],[syn="NN"]?,[token=")"],! [token="of"|"in", token=__d],! [token="the", token=__d]?,[sem>="geoname", token=__d, token=_c]! /! [sem="haveverb"]?,! [sem="reporting_verbs"];
98
#Ex: Ministry of Health of the Kingdom of Cambodia[syn=clause, sem=reporting_authority, type=relationship, key=__d, country=_c, health_agency=_h, rulid=reporting_authority3] =>! \! [sem="health_agency", token=__d, token=_h],! [token="of"|"in", token=__d],! [token="the", token=__d]?,[sem>="geoname", token=__d, token=_c]!! /! [sem="haveverb"]?,! [sem="reporting_verbs"] ;
#Ex: The Afghan Ministry of Public Health[syn=clause, sem=reporting_authority, type=relationship, key=__d, nationality=_n, health_agency=_h, rulid=reporting_authority4] =>! \! [syn=DT, token=__d],[orth=capitalized, token=_n, token=__d],! [sem="health_agency", token=__d, token=_h]! /! [sem="haveverb"]?,! [sem="reporting_verbs"];
#Ex: Ministry of Health (MoH) of the Kingdom of Cambodia[syn=clause, sem=reporting_authority, type=relationship, key=__d, country=_c, health_agency=_h, rulid=reporting_authority5] =>! \! [sem="health_agency", token=__d, token=_h],! [token="("]?,[syn="NN"]?,[token=")"]?,! [token=",", token=__d],[token="the", token=__d]?,! [sem>="geoname", token=__d, token=_c]!! /! [sem="haveverb"]?,! [sem="reporting_verbs"];
#Ex: The Human Services Department, Public Health Division of the Government of Victoria[syn=clause, sem=reporting_authority, type=relationship, key=__d, country=_c, health_agency=__h, rulid=reporting_authority6] =>! \! [syn=DT, token=__d, sent<=4],! [orth=capitalized, token=__h, token=__d],! [orth=capitalized, token=__h, token=__d]?,! [orth=capitalized, token=__h, token=__d]?,! [token=","|"in"|"and"|"of",token=__h, token=__d]?,! [orth=capitalized, token=__h, token=__d]?,! [orth=capitalized, token=__h, token=__d]?,! [orth=capitalized, token=__h, token=__d]?,! [token="of"|"in"|",", token=__d],! [sem>="geoname", token=__d, token=_c, token=__d]!! /! [sem="haveverb"]?,! [sem="reporting_verbs"];
#Ex: The Ministry of Health and Public Health Division, Bangladesh[syn=clause, sem=reporting_authority, type=relationship, key=__d, country=_c, health_agency=__h, rulid=reporting_authority7] =>
99
\! [syn=DT, token=__d, sent<=4],! [sem="health_agency", token=__d, token=__h],! [token="and", token=__d, token=__h],! [token=__h, token=__d],! [token=__h, token=__d],! [token=__h, token=__d]?,! [token=__h, token=__d]?,! [token="of"|"in"|",", token=__d],! [sem>="geoname", token=__d, token=_c, token=__d]/! [sem="haveverb"]?,! [sem="reporting_verbs"];
#Ex: Ministry of Health (MoH) of the Kingdom of Cambodia[syn=clause, sem=reporting_authority, type=relationship, country=_c, health_agency=_h, rulid=reporting_authority8] =>! \! [sem="health_agency", token=_h]! /! [sem="haveverb"]?,! [sem="reporting_verbs"];
#Ex: The ministry of health has reported[syn=clause, sem=reporting_authority, type=relationship, health_authority=_h, rulid=reporting_authority9] =>
! \! [sem>="geoname", token=_h]! /! [sem="haveverb"]?,! [sem="reporting_verbs"];
#Ex: a total of 167 230 cases have been confirmed in Egypt, of which 60 have been fatal#Ex: 20 111 cases reported in Cambodia since 2005, 18 have been fatal.[syn=clause, sem=cumulative_incidence , type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak1] =>!! [token="total"],! [token="of"|"to"]?! \! [syn=CD, token=_a], [syn=CD,token=_b],! [token="cases"],! [sem="haveverb"]?,! [sem="beverb"]?,! [sem="reporting_verbs"],! [token="in"],! [sem>="geoname"],! [token="since"]?,! [token~"19??"|"20??"]?,! [token=","|"."]?,! [token="of"]?,! [token="which"|"these"]?,! [syn=CD, token=_d],
100
! [sem="haveverb"]?,! [sem="beverb"]?,! [sem="fatal_cases",dead=_answer]! /;
#Ex: 20 cases reported in Cambodia since 2005, 18 have been fatal.[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak2] =>
! [token="total"]?,! [token="of"|"to"]?! \! [syn=CD, token=_n],! [token="cases"],! [sem="haveverb"]?,! [sem="beverb"]?,! [sem="reporting_verbs"],! [token="in"],! [sem>="geoname"],! [token="since"]?,! [token~"19??"|"20??"]?,! [token=","|"."]?,! [token="of"]?,! [token="which"|"these"]?,! [syn=CD, token=_d], ! [sem="haveverb"]?,! [sem="beverb"]?,! [sem="fatal_cases",dead=_answer]! /;
#Ex: 130 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus #Ex: including 31 deaths
[syn=clause, sem=cumulative_incidence , type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak3] =>!! [token="total"]?,! [token="of"|"to"]?! \ ! [syn=CD, token=_a], [syn=CD,token=_b],! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token="human"]?,! [token="infection"]?,! [token="with"]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]{0,3},! [token=","]?,! [token="("]?,! [token="and"|"including"|"of"|"with"]?,! [token="which"]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [token=")"]! /;
101
#Ex: 826 cases of dengue (3 deaths)[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak4] =>!! [token="total"]?,! [token="of"|"to"]?! \ ! [syn=CD, token=_n], ! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token="human"]?,! [token="infection"]?,! [token="with"]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]{0,3},! [token=","]?,! [token="("]?,! [token="and"|"including"|"of"|"with"]?,! [token="which"]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [token=")"]! /;
#Ex: the total in Azerbaigjan to 8[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak5] =>! [token="total"],! [token="in"]?,! [sem="geoname/COUNTRY"],! [token="to"]! \! [syn=CD, token=_n], ! [token="."|","],! [syn=CD, token=_d], ! [token="of"],! [token="these"],! [token="cases"],! [sem="beverb"],! [sem="fatal_cases",dead=_answer]! /;
#Ex: 13 100 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus #Ex: including 31 deaths [syn=clause, sem=cumulative_incidence , type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d,disease_name=__dn, disease_code=__s, dead_cases=_answer, year=_year, rulid=reporting_outbreak6] =>! [token="total"]?,! [token="of"|"to"]?! \!! [syn=CD, token=_a], [syn=CD,token=_b], ! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,
102
! [token="cases"],! [token="of"]?,! [token="human"]?,! [token="infection"]?,! [token="with"],! [token=__dn]?,! [orth=uppercase]?,! [token="("]?,! [token=__s]?,! [token=")"]?,! [sem="disease_type"]?,! [token="and"|"including"],! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?! /;
#Ex: 13 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus [syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d,disease_name=__dn, disease_code=__s, dead_cases=_answer, year=_year, rulid=reporting_outbreak7] =>! [token="total"]?,! [token="of"|"to"]?! \!! [syn=CD, token=_n], ! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token="human"]?,! [token="infection"]?,! [token="with"],! [token=__dn]?,! [orth=uppercase]?,! [token="("]?,! [token=__s]?,! [token=")"]?,! [sem="disease_type"]?,! [token="and"|"including"],! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?! /;
#Ex: ! 27 confirmed cases with 13 deaths have been reported[syn=clause, sem=cumulative_incidence , type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak8] =>! [token="total"]?,! [token="of"|"to"]?! \! [syn=CD, token=_a], [syn=CD,token=_b], ! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="with"],! [syn=CD, token=_d],
103
! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,! [token="been"]?,! [sem="reporting_verbs"]?,! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?! / ;
#Ex: ! 27 confirmed cases with 13 deaths have been reported[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak9] =>! [token="total"]?,! [token="of"|"to"]?! \! [syn=CD, token=_n],! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="with"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,! [token="been"]?,! [sem="reporting_verbs"]?,! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?! / ;
#Ex: total of 130 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus #Ex: including 31 deaths [syn=clause, sem=cumulative_incidence , type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak10] =>! [token="total"],! [token="of"|"to"]?! \! [syn=CD, token=_a], [syn=CD,token=_b], ! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]{0,2},! [sem=disease_type]?,! [orth=uppercase]?,! [token="("]?,! [token=__s]?,! [token=")"]?,! [sem="disease_type"]?,! [token=","]?,! [token="of"]?,! [token="which"]?,! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,
104
! [token="been"]?,! [sem="reporting_verbs"]?,! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?! /;
#Ex: total of 130 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus #Ex: including 31 deaths [syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak11] =>! [token="total"],! [token="of"|"to"]?! \! [syn=CD, token=_n], ! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]{0,2},! [sem=disease_type]?,! [orth=uppercase]?,! [token="("]?,! [token=__s]?,! [token=")"]?,! [sem="disease_type"]?,! [token=","]?,! [token="of"]?,! [token="which"]?,! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,! [token="been"]?,! [sem="reporting_verbs"]?,! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?! /;
#Ex: a total of 132 cases, including 37 deaths[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak12] =>! [token="total"],! [token="of"|"to"]?! \! [syn=CD, token=_n],! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token=","]?,! [token="and"|"including"|"of"]?,! [token="which"]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,! [token="been"]?,
105
! [sem="reporting_verbs"]?,! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?! /;
!#Ex: total number of children affected to be 59. Of these, 52 have died[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak13] =>! \! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?,! [orth=punct]?,! [token="the"],! [token="total"],! [token="number"]?,! [token="of"],! [token="children"|"people"|"human"]?,! [token="affected"]?,! [token="to"]?,! [sem="beverb"]?,! [syn=CD, token=_n], ! [token="cases"]?,! [token=","|"."]?,! [token="of"|"Of"]?,! [token="which"|"these"]?,! [token=","]?,!! [syn=CD, token=_d], ! [sem="haveverb"]?,! [sem="beverb"]?,! [sem="fatal_cases",dead=_answer]! / ;
#Ex: has reported 2 249 cases of dengue fever including 6 deaths [syn=clause, sem=outbreak_event, type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak14] =>! [sem="haveverb"],! [token="today"]?,! [sem="reporting_verbs"]! \ ! [syn=CD, token=_a], [syn=CD,token=_b],! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"|"Of"]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]{0,3},! [sem="disease_type"]?,! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]! /;!
#Ex: has reported 2 249 cases of dengue fever including 6 deaths
106
[syn=clause, sem=outbreak_event, type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak15] =>
! [sem="haveverb"],! [token="today"]?,! [sem="reporting_verbs"]! \! [syn=CD, token=__n], ! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"|"Of"]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]{0,3},! [sem="disease_type"]?,! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]! /;
#Ex: has reported 291 suspected cases of Rift Valley fever, including 64 deaths [syn=clause, sem=outbreak_event, type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak16] =>!! [sem="haveverb"],! [token="today"]?,! [sem="reporting_verbs"]! \! [syn=CD, token=_a], [syn=CD,token=_b],! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"|"Of"]?,! [orth=capitalized]{0,3},! [sem="disease_type"]?,! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]! /;
#Ex: has reported 291 suspected cases of Rift Valley fever, including 64 deaths [syn=clause, sem=outbreak_event, type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak17] =>!! [sem="haveverb"],! [token="today"]?,! [sem="reporting_verbs"]! \! [syn=CD, token=_n], ! [token="new"]?,! [sem="case_classification",classification=_c]?,! [token="cases"],! [token="of"|"Of"]?,! [orth=capitalized]{0,3},
107
! [sem="disease_type"]?,! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]! /;
#Ex: 130 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus #Ex: including 31 deaths [syn=clause, sem=outbreak_event, type=event, number_of_cases="(+ (* _a 1000)_b))", case1_class=_c, case2_class=_c1, case3_class3=_c2, number_of_deaths=_d, dead_cases=_answer, number_of_cases2=_n2, number_of_cases3=_n3, rulid=reporting_outbreak18] =>! \! [syn=CD, token=_a], [syn=CD,token=_b],! [sem="case_classification",classification=_c ]?,! [token="cases"],! [orth=punct]?,! [token="("],! [token=_n2],! [token="by"]?,! [sem="case_classification", classification=_c1],! [token=","|"and"],! [token=_n3],! [token="by"]?,! [sem="case_classification", classification=_c2],! [token=")"],! [token=","]?,! [token="and"|"including"|"of"]?,! [token="which"]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?! /;
#Ex: 30 confirmed cases (14 laboratory and 16 epidemiologically linked)[syn=clause, sem=outbreak_event, type=event, number_of_cases1=__n, case_class1=_c, case_class2=_a, case_class3=_b, number_of_deaths=_d, dead_cases=_answer, number_of_cases2=_n2, number_of_cases3=_n3, rulid=reporting_outbreak19] =>! \! [syn=CD, token=__n], ! [sem="case_classification",classification=_c ]?,! [token="cases"],! [orth=punct]?,! [token="("],! [token=_n2],! [token="by"]?,! [sem="case_classification", classification=_a],! [token=","|"and"],! [token=_n3],! [token="by"]?,! [sem="case_classification", classification=_b],! [token=")"],! [token=","]?,! [token="and"|"including"|"of"]?,! [token="which"]?,! [syn=CD, token=_d]?,
108
! [token="new"]?,! [sem="fatal_cases",dead=_answer]?! /;
#Ex: has today reported 39 new probable SARS cases and 4 new deaths[syn=clause, sem=outbreak_event, type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak20] =>! [sem="haveverb"]?,! [sem="beverb"]?,! [token="today"]?! \! [sem="reporting_verbs"],! [syn=CD, token=_n], ! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token~"^[a-z]+$"]?,! [token="cases"],! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"|"including"]?,! [sem="fatal_cases",dead=_answer]! /;
#Ex: 130 confirmed cases of avian influenza including 31 deaths[syn=clause, sem=outbreak_event , type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer,disease_name=_dn, year=_year, rulid=reporting_outbreak21] =>! \! [syn=CD, token=_a], [syn=CD,token=_b],! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token=_dn]?,! [sem=disease_type]?,! [token=","]?,! [token="and"|"including"|"of"]?,! [token="which"]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,! [token="been"]?,! [sem="reporting_verbs"]?,! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?! /;
#Ex: 130 confirmed cases of avian influenza A(H7N9) virus including 31 deaths [syn=clause, sem=outbreak_event , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, disease_name=_dn, rulid=reporting_outbreak22] =>!! \! [syn=CD, token=_n],! [token="new"|"additional"]?,! [sem="case_classification",token=_c ]?,! [token="cases"],
109
! [token="of"]?,! [token="_dn"]?,! [sem=disease_type]?,
! [orth=uppercase]?,! [token="("]?,! [token=__s]?,! [token=")"]?,! [sem="disease_type"]?,
! [token=","|"and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,! [token="been"]?,! [sem="reporting_verbs"]?! /;
#Ex: reported five new human cases of avian influenza[syn=clause, sem=outbreak_event, type=event, number_of_cases=__n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak23] =>! [sem="haveverb"]?,! [sem="beverb"]?,! [token="today"]?,! [sem="reporting_verbs"]! \ ! [syn=CD, token=__n], [syn=CD, token=__n]?,! [sem="case_classification",classification=_c ]?, ! [token="new"]?,! [token="human"]?,! [token="cases"],! [token="of"]?,! [token~"^[a-z]+$"]{0,2},! [orth=capitalized]{0,2},! [sem="disease_type"]?,! [token=","]?,! [token="and"|"including"]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?! /;
#Ex: 84 cases of acute flaccid paralysis and 85 deaths have been reported[syn=clause, sem=outbreak_event, type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak24] =>! \! [syn=CD, token=_a], [syn=CD,token=_b],! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [sem=disease_condition]?,[token=__o]{0,2},[sem=disease_condition]?,! [token="and"|"including"]?,! [syn=CD, token=_d], ! [sem="fatal_cases",dead=_answer],! [sem="haveverb"]?,
110
! [sem="beverb"]?,! [token="today"]?,! [sem="reporting_verbs"]! /;
#Ex: 25 suspect cases including 15 deaths were identified[syn=clause, sem=outbreak_event, type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak25] =>! \! [syn=CD, token=__n],! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [sem=disease_condition]?,[token=__o]{0,2},[sem=disease_condition]?,! [token="and"|"including"]?,! [syn=CD, token=_d], ! [sem="fatal_cases",dead=_answer],! [sem="haveverb"]?,! [sem="beverb"]?,! [sem="reporting_verbs"]! /;
#Ex: Twenty-five suspect cases including 15 deaths were identified[syn=clause, sem=outbreak_event, type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak26] =>! \! [syn=CD, orth=caphyphenated, token=_n], ! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token~"^[a-z]+$"]{0,3},! [token="and"|"including"]?,! [syn=CD, token=_d], ! [sem="fatal_cases",dead=_answer],! [sem="beverb", ],! [token="today"]?,! [sem="reporting_verbs"]?! /;
#Ex: has today reported 39 new probable SARS cases and 4 new deaths[syn=clause, sem=outbreak_event, type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak27] =>! [sem="haveverb"],! [token="today"]?,! [sem="reporting_verbs"]! \! [syn=CD, token=_n],! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]?,! [token="cases"],! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,
111
! [sem="fatal_cases",dead=_answer]! / ;
#Ex: 39 new probable SARS cases and 4 new deaths has been reported[syn=clause, sem=outbreak_event, type=event, number_of_cases=__n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak28] =>! \! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?,! [orth=punct]?,! [syn=CD, token=__n], [syn=CD, token=__n]?, ! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]?,! [token="cases"],! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer], ! [sem="haveverb"],! [token="been"]?,! [sem="reporting_verbs"]! /;
#Ex: has reported an outbreak of diarrhoeal diseases of 6 691 cases, including 3 deaths[syn=clause, sem=outbreak_event, type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak29] =>! [sem="haveverb"]?,! [token="today"]?,! [sem="reporting_verbs"]?,! [token="an"],! [token="outbreak"],! [token="of"],! [token~"^[A-Z]+$"|"^[a-z]+$"]?,! [sem="disease_type"]?,! [token="of"]! \! [syn=CD, token=_a], [syn=CD,token=_b],! [token="cases"],! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]! /;
#Ex: Eight cases have been laboratory confirmed, of which 2 have died.[syn=clause, sem=outbreak_event, type=event, number_of_cases=__n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak30] =>! \! [syn=CD, token=__n], [syn=CD, token=__n]?,! [token="new"|"additional"]?,! [token="cases"],
112
! [sem="haveverb"]?,! [token="been"]?,! [sem="case_classification",token=_c ],! [token=","]?,! [token="of"]?,! [token="and"|"including"|"which"]?,! [syn=CD, token=_d]?, ! [sem="haveverb"]?,! [sem="fatal_cases",dead=_answer]?! /;
#Ex: 8 laboratory-confirmed cases and 31 deaths [syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak31] =>! \! ! [syn=CD, token=_n, sent<=3], ! [token="new"|"additional"]?,! [sem="case_classification",token=_c]?,! [token="cases"],! [token="and"|","]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?! /;
# Ex: H1N1 infection[syn=np, sem=Disease_code, key=__s, rulid=disease_code1] =>! \! [orth="other", token=__s]! /! [sem="disease_type"];
# Ex: A(H1N1)[syn=np, sem=Disease_code, key1=__s, rulid=disease_code2] =>[sem!="symptoms"]! \! [orth=uppercase, token=__s]?,! [token="("],! [token=__s],! [token=")"]! /![token="disease"|"fever"|"outbreak"|"outbreaks"|"syndrome"|"influenza"|"illness"|"epidemic"|"virus"|"infection"|"infections"|"*virus"|"intoxication"];
113
Appendix B: Gazetteer entriesJANUARY:instance=january,class=temporal/interval/month,monthno=01
FEBRUARY:instance=february,class=temporal/interval/month,monthno=02
MARCH:instance=march,class=temporal/interval/month,monthno=03
APRIL:instance=april,class=temporal/interval/month,monthno=04
MAY:instance=may,class=temporal/interval/month,monthno=05
JUNE:instance=june,class=temporal/interval/month,monthno=06
JULY:instance=july,class=temporal/interval/month,monthno=07
AUGUST:instance=august,class=temporal/interval/month,monthno=08
SEPTEMBER:instance=september,class=temporal/interval/month,monthno=09
OCTOBER:instance=october,class=temporal/interval/month,monthno=10
NOVEMBER:instance=november,class=temporal/interval/month,monthno=11
DECEMBER:instance=december,class=temporal/interval/month,monthno=12
Anthrax:instance=anthrax,class=disease
botulism:instance=botulism,class=disease
Buffalopox:instance=buffalopox,class=disease
Chikungunya:instance=chikungunya,class=disease
Cholera:instance=cholera,class=disease
Coccidioidomycosis:instance=coccidioidomycosis,class=disease
E.coli O157:instance=e.coli O157,class=disease
Hepatitis E:instance=Hepatitis E,class=disease
Japanese Encephalitis:instance=japanese encephalitis,class=disease
Legionellosis:instance=legionellosis,class=disease
Leishmaniasis:instance=leishmaniasis,class=disease
Leptospirosis:instance=leptospirosis,class=disease
Listeria:instance=listeria,class=disease
Louseborne typhus:instance=louseborne typhus,class=disease
Malaria:instance=malaria,class=disease
Measles:instance=measles,class=disease
Monkeypox:instance=monkeypox,class=disease
Pertussis:instance=pertussis,class=disease
Plague:instance=plague,class=disease
Rabies:instance=rabies,class=disease
Tularemia:instance=tularemia,class=disease
flaccid:instance=flaccid,class=disease
poliovirus:instance=poliovirus,class=disease
hand, foot and mouth disease :instance=hand, foot and mouth
disease ,class=disease
acute:instance=acute,class=disease_condition
Acute:instance=acute,class=disease_condition
paralysis:instance=paralysis,class=disease_condition
haemorrhagic:instance=haemorrhagic,class=disease_condition
114
watery diarrhoeal:instance=watery diarrhoeal,class=disease_condition
wild:instance=wild,class=disease_condition
Wild:instance=wild,class=disease_condition
influenza:instance=influenza,class=disease_type
Virus:instance=virus,class=disease_type
intoxication:instance=intoxication,class=disease_type
infections:instance=infections,class=disease_type
epidemic:instance=epidemic,class=disease_type
Outbreaks:instance=outbreaks,class=disease_type
disease:instance=disease,class=disease_type
diseases:instance=diseases,class=disease_type
Virus:instance=virus,class=disease_type
illness:instance=illness,class=disease_type
fever:instance=fever,class=disease_type
syndrome:instance=syndrome,class=disease_type
report:instance=report,class=reporting_verbs
reports:instance=reports,class=reporting_verbs
identified:instance=identified,class=reporting_verbs
confirmed:instance=confirmed,class=reporting_verbs
reported:instance=reported,class=reporting_verbs
occurred:instance=occurred,class=reporting_verbs
occurring:instance=occurring,class=reporting_verbs
affecting:instance=affecting,class=reporting_verbs
Reported:instance=reported,class=reporting_verbs
notified:instance=notified,class=reporting_verbs
announced:instance=announced,class=reporting_verbs
informed:instance=informed,class=reporting_verbs
fatal:instance=fatal,class=fatal_cases,dead=no
deaths:instance=deaths,class=fatal_cases,dead=yes
death:instance=death,class=fatal_cases,dead=yes
died:instance=died,class=fatal_cases,dead=yes
fatal cases:instance=fatal cases,class=fatal_cases,dead=no
province:instance=province,class=areas
Province:instance=province,class=areas
Provinces:instance=provinces,class=areas
provinces:instance=provinces,class=areas
regions:instance=regions,class=areas
region:instance=region,class=areas
Region:instance=region,class=areas
cities:instance=cities,class=areas
city:instance=city,class=areas
City:instance=City,class=areas
state:instance=state,class=areas
115
states:instance=states,class=areas
State:instance=state,class=areas
District:instance=district,class=areas
Districts:instance=districts,class=areas
district:instance=district,class=areas
districts:instance=districts,class=areas
male:instance=male,class=people
female:instance=female,class=people
child:instance=child,class=people
man:instance=man,class=people
woman:instance=woman,class=people
Ministry of Health:instance=ministry of health,class=health_agency
The National Health and Family Planning Commission:instance=The National
Health,class=health_agency
The Ministry of Health and Population:instance=The Ministry of Health and
Population,class=health_agency
MoH:instance=ministry of health,class=health_agency
Ministry of Public Health:instance=ministry of Public Health,class=health_agency
suspect:instance=suspect,class=case_classification,classification=suspected
confirmed:instance=confirmed,class=case_classification,classification=confirmed
suspected:instance=suspectedd,class=case_classification,classification=confirmed
probable:instance=probable,class=case_classification,classification=probable
epidemiologically linked:instance=epidemiologically
linked,class=case_classification,classification=confirmed
laboratory-confirmed:instance=laboratory
confirmed,class=case_classification,classification=laboratory-confirmed
epidemiologically-linked:instance=epidemiologically-
linked,class=case_classification,classification=epidemiologically-linked
laboratory confirmed:instance=laboratory
confirmed,class=case_classification,classification=laboratory confirmed
laboratory testing:instance=laboratory
testing,class=case_classification,classification=confirmed
116