a dissertation submitted to the university of manchester...

Information Extraction

A dissertation submitted to The University of Manchester for the degree of Master of

Science in the Faculty of of Engineering and Physical Sciences

2013

Wafa Al Showaib

School of Computer Science

Table of Contents

....................................................................................................................Abstract 5

................................................................................................................Declaration 6

.......................................................................................COPYRIGHT STATEMENT 7

.........................................................................................ACKNOWLEDGEMENTS 8

1. .........................................................................................................Introduction 9

1.1. .......................................................................................................Aim of the Project 9

1.2. ......................................................................................................Project Objectives 9

1.3. .......................................................................................................Report Structure 10

2. .......................................................................................................Background 12

2.1. ..............................................................................................Information Extraction 12

2.2. ................................................................................Defining Information Extraction 12

2.3. .......................................................................................IE and Other Technologies 14

2.4. ..............................................................................................................Brief History 15

2.5. ........................................................................................................Challenges of IE 16

2.6. ....................................................................................Information Extraction Tasks 17

2.7. ..................................................................................IE System Evaluation Forums 18

2.8. .........................................................................................Evaluation of IE Systems 21

2.9. ...................................................................................................IE Overall Process 24

2.10. ...................................................Information Extraction System Design Approach 28

2.11. ...........................................................................................Examples of IE system 32

2.12. .........................................................................................IE System Performance 35

2.13. ................................................................................................................Summary 36

3. .........................................................................................CAFETIERE System 37

3.1. ..............................................................................................................CAFETIERE 37

3.2. ................................................................................................System Components 37

3.3. ................................................................................................Notation of the Rules 39

3.4. .......................................................................................Exploiting the rule notation 40

3.5. .................................................................................................................Gazetteer 42

3.6. ..................................................................................................................Summary 43

4. ......................................................................................Requirement Analysis 44

2

4.1. .....................................................................................................Domain Selection 44

4.2. ...............................................................................Motivation for Domain Selection 44

4.3. ..................................................................................................Structure of the text 47

4.4. ........................................................Entities, relationships and events identification 48

4.5. ................................................................................................Project Methodology 50

4.6. .......................................................................................Development Methodology 50

4.7. ..................................................................................................................Summary 51

5. ..............................................................Design, Implementation and Testing 52

5.1. ..............................................................................................General design issues 52

5.2. ........................................................................................................Entity extraction 53

5.3. ...................................................................Rules for extracting the publishing date 53

5.4. ............................................................Rules for extracting the announcement date 54

5.5. ............................................................................Rules for extracting country name 56

5.6. ..........................................................Rules for extracting the name of an outbreak 57

5.7. ....................................................Rules for extracting affected cities and provinces 60

5.8. .............................................................................................Relationship extraction 63

5.9. ...............................Rules for extracting the name of the reporting health authority 63

5.10. ....................................................................................................Events extraction 65

5.11. ...................................................................Rules for extracting an outbreak event 65

5.12. ....................................Rules for extracting the total number of cases and deaths 70

5.13. ..............................................................................................................Discussion 71

6. ............................................................................................System Evaluation 74

6.1. .......................................................................................System evaluation metrics 74

6.2. ......................................................................................System evaluation process 75

6.3. .......................................................................................................Results analysis 78

7. ........................................................................................................Conclusion 84

..............................................................................................................References 86

................................................................................Appendix A: Extraction rules 89

...........................................................................Appendix B: Gazetteer entries 114

Word Count: 22,867

3

List of Figures Page

Figure 2.1: IE position within information retrieval and text understanding 15

Figure 2.2: Confusion matrix 23

Figure 2.3: Information extraction overall process 25

Figure 3.1: CAFETIERE overall analysis and query model 37

Figure 4.1: The development life-cycle model for information extraction 51

Figure 5.1: Announcement date extraction 55

Figure 5.2: Extracting groups of outbreak locations 61

Figure 5.3: Reporting health authority extraction 65

Figure 6.1: Manual annotation 76

Figure 6.2: System annotation 77

List of Tables Page

Table 2.1: Summary of MUC topics from 1987 to 1997 19

Table 2.2: Decision factors for IE approaches 32

Table 2.3: Top scoring in MUC-7 36

Table 3.1: Examples of the values that can be assigned to syn feature 40

Table 4.1: Named entities in outbreak reports for IE 49

Table 6.1: Breakdown of the counting results of the training corpus 78

Table 6.2 : Breakdown of the evaluation metrics of the training corpus 79

Table 6.3: Breakdown of the counting results of the test corpus 81

Table 6.4: Breakdown of the evaluation metrics of the test corpus 81

Table 6.5: Number of occurrences of each entity type 82

4

AbstractInformation extraction (IE) is a technology that facilitates the movement of data from their

initial manifestation in natural texts into structured representation, usually in the form of

databases, to facilitate their use in further analysis. IE systems serve as the front end and

core stage in different natural language programming tasks. Although IE is a relatively new

area, the field has witnessed rapid development; this report gives a brief review of IE

history and the IE-focused conferences that have influenced its growth over the last two

decades. The overall system structure is also detailed. Two approaches will be discussed

and compared: the knowledge engineering approach, and the automatic training

approach.

The practical work in this project was based on the knowledge engineering approach, also

known as the rule-based approach. The extraction system used was CAFETIERE, which was designed by The National Centre for Text Mining. As IE has proved its efficiency in

domain-specific tasks, this project focused on one domain: disease outbreak reports. Several reports from the World Health Organization were carefully examined to formulate

the extraction tasks: named-entities, such as disease name, date and location; the location

of the reporting authority; and the outbreak incident. Extraction rules were then designed, based on a study of the textual expressions and elements found in the text that appeared

before and after the target text.

The experiment resulted in very high performance scores for all the tasks in general. The

training corpora and the testing corpora were tested separately. The system performed with higher accuracy with entities and events extraction than with relationship extraction.

It can be concluded that the rule-based approach has been proven capable of delivering

reliable IE, with extremely high accuracy and coverage results. However, this approach

requires an extensive, time-consuming, manual study of word classes and phrases.

5

DeclarationNo portion of the work referred to in the dissertation has been submitted in support of an

application for another degree or qualification of this or any other university or other

institute of learning.

6

COPYRIGHT STATEMENTi. The author of this dissertation (including any appendices and/or schedules to this

dissertation) owns certain copyright or related rights in it (the “Copyright”) and s/he has

given The University of Manchester certain rights to use such Copyright, including for

administrative purposes.

ii. Copies of this dissertation, either in full or in extracts and whether in hard or electronic

copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance

with licensing agreements which the University has entered into. This page must form part of any such copies made.

iii. The ownership of certain Copyright, patents, designs, trade marks and other intellectual property (the “Intellectual Property”) and any reproductions of copyright works in the

dissertation, for example graphs Guidance for the Presentation of Taught Masters’ Dissertations and tables (“Reproductions”), which may be described in this dissertation,

may not be owned by the author and may be owned by third parties. Such Intellectual

Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or

Reproductions.

iv. Further information on the conditions under which disclosure, publication and

commercialisation of this dissertation, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the University IP Policy (see

http://documents.manchester.ac.uk/display.aspx?DocID=487), in any relevant Dissertation restriction declarations deposited in the University Library, The University Library’s

regulations (see http://www.manchester.ac.uk/library/aboutus/regulations) and in The

University’s Guidance for the Presentation of Dissertations.

7

http://documents.manchester.ac.uk/display.aspx?DocID=487

http://documents.manchester.ac.uk/display.aspx?DocID=487

http://www.manchester.ac.uk/library/aboutus/regulations

http://www.manchester.ac.uk/library/aboutus/regulations

ACKNOWLEDGEMENTSI would like to express my sincerest appreciation to my supervisor, Mr. Jock McNaught, for

his invaluable support and contribution. I would like to extend special thanks to KACST for

this great scholarship opportunity, which allowed me to be here to pursue this great and

exciting challenge.

I dedicate this work to my parents for their continued support during my studies.

8

1. IntroductionOver the last two decades, the World-Wide-Web (Web) has played a key role in the rapid

proliferation of information that is available to humans. Digital information is available in a

myriad of forms in different locations on the internet and intranet. A significant amount of

data is available in the form of news, blogs, annual reports and social media. This has

resulted in a growing need for effective techniques to analyse and manipulate natural text

data to discover uncovered relevant and valuable knowledge. Probably, one of the first

solutions is to convert free text into tabular form (databases) in order to facilitate using

them in computerized systems. This need has resulted in the emergence of information

extraction (IE) technologies.

IE is one of the natural language processing (NLP) techniques, and is the process of

extracting structured information from semi-structured or unstructured documents. It is an

emerging technology that is used to tackle the problem of information that is growing very

quickly, while the development of automated NLP techniques is relatively slower. It is

commonly the first process in text mining, in which a collection of documents representing

the area we are interested in are skimmed to extract a sub-sequence of information using

a pre-defined set of rules applied to specific texts.

1.1.Aim of the Project

The main aim of this project is to gain a deeper understanding of current IE

techniques and principles. More specifically, how a rule-based IE system is

generated to retrieve specific data from unstructured (natural language) texts to

form a structured representation.

1.2.Project Objectives

To achieve the aim of this project:

• It is necessary to study the ‘state-of-the-art’ in the IE field, and particularly to

identify the research efforts that have been made thus far. This phase of reviewing

the history of IE will provide a clear understanding of IE, which is necessary so as

not to confuse it with other NLP techniques, such as information retrieval.

• Choose the extraction domain.

9

• This study will explore and learn the extraction system formalism, i.e., the

development of extraction rules for extracting relevant information from a given text.

• Design, implement and test the extraction rules.

• Evaluate the overall performance of the extraction rules by calculating recall,

precision and F-measure.

Limitations of this project include:

1.The researcher will play two simultaneous roles, as both rule developer

(which includes requirements analysis and design) and tester.

2.The text source to be examined is drawn from natural texts (unstructured),

written in the English language.

3.The time dedicated to this project is seven months in total; for the first four

months the project will be pursued part-time, and for the remaining three, full-

time.

4. Because we are focusing on extracting information from natural texts, only

specific information, i.e. entities and relationships, will be extracted from a

text.

There are two main deliverables of this project: constructing a domain-specific

gazetteer and developing a set of rules in addition to the dissertation project report.

1.3.Report Structure

The progress report for the project is organised as follows:

Chapter 2, Background: The chapter starts with providing a clear definition for IE,

then distinguishing it from various NLP tasks, and identifying where it is positioned

among them. A brief history of IE will follow, which will include the influence of

Message Understanding Conferences (MUCs) and automatic extraction forums. An

important part of this chapter is dedicated to a discussion the general architecture of

an IE system and the two approaches that have been followed in almost all of

10

these. This chapter will conclude with a discussion of the evaluation methods and

metrics, and finally, a brief discussion of how the performance of an IE system could

be measured.

Chapter 3, CAFETIERE System: This chapter discusses the software components

of the IE system adapted for this project (CAFETIERE system).

Chapter 4, Requirement Analysis: The aim of this chapter is to describe the

domain focus and how the corpus will be gathered for analysis. A preliminary

analysis of some texts is presented. A list of entities and relationships that we are

looking for is listed at the end of the chapter. Finally, an overview of the

methodology adopted for this project.

Chapter 5, Design implementation and testing: This chapter gives an overview

of the process that was followed in designing the extraction rules for all entities,

relationships and events. The primary textual patterns that influenced the final

extraction decisions are discussed, in addition to a number of examples that

demonstrate the final analysis findings.

Chapter 6, Evaluation: The evaluation metrics that were used to assess the

system performance are discussed, in addition to some basic definitions that were

considered while validating the extraction outputs. A demonstration of how each

report was evaluated is also shown. Finally, precision, recall and F-measure were

calculated for both training and testing sets to conclude the main findings of this

project.

Chapter 7, Conclusion: This chapter concludes the project. It also provides

several suggestions for future research proposals.

11

2. BackgroundThis chapter gives an overview of the basics of IE beginning with a definition of IE and

stating the position of IE within various NLP technologies. A brief history and a discussion

of previous IE research will lead to the development of two main approaches of building an

IE system. An evaluation of an IE system and measuring the performance of such a

system is also discussed in this chapter.

2.1.Information Extraction With the tremendous amount of data that accumulate on the web every second, the

urge for automatic technologies that read, analyse, classify and populate data has

evolved. Humans cannot read and memorise a megabyte of data on a daily basis.

This has resulted in opportunities for historical, archival information to be lost or

discarded. Information that may currently seem to contain no value may hold

valuable information for future needs. Information also runs the risk of being

overlooked or missed because it was not presented in a specific manner or was

contained with additional misleading data.

Lost opportunities and limited human abilities have spurred researchers to explore

and create strategies to manage this text ‘wilderness’. In the last decades,

researchers have mainly worked in natural language techniques. Since human

language is difficult and follows different writing styles, the Natural Language

Processing (NLP) technologies cannot be classified under one domain only.

Different stages of processing comprise the NLP field, and each stage is a unique

science and field of research. IE systems serve as the front-end and core stage in

different NLP techniques.

2.2.Defining Information ExtractionIn the literature, different researchers give different descriptions for the term

‘Information Extraction’ (IE). One of the oldest definitions was proposed by Cowie

and Lehnert (1996), who defines it as any process that extracts relevant information

from a given text then pieces together the extracted information in a coherent

structure. De Sitter (2004) sees that IE can take a different definition according to

the purpose of the system:

12

One best per approach: the information system is a system for filling a

template structure;

All occurrences approach: the IE Information system is to find every

occurrence of a certain item in a document.

However, De Sitter’s definition lacks the part about recognizing relationships and

facts. Moens (2006) suggests a very comprehensive definition:

“Information extraction is the identification, and consequent or concurrent classification and structuring into semantic classes, of specific information found in unstructured data sources, such as natural language text, providing additional aids to access and interpret the unstructured data by information systems.” (Moens, 2006)

It seems that in recent IE manuscripts, researchers partially agree with similar

descriptions. Saggoin et al. (2010) described an IE system as a technology of

extracting snippets of data from natural language. A similar definition is provided by

Ling (2010), who stated that IE is the problem of distilling relational data from

unstructured texts. Acharya and Parija (2010) suggested another definition, which is

to reduce the size of text to a tabular form by identifying only subsets of instances of

a specific class of relationships or events from a natural language document, and

the extraction of the arguments related to the event or relationship.

Before continuing with discussion in this report, it seems essential to view the

definition that has been adopted for this project. We agree with Moens’ (2006)

definition that additional aids are needed to find primary data, and also with De

Sitter from the aspect that IE can handle more than one definition depending on the

aim of the system.

The definition that seems most comprehensive for this project is that IE is the

process of extracting predefined entities, identifying the relationships between those

entities from natural texts into accessible formats that can be used later in further

applications and, with the help of evidence, can be deduced from particular words

from the text or from the context.

13

2.3.IE and Other TechnologiesIn most cases, IE is not a final goal, but rather assists in going forward with other

natural texts processing tasks such as information retrieval, text understanding and

data mining (Moens, 2006). The aim of this section is to define the differences

between IE and information retrieval (IR), and between IE and text understanding.

2.3.1.Information RetrievalFor the purpose of this project, it seems suitable to distinguish between IR

and IE as they may interfere with one another. Information retrieval focuses

on collecting documents with relevant text from a set of articles available in

newspapers, journals and the Web. IE starts with texts collected in advance,

it then digests them into more readily analysable form. It isolates irrelevant

fragments and keeps relevant-only information from the text, and then

gathers the targeted information in a comprehensible framework.

When compared with IR, IE has some advantages and disadvantages. IE

tasks are generally more difficult and the systems themselves are more

knowledge-intensive, such as built-in dictionaries. From a performance point

of view, IE consumes more computation power than IR. However, in the case

of a large volume of text, IE is potentially more efficient than IR because of

its ability to summarise the text in a dramatically short time when compared

to the time spent by people to read the same information. Also, where results

need to be translated instantly into other languages, IE conversions are

simpler and more straightforward compared to IR, where whole retrieved

documents must be provided with full translation tasks (Cunningham, 2006).

2.3.2.Text UnderstandingIE is a very important milestone towards real text understanding (Moens,

2006). A similar view is given by Riloff (1999), who sees that extracted data

can be very useful to represent more complex semantic classes. IE is usually

the first and cheapest task in overall text understanding tasks. Appelt et al.

(1993) distinguish between IE and text understanding. According to these

researchers, in text understanding:

• The main goal of the system is to make sense of the entire text.

• The target text must reflect the full complexity of a language.

14

One should be able to recognise the nuance of meaning and the

author’s aim.

While on IE:

• Only part of the text is relevant and accommodates the final goal; for

example, in MUC-4, in analysing terrorist reports, only 10 per cent of

the text was relevant.

• The final representation of the data is predefined, in most cases a

database.

Figure 2.1: IE position within information retrieval and text understanding

Finally, in the spectrum of the processing of natural language, it can be said that IE

is situated in the middle between information retrieval and text understanding

(Appelt, 1999).

2.4.Brief HistoryThe idea of IE can be traced back to 1964 as some papers were found that

discussed the attempts of filling templates by searching pieces of text, but these

ideas depended purely on human work, and were not combined with any

computational powers (Wilks, 1997). The earliest practical work in the field is that of

Sager in the 1970s, conducted at the New York University in the medical field.

Basically, Sager extracted patients’ information to fill in forms that could be used as

inputs of the traditional Conference on Data Systems Languages (CODASYL)

database (Cowie and Lehnert, 1996). Although the work was based on handcrafted

structures and a limited number of techniques, it was highly effective (Wilks, 1997).

15

Subsequently, in the late 1970s, De Jong at Yale University developed the Fast

Reading, Understanding and Memory Program (FRUMP), which was one of the

earliest artificial intelligence systems (Jong, 1977). FRUMP was designed to work

with unstructured texts by skimming newspaper stories using a computer to fill

predefined slots in structures (Cowie and Wilks, 2000). The goal was to find the

important points of the news stories without reading them in detail (Wilks, 1997).

This work can be considered the basis of many IE systems that emerged in the

1980s, such as TRANS (Wilks, 1997).

In the mid-1980s, the Carnegie Group developed one of the earliest commercial IE

systems called JASPER. As in the case of the early systems, it relied on a high

degree of manual intervention, mainly using very complicated templates generated

by analysts and with complicated extraction tasks. As with earlier attempts,

JASPER had limited access to lexicons and dictionary resources and with no

learning algorithms. JASPER was designed for Reuters, and unlike FRUMP,

JASPER was benchmarked and seriously evaluated for its performance (Wilks,

1997). However, the movement in IE was motivated by a growing trend towards

more practical approaches of using computational powers to manipulate large

volumes of text and to depend less on handcrafted linguistic templates.

2.5.Challenges of IE The challenges of IE can be summarised by the following three areas:

1 - IE is a domain-specific task. The target types of objects and facts relate to

a specific domain; for example, the tasks of extracting information about

products, companies and release dates is different from those for another

domain, for example, natural disasters.

2 - Accuracy: The primary challenge facing this research area is the design

of an extraction model that would achieve a high level of accuracy in the

execution of extraction tasks.

3 - Running time: It is necessary to look at the expense of processing steps

that the selected text must scan.

16

Apart from these challenges, IE is an interesting field of research due to the

following reasons (Cowie and Lehnert,1996):

• Tasks of IE are clear and well-defined.

• IE applies to real-world texts.

• IE poses challenging NLP problems.

• The performance of IE can be compared to the human benchmark for the same

task.

2.6.Information Extraction TasksThe prime goal of IE has been divided into several tasks. The tasks are of

increasing difficulty, starting from identifying names in natural texts then moving into

finding relationships and events.

2.6.1.Named Entity Recognition

The term named entity recognition (NER) was first introduced in MUC-6

(Grishman and Sundheim, 1996). A key element of any extraction system is

to identify the occurrence of specific entities to be extracted. It is the simplest

and most reliable IE subtask (Cunningham, 2006). Entities typically are noun

elements that can be found within text, and they usually consist of one to a

few words. In early work in the field, more specifically at the beginning of the

MUC and Automatic Content Extraction (ACE) competitions, the most

common entities were named entities, such as names of persons, locations,

companies and organizations, numeric expressions, e.g. $1 million, and

absolute temporal terms, e.g. September 2001. Now, named entities have

been expanded to include other generic names, such as names of diseases,

proteins, article titles and journals. More than 100 entity types have been

introduced in the ACE competition for named entity and relationship

extraction from natural language documents (Sarawagi, 2007).

The NER task not only focuses on detecting names, but it can also include

descriptive properties from the text about the extracted entities. For instance,

in the case of person names, it can extract the title, age, nationality, gender,

position and any other related attributes (Esparcia et al., 2010).

17

http://link.springer.com/search?facet-author=%22Sergio+Esparcia%22

http://link.springer.com/search?facet-author=%22Sergio+Esparcia%22

There is now a wide range of systems designed for NER, such as the

Stanford Named Entity Recognizer 1. Regarding the performance of these

subsystems, the accuracy reached 95 per cent. However, this accuracy only

applies for domain-dependent systems. To use the system for extracting

entities from other types, changes must be applied (Cunningham, 2006).

2.6.2.Relationship Extraction

Another task of the IE system is to identify the connecting properties of

entities. This can be done by annotating relationships that are usually

defined between two or more entities. An example of this is ‘is an employee

of’, which describe the relationship between an employee and a company; ‘is

caused by’ is a relationship between an illness and a virus (Sarawagi, 2007).

Although the number of relations between entities that may be of interest can

generally be unlimited, in IE, they are fixed and previously defined, and this

is considered part of achieving a well-specified task (Piskorski and

Yangarber, 2012). The extraction of relations differs completely from entity

extraction. This is because entities are found in the text as sequences of

annotated words, whereas associations are expressed between two

separate snippets of data representing the target entities (Sarawagi, 2007).

2.6.3.Event Extraction

Extracting events in unstructured texts refers to identifying detailed

information about entities. These tasks require the extraction of several

named entities and the relationships between them. Mainly, events can be

detected by knowing who did what, when, for whom and where.

2.7.IE System Evaluation ForumsInformation extraction specialized conferences played a key role in the rapid

development of IE systems and their underlying models and technologies.

Following is a discussion of three main conferences, the Message understanding

18

1 Stanford Named Entity Recognizer website: http://nlp.stanford.edu/software/CRF-NER.shtml [Last Accessed: 28 April 2013]

http://nlp.stanford.edu/software/CRF-NER.shtml

http://nlp.stanford.edu/software/CRF-NER.shtml

conferences, the Automatic content extraction and finally Knowledge Base

Population.

2.7.1.Message Understanding Conference

The MUC series was initiated and funded by the Defense Advanced

Research Projects Agency (DARPA) (Cardie, 1997). The main goal of the

MUCs was to foster research in automating the extraction of information from

texts. However, one of the main outputs was defining the evaluation

standards for IE systems. For instance, defining quantitative metrics, such as

precision and recall. In total there were seven conferences, with the first,

MUC-1 taking place in 1987 and the last, MUC-7 in 1997.

Although they are called conferences, they are also widely known as

competitions, because concurrent research groups who wanted to attend the

conferences were asked to submit the evolutions of their systems in order to

be accepted for participation. For each conference, participants were given

sample messages and a set of carefully defined instructions on the type of

the information to be extracted. On a later date, before the start of the

conference, participants received a set of 100 previously unseen tests

(Appelt, 1999) to run on their developed systems without making any

changes to them. The processed output was tested against a model answer

manually prepared by experts (Grishman and Sundheim, 1996). The

comparison was done on the basis of a scoring system that rated the output

summary of each system according to metrics of recall and precision

(Cardie, 1997). Table 2.1 gives a summary of seven events of MUC

(Grishman and Sundheim, 1997; Grishman et al., 2002, Chinchor, 2001).

Table: 2.1 Summary of MUC topics from 1987 to 1997 Table: 2.1 Summary of MUC topics from 1987 to 1997 Table: 2.1 Summary of MUC topics from 1987 to 1997 Table: 2.1 Summary of MUC topics from 1987 to 1997

MUC Year Text source Evaluation taskEvaluation taskMUC-1 1987 Naval reports Scoring system undefined Scoring system undefined MUC-2 1989 Naval reports Undefined, large training corpusUndefined, large training corpusMUC-3 1991 Newswire on tterrorism in Latin American Semi-automated scoring program.

Recall and precession has been introduced.

Semi-automated scoring program. Recall and precession has been

introduced.MUC-4 1992 Newswire on tterrorism in Latin American Semi-automated scoring program,

further increase in template complexity in comparisons to those

in MUC-3

Semi-automated scoring program, further increase in template

complexity in comparisons to those in MUC-3

MUC-5 1993 Joint ventures and Microelectronic fabrication Multilingual (English and Japanese)Multilingual (English and Japanese)

19

MUC-6 1995 Management Succession Named-entity recognition, co-reference, template element filling

Named-entity recognition, co-reference, template element filling

MUC-7 1997 Airplane crashes Named-entity recognition, co-reference, template element filling,

template relation filling

Named-entity recognition, co-reference, template element filling,

template relation filling

Table 2.1 shows that from the first MUC with the last one, the number of

tasks and their nature completely differed. In MUC-7 the number of tasks to

be evaluated increased in number and complexity. There were five tasks to

be evaluated, including named-entity recognition, co-reference resolution,

template element filling, template relation filling and scenario template filling.

Before MUCs, templates were completely prepared by experts and were of

limited size for only limited texts. According to Cowie and Lehnert (1996), if

those templates had been examined with the measurements followed in later

MUC competitions, they would only earn scores between 60 to 80 per cent

for overall accuracy of the system, which is far less than expected (Cowie

and Lehnert, 1996).

The MUC proceedings were considered very influential in fostering the

development of the field of IE systems. They were a significant reference

resource for understanding how to evaluate IE systems. In addition, these

conferences helped in understanding the current state-of-the-art (Appelt,

1999).

2.7.2.Automatic Content Extraction (ACE)

The next series of conferences were the ACE, which aimed to cover texts

from broader domains, such as general news stories, including politics and

air accidents. ACE was an annual conference that started in 2003 and it

continued evaluations of different systems until 2008. Later ACE events

included testing of multilingual systems for texts of languages other than

English, such as Chinese and Arabic. The evaluation of extraction systems

was done separately by testing their ability to correctly extract the required

information from each text (Grishman, 2012).

20

2.7.1.Knowledge Base Population

The Text Analysis Conference (TAC), funded by the National Institute of

Standards and Technology (NITS), was a series of workshops initiated in

2008 and held annually thereafter. The aim was to provide an evaluation

infrastructure for NLP technologies 1. Part of TAC is the Knowledge Base

Population (KBP) workshops. The aim of KBP was to motivate research in

the field of named entity extraction from large test collections of unstructured

texts. The organisers provided competitors with common procedures and

each system was tested to find given names, and articles from newswires

and blog posts. This workshop raised questions about the context in which

the word occurs; for instance, ‘Apple’ is the name of a company and also the

name of a fruit. Other questions involve finding redundant words and

connecting properties (Grishman, 2012). Most of these questions were not

addressed in past events (Grishman, 2012).

2.8.Evaluation of IE SystemsThe findings of the MUCs have demonstrated that the evaluation of IE systems are

rather challenging, even for trained professionals, when compared with other

systems that manages natural language.

In IE, the element of difficulty resides in the fact that there are no clear guidelines

on the correctness of the items extracted by an extraction algorithm. To be able to

give a precise declaration of the desired output class is a difficult task; for example,

for a straightforward entity class such as ‘Country’, results such as ‘Palestine’ and

‘Europe’ are considered countries in some contexts, and in others they are not

(Moens, 2006).

The annotation tasks for IE are very complex, and they require the development of

a formal guideline document that clearly describes the desired output, alongside

several examples to guide the annotation process. For example, when comparing

the work of two people for the same tasks (human annotation), the Linguistic Data

Consortium (LDC) reported a rate of 92.6 per cent for entities and 70.2 per cent for

21

1 Text Analysis Conference webpage. http://www.nist.gov/tac/. Last Accessed: 12 April 2013

http://link.springer.com/search?facet-author=%22Marie-Francine+Moens%22


http://www.nist.gov/tac/

http://www.nist.gov/tac/

much more complicated tasks, such as relationships (Ramshaw and Weischedel,

2005). These difficulties can be clearly be seen in the human inter-annotator

agreements, where the expected rate is between 70 to 85 per cent, meaning that

there were discussions about correctness. (De Sitter et al., 2004).

De Sitter et al. (2004) pointed out that the lack of standardisation is actually a result

of the following three main issues:

• There is ambiguity with regard to the definition of IE; is the goal to fill a

template, or about finding all the occurrences of an instance in a document?

• How is it decided whether the extracted item is correct or not? Since the

inter-annotator agreement states that only 70 to 85 per cent is enough on a

standard IE task, obviously, this means there is a discussion about

correctness.

• What statistical metrics should be used to measure the effectiveness of an

IE system? Most researchers use F-measures, while others use recall and

precision measures.

Although, literature shows that there is no definite framework to evaluate IE

systems, there are some metrics that have been used widely. For counting of

results, a confusion matrix is a typical measure for evaluating an IE system. Figure

2.2 presents a confusion matrix. The evaluation will be done first on the entity level.

If it is correctly classified, it is a true positive; otherwise, it is a false positive. The

number of entities that the system should have extracted but failed to is labelled as

false negative; the true negative is rarely used in IE evaluations (Moens, 2006). The

confusion matrix is a very useful tool to determine how successful the system is in

classifying data. Hence, measures such as recall and precision can be easily

computed using confusion matrix.

22

Predicted classPredicted class

Yes No

Actual class Yes True Positive False NegativeActual class

No False Positive True Negative

Figure 2.2: Confusion matrix

Also, there are two well-known metrics that cannot be neglected. They are widely

accepted an adopted in the research community (De Sitter et al., 2004). These

measures are precision and recall.

Definition 1:

Recall = True Positive

True Positive + False Negative

Definition 2:

Precision = True Positive

True Positive + False Positive

The recall measure indicates the percentage of items that the system should have

detected. If high recall is achieved, then almost all of the information that had to be

extracted was indeed extracted. Precision indicates the percentage of the correct

items that the system has produced. If high precision is achieved, then almost all of

the extracted information is correct and there are few or no errors (Moens, 2006).

There is a clear trade-off between recall and precision and, in most cases there is

an inverse relationship. For the intended end-user, it is important to identify what is

the aim of the application; in other words, which is more important — obtaining a

high recall (avoid false negatives) or high precision (avoid false positives) (Appelt

and Israel, 1999).

According to Sarawagi (2007), in most systems, achieving high precision is a much

easier task than high recall. This is mainly because mistakes can be easily detected

from the extraction output, so they can be corrected manually, and the model can

be tuned until there are no errors. However, achieving high recall is more

challenging, because it requires an extensive annotation of data in the document; if

23



this is not achieved, then the identification of missed data from large unstructured

corpus is relatively unfeasible (Sarawagi, 2007).

Because of this unavoidable trade-off between recall and precision, an average

measure is usually reported, namely, the F-measure, which is a combination of both

recall and precision.

Definition 3:

F-measure= 2 x Recall x Precision Recall + Precision

Definition 4:

FḂ-measure= (1+Ḃ2) x Recall x Precision

Ḃ2 x Recall + Precision

Where Ḃ is a non-negative real number that indicates the relative importance of

precision and recall, and when Ḃ = 1, both are of equal importance (Moens, 2006).

Thus, the F-Ḃ is a weighted measure of both recall and precision (Han et al., 2012).

Although, the F-measure is widely used to make a comparison between IE systems

by using only one single measure, in most cases it is probably essential to know the

individual score values of recall and precision to be able to fully compare one

system against another.

For example, if one system has achieved a recall of 80 per cent and precision of 20

per cent, the obtained F-measure score will be the same as a system that scored

20 per cent recall and 80 per cent precision, even though the two systems are very

different. Therefore, in such cases, it is impossible to tell whether one system is

better than the other (De Sitter et al., 2004).

2.9.IE Overall ProcessIn general, IE systems that have been developed to accomplish different tasks for

different domains that differ from each other in a myriad of ways, but there are basic

components that are found in nearly every extraction system that deals with natural

texts. Obviously, most IE system designs were influenced by the design that was

24

defined by Hobbs in MUC-5 (Hobbs, 1993). Hobbs’ general architecture is based on

the idea of ‘cascading’ independent modules or engines separating the overall

processes into several smaller stages. The earlier stages deal with small linguistic

objects and work in a predefined domain-independent manner. At each step,

structure is added to the document and irrelevant information is filtered out. Some

authors combine several steps in a bigger stage, while others tend to divide one

step into smaller ones ( Cowie and Wilks 2000; Appelt and Israel 1999; Turmo et al.

2006; Archaya and Parija 2010; Piskorski and Yangarber 2012).

Reviewing the historical framework, most current systems, to a greater or lesser

degree, follow six major system functionalities:

• Document pre-processing:

• Morphological and lexical processing

• Syntactic parsing

• Semantic interpreter

• Co-reference resolver

• Template generator

Figure 2.3: Information extraction overall process

25

http://link.springer.com/search?facet-author=%22Jakub+Piskorski%22


http://link.springer.com/search?facet-author=%22Roman+Yangarber%22


2.9.1.Document pre-processing

The first task is to prepare the input corpora for processing. This can be

achieved by the use of three primary modules:

Text zoner: Also known as text splitters, the text zoner splits the

document into sets of appropriate zones or segments, usually

sentences.

Filters: These take a segment of sentences and filter the relevant

sentences and discard the irrelevant ones. The prime consideration of

this module is to speed up the processing time by blocking unwanted

text.

Tokenization: This process mainly identifies lexical units. For

languages, such as English, French and Arabic, tokenization is a trivial

problem, and words can be identified by whitespace characters. In

newspaper stories and articles of any type, punctuation mostly indicates

a sentence boundary. However, in some languages, such as Japanese

and Chinese, where orthography cannot indicate word boundaries,

additional, mostly complex, processing modules are added to identify

word segments.

2.9.2.Morphological and lexical processing

This stage manages the task of building small-scale determinable structures

from sequences of lexical units. At this stage, system developers take

advantage of available resources to do a sufficient part of the work. Here

part-of-speech taggers, stemmers, dictionaries and lexicon resources are all

used. There are mainly four modules that fall in this stage:

Proper name extraction: Some authors believe that this step must

be done in the pre-processing stage (Cowie and Wilks, 2000). One of

the most important tasks in IE systems is to extract enumerate units

(dates, spelled out numbers) and named entities. As mentioned

earlier, named entity recognition (NER) is the process of identifying

and classifying the domain-dependent names in cases where the

system is designed to extract data from specific domains, otherwise,

common names or domain-independent. Although extracting names is

26

a critical task, it may represent some difficulty if the entity is of a large

class. For example, there are many cities in the world in different

locations that have the same name.

Part-of-speech taggers: IE systems do different tagging, one of

which is part-of-speech tagging. It works by assigning a part-of-

speech symbol (proper noun, verb) to a word from the corpus. They

work on the basis of statistical methods resulting from training pre-

tagged texts. These systems can be built independently of the system

(Wilks, 1997). Generally, constructing part-of-speech taggers vary,

especially when extracting information from highly specific domains,

where the system will need more effort and time to train (Appelt and

Israel 1999).

Word sense tagging: Also known as lexical disambiguator, this

assigns lexical units to one and only one lexical tag (Cowie and Wilks,

2000). It is important for revealing noun ambiguity on the predicted

fragments.

2.9.3.Syntactic parsing

Full document parsing raises issues related to performance and time

consumption. At the beginning of the MUC competitions, IE systems were

designed to implement full parsing of the input texts. However, in MUC-3, the

system that achieved the best score was built on the basis of partial parsing

by using finite-state grammars. This system was developed by Lehnert

(Turmo et al., 2006). This system was soon followed by the Hobbs’ group

(MUC-4, 1993) which, like Lehnert, followed an approach of using a

simplified parser (Turmo et al., 2006). The idea of using simpler language

models i.e. finite-state grammars was advocated by Church in 1980. Church

contended that finite-state grammars are adequate to achieve good

performance in human linguistic systems (Appelt et al.,1993)

Hence, in the most current IE systems, a shallow parser based on finite-state

approach is sufficient to distinguish the syntactic structure of a sentence and

its main components. However, for some domains, a full parser may be

desirable (Appelt and Israel 1999).

27

2.9.4.Semantic interpreter

Generating a semantic structure or an event from a syntactic structure is the

goal of this module (Archaya and Parija 2010). Simple approaches, such as

verb sub-categorisation and identifying appositive types are usually used

(Cowie and Wilks 2000): for example, ‘John Smith’ (person name),

‘CEO’ (occupation) of ‘ABC Corp’ (company name). This task is usually

limited to finding predefined argument structures.

2.9.5.Co-reference resolver

In MUC-6, this became an important part of the system evaluation process.

Its importance was derived from the need to combine structures to produce

fewer ones by generating larger, and hence, complete templates. In natural

texts, entities are referred to by different names and in different ways. For

example, ‘International Business Machines’ and ‘IBM’ are both references of

the same entity; one author may mention the full company name at the

beginning of the article, but only use the acronym in later positions. IE

systems must be able to identify co-references for successful processing

(Cowie and Wilks 2000).

2.9.6.Template generator

The final stage is to produce a semantic structure that can be evaluated. It is

crucial to ensure that the output template is (i) in the proposed format and

contains lexical units extracted from the original text (Cowie and Wilks 2000),

(ii) a representational device, and (iii) designed to be an input of another

processing by humans or programs or both (Appelt and Israel, 1999).

2.10.Information Extraction System Design Approach

Two major approaches exist for the design of IE, namely, the knowledge

engineering approach and the automatic engineering approach.

2.10.1.The Knowledge Engineering Approach

This approach can be identified by its fundamental characteristics, and

requires the involvement of the human factor to write the extraction rules.

Because it relies on handcrafted rules, the knowledge engineering approach

is also known as the rule-based approach (Sarawagi, 2007). The main

component of the IE system is a domain-specific grammar, which is usually

28

written or supervised by a domain expert. Appelt and Israel (1999) labelled

the person who works on the IE system under this approach a ‘knowledge

engineer’. Mainly because this approach requires a person who must be

familiar with the IE system mechanism and he must be able to familiarise

himself with the target domain, whom then, by himself or with a help of an

expert of the application domain, writes the extraction rules from natural

texts. In this methodology, the knowledge engineer plays a fundamental role

in the level at which the system performs. In addition to the need for skill and

high familiarity with the system and the domain, this approach is very labour-

intensive. In this case, it is obvious that achieving a high performing system

is an iterative process, where the written rules will be continually modified

until satisfactory accuracy scores are reached.

2.10.2.The Automatic Training Approach

In MUC-5, researchers were tempted with the idea of using a statistical-

based system to learn extraction rules instead of writing them manually

(Turmo et al, 2006). The principle motivation behind this model was to

reduce the workload by shifting away from knowledge-based systems

towards an automatic system designed upon machine learning algorithms

(Piskorski and Yangarber, 2012).

By following this model, there is no need for a knowledge engineer to write

the extraction rules. Instead, the rules are derived automatically by following

an intensive training of the system over the input corpus of the texts. The

only human intervention required to accomplish the task relates to annotating

the input texts for information to be extracted. Once the training corpus is

analysed and the required information marked, the training algorithm is run,

this then generates statistical information that will be used in the processing

of novel texts. Users are allowed to examine the results to check whether the

system hypotheses are correct or not. Thus, in the case of negative findings,

the system can modify its rules to respond to the new information (Appelt

and Israel 1999). Some well-known machine learning statistical models were

applied in this area, such as the hidden Markov model (HMM) and

conditional random fields (CRFs) (Piskorski and Yangarber, 2012).

29

Several types of learning have been used within the automatic training

approach, including supervised, semi-supervised and unsupervised learning

(Appelt and Israel 1999; Turmo et al. 2006; Nadeau and Sekine 2007;

Grishman 2005; Piskorski and Yangarber 2012).

Supervised learning approach

The term ‘supervised learning’ is applied in a situation where the entire

input into the training system is annotated manually. In preparing the input,

the annotation process will be in sequence, typically going through the

corpus document by document. Supervised learning makes sense only if

the large training set is already annotated; otherwise, the whole procedure

is extremely expensive.

Semi-supervised learning approach

An alternative is the ‘semi supervised’ approach, also known as the

‘weakly supervised’ learning approach (Nadeau and Sekine, 2007). The

semi-supervised approach uses a combination of a limited amount of

labelled data with large amount of unlabelled data (Grishman, 2005). It

requires limited human supervision, mainly in the initial stage to provide

the input, and in the last stage to check the output.

Unsupervised learning approach

The key idea of unsupervised learning is the use of clustering. For

instance, named entities can be grouped based on the similarity of the

underlying context. This technique relies on using lexical resources for

training the system, such as WordNet, and on lexical statistics generated

from large unlabelled texts (Nadeau and Sekine, 2007).

According to Appelt and Israel (1999) a project that adopted a hybrid

approach by using a combination of these learning strategies (supervised,

semi-supervised and unsupervised) conducted by Ralph Weischedel, and

TIPSTER in MUC-7, resulted in high performance results (Appelt and

Israel, 1999).

30





In general, although these two approaches were discovered a long time ago and

were used in IE systems in the 1980s and 1990s, both are still being used in

parallel depending on the nature and the purpose of the system’s extraction tasks

(Sarawagi, 2007).

Appelt and Israel (1999) highlighted some principle considerations that can drive

the decision of which approach to choose :

1 - The availability of training texts: If there is an adequate quantity or if

obtaining the training data is cheap, then the question on which approach to

choose, to some extent, returns to the difficulty of the domain. If the named

entity were obvious and easy, the required skill of the annotator is relatively

modest. However, where the situation is different for more difficult domains,

the inter-annotator agreement for accuracy level is lower and the process is

slower and much more complex and may require a more experience level. In

most cases, the annotation task is much cheaper to produce than rules, but

with a complex domain, this cost is arguable.

2 - The availability of lexical resources: If lexical resources, such as

dictionaries and lexicons, are available for the target language, then the

knowledge engineering approach is more suitable. Otherwise, it is necessary

to use the trainable approach and annotated corpus.

3 - The availability of a knowledge engineer: This factor is a prerequisite for

following the knowledge engineering approach.

4 - The stability of the system specification: It is often easier to do a minor

change in rules from re-annotating a text than to respond to different

requirements.

5 - The required performance degree: Results from MUC-6 and MUC-7 show

that rules produced by humans achieved an error rate approximately 30 per

cent lower than the automatic trainable systems.

31

Therefore, there is no winning model, and each system has its appropriate use. For

example, for closed domains, the knowledge engineering systems are more

appropriate. For unspecific domains, the statistical methods are preferable. Table

2.2 gives a summary of these factors.

Table 2.2 Decision factors for IE approaches

The Knowledge Engineering Approach The Automatic Training Approach

Lexical resources are available Lexical resources are unavailable

Rule-writers are available No knowledge-engineer is available

Training resources are rare or difficult to obtain Training resources are available and cheap

Extraction tasks are not stable and likely to change

Extraction tasks are stable

The required performance degree is high Good performance scores are sufficient to the task

2.11.Examples of IE system

The subsequent achievements in the area of rule-based system inclined

researchers to move toward general-purpose IE systems that could facilitate the

incorporation with new domains and languages (Piskorski and Yangarber 2012).

One of the earliest IE systems with a comprehensive design was FASTUS: A Finite-

state Processor for Information Extraction from Real-world Text. This system was

developed at SRI international’s Artificial Center, AIC, in 1993 and it has been

tested in MUC-4 (Appelt et. al, 1993). The main aim of the system was to address

the need for a system that extract predefined information from natural texts with

high speed and accuracy (Appelt et. al, 1993). FASTUS was able to process texts

from English and Japanese and it achieved one of the top scores in the MUC

competitions. It was designed as a set of cascaded finite-state transducers, where

the output of each stage serves as the input of the next stage. Each stage of the

processing model is actually a finite-state device (Piskorski and Yangarber, 2012).

The architecture of FASTUS consisted of four stages: (i) triggering of words; (ii)

recognising phrases; (iii) recognising patterns; (iv) merging incidents (Appelt et. al,

1993). The system has achieved a recall of 44% and a precision of 55% when

testing it against a never seen set of 100 texts. At the time of testing in MUC-4, the

system has been considered very fast.

32









Most of the systems that were designed at the same time of FASTUS, consisted of

one building block, where most of the processing modules, including stemmers,

parsers, POS, and tokenizes, were usually a one person or one group effort (Hahn

et al, 2008). Most of them were stand-alone systems, making the interoperability of

these components very hard to achieve. At that time, the idea of reusability of

system components was not considered, but rather the modules were designed to

work only for specific tasks, resulting in a low level of abstraction of specifications.

So, to develop an IE system, developers had to start from scratch because of this

lack of reusability (Hahn et al., 2008).

Researchers wanted to combine their work with the work of other researchers in an

effort to deliver innovative and coherent solutions, and this was also the idea that

motivated researchers at IBM to develop the Unstructured Information Management

Architecture (UIMA) system. UIMA is both an architecture and implementation

known as a framework (Ferrucci and Lally, 2005).

The high level architecture of UIMA can be described as text analysis engine

(where available in both C++ and Java programming language), and it consists of

two fundamental phases: analysis and delivery. The analysis phase consists of the

text analysis engine (TAE), which includes different processing modules such as

tokenization, stemming, syntactic parsing, and dictionary look-ups. However, in the

delivery stage, the analysis results can be presented to the user in different ways.

One way is a search from a query interface (a semantic search engine) for

documents that contain a particular token, entities or relationships (Ferrucci and

Lally, 2005).

The document analysis of UIMA works on the basis of a data structure: the original

document and the associated meta-data. This data structure is referred to as the

Common Analysis System (CAS). TAE takes the data structure CAS, performs the

required analysis, then produce the updated documents with additional meta-data

that contains the detected data such as name of persons, locations, organizations.

(Ferrucci and Lally, 2005).

33

UIMA has been used in various IE systems such as the Mayo Clinical Text Analysis

and Knowledge Extraction System (cTAKES)1, AVATAR extraction system2 . A

system similar to UIMA, is GATE, where the core language processing algorithm is

isolated from system services, such as data storage, communication between

components, and results visualisation panel.

GATE is the acronym of the “General Architecture for Text Engineering”. It was

developed at the university of Sheffield in 1996. The system was assessed in two

MUCs and has achieved high scores for named entity tasks (Cowie and Wilks,

2000).

UIMA and GATE frameworks as well as their tools and resources represent the-

sate-of-art of the available tools for text natural language processing techniques

(Dietl et al, 2008). Cowie and Wilks (2000) highlighted the three main objectives of

GATE framework:

•To allow the data to pass through different system modules at the highest

common level.

•To support the integration of system modules developed in any

programming language, and thereby, to be available on any applicable

platform.

•To provide a common easy-to-use interface to facilitate the the evaluation

and refinement of system modules, and to manage input text and linguistic

resources.

GATE is open source and completely written in Java programming language and it

supports PostgreSQL and Oracle databases (Dietl et al., 2008). Considering IE

technology in particular, many developers of IE systems opted for GATE because of

its high reliability and robustness, and because it incorporates the shallow

processing approach, which has proved it’s suitability for the IE domain (Cowie and

Wilks, 2000).

34

1 Open Health Natural Language Processing (OHNLP) Consortium. http://www.ohnlp.org [Last Accessed: 20 April 2013]

2 Avatar Information Extraction Systemhttp://libra.msra.cn/Publication/2175452/avatar-information-extraction-system [Last Accessed: 20 April 2013]

http://www.ohnlp.org/

http://www.ohnlp.org/

http://libra.msra.cn/Publication/2175452/avatar-information-extraction-system




GATE comprises three main components: (i) Language Resources LRs (lexicons,

textual documents, ontologies); (ii) Processing Resources PRs (parser algorithms,

generators, modelers); and (iii) the Visual Resources VRs (graphical user interface

components) (Cowie and Wilks 2000, Dietl et al. 2008). GATE Processing

Resources (PR) are analogous to the TAEs of the UIMA architecture.

Therefore, the existence of this development pipeline, under the guidance of

software engineering principles and practices, has led to incremental development

in the field and it created opportunities for the development of NLP complicated

systems. Accordingly, an increasing number of institutes concerned with NLP

technologies, has adopted the software development that comply with UIMA

specifications to meet the emerging standards (Hahn et al., 2008).

2.12.IE System Performance

Measuring the overall performance of IE systems is an aggregated process and it

depends on multiple factors. The most important factors are (i) the level of the

logical structure to be detected, such as named entities, relationships, events and

the co-references,

(ii) the type of system input, i.e. newspapers articles, corporate reports, database

tuples, short text messages from social media or sent by mobile phones, (iii) the

focus of the domain, i.e. political, medical, financial, natural disasters, (iv) the

language of the input texts, i.e., English, or a more morphologically sophisticated

language, such as Arabic (Piskorski and Yangarber, 2012).

The relative complexity of assessing the performance of an IE system can be

managed by noting the scores obtained by the system in MUC competitions. In

MUC-7, in which the domain focus was aircraft accidents from English newspaper

articles, the system, which achieved the highest overall score, obtained a different

score for each subsystem. Scores for both recall and precision are presented in

table 2.3 (Piskorski and Yangarber, 2012). These figures provide a glimpse of

what to expect from an IE system in which the best performance is achieved in NER

and the lowest scores indicate the most difficult task of event extraction.

35









Table 2.3 Top scoring in MUC-7

Task Recall score Precision score

NER 95 95

Relationships 70 85

Events 50 70

Co-reference 80 60

2.13.Summary

IE is a milestone for many NLP techniques; therefore, it has progressed rapidly over

the last two decades. During and after the MUCs, researchers applied IE

techniques in a myriad of domains, resulting in massive evolution in the field. The

technologies and techniques underwent many developments. IE started out relying

heavily on handcrafted rules writing; the advent of machine learning techniques

advanced the progress of IE systems. However, there are still many challenges; the

current state-of-the-art systems show that IE system performance has not achieved

the level of accuracy of human extraction tasks. In addition to this, the IE system is

an intricate system consisting of several components that vary in complexity, and

their combined performance has a massive effect on the consequences. For

example, in order to extract a relationship, the NER task must first work with

relatively accurate results. The examples of IE systems, have demonstrated how

the field of NLP in general and in IE in particular has been influenced in the last

decade by the evolvement of software engineering principles and design

frameworks. The development of IE systems has been moved from standalone

systems to much more interoperable components.

36

3. CAFETIERE SystemThe aim of this chapter is to give an overview of the extraction engine adopted in this

project. The system framework and the rule grammar are explained. Some examples are

added for more elaboration.

3.1.CAFETIERE

The CAFETIERE system is a rule-based system for the detection and extraction of

basic semantic elements. CAFETIERE is an abbreviated term for Conceptual

Annotation for Facts, Events, Terms, Individual Entities and RElations. It is an

information extraction system developed by the National Center for Text Mining at

the University of Manchester. The engine incorporates a knowledge engineering

approach for extracting tasks. In this project, CAFETIERE is the extraction engine

to be used.

Figure 3.1: CAFETIERE overall analysis and query model

3.2.System Components

Before applying the extraction rules, the input will go through different

preprocessing tasks. The main framework in the preprocessing comprises the

UIMA analysis pipeline:

3.2.1.Document capture and zoning

37

In this task a structural annotation is implemented, the text will be partitioned

into structural segments using the Common Annotation Scheme known as

CAS; an XML-based annotation scheme. This step helps in splitting the text

into a title and main body components, the body will be then split into

paragraphs.

3.2.2.Tokenization

Text is seen by the extraction engine as a consequence of elements. Every

basic element within a paragraph will be marked: words, numbers,

punctuation marks and special symbols. These elements are called tokens.

Tokens are saved into the system as objects with associative attributes that

represent the token position in the text and its orthography features. These

features are defined at an earlier stage in number of codes that can be used

later in writing the rules. For example, the orthography code “capitalized”

means that only the first character in the word is capitalized, while the rest

are written in lowercase (Black et al., 2005).

3.2.3.Tagging

The task of lexical annotation of the text, where every token is identified by

its part of speech (POS) using the tagger. The tagger is the supporting

engine that has previously been trained to identify POS using Brill algorithm

and it uses the tag-sets1 of Penn Treebank as the tags in use. The accuracy

of an IE system output depends highly on the performance of the tagger.

Tokens will be then identified by their semantic properties: noun, proper

nouns, verbs, adjectives etc (Black et al., 2005).

3.2.4.Gazetteer lookup

This process is mainly there to perform semantic annotation. In CAFETIERE,

users can upload a lexical resource that consists of collection of all words

and phrases related to the domain in focus. This collection and its

representation is called Gazetteer. It is held in Access or MySQL database.

The role of the gazetteer is to identify proper names such as people names,

38

1Automatic Mapping Among Lexico-Grammatical Annotation Models http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html[Last accessed 23 April 2013]

http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html




locations, titles, months and any other domain specific words (Black et al.,

2005).

3.3.Notation of the Rules

After the preprocessing phase finishes and the system recognized every token, the

document will be ready to undergo the extraction phase by applying the rules. Rules

in cafeteria system are written in high-level programming language (Black et al.

2005).

The general rule formalism in CAFETIERE is (Black et al. 2005):

Y => A \ X / B

Where Y represents the phrase to be extracted and X are the elements that are part

of the phrase. A represent the part of the context that may appear immediately

before X in the text, and B represents the part of context that may appear

immediately after X, both A and B are optional (which means may be null).

Many rules lack the presence of A or B or both, therefore rules may have one of the

following form (Black et al. 2005):

Y => \ X / B

Y => A \ X /

Y => \ X /

Rules are context sensitive, that means the constituents present in the text before

and after the phrase should be reflect on the right hand-side (A) and on the left

hand-side (B) of the rule. The rule must have at least one constituent (X) and can

be more if required.

These rules define phrases (Y) and their constituents (A, X, B) as pairs of features

and their corresponding values, for example:

Y A X B

[syn=np, sem=date] => \ [syn=CD], [sem=temporal/interval/month], [syn=CD]

39

Where this represents a context free rule where both A and B are null, the phrase

and the constituent part are written in as a sequence of features and values

enclosed by in square brackets [Feature Operator Value]. If there is more than one

feature, then they are separated by commas, as can be seen in the feature bundle

for Y . Following is brief description for each of them:

Feature: Denote the attribute of the phrase to be extracted. The most commonly

used features are syn, sem, and orth, where syn is syntactic, sem is semantic and

orth is orthography. Features are written as sequence of atomic symbols.

For example, some of the values (and their meaning) may be assigned to the

feature syn listed in table 3.1 (Black et al., 2005)

Table 3.1: Examples of the values that can be assigned to syn feature

Tag Category Example

CD Cardinal number 4 , four

NNP Proper noun London

NN Common noun girl, boy

JJ Adjective happy, sad

Although these features are built into the system, there is no restriction on the name

of the features on the left-hand side of the rule.

Operator: Denote the function applied to the attribute and the predicated value,

operators that can be used in the system are ( >, >=, <, <=, , =, !=, ~) all have the

same usual meaning, the tilde operator matches a text unit with a pattern.

Value: expresses a literal value that may be strings or numbers or combination of

both, they may be quoted or unquoted.

3.4.Exploiting the rule notation

In general, rule grammar of the CAFETIERE system are diverse, and their usages

are different. Following are some of the main rule grammar that might be helpful for

assembling the appropriate rule (Black et al., 2005).

40

3.4.1.Regular expressions

Such as ?, *, and + can enhance the rule performance. They can be added

at the end of the right-hand side of the rule to show the following:

? means optional

* matching from zero to infinity

+ matching from 1 to infinity

Numbers or characters in square brackets e.g. [A-Z] can show precisely the

start and the end of a specific range (Black et al., 2005).

3.4.2.Grouping of subsequent words

Constituents can be grouped in round brackets. For example, list of words

written in lowercase characters separated by a comma, can be identified by

the following features enclosed by the brackets (Black et al., 2005).

[syn=NN, sem=list] =>\ [orth=lowercase]+, ([token=","], [orth=lowercase]+)+ / ;

3.4.3.The use of variables

Let’s assume that the semantic category of the target element is unknown, in

this case the identification will be problematic. One solution is to make the

‘sem’ feature to range across all semantic values by assigning a variable to

it, e.g. in the left hand side of the rule sem = _sem, while in the right hand-

side of the rule the value assigned to the lookup1 feature is _sem. This

means whenever a match is found according to the sequence of features in

the right hand side, save the value of lookup to be the value of sem.

Variables are written as unquoted strings with an initial under score e.g. _var

(Black et al., 2005).

3.4.4.Disjunction of values

From a perspective of efficiency, to avoid writing similar rules that differ in

feature value only, one value expression can be used to denote alternative

411 This feature is only assigned to words identified by the gazetteer or the lexical resource.

values. This can be done by using the pipe symbol “ | “ as in between

operator.(Black et al. 2005).

3.4.5.Partial matching

Partial matching is very useful for the rule-writer if he is looking for a string

that begins with certain characters. For example, many surnames start with

Mac or Mc but the rest of the name may differ. The use of wildcards “*” is

essential and very useful. The following example is taken from Black et Al.

(2005).

[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name3] =>

\ [token= "*virus", token=__o], [sem= disease_type]/;

3.4.6.Rules Order

Rules will be invoked in sequence by the system processor, therefore, their

analysis is deterministic (Black et al, 2012). It is the author’s responsibility to

organize the rules in order, because once words and phrases are identified

by a certain rule they will be no longer available to the subsequent rules. To

illustrate this further, assuming that there are two rules:

Rule 1: extract title, first name, initials, surname.

Rule 2: extract title and the surname

Target text: Ph.D John N. Scott

If rule 2 is executed before rule 1, then rule 2 will match Ph.D. and John,

yielding to N. Scott remaining. Therefore, when rule 1 is fired, it will not

detect N. Scott. because the other parts have already been used.

To avoid such a situation, when writing rules to match tokens that vary in

their numbers, the longest match rule must be executed first.

3.5.Gazetteer

In addition to the system built-in gazetteer, users can upload their own gazetteers to

look up words and expressions from the domain that they are focusing on. The

42

system recognizes a plain text files with the extension .gaz as a gazetteer file, then

it will added to the existed file (which is a relational database table) (Black et al.,

2012). The look up mechanism of the gazetteer works by identifying all the strings

that happen to be in the gazetteer by loading the relevant data from the lexical base

of the system before applying the rules. This process may consume time because

all tokens are looked up to see if there is any information about them in the

gazetteer(Black et al., 2012).

3.6.SummaryCAFETIERE system has been explained in detail, starting from the analysis pipeline, which determines the different preprocessing phases that the text

encounter. Moving to the rules grammar and the specific features such as identifying variables and regular expressions usage.

43

4. Requirement AnalysisThe aim of this chapter is to describe the domain focus and how the corpus will be

gathered. A preliminary analysis of some texts will give the guidelines for rules design and

implementation.

4.1.Domain Selection

The capability of processing natural texts, extracting specific entities, and identifying

relationships are the principle aim of most IE systems. Many systems have been

developed specifically to extract information from news topics under the influence of

MUCs. In the last two decades, IE techniques have been applied in different areas,

such as terrorist attack reports, business news and financial reports, drug

discoveries and biomedical domains. For this project, disease outbreak reports

have been chosen as the extraction domain.

Due to the differences in context for each domain, and the language patterns, this

project will be carried out using complete analysis.

4.2.Motivation for Domain Selection

Once an infectious disease begins to spread in a certain location in the world, the

general public instantly becomes very concerned about the severity of the disease,

its origins, and how many people have been infected and how many have died. This

drives the need for a system that is able to analyse data timely from natural texts,

such as news articles.

Furthermore, it is necessary to complement the clinical-based reporting systems by

enriching their databases with information extracted from disease outbreak reports.

It is essential in many emergency situations to go back and study the history of a

certain disease. Information related to diseases outbreaks is often written as free

text, and, therefore, difficult to use in computerised systems. Confining such

information in this rigid format makes the process of accessing it rapidly very

difficult.

According to the World Health Organization (WHO), analysing the information from

disease epidemic reports can be used for:

44

• The identification of disease clusters and patterns;

• Facilitating the tracking and following up with the spread of a disease outbreak;

• Estimating the potential increase in the number of infected people in a further

spread;

• Providing an early warning in the case of an increase in the number of incidents;

• Help in strategic decision-making as to whether control measures are working

effectively.

In addition to these factors, it has been found throughout history that the number of

information systems that were designed to extract information from disease

outbreak reports is very few and limited, the only system that was designed for

disease outbreaks is Proteus-BIO in 2002 (Grishman et al., 2002). All these are the

motivating factors for choosing to study this domain.

The intention of the work proposed in this project is to be able to extract information

about disease outbreaks from different natural texts. There are a number of news

websites that produce reports about disease outbreaks. Some of these report are

annual reports, where one report presents a summary of all disease epidemics that

have been recorded in one year; for example, reports found in the Centre for

Disease Control and Prevention website1 are all historical reports related to food-

borne diseases. However, for the sake of the project, the aim is to analyse texts

from different formats, where one particular disease is discussed in many reports.

Therefore, a decision has been taken to mainly analyse news reports from WHO.

The WHO website2 represents an ideal source that contains archives of news

classified in different categories, either by country, year or by disease.

In almost every case, the authors of these news stories are reporting the same

information; however, from report to report, the style of writing is slightly different.

This provides an opportunity of being exposed to a variety of writing styles in

reporting disease data.

The following are examples of news taken from the WHO:

45

1 http://www.cdc.gov/outbreaknet/surveillance_data.html

2 http://www.who.int/csr/don/en/

http://www.cdc.gov/outbreaknet/surveillance_data.html

http://www.cdc.gov/outbreaknet/surveillance_data.html

http://www.who.int/csr/don/en/

http://www.who.int/csr/don/en/

•Example 1: A sample of reporting incidents in Brazil diagnosed with dengue

haemorrhagic fever:“10 April 2008 - As of 28 March, 2008, the Brazilian health authorities have reported a national total of 120 570 cases of dengue including 647 dengue haemorrhagic fever (DHF) cases, with 48 deaths.

On 2 April 2008, the State of Rio de Janeiro reported 57 010 cases of dengue fever (DF) including 67 confirmed deaths and 58 deaths currently under investigation. Rio de Janeiro, where DEN-3 has been the predominant circulating serotype for the past 5 years since the major DEN-3 epidemic in 2002, is now experiencing the renewed circulation of DEN-2. This has led to an increase in severe dengue cases in children and about 50% of the deaths, so far, have been children of 0-13 years of age.

The Ministry of Health (MoH) is working closely with the Rio de Janeiro branch of the Centro de Informações Estratégicas em Vigilância em Saúde (CIEVS) to implement the required control measures and identify priority areas for intervention. The MoH has already mobilized health professionals to the federal hospitals of Rio de Janeiro to support patient management activities, including clinical case management and laboratory diagnosis.

Additionally public health and emergency services professionals have been recruited to assist community-based interventions. Vector control activities were implemented throughout the State and especially in the Municipality of Rio. The Fire Department, military, and health inspectors of Funasa (Fundacao Nacional de Saude, MoH) are assisting in these activities.”1

•Example 2: A sample of reporting incidents in Turkey diagnosed with Avian

influenza:“30 January 2006 - A WHO collaborating laboratory in the United Kingdom has now confirmed 12 of the 21 cases of H5N1 avian influenza previously announced by the Turkish Ministry of Health. All four fatalities are among the 12 confirmed cases. Samples from the remaining 9 patients, confirmed as H5 positive in the Ankara laboratory, are undergoing further joint investigation by the Ankara and UK laboratories. Testing for H5N1 infection is technically challenging, particularly under the conditions of an outbreak where large numbers of samples are submitted for testing and rapid results are needed to guide clinical decisions. Additional testing in a WHO collaborating laboratory may produce inconclusive or only weakly positive results. In such cases, clinical data about the patient are used to make a final assessment.”2

•Example 3: A sample of reporting incidents in Sierra Leone diagnosed with

cholera outbreak:“18 September 2012 - As of 16 September 2012, a cumulative total of 18,508 cases including 271 deaths (with a case fatality ratio of 1.5%) has been reported in the ongoing cholera outbreak in Sierra Leone since the beginning of the year.

The highest numbers of cases are reported from the Western area of the country where the capital city of Freetown is located.

46

1 http://www.who.int/csr/don/2008_04_10/en/index.html

2 http://www.who.int/csr/don/2006_01_30/en/index.html

http://www.who.int/csr/don/2008_04_10/en/index.html




The Ministry of Health and Sanitation (MOHS) is closely working with partners at national and international levels to step up response to the cholera outbreak. The ongoing activities at the field level include case management; communication and social mobilization; water, sanitation and hygiene promotion; surveillance and data management.” 1

4.3.Structure of the text

After reviewing 20 outbreak reports from the WHO website, it can be said that most

of them follow a general scheme. Reports chosen for this project will range from

100 to 300 words in length, because those of over 300 words usually contain

additional information, such as recommendations or medical treatments, which are

out of the scope of this study.

All the documents on the WHO website start with a date string indicating the data of

publication. Some elements are common to all texts, such as information about the

number of people affected by an outbreak, the name of the disease, and the

location where it is spreading. The structure usually consists of the following points:

•The first sentence ,after the title, contains a date string featuring the date of

publication on the website, always presented in the same format, e.g. 2 April

2011.

•Some of the reports contain another date, which is the announcement date;

this is always given after the first date in the text, in the second sentence in

most cases.

•In most instances the disease is reported by a health agency in a country,

e.g. “The Brazilian health autorities “. This can be very useful piece of

information, since it has been noticed that in some reports the name of

country is not mentioned, and the name of the national health agency is

enough to indicate the location.

•Disease names are not capitalized but they are sometimes accompanied

with indicating words such as fever, outbreak, influenza etc. Some of the

disease names are combination of characters and numbers like H5N1-

influenza.

•The report identifies the number of suspected and confirmed disease cases.

471 http://www.who.int/csr/don/2012_09_18b/en/index.html

http://www.who.int/csr/don/2012_09_18b/en/index.html

http://www.who.int/csr/don/2012_09_18b/en/index.html

•The total number of people affected by the disease since it was initially

discovered to the date of announcement is sometimes reported.

•Infected cases are reported individually for one state, from the text: the

State of Rio de Janeiro reported 57 010 cases of dengue fever (DF) including

67 confirmed deaths and 58 deaths currently under investigation.

•A pattern of “health authority of a Country reported victims” can be found in

the text: “the Brazilian health authorities have reported a national total of 120

570”. In some cases the word reported is replaced by synonyms such as

“announced”. Also, this pattern can be found in other documents in other

order “victims reported by health agency of country”.

4.4.Entities, relationships and events identification

The heart of this project is to find outbreak events within outbreak reports.

Extracting these events requires finding outbreak patterns in the text. We found that

sometimes the reporting of the outbreak event follows a very simple pattern — i.e.

“H1N1 killed 10 victims” — whereas in some reports, the event can contain complex

phrases, i.e. “the total of 18,508 cases including 271 deaths has been reported in

the cholera outbreak”.

Rather than trying to build a very complex rule, a set of smaller tasks must be

accomplished in order to compute the final event.

Information about a particular incident must be drawn from several constituent

elements within the text: the publication date, announcement date, disease name,

country of the outbreak, specific location (cities, states), number of infected people,

status of victims (sick, dead). The organization of this information often differs from

one report to another.

To make event extraction easier, we need to distinguish between relationships and

events. Relationship extraction will rely on identifying a single piece of information,

such as the nationality of the reporting authority, while the event will be the outbreak

incident itself (number of people infected by the disease in the country).

48

In sum, the extraction task will involve detecting the following information elements:

Entities

Publication date

Announcement date

Disease name

Disease code

Country

Locations of the outbreak(cities, villages,...)

Relationships

Nationality of the reporting authority

Events

Number of cases and deaths of an outbreak

Total number of affected cases

Table 4.1 Named entities in outbreak reports for IETable 4.1 Named entities in outbreak reports for IE

Entity Position in the text

Report date Actual string from document

Disease name Clue words: fever, outbreak, syndrome, influenza, fever

Health agency name Actual string from document, clues: ministry, agency

Country Actual string from document, or computed from the agency name

Location States, cities, town

Number of victims numeric value of cases mentioned in the text, or computed by counting number of cases in different

locations

To comply with the rule order, rules to extract entities must be presented before

rules for relationships and events. Based on the initial analysis, the following

assumptions has been made, extracting the named entities will be relatively

straightforward task since they are usually accompanied by clues occurring

immediately before or after their appearance in a text (see table 4.1). At the same

49

time, extracting location information such as a country name is much more difficult,

as these may not appear at all in the text, and thus should be computed from the

name of the agency. Finally, extracting outbreak events may be the most

challenging part of this project.

4.5.Project Methodology

This project is placed under the learning model proposed by the knowledge

engineering approach. As discussed earlier in Chapter 2, the rule-based approach

proved its efficiency in extracting information from various domains. This learning

model will use a set of features that will facilitate the extraction process, including

proper name identification, special characters and punctuation.

4.6.Development Methodology

Delivering an information system that has fulfilled the proposed requirements

within a specific timeframe can be achieved by following different strategies.

Before deciding which approach to choose, it seems appropriate to give a

brief description of two major approaches used in most system development

processes.

4.6.1.The Waterfall Approach

The most traditional approach for system development was proposed in

1970 by Royce (Massey and Satao, 2012). This approach follows a linear

framework consisting of different development stages that should be followed

in sequential order, including: requirement gathering, system design,

implementation and development, testing and finally maintenance. In this

model, before moving to the next stage, the preceding task must be

completed. Each stage has a set of goals and deliverables that must be met

and they have been stated previously. This approach is usually followed if

the system requirements are clear and stable before the start of the project

(Avison and Fitzgerald, 2003).

4.6.2.The Prototyping Approach

One of the relatively new approaches is the prototype approach, where

modification of the information system model is much simpler. The initial

prototype is developed quickly with minimal expense to give the client a

50

glimpse of how it will look and work. The system can critically analysed to

enhance it before the actual deployment starts. Thus, an iterative process is

followed until the client/ end user is satisfied with the generated deliverables

(Moscove, 2001).

Due to the nature of this project, a combination of these two approaches will be

employed. By incorporating the strengths of the waterfall and the prototyping, their

potential difficulties can be resolved. The waterfall will provide the overall structure

needed to accomplish the project, while the prototyping model will be followed in

certain stages. Mainly, there will be an iteration between the implementation of the

extraction rules and testing them, until finally target results are reached. According

to Sarawagi (2007) the most successful rule-based systems have followed a hybrid

model, where after extracting the most common patterns, rules are modified and

repeatedly tuned to obtain optimal results.

Figure 4.1: The development life-cycle model for information extraction

4.7.Summary

The domain chosen and the motivation behind the selection have been explained.

Samples of the texts upon which the project will focus were provided. The entity

types, relations, and events to be extracted have been clearly identified. Finally, the

project methodology has been described.

51

5. Design, Implementation and TestingThis chapter gives an overview of the process that was followed in designing the extraction

rules for all entities, relationships and events. The primary textual patterns that influenced

the final extraction decisions are discussed, in addition to a number of examples that

demonstrate the final analysis findings.

5.1.General design issues

The process of designing the extraction rules is based on studying the textual

expressions and elements found in the text, so for every entity, relationship and

event, a similar approach has been followed. The following are the factors that

influenced the design of the majority of the rules:

• Every textual element is recognized and captured using linguistic features

(e.g. syntactic, semantic, orthography). For example, to extract a token of

type number such as ‘45’, the rule should contain the syntactic feature

‘syn=CD’. (CD refers to Cardinal Number.)

• For each extraction task, the span of text that appears before and after the

target text is collected and studied to find common patterns that may help in

identifying the correct element. This task of studying the context surrounding

the element is the heart of this work as it is the only way to avoid false

matches.

• Patterns can be very simple, such as ‘prepositional phrase + noun’, or very

complex, such as the patterns used to look for outbreak events when the

pattern is a whole sentence: ‘verbs + prepositional phrases + nouns +

punctuations’. Hence, not all the constituents mentioned in the pattern will be

extracted - only the required ones.

• Rule order is very important. If there are two elements to be extracted and

the first element depends on the existence of the second element in the

sentence, then the first element should be extracted before the second one.

This is because when an element is recognized by a rule, it will be hidden

from the rest of the rules; therefore, each element is only extracted once.

52

Although the project uses rule-based systems, for some extraction tasks there was

an essential need for additional entries to the system gazetteer. Therefore, one of

the initial steps was to collect domain-specific vocabularies and then add them

under the appropriate semantic class or create new semantic classes if needed. Not

only have the domain terminologies been added to the gazetteer, but some

commonly used verbs and nouns have also been collected, added and categorized.

5.2.Entity extraction

5.2.1.Rules for extracting the publishing date

Extracting the publishing date of a report on the WHO website is essential.

When an outbreak hits a country, the WHO will release a periodic report

about the status of the outbreak; thus, reports are ordered by their publishing

date on the WHO website.

The date in the training corpora is expressed in two formats: ‘D Month

YYYY’ (‘6 June 2010’) or ‘D Month, YYYY’ (‘6 June, 2010’). Month names

already exist in the default gazetteer. The date rule is given below:

[syn=np, sem=date, type=entity,key=__t, month=_mno, day=_day, year=_year, rulid=publish_date1] => \ [syn=CD, orth=number, token=__t, token=_day, sent<=1], [sem="temporal/interval/month", monthno=_mno, key=__t, sent<=1],

[token=","]?, [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_year, sent<=1] /;

More than one date can be found for any randomly chosen outbreak report,

each of which may refer to something different (such as the date of the first

suspected ill person). To resolve this issue, we found that the publishing

dates are usually mentioned in the first or second sentence; therefore, the

feature ‘sent’ in the CAFETIERE system that indicates the sentence number

has been used. This has only been used to tag the date pattern in sentence

0 or 1; the annotation system starts counting sentences from 0.

53

Though this rule has successfully captured almost all the dates in the training

corpora, some of the reports in which the month has been written in capital

letters, such as ‘JULY’, were not identified by the gazetteer. For this reason,

the list of months written in all capital letters has been added to the gazetteer

and has been assigned the same semantic class as those in the default

gazetteer:

JANUARY:instance=january,class=temporal/interval/month,monthno=01FEBRUARY:instance=february,class=temporal/interval/month,monthno=02MARCH:instance=march,class=temporal/interval/month,monthno=03...DECEMBER:instance=december,class=temporal/interval/month,monthno=12

By adding these entries to the gazetteer, we have eliminated the need to

design another rule for each month written in capital letters.

5.2.1. Rules for extracting the announcement date

While this is similar to the publishing date (but has different meaning), the

announcement date is the date the national health authority reported an

outbreak to the WHO. Unlike the publishing date, this comes in three

formats: 1) a day with a month (‘DD Month’), a month with a year (‘Month

YYYY’) or a full date (‘DD Month YYYY’). The rules for the first and second

patterns are presented below; the rule for the last pattern is similar to the

publishing date rule:

[syn=np, sem=date, type=entity,key=__t, month=_mno, year=_year, rulid= announcement_date2, sentence=_s] =>

[token="As"|"as"]?, [token="of"|"on"|"On"|"in"|"since"] \ [sem="temporal/interval/month", monthno=_mno, key=__t, sent=2], [token=","]?, [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_year, sent=2] /;

[syn=np, sem=date, type=entity,key=__t, month=_mno, rulid= announcement_date3] =>

[token="As"|"as"]?, [token="of"|"on"|"On"|"in"|"since"]

54

\ [syn=CD, orth=number, token=__t, token=_day, sent=2], [sem="temporal/interval/month", monthno=_mno, key=__t, sent=2] /;

Some constraints have been added to the rules to avoid extracting dates with

different meanings; these constraints are the result of closely studying all the

training texts. The first constraint is that the announcement dates are usually

mentioned in sent=2 and in some cases in sent=3. As a result, the rules were

rewritten to handle announcement dates mentioned in the fourth sentence

(sent=3).

Expressions such as ‘As of 6 July 2002, the ministry of health has

reported . . .’ and ‘On 4 March, the Gabonese Ministry of Public Health

reported . . .’ are used to report the news; therefore, the left constituent of the

rule was designed to look for ‘As of’ and ‘On’ before capturing the reporting

date.

Moreover, another pattern has been found in the form of ‘During 1-26

January 2003’; this expression is used to identify the period that the outbreak

report is covering. Figure 5.1 shows the extraction task for such a pattern:

! ! Figure 5.1: Announcement date extraction

55

5.2.2.Rules for extracting country name

All outbreak reports contain information about the country affected by the

disease. In addition to the country of the outbreak, other countries may be

found in the text; therefore, the process of finding the country of the outbreak

had to be very explicit and precise. There is a separate gazetteer for

GeoName places, which occurs by default with the CAFETIERE system.

When a country or city occurrence is found in the input text and matches an

occurrence in the GeoName database, a phrasal annotation is made.

Therefore, to extract the correct country of the outbreak, rules must be

designed to classify the tagged locations.

First of all, we found that the country of the outbreak is usually mentioned in

the first three sentences, but this is not enough of a constraint to extract the

country since other countries can be mentioned as well. A very common

pattern involves indicating the country of the collaborating laboratory used to

examine a virus. For example: ‘Tests conducted at a WHO collaborating

laboratory in the United Kingdom . . .’; therefore, phrases such as

‘collaborating laboratory’ and ‘laboratory centre’ were collected and added

under the semantic class ‘collaboration’. Now, whenever a country is

mentioned after these words, it will not be extracted as the country of the

outbreak.

The most confirmed country entity to be extracted as the country of the

outbreak is when it is mentioned in the title of the report (sent=0); this can be

extracted by the following rule:

[syn=NNP, sem=country_of_the_outbreak, type=entity, key=_c, rulid=country_name2] => \ [sem>="geoname/COUNTRY", token=_c, sent=0] /;

Country names are categorized in the GeoName database under the

semantic class ‘geoname/COUNTRY’.

The second pattern is when the country name is preceded by the phrase

‘Situation in’, which is very common in reports on the WHO website. Since no

56

similar phrases are found in other texts, there was no need to add it to the

gazetteer. The following rule has been created to capture this pattern:

[syn=NNP, sem=country_of_the_outbreak, type=entity, country=_c, rulid=country_name] => [token="Situation"|"situation", sem!="collaboration"],[token="in"] \ [syn =DT]?, [sem>="geoname/COUNTRY", token=_c] /;If the country of the outbreak cannot be found using the previous rules, the

third pattern may be employed. Any country name preceded with the

preposition ‘in’ or ‘,’ that has occurred in the first three sentences and is not

preceded with any phrase that indicates a collaboration will be captured by

this rule:[syn=NNP, sem=country_of_the_outbreak, type=entity, key=_c, rulid=country_name1] => [sem!="collaboration"], [token="in"|","] \ [sem>="geoname/COUNTRY", token=_c, sent=<2] /;The rule order, especially in finding the country of the outbreak, is very

important; this is because many countries that are not affected by a disease

but are mentioned regarding other information can be found in outbreak

reports.

5.2.3.Rules for extracting the name of an outbreak

In order to extract the name of an outbreak, it is essential to know as many

disease names as possible. One way to do this is to incorporate a list of

disease names, types and symptoms into our extraction system. The medical

domain in particular is enriched with specific and generic dictionaries and

terminology lists, including names and categories of diseases such as the

ICD-101 disease lists. However, the emphasis of this project is to train the

system to automatically identify disease names of various patterns and to be

able to extract recently-discovered disease names and codes not found on

the pre-fixed lists.

571 ICD-10: The International Classification of Diseases standard.

One of the main problems making the identification of disease names more

complex than classic entities (such as person names and countries) is that

they are in most cases written in lowercase. Therefore, there was a high

need to recognize other features in the text. Another problem is when

diseases are mentioned in the text but are not the outbreak itself, e.g. when

the symptoms of an outbreak are similar to another disease. To avoid this

situation, words indicating the occurrence of such a case are gathered and

added in the gazetteer under the semantic class ‘symptoms’.

The first pattern recognized is when the disease name is mentioned in the

title of the form:

orth(capitalized) sem(preposition, punctuation) sem(GeoName)

E.g.: ‘Cholera in Chile’.

This pattern is only suitable for diseases mentioned in the title (sent=0). The

following rule was subsequently developed:[syn=np, sem=outbreak, type=entity, key=_o, rulid=outbreak_name1] => [token!="in"] \ [sem=disease, orth="capitalized", token=_o, syn!=DT, sent=0], [sem="disease_type"]? / [token="in"|","], [sem="geoname/COUNTRY"];

Nouns that indicate disease types (usually mentioned after the disease

name), such as virus, infection and syndrome, were all collected and added

into the gazetteer under the semantic class ‘disease_types’.

Many of the diseases are in the form of compound nouns where multiple

words are used to describe one disease entity. A typical disease name may

consist of the following:

sem(disease condition) sem(disease) sem(disease type)

‘Disease conditions’ is a new category created to cover all health conditions,

such as ‘acute’, ‘paralysis’ and ‘wild’ that are used as disease descriptors. An

example of this pattern is as follows:

‘acute poliomyelitis outbreak’

58

The following rule was developed to extract this pattern when mentioned in

the text body:

[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name6] => [sem!="symptoms", sent>0] \ [sem="disease_condition", token=__o], [sem=disease, token=__o], [sem=disease_type]? /;

Another disease pattern is when the word ‘virus’ is attached to the end of the

name, such as ‘Coronavirus’. Subsequently, the following rule has been

designed:[syn=np, sem=outbreak, type=entity, key=_o, rulid=outbreak_name3] => \ [token= "*virus", token=_o],[sem= disease_type] /;

Diseases that are always mentioned in the text without clues, such as

disease types and conditions, were collected from the WHO website and

added to the gazetteer under the semantic class ‘disease’. Examples of this

class are Malaria, plague, E.coli and Japanese encephalitis. They are

extracted by this rule:

[syn=np, sem=outbreak, type=entity, key=_o, rulid=outbreak_name2] => [sem!="symptoms", sent>0] \ [sem=disease, token=_o], [sem="disease_type"]? /;

In addition to this, extraction rules have been designed to identify disease

codes such as H1N1. The CAFETIERE system recognizes the word ‘H1N1’

as one token and assigns the value ‘other’ to the orthography feature

because it contains characters and numbers; thus, the rules were designed

based on this finding:

[syn=np, sem=Disease_code, key=__s, rulid=disease_code2] => \ [orth="other", token=__s], [sem="disease_type", token=__s] /;

59

The number of patterns for disease extraction we have designed and

implemented total eighteen. Other patterns not presented in this chapter can

be found in appendix A.

5.2.4.Rules for extracting affected cities and provinces

When an outbreak hits a country, one or more cities will be affected. The goal

of this task is to extract the affected locations by studying the context of the

texts. With the existence of the default GeoName database, it will be easy to

capture any geographical location mentioned in the text as a location of the

outbreak; however, in some reports, the cities mentioned are not affected by a

disease but instead are mentioned as part of other details. As a result, the

context that precedes the city names was carefully studied.

After examining the texts, we found that not all of the cities, areas, provinces

and states had been added into the GeoName database, resulting in them not

being identified. This is either due to a transliteration problem or because they

are not very well-known places. The problem was partially solved by studying

the expression used to represent the locations in the text.

The corpora was studied to collect all the words that may indicate a location,

such as ‘city’, ‘province’, ‘district’, ‘village’ or ‘region’, and then these terms

were added to the gazetteer under the semantic class ‘areas’. All possible

forms that can be used were also added (singular, plural and capitalized).

In some texts, the names of the affected areas occur in expressions like ‘54

cases have been reported in the provinces of Velasco’ and ‘Cases also

reported from Niari’. The first rule was designed to look for the following

pattern:

sem(reporting_verbs) sem(preposition) sem(GeoName)

60

To identify any locations excluding country names, two categories from the

GeoName database - geoname/PPL1 and geoname/ADM22 - were used.

Thus, the following rule was constructed for both the passive and active

expression:

[syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations] =>[sem="haveverb"]?,[token="today"|"also"]?,[tokem="been"]?,[sem="reporting_verbs"],[token="in"|"from"|","|"In"], [token~"^[a-z]+$"]{0,5}\[sem="geoname/PPL"|"geoname/ADM2", token=_c][sem="areas"]?/;

We designed a rule to capture groups of entities for situations where the

outbreak hits more than one location. The maximum number of entities it

captures is eight. Figure 5.2 shows the text after extracting the outbreak

locations.

Figure 5.2: Extracting groups of outbreak locations

61

1 PPL: “A city, town, village, or other agglomeration of buildings where people live and work”, Source: The GeoNames geographical database: Available from:http://www.geonames.org/export/codes.html Last accessed [25 July 2013]

2 ADM2: “A subdivision of a first-order administrative division”, Source: The GeoNames geographical database: Available from: http://www.geonames.org/export/codes.html Last accessed [25 July 2013]

http://www.geonames.org/export/codes.html




For locations not identified by the gazetteer, a number of rules have been

designed to extract locations mentioned within explicit expressions. These

expressions are usually used to express location and are followed by an

indicating word after the location, such as ‘city’, ‘province’ and ‘state’:

sem(reporting_verbs) sem(preposition) Orth(capitalized) sem(areas)

This pattern conforms with expressions such as:

‘ . . . reported from the Oromiya region’.

Additionally, we have designed a rule to capture locations mentioned after

reporting the incidence of a single case:

‘ . . . female from Kampong Speu Province’.

‘ . . . male from Kong Pisey district’.

Words indicating a single person, such as ‘male’, ‘female’, ‘child’, ‘man’

and ‘woman’ were added under the semantic class ‘people’. The

following rule has been designed:

[syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations10] => [sem="people"], [token="from"|"for"]?, [token~"^[a-z]+$"]{0,2} \ [sem="geoname/PPL"|"geoname/ADM2", token=_c] / [sem="areas"]?;

Identifying the name of the location seems to be the most challenging task,

especially when the name of place is not identified using the gazetteer.

Location phrases can take various forms and can occur anywhere in the text.

The use of simple rules (finding location prepositions such as ‘in and ‘from’)

may extract all the locations in the text, but this also may increase the number

of false matches.

62

5.3.Relationship extraction

5.3.1.Rules for extracting the name of the reporting health authority

In the domain being studied in this project, the most interesting relationship

we have found is the name of the health authority reporting the outbreak to

the WHO. This relationship is a binary relationship of type “located in” (E.g.

Ministry of Health, Afghanistan). This task is especially important because in

some reports, the name of the country is not mentioned in the beginning but

instead is implied in the name of the authority.

When we began our analysis, the name of the health authority was

considered part of the outbreak event because it was usually mentioned at

the beginning or at the end of the outbreak event clause. The outbreak

event, as can be seen later in this report, is a very complex task; extracting

the name of the reporting authority is not straightforward task, in that it can

take various forms and can be mentioned in different locations in a sentence,

depending on the tense used (passive, active). For this reason, we decided

to extract it using separate extraction rules.

According to the texts under study, the reporting authority always takes the

form of the relevant country’s health authority, where the name of the health

authority is adjacent to the country name. The most common form is:

sem(health authority) sem(preposition) sem(GeoName)

All the names that might refer to a health authority were collected and added

to the gazetteer under the semantic class ‘health_agency’. The most

common authority reporting outbreaks was a country’s ‘ministry of health’;

however, other names such as ‘The Ministry of Health and Population’ and

‘The National Health and Family Planning Commission’ were also found.

Pattern 1:

sem(health_agency) sem(preosition) sem(GeoName)

This will capture

‘Ministry of Health (MoH) of Egypt’

63

with the following rule:

[syn=clause, sem=reporting_authority, type=relationship, key=__d, country=_c, health_agency=_h, rulid=reporting_authority2] => \ [sem="health_agency", token=__d, token=_h], [token="("],[syn="NN"]?,[token=")"], [token="of"|"in", token=__d], [token="the", token=__d]?,[sem>="geoname", token=__d, token=_c] / [sem="haveverb"]?, [sem="reporting_verbs"];

Pattern 2:

The nationality of the country is mentioned instead of the country name:

orth(DT) NNP(nationality) NP(health authority)

where DT refers to determiners such as ‘the’. This pattern conforms with the

following example:

‘The Afghan Ministry of Public Health’.

Pattern 3:

Punctuation is used to refer to the authority name:

sem(health authority) sem(punctuation) sem(GeoName)

to capture:

‘The National Health and Family Planning Commission, China’.

For cases when a health authority is of the form: ‘The Ministry of Health and Care Services of Norway’ ‘The Ministry of Health and Population, Angola’

Where the first part of the health authority name (The Ministry of Health) is

already added to the gazetteer, but the second part of the name “and Family

Welfare” or “and Population” is new and can be of various names, so to avoid

the incident of the whole relationship not to be identified, this pattern was

added to the other relationship patterns:

sem(health authority) token(and) orth(capitalized){1,4} sem(GeoName)

64

Figure 5.3: Reporting health authority extraction

In the text shown in figure 5.3, the “The Ministry of Health and Family

Welfare, Bangladish” has been recognized as the name of the reporting

agency.

However, requiring exact matches for the relationship extraction may result in

reporting no information, for instance in some cases, the health authority of

the country is the reporting authority, but the name of the country is not

adjacent, or the country name is used to report an outbreak for example,

“China has reported to WHO 5 deaths..”. So to avoid not extracting them at

all, it has been decided to extract the health authority or the country name as

reporting authority if the reporting pattern is mentioned in the text, even

though, it does not align with complete relation extraction pattern.

5.4.Events extraction

As in the ACE model, only events with interesting arguments should be extracted,

such as life events (born, die) and transactions (money transfers)(Ahn, 2006).

Consequently, events found in the outbreak reports are life events because a

certain number of people are affected by a disease in a specific country.

5.4.1.Rules for extracting an outbreak event

After examining 25 reports, we found that the patterns used to report an

outbreak event are in the form of the number of victims of an outbreak

reported by an authority. To avoid increasing the complexity of the events

65

rules, the authority name is extracted in advance. Typically, the simplest

event will be in the following form:

sem(GeoName) sem(reporting verbs) orth(CD) token (“cases”)

This will capture a sentence in the following form:

‘China reported 34 cases’.

However, as we analyse more texts, more interesting arguments can be

found, such as case classification and fatal cases. Therefore, before

designing the events rules, we have taken key considerations into mind:

5.4.1.1. Number of cases

The number of cases and deaths are usually in digit form, such as

‘134 cases’. Alternately, they can be in written form: ‘five cases’. We

have also found that the form of ‘twenty-five cases’, where a dash is

inserted between two numbers, is also used in some reports. Another

issue related to extracting the numbers arises when the number

consists of four or more units and a space is used after every three

number units (e.g. ‘45 100’). To overcome the last problem, the

number can simply be read as a whole string, ‘45 100’, but if we want

it to be saved as a proper integer value the following arithmetic

calculation would solve the problem : total = “(+ (* _a 1000)_b))”,

CAFETIERE will interpret this calculation by multiplying the first

number by 1000 then adding the second number.

For example, ‘45’ is the first number token ‘a’ and ‘100’ is the second

token ‘b’, thus,

45 * 1000 = 45000

45000 + 100 = 45100.

5.4.1.2.Case classification

Cases of infection from a disease are usually classified as either

suspect, probable or confirmed cases to identify the degree of

66

certainty of an outbreak. Those terms are known as ‘case

classification’ and are often used in outbreak reports. Therefore, case

classification has been added as a feature to the event extraction

rules. All of the terms that fall under the case classification have been

added to the gazetteer under the class ‘case-classification’. Terms

such as ‘laboratory confirmed’ and ‘epidemiologically linked’ are types

of confirmed cases that have been added.

5.4.1.3.Fatal cases

In addition to the typical classification of reported cases mentioned

above, reports usually contain information the about number of fatal

cases and deaths. To distinguish these cases from the others, their

semantic class is ‘fatal cases’, and to distinguish the fatal but not dead

from the deaths, the feature ‘dead’ will be used and will hold to values

of either ‘yes’ or ‘no’ according to the terms used in the texts that

describe the situation.

The simple event pattern ‘China reported 34 cases’ is very common;

however, other similar patterns can be found:

‘China reported 34 new suspected cases and 4 new deaths’. ‘China reported 34 new suspected SARS cases and 4 new deaths’.

So to borden the coverage of similar patterns, verbs that indicate the

reporting such as ‘reported’, ‘identified’ and ‘confirmed’ were added

along with their different tenses to the gazetteer. Their semantic class

is ‘reporting verbs’.

[syn=clause, sem=outbreak_event, type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak14] => [sem="haveverb"], [token="today"]?, [sem="reporting_verbs"] \

67

[syn=CD, token=_a], [syn=CD,token=_b], [token="new"]?, [sem="case_classification",classification=_c ]?, [token="cases"], [token="of"|"Of"]?, [token~"^[A-Z]+$"|"^[a-z]+$"]{0,3}, [sem="disease_type"]?, [token=","]?, [token="and"|"including"], [syn=CD, token=_d], [token="new"]?, [sem="fatal_cases",dead=_answer] /;

In addition to the active voice, passive patterns like

syn(CD) token (“cases”) sem(haveverb +beverbs) sem(reporting_verbs)

are also used widely in outbreak reports; therefore, the verb groups

such as ‘have been’ and ‘has been’ were added to the rules to capture

the following type of pattern:

‘2 249 cases have been reported ...’

Another example of an outbreak event pattern is:

orth(CD) sem(case_classification) token(“cases”) sem(preposition) syn(NN)

This will pick up sentences such as:

‘130 laboratory-confirmed cases of avian influenza’.

In addition to this, temporal and locative information may appear in

different positions in the sentence or clause:

‘Since 2005, 20 cases reported, 18 of which have been fatal’

‘20 cases reported since 2005- 18 have been fatal’

68

‘Of the 20 cases reported, 18 have been fatal since 2005’

More complex patterns can be found when both the temporal and

locative information are mentioned in the same sentence:

“20 cases reported in Cambodia since 2005, 18 have been fatal.”

Which can be captured by the following rule:

[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak11] =>

\ [syn=CD, token=_n], [token="cases"], [sem="haveverb"]?, [sem="beverb"]?, [sem="reporting_verbs"], [token="in"], [sem>="geoname"], [token="since"]?, [token~"19??"|"20??"]?, [token=","|"."]?, [token="of"]?, [token="which"|"these"]?, [syn=CD, token=_d], [sem="haveverb"]?, [sem="beverb"]?, [sem="fatal_cases",dead=_answer] /;

Similarly, the phrase ‘has reported’ can occur anywhere in the reporting

clause. The adjunct clause will be used to extract fatal cases such as

number of deaths. We have found the following sub-pattern in the WHO

reports:

tokens(“including“| “and” | “of which” | “of these” ) orth(CD) sem(haveverb)?

sem(NN)

will match

‘ . . . including 31 deaths’.

69

Another pattern occurs when the passive voice is used:

tokens(“including” | “and “|“of which” |“of these”) orth(CD)

sem(haveverb +beverbs)? sem(NN)

This will capture clauses such as:

‘ . . .and 18 have been fatal’.

5.4.2. Rules for extracting the total number of cases and deaths

In many of the texts chosen for this study, we found that in addition to

reporting the number of cases and deaths, the total or the cumulative total of

cases on a particular date was usually mentioned in the same report. All the

patterns mentioned in the previous outbreak reporting event apply to this

task; however, the only difference for these events is when texts use phrases

like ‘the total number is’ and ‘the cumulative total is’.

For example, to capture the total number mentioned in this fragment:

‘ . . . total number of children affected to be 59. Of these, 52 have died’, we

have designed the following rule:

[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak15] => \ [token="since"|"In"|"in"]?, [syn=CD,orth=number, token~"19??"|"20??", token=_year]?, [orth=punct]?, [token="the"], [token="total"], [token="number"]?, [token="of"], [token="children"|"people"|"human"]?, [token="affected"]?, [token="to"]?, [sem="beverb"]?, [syn=CD, token=_n], [token="cases"]?, [token=","|"."]?, [token="of"|"Of"]?, [token="which"|"these"]?, [token=","]?, [syn=CD, token=_d], [sem="haveverb"]?, [sem="beverb"]?,

70

[sem="fatal_cases",dead=_answer] / ;

5.5. Discussion

This chapter provides a comprehensive overview of how the training corpus has

been analyzed to capture disease outbreak details, determining the expressions we

found that helped in building linguistic patterns. We have discussed all of the

entities, relationships and events that were considered as the extraction tasks in

this project and have given examples of the rules and the output results. Some of

the patterns used to capture certain entities and events are not mentioned - the

complete file containing all the rules for all the patterns is attached in appendix A.

Even the texts chosen in this study belong to one domain, challenges caused by

linguistic variation do exist. By linguistic variation we mean that different

expressions may be used to deliver the same idea. Extracting information from texts

can be achieved either by writing a few general patterns (which may lead to

information being tagged under incorrect semantic classes) or by writing as many

specific rules as possible (which will lead to an extensive workload by trying to write

a rule for each pattern, even for those patterns that are rarely found in natural

texts). Due to time constraints, both generic and specific rules were written to cover

as many patterns for entities, relationships and events as possible.

Writing generic rules for extracting entities seemed like an efficient way to capture

as many patterns as possible. This means that many of the constituents that

comprise the main part of the rule are optional values; however, this can be a

problem if the mandatory fields are few and not very specific, which may end up

only capturing the information of very common patterns. Therefore, we have found

that designing many rules with a limited number of optional values has reduced the

amount of information that is captured incorrectly.

Regarding the entities extraction, in the beginning we assumed that extracting the

entities would be the the most straightforward part of the project. This assumption

has proven to be true for extracting the dates as they are always mentioned in the

same way. This is also true for extracting the country of the outbreak. The only

71

problem is with countries that are mentioned in the text but have no further reporting

of disease outbreak.

E.g.: ‘Argentina and Peru have been notified of the cases that occurred

earlier this month in Chile’.

The countries ‘Argentina’ and ‘Peru’ are not disease outbreak locations - ‘Chile’ is

the outbreak location. So for this task, the work has been focused on the sentence

level; countries mentioned in the first sentences are only captured if they conform to

specific patterns, as it has been found that in the disease outbreak reports, the

important information related to the actual outbreak event is always presented first

and the secondary information is presented later.

Extracting the outbreak name was a relatively challenging task, as these do not

conform to common patterns - even the orthography features are not obvious. Many

diseases are named after the person who discovered them or after the location

where they first appeared; this can cause confusion when extracting them. For

example, the word ‘Avian’ was tagged by the gazetteer as the name of a location

and not as the name of a disease. Some reports discuss the symptoms of an

outbreak, which can be problematic if their sentence matches a pattern designed to

capture an outbreak disease. All of these reasons have complicated the extraction

process. Extracting the locations of an outbreak was very challenging task,

especially for locations that were not tagged in the GeoName database; therefore, it

was essential to discover as many expressions as possible.

Conversely, extracting the ‘located in’ relationship was relatively straightforward.

This is because the reporting authorities have a limited number of patterns.

We initially assumed that events extraction would be difficult because the outbreak

events are usually very long and consist of other information that may be extracted

in advance; however, after closely examining and testing the patterns, we decided

to treat each clause or sentence as a number of constituents indicating certain

features. The longest event clause that can occur is when all the features are

mentioned in the same sentence. By features, we mean that case classification,

locative and temporal details and the number of cases are reported in the same

72

sentence or clause. Other information that may appear in the event clause, such as

the disease name and the country of the outbreak, is read only as a linguistic

pattern that helps in constructing patterns but that is not extracted. This is because

they are always extracted beforehand using separate rules. For example:

‘130 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus including 31 deaths’.

The information extracted is:

Number of cases = 130 Case classification = laboratory-confirmed Number of fatal cases = 31 dead_cases = yes

The name of the disease will not be extracted and is only used to formalize one of

the outbreak event patterns. In doing this, the extraction process will be facilitated

as there is no reason to extract the same information again.

Most of the difficulties we encountered when designing the rules were due to rules

order. The file containing the rules is ordered by featuring the rules for extracting

entities at the beginning, followed by relationships and finally events. If a rule for

entity extraction captures information from the clause containing the event

information, the whole event will not be recognized. This problem was partially

solved by defining the ‘before’ and ‘after’ constituents - the more conditions added,

the more potential similarities between patterns will be avoided.

73

6. System EvaluationAt the beginning of this chapter, the evaluation metrics that were used to assess the system performance are discussed, in addition to some basic definitions that were considered while validating the extraction outputs. A demonstration of how each report was evaluated is also shown. Finally, precision, recall and F-measure were calculated for both training and testing sets to conclude the main findings of this project

6.1.System evaluation metrics

To aid in assessing the level of performance achieved by the designed system, a

complete system evaluation was performed. As discussed in chapter 2, the main

findings of the MUCs include the measures of precision and recall, as well as the F-

measure which is the average of precision and recall. Those metrics were adopted

in this project. Precision indicates how many of the elements extracted by the

system are correct (accuracy), while recall indicates how many of the elements that

should have been extracted were actually extracted (coverage). Although these

measures have occasionally been changed slightly, they are the most commonly

used by researchers to compare the results of their systems with those of other IE

systems (Maynard et al., 2006).

Precision = True Positive

True Positive + False Positive

Recall = True Positive True Positive + False Negative

F-measure= 2 x Precision x Recall Precision + Recall

In preparation for the evaluation, the sets of texts run through the system were also

manually annotated. The evaluation process was based on a comparison of the

manual extractions with the system’s output. Elements extracted by the system

were identified as:

• Correct (true positive): Elements extracted by the system align with the value

and type of those extracted manually.

74

• Spurious (false positive or match): Elements extracted by the system do not

match any of those extracted manually.

• Missing (false negative): The system did not extract elements that were

extracted manually.

• Partial: The extracted elements are correct, but the system did not capture

the entire range. For example, from the sentence “China today reported 39

new SARS cases and four new deaths”, the system should extract the

number of cases and deaths, but in this instance, it extracted only the

number of cases. This case is a partial extraction and would be allocated a

half weight, resulting in the coefficient 0.5. Another coefficient could be used

to obtain more accurate results. For example, if the majority of an element is

extracted, then a coefficient of 0.75 or higher can be used, but if only a small

part of the element is extracted, a coefficient of 0.40 or less can be used. All

the MUCs assigned partial scores for incomplete but correct elements

(Grishman, 2012).

Therefore, the measures of precision and recall can be calculated as follows:

Precision = Correct + 0.5 Partial = Correct + 0.5 Partial

Correct + Spurious + Partial N

Recall = Correct + 0.5 Partial = Correct +0.5 Partial

Correct + Missing + Partial M

Where:

N = Total number of elements extracted by the system

M = Total number of manually extracted elements

6.2. System evaluation process

The manually annotated entities were employed as the answer keys used to

validate the system output. Validation against the training corpus utilised in

designing the rules should yield high scores for both recall and precision. Therefore,

75

10 texts new to the system were selected from the WHO website. In the training

phase, a set of 25 texts was chosen randomly. Summary reports of disease

outbreaks in different countries were excluded from both the training and the test

set because they typically are constructed differently and contain significantly

different textual patterns.

The system tagged elements either because they were captured by the extraction

rules or they matched a gazetteer entry. The main goal of this project was to test

the extraction rules’ ability to identify elements of the desired value and type.

Elements tagged by the gazetteer consistently possessed the correct value and

type and, if assigned a score, would receive the full score of 1. Therefore, it was

decided to count only the elements captured by the extraction rules.

For example, a text was annotated manually to identify the elements that the

system should extract (see Figure 6.1.).

Figure 6.1: Manual annotation

Entities: Outbreak name: Meningococcal disease Country: Burkina Faso Publish date: 4 February 2003 Report date start: 1 January 2003 Report date end: 26 January 2003 Outbreak locations: Batie, Kossodo, Manga and Tenkodogo

Meningococcal disease in Burkina Faso.4 February 2003.Disease Outbreak Reported.During 1-26 January 2003, the Ministry of Health of Burkina Faso has reported 980 cases and 196 deaths (case-fatality rate, 20%) in the country. On 26 January 2003, 4 districts, Batie, Kossodo, Manga and Tenkodogo, were in the alert phase, although none had crossed the epidemic threshold.

For more details about the epidemic threshold principle, see the article, "Detecting meningococcal meningitis epidemics in highly-endemic African countries" in the Weekly Epidemiological Record.Of a total of 28 specimens collected in 3 districts (Nanoro, Paul VI, Pissy), the National Public Health Laboratory has confirmed Neisseria meningitidis serogroup W135 in 10 samples, Streptococcus pneumoniae in 8 and Haemophilus influenzae type b in 4. The Ministry of Health is implementing control measures to contain the outbreak, including the pre-positioning of laboratory materials and oily chloramphenicol at district level, enhanced epidemiological surveillance, training of health personnel and social mobilization in communities.

76

Relationship: Reporting authority: Ministry of Health of Burkina Faso

Event: Outbreak event: 980 cases and 196 deaths

The same text was run through the system (see Figure 6.2.).

Figure 6.2: System annotation

As can be seen, the elements were correctly extracted and assigned to the

appropriate type (class). The system extracted more elements than the manual

process. Among the additional elements tagged by the system gazetteer in this

particular example, the Ministry of Health is mentioned twice in the text. In the first

instance, the phrase is followed by a country name. This pattern conforms to the

reporting authority rules and therefore was tagged by the system. The second

mention, however, did not follow a pattern recognised by any of the rules; therefore,

it was tagged only by the gazetteer.

As well, additional tags which were not tagged by the gazetteer can be found and,

in this case, are considered spurious elements. The same process of analysis was

undertaken for both the training and the test sets.

77

6.3.Results analysis

Table 6.1 shows the extraction results from the training corpus of 25 texts. The table

displays the number of entities, relationships and events correctly identified, those

partially identified by the system and spurious elements. The actual entities, actual

relations and actual events columns present the total number of elements extracted

manually. C, P, S and T refer to the correct, partial, spurious and total elements,

respectively.

Table 6.1: Breakdown of the counting results of the training corpus

Text Entities extractedEntities extractedEntities extractedEntities extracted Actual entities

Relations extractedRelations extractedRelations extractedRelations extracted

Actual relations

Events extractedEvents

extractedEvents

extractedEvents

extractedActual events

Text

C P S T

Actual entities

C P S T

Actual relations

C P S T

Actual events

1 4 4 4 1 1 1

2 5 1 (50%)

6 7 1 1 1 1 1 1

3 4 4 5 0 1 1 1 1

4 4 4 5 1 1 1 1 1 1

5 3 1 4 4 0 1 1 1 1

6 4 4 4 1 1 1 1 (80%) 1 1

7 4 4 4 1 1 1 1 1 1

8 10 10 12 1 1 1 1 1 1

9 4 4 4 1 2 (50%) 3 3

10 7 2 9 8 1 1 1 1 (50%) 1 1

11 3 1 4 8 1 1 2 1

12 4 4 4 1 1 1 1 1 1

13 4 4 4 1 1 1 1 1 1

14 5 1 6 6 1 1 1 1 1 1

15 5 5 5 1 1 1

16 4 4 7 1 1 1 2 2 2

17 4 4 5 3 1 4 3

18 4 4 7 1 1 1 1 (80%) 1 1

19 3 3 3 1 1 1 1 (50%) 1 1

20 8 8 10 1 1 1 1 1 1

21 4 4 5 0 1 1 1 1

22 6 1 7 8 0 1 1 1 1

23 4 4 4 1 1 1 1 1 1

24 4 4 5 1 1 1

25 5 1 6 5 1 1 1 2 2 3

78

The precision, recall and f-measures were calculated based on the results

presented in Table 6.1. The calculation results are shown in Table 6.2.

Table 6.2: Breakdown of the evaluation metrics of the training corpus

Text EntitiesEntitiesEntities RelationsRelationsRelations Events Events Events Text

Precision Recall F-measure Precision Recall F-measure Precision Recall F-measure

1 1.00 1.00 1.00 1.00 1.00 1.00

2 0.91 0.78 0.85 1.00 1.00 1.00 1.00 1.00 1.00

3 1.00 0.80 0.90 0.00 0.00 1.00 1.00 1.00

4 1.00 0.80 0.90 1.00 1.00 1.00 1.00 1.00 1.00

5 0.75 0.75 0.75 0.00 0.00 1.00 1.00 1.00

6 1.00 1.00 1.00 1.00 1.00 1.00 0.8 0.8 0.80

7 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

8 1.00 0.83 0.92 1.00 1.00 1.00 1.00 1.00 1.00

9 1.00 1.00 1.00 0.70 0.70 0.70

10 0.80 0.90 0.85 1.00 1.00 1.00 0.5 0.5 0.50

11 0.80 0.40 0.60 0.50 1.00 0.75

12 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

13 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

14 0.83 0.83 0.83 1.00 1.00 1.00 1.00 1.00 1.00

15 1.00 1.00 1.00 1.00 1.00 1.00

16 1.00 0.60 0.80 1.00 1.00 1.00 1.00 1.00 1.00

17 1.00 0.8 0.90 0.80 1.00 0.90

18 1.00 0.60 0.80 1.00 1.00 1.00 0.80 0.80 0.80

19 1.00 1.00 1.00 1.00 1.00 1.00 0.5 0.5 0.50

20 1.00 0.8 0.90 1.00 1.00 1.00 1.00 1.00 1.00

21 1.00 0.8 0.90 0.00 0.00 1.00 1.00 1.00

22 0.93 0.80 0.87 0.00 0.00 1.00 1.00 1.00

23 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

24 1.00 0.8 0.90 1.00 1.00 1.00

25 0.83 1.00 0.92 1.00 1.00 1.00 1.00 0.70 0.85

Average 0.95 0.85 0.90 0.80 0.80 0.80 0.90 0.92 0.91

79

An initial observation that emerges from both tables is that the system delivers a high level

of performance when extracting outbreak events. Extremely high precision and recall were

achieved in all events; in 17 of 25 cases, precision and recall equalled 1. These high

values were produced not only because the texts used were the actual training set, but

also because the patterns utilised to extract the event were studied extensively in order to

design additional rules for never-before-seen patterns. These additional patterns were

predicted based on the knowledge that many tokens can separate adjacent pieces of

information (the number of cases and deaths). The word ‘tokens’ here describes elements

that can refer to temporal and location information, or to disease names.

The results of entities extraction also demonstrate extremely high performance. Recall is

slightly lower than precision but is still considered high, with an average of 0.85.

The relationships were either correctly extracted or not extracted at all. Although the task

of designing the relationships rules was relatively straightforward, it produced the lowest

precision and recall, primarily because all the constituents of the relationship rule were

made mandatory fields during the design phase, which prevented partial extraction.

Enabling optional fields could lead to many false alerts because the names of health

authorities often occur frequently in a single outbreak reports. This restriction resulted in a

score of 1 for both recall and precision in 80% of the relations mentioned in all the reports,

indicating that the rules succeeded in accurately extracting all the relationships in a report.

The test corpus results are shown in Tables 6.3 and 6.4.

80

Table 6.3: Breakdown of the counting results of the test corpus

Text Entities extractedEntities extractedEntities extractedEntities extracted Actual entities

Relations extractedRelations extractedRelations extractedRelations extracted

Actual relations

Events extractedEvents

extractedEvents

extractedEvents

extractedActual events

Text

C P S T

Actual entities

C P S T

Actual relations

C P S T

Actual events

1 5 5 6 1 1 2

2 8 8 10 0 1 1 1 1

3 4 4 5 1 1 1 1 2 1

4 3 3 4 1 1

5 8 8 8 1 1 1 1 1

6 4 4 5 0 1 1 1 2

7 3 3 9

8 4 4 5 1 1 1 1 1 1

9 2 2 5 1 1 2

10 6 6 6 1 1 1 1 1 1

Table 6.4: Breakdown of the evaluation metrics of the test corpus

Text EntitiesEntitiesEntities RelationsRelationsRelations Events Events Events Text

Precision Recall F-measure



1 1.00 0.83 0.92 1.00 0.5 0.75

2 1.00 0.8 0.90 0 0 1.00 1.00 1.00

3 1.00 0.8 0.90 1.00 1.00 1.00 0.5 1.00 0.75

4 1.00 0.75 0.88 1.00 1.00 1.00

5 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

6 1.00 0.8 0.90 0 0 1.00 0.5 0.75

7 1.00 0.33 0.67

8 1.00 0.8 0.90 1.00 1.00 1.00 1.00 1.00 1.00

9 1.00 0.4 0.70 1.00 0.5 0.75

10 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

Average 1.00 0.75 0.88 0.70 0.70 0.70 0.90 0.80 0.85

The most noticeable result is that all entities extracted from the test corpus were correct.

The low recall indicated that extraction was performed with few errors, achieving high

recall is generally more challenging than precision (Sarawagi, 2007).

81

As in the training corpus, the relationship extraction had the lowest performance level. In

addition to the possible cause discussed earlier, the rules did not take into account the

new reporting patterns.

Event extraction for the test set also achieved a high level of performance even for new

patterns. The results show an average precision of 90% and recall of 80%, reasonably

high considering the complexity of the task.

It is necessary to determine the number of occurrences of each entity type in order to

assess their level of difficulty. Not all the entities have the same frequency, necessitating

accurate measurement of the performance of the rules.

Table 6.5: Number of occurrences of each entity type

Entity Training setTraining setTraining setTraining set Testing setTesting setTesting setTesting setEntity

C P S M C P S M

Published date

24 10

Report date 14 1 6 5 1

Country 22 1 10

Disease 23 2 1 6 4

Locations 28 2 3 17 15 11

Disease code

5 1

Table 6.5 shows that, in both the training and the testing sets, the location entity has the

greatest number of missing elements. This result is not surprising because the extraction

of locations was the most challenging task during the design phase. Locations can be

mentioned anywhere in a report and do not conform to obvious patterns. In addition, unlike

the other entities, locations can be mentioned within a group of other locations as, for

example, in the statement “Cases have also been reported in Larnaca, Famagusta,

Nicosia and Paphos”. Problematically, in such patterns, the number of locations that can

be mentioned within a clause may remain undetermined. In addition, the preceding and

following sentences can present various patterns. Those factors make location one of the

most difficult entities to handle within outbreak reports.

82

Another observation can be made about disease names. Although the design of the

extraction rules for diseases was extremely challenging and required both a deep analysis

of various disease names and the linguistic analysis of the context in order to prove that

an entity actually was an outbreak, the results show highly accurate identification and few

errors. These results are acceptable considering the difficulty of the task.

The extraction results, though, can be improved. In particular, the results for location and

disease name entities could be bettered significantly by using up-to-date official datasets

for location and disease names. Doing so would allow most effort to be focused on

analysing linguistic patterns rather than positing potential name structures and

combinations.

Taking into account the limited time allocated for the project and the time-consuming

nature of building extraction rules that yield instantaneous adjustments and testing, the

performance results show an extremely high degree of accuracy.

83

7. ConclusionThe main aim of this project was to acquire basic knowledge of IE technology and

methodologies. The number of objectives was defined in accordance with this general aim.

The background study was conducted to satisfy the first objective, which was to assess

the current state-of the-art of IE. We reviewed the basic definitions of IE and its position in

the NLP spectrum. The history of IE was briefly described, including the most influential

events such as MUC and ACE, in order to reveal the basics of these systems. The two IE

approaches the knowledge engineering approach and the automatic training approach

were examined, and the reasons for preferring one explained. The evaluation metrics—

recall and precision—were reviewed. Finally, several examples of IE systems were given

and the influence of modern software engineering approaches on system designs

considered.

A rule-based approach—the knowledge engineering approach—was selected as the

methodology for this project. To determine the success of the designed rules, a hybrid

model involving both the prototype and the waterfall approach was adopted as the overall

development strategy; therefore, rules were designed, implemented and tested

simultaneously.

Along with the theoretical work, the majority of the practical work consisted of extracting a

predefined set of information from the online WHO disease outbreak reports that contained

similar information. Extraction rules thus were designed to capture the repeated

information. The information extraction system CAFETIERE was used to implement the

rules, while the rules writing was based on extensive study of the linguistic patterns

preceding and following the target element. Essential to the rule-based approach is

studying the classes of words and phrases used in the arguments of such information.

The evaluation of the extraction rules yielded high precision and recall scores, close to

those of state-of-the-art IE. The experiments were conducted independently with two

subset corpora (the training and testing sets). The sets delivered similar system

performance, although the training corpus had higher accuracy, particularly for relationship

extraction. Event extraction, surprisingly, yielded to very high scores, the approach that

helped in achieving such scores returns to the idea of looking what information digests that

may form the event clause itself, so instead of only capturing the number of cases and

84

deaths caused by the outbreak, other information was also included in the task such as

case classification, fatality status, year and total numbers. Those constituents have helped

in building many linguistic patterns that comprise the outbreak events.

It can be concluded that the rule-based approach has been proven capable of delivering

reliable information extraction with extremely high accuracy and coverage results. This

approach, though, requires an extensive, time-consuming, manual study of word classes

and phrases.

In the future, this research could be expanded in various directions. For instance,

information about individual cases affected by an outbreak could be extracted, such as the

gender, age, province, village and initial symptoms of a particular case. It would be useful

to investigate how to use co-references in multiple sentences. In addition, the identification

of location entities could be improved by combining the different levels of a location into a

single relation; for example, Halifax, Nova Scotia, could be extracted as a location

relationship. Finally, study should be directed toward reports on outbreaks affecting plants

and animals.

85

ReferencesACHARYA, S. and PARIJA, S. (2010) The Process of Information extraction through natural language processing. International Journal of Logic and Computation, 1(1), pp. 40-51.

AHN, D. (2006) The stages of event extraction. ARTE '06 Proceedings of the Workshop on Annotating and Reasoning about Time and Events, Sydney, Australia, pp.1-8.

APPELT, D. et al. (1993) FASTUS: A finite-state processor for information extraction from real-world text. Proceedings of the 13th International Joint Conference on Artificial Intelligence, (IJCAI-93), pp. 1172-1178.

APPELT, D. and ISRAEL, D. (1999) Introduction to information extraction. Artificial Intelligence Communications, 12 (3), pp. 161–172.

AVISON, D. and FITZGERALD, G. (2003) Where now for development methodologies? Communications of the ACM, 46 (1), pp. 78–82.

BANKO, M. and ETZIONI, O. (2008) The tradeoffs between open and traditional relation extraction. Proceedings of ACL-08: HLT,Columbus, Ohio, Association for Computational Linguistics, June 2008, pp. 28-36.

BLACK, W.J. et al. (2005) Parmenides Technical Report. Available from: http://www.nactem.ac.uk/files/phatfile/cafetiere-report.pdf [Accessed 29/04/13].

BLACK, W.J. et al. (2012) A data and analysis resource for an experiment in text mining collection of micro-blogs on a political topic. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 2083-2088Available from: http://www.lrec-conf.org/proceedings/lrec2012/pdf/1056_Paper.pdf [Accessed 29/04/2013].

CARDIE, C. (1997) Empirical methods in information extraction. AI Magazine, 18 (4), pp. 65-79.

CHINCHOR, N. (2001) Overview of MUC-7. Science Applications International Corporation. Available from: http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/overview.html [Accessed 28/04/13].

COWIE, J. and LEHNERT, W. (1996) Information Extraction. Communications of the ACM, 39 (1), pp. 80–91.

COWIE, J. and WILKS, Y. (2000) Information extraction. In: DALE, R., MOISL, H. and SOMERS, H., eds. Handbook of natural language processing. New York: Marcel Dekker, pp. 241-269.

CUNNINGHAM, H. (2006) Information extraction, automatic. In: Encyclopedia of language and linguistics, 2nd ed. Amsterdam: Elsevier Science,5, pp. 665-677.

NADEAU, D. and SEKINE, S. (2007) A survey of named entity recognition and classification. Linguisticae Investigationes, 30, pp. 3–26.

De SITTER, A. et al. (2004) A formal framework for evaluation of information extraction. Technical report no. 2004-4. Antwerp: University of Antwerp Dept. of Mathematics and Computer Science, 2004.

DIETL, R. et al. (2008) Project deliverable report: Deliverable D2.1 services approach & overview general tools and resources. Available from: http://dspace.ou.nl/bitstream/1820/1707/1/D2.1%20final%20EC.pdf [Accessed: 28/04/13] .

86

http://www.nactem.ac.uk/files/phatfile/cafetiere-report.pdf




http://www.lrec-conf.org/proceedings/lrec2012/pdf/1056_Paper.pdf

http://www.lrec-conf.org/proceedings/lrec2012/pdf/1056_Paper.pdf

http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/muc_7_proceedings/overview.html




http://dspace.ou.nl/bitstream/1820/1707/1/D2.1%20final%20EC.pdf




ESPARCIA, S. et al. (2010) Integrating information extraction agents into a tourism recommender system. Proceedings of EAIS2010, Springer LNAI 6077, pp.193 – 200.

FERRUCCI, D. and LALLY A. (2005) Building an example application with the unstructured information management architecture. IBM Systems Journal, 43, pp. 455-475.

FRIEDMAN, C. et al. (1995) Natural language processing in an operational clinical information system. Natural Language Engineering, 1, pp. 1-28.

GRISHMAN, R. et al. (2002) Information extraction for enhanced access to diseaseoutbreak reports. Journal of Biomedical Informatics, 35 (4), pp. 236–246.

GRISHMAN, R. (2005) NLP: An information extraction perspective. In ANGELOVA, G. et al., eds. Recent Advances in Natural Language Processing (RANLP). Borovets, Bulgaria: INCOMA, pp. 1-4.

GRISHMAN, R. (2012) Information extraction: Capabilities and challenges.Notes prepared for the 2012 International Winter School in Language and Speech Technologies, Rovira i Virgili University, Tarragona, Spain.

GRISHMAN, R. and SUNDHEIM, B. (1996) Message understanding conference - 6: A brief history. Proceedings of the 16th International Conference on Computational Linguistics (COLING 96), Copenhagen, August 1996.

HAHN, U. et al. (2008) An overview of JCoRe, the JULIE Lab UIMA component repository.Proceedings of the LREC 2008 Workshop on UIMA for NLP, pp. 1–7.

HAUG, P.J. et al. (1997) A natural language parsing system for encoding admitting diagnoses. Proceedings of the AMIA Annual Fall Symposium, pp. 814-8.

JONG, G. (1977) FRUMP. ACM SIGART Bulletin, 61, pp. 54-55.

MASSEY, V. and SATAO, K. (2012) Comparing various SDLC models and the new proposed model on the basis of available methodology. International Journal of AdvancedResearch in Computer Science and Software Engineering (IJARCSSE), 2, pp.170-177.

MAYNARD, D. et al. (2006) Metrics for Evaluation of Ontology-based Information Extraction. Proceedings WWW 2006 Workshop on Evaluation of Ontologies for the Web”(EON), Edinburgh, May 2006.

MOENS, M. (2006) Information extraction: Algorithms and prospects in a retrieval context. The Information Retrieval Series 21. NewYork: Springer.

MOSCOVE, S. (2001) Prototyping: An alternative approach to systems development work. Review Of Business Information Systems, 5 (3), pp. 65-72.

PISKORSKI, J. and YANGARBER, R. (2012) Information extraction: Past, present and future. In POIBEAU, T. et al. Multi-source, multilingual information extraction and summarization. Theory and Applications of Natural Language Processing series. Berlin and Heidelberg: Springer-Verlag, pp. 23-49.

RAMSHAW, L. and WEISCHEDEL, R. (2005) Information extraction. Proceedings of IEEE ICASSP, Philadelphia, PA, 2005. IEEE digital library, 5, pp. 969–972.

SAGER, N., et al. (1987) Medical language processing: Computer management of narrative data. Reading, MA: Addison-Wesley.

87

http://www.springer.com/computer/database+management+%26+information+retrieval/book/978-3-642-28568-4




SARAWAGI, S. (2007) Information extraction. Foundations and Trends Databases, 1 (3), pp. 261-377.

TURMO, J. et al. (2006) Adaptive information extraction. ACM Computing Surveys 38 (2), pp. 1–47.

WILKS, Y. (1997) Information extraction as a core language technology. In: PAZIENZ, M. Information Extraction A Multidisciplinary Approach to an Emerging Information Technology. Berlin and Heidelberg: Springer-Verlag, 1299, pp. 1-9.

88

Appendix A: Extraction rules# Ex: Cholera in Chile[syn=np, sem=outbreak, type=entity, key=_o, rulid=outbreak_name1] =>! [token!="in"]! \! [sem=disease, orth="capitalized", token=_o, syn!=DT, sent=0], [sem="disease_type"]?!! /! [token="in"|","], [sem="geoname/COUNTRY"];

# Ex: Malaria[syn=np, sem=outbreak, type=entity, key=_o, rulid=outbreak_name2] =>! [sem!="symptoms", sent>0]! \! [sem=disease, token=_o], [sem="disease_type"]?!! /;

# Ex: Coronavirus[syn=np, sem=outbreak, type=entity, key=_o, rulid=outbreak_name3] =>! \! [token= "*virus", token=_o],[sem= disease_type]! /;

# Ex: Herpes type 1[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name4] =>! \! [sem=disease, token=__o], [token="type", token=__o]?,[token="-", token=__o]?, [orth=number, token=__o]?! /;

# Ex: acute poliomyelitis outbreak [syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name5] =>! \! [sem="disease_condition", token=__o], [sem=disease, token=__o], [sem=disease_type]?! /;

# Ex: acute poliomyelitis outbreak [syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name6] =>! [sem!="symptoms", sent>0]! \! [sem="disease_condition", token=__o], [sem=disease, token=__o], [sem=disease_type]?! /;

# Ex: Severe acute respiratory syndrome[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name7] =>! [sem!="symptoms", sent>0]! \! [sem="disease_condition", token=__o], [token=__o]{1,2}!! /! [sem=disease_type];

89

# Ex: outbreak of Shigellosis[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name8] =>! [token="outbreak"|"outbreaks"],[token="of"]! \! [token=__o]{1,3},[sem=disease_type, token=__o]! /;

# Ex: avian influenza A(H7N9) virus[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name9] =>! [token="infection"], [token="with"]! \! [token=__o, syn!=DT],! [sem=disease_type, token=__o],[token=__o]{1,2},[sem=disease_type]!! /! [token!="Outbreak"|"of"|"including"|"and"|","], [token!="Reported"];

# Ex: acute poliomyelitis outbreak [syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name10] =>! \! [sem="disease_condition", token=__o],[token=__o]{1,2},[sem=disease_type]!! /! [token!="Outbreak"|"of"|"including"|"and"|","], [token!="Reported"];

# Ex: Acute haemorrhagic fever[syn=clause, sem=outbreak, type=entity, key=__o, rulid=outbreak_name11] =>! \! [token=__o, sem="disease_condition", orth=capitalized],[token=__o, sem="disease_condition"]*,[sem="disease_type", token=__o]{1,2}!! /! [token!="Outbreak"|"of"|"including"|"and"|","|"outbreak"], [token!="Reported"];

# Ex: Acute haemorrhagic fever[syn=clause, sem=outbreak, type=entity, key=__o, rulid=outbreak_name12] =>! [sem!="symptoms", sent>0]! \! [token=__o, sem="disease_condition", orth=capitalized],[token=__o, sem="disease_condition"]*,[sem="disease_type", token=__o]{1,2}! /! [token!="Outbreak"|"of"|"including"|"and"|","|"outbreak"], [token!="Reported"];

# Ex: SARS virus [syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name13] =>! \! [token~"^[A-Z]+$", token=__o],! [sem= disease_type, token=__o]! /;

# Ex: avian influenza[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name14] =>

90

[token="infection"], [token="with"]! \! [token=__o, syn!=DT],[sem=disease_type, token=__o]! /! [token!="Outbreak"|"of"|"including"|"and"|","], [token!="Reported"];!

# Ex: Dengue/dengue haemorrhagic fever[syn=clause, sem=outbreak, type=entity, key=__o, rulid=outbreak_name15] =>! [sem!="symptoms"]! \! [token=__o, orth=capitalized, syn!=DT], [token=__o]{1,2},! [token=__o, sem="disease_condition"],[sem="disease_type", token=__o]{1,2}!! /! [token!="Outbreak"|"of"|"including"|"and"|","|"outbreak"], [token!="Reported"];

# Ex: Dengue/dengue haemorrhagic fever from title[syn=clause, sem=outbreak, type=entity, key=__o, rulid=outbreak_name16] =>! \! [token=__o, orth=capitalized, syn!=DT, sent=0], [token=__o]{1,2},! [token=__o, sem="disease_condition"],[sem="disease_type", token=__o]{1,2}! /! [token!="Outbreak"|"of"|"including"|"and"|","|"outbreak"], [token!="Reported"];

# Ex: Yellow fever from title[syn=np, sem=outbreak, type=entity, key=__o, rulid=outbreak_name17] =>! \! [orth=capitalized, token=__o, sent=0],[token=__o, sent=0]?,! [sem= disease_type, token=__o]! /;

# Ex: Ebola haemorrhagic fever from title[syn=clause, sem=outbreak, type=entity, key=__o, rulid=outbreak_name18] =>! \! [token=__o, orth=capitalized, syn!=DT, sent=0], [token=__o]{0,2},! [token=__o, sem="disease_condition"],[sem="disease_type", token=__o]{1,2}!! /! [token!="Outbreak"|"of"|"including"|"and"|","|"outbreak"], [token!="Reported"];

# Ex: China[syn=NNP, sem=country_of_the_outbreak, type=entity, country=_c, rulid=country_name] =>! [token="Situation"|"situation", sem!="collaboration"],[token="in"]! \! [syn =DT]?,! [sem>="geoname/COUNTRY", token=_c]! /;

# Ex: China

91

[syn=NNP, sem=country_of_the_outbreak, type=entity, key=_c, rulid=country_name1] =>! [sem!="collaboration"],! [token="in"|","]! \! [sem>="geoname/COUNTRY", token=_c, sent=<2]! /;

# Ex: China from title[syn=NNP, sem=country_of_the_outbreak, type=entity, key=_c, rulid=country_name2] =>! \! [sem>="geoname/COUNTRY", token=_c, sent=0]! /;

# Ex: reported from Bengo, Malange, and Luanda[syn=NNP, sem=outbreak_locations, type=entity, loc1=_l1, loc2=_l2, loc3=_l3, loc4=_l4, loc5=_l5, loc6=_l6, loc7=_l7, loc8=_l8, rulid=locations1] =>! [sem="reporting_verbs"],! [token="in"|"from"|","|"In"], ! [token~"^[a-z]+$"]{0,5} ! \! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l1],! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l2]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l3]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l4]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l5]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l6]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l7]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2"|"geoname/PT", token=_l8]?,[sem="areas"]?! /;

[syn=NNP, sem=outbreak_locations, type=entity, loc1=_l1, loc2=_l2, loc3=_l3, loc4=_l4, loc5=_l5, loc6=_l6, loc7=_l7, loc8=_l8, rulid=locations2] =>

! [sem="disease_type"],[token~"^[a-z]+$"]{0,2},[token="in"|","|"In"|"from"]! \! [syn =DT]?,[sem="geoname/PPL"|"geoname/ADM2", token=_l1],! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l2]?, [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l3]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l4]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l5]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l6]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l7]?,

92

! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l8]?,! [sem="areas"]?! /;

[syn=NNP, sem=outbreak_locations, type=entity, loc1=_l1, loc2=_l2, loc3=_l3, loc4=_l4, loc5=_l5, loc6=_l6, loc7=_l7, loc8=_l8, rulid=locations3] =>

! [token="outbreak"|"Outbreak"],! [token~"^[a-z]+$"]{0,2},! [token="in"|","|"In"|"from"] ! \! [syn =DT]?,[sem="geoname/PPL"|"geoname/ADM2", token=_l1],! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l2]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l3]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l4]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l5]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l6]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l7]?,! [sem="areas"]?,[token=","|"and"]?,[token~"^[a-z]+$"]{0,2}, ! [sem="geoname/PPL"|"geoname/ADM2", token=_l8]?,! [sem="areas"]?! /;

# Ex: Cases have also been reported for the first time in the provinces of Velasco [syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations4] =>! [sem="haveverb"]?,! [token="today"|"also"]?,! [tokem="been"]?,! [sem="reporting_verbs"],! [token="in"|"from"|","|"In"], ! [token~"^[a-z]+$"]{0,5}! \! [sem="geoname/PPL"|"geoname/ADM2", token=_c]! [sem="areas"]?! /;

# Ex: male from Phnom [syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations5] =>

! [sem="people"],! [token="from"|"for"]?,! [token~"^[a-z]+$"]{0,2}! \! [sem="geoname/PPL"|"geoname/ADM2", token=_c]! /! [sem="areas"]?;

# Ex: Beijing has reported

93

[syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations6] =>! \! [sem="geoname/PPL"|"geoname/ADM2", token=_c]! /! [sem="haveverb"]?,[token="today"]?,! [sem="reporting_verbs"];

# Ex: 4 cases from Beijing[syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations7] =>

! [token="cases"|"deaths"|"case"],! [token~"^[a-z]+$"]{0,2},! [token="in"|"from"|","]! \! [sem="geoname/PPL"|"geoname/ADM2", token=_c]! /! [sem="areas"]?;

# Ex: Beijing[syn=NNP, sem=outbreak_locations, type=entity, key=_c, rulid=locations8] =>

! [sem="outbreak_locations"],! [token="of"]! \! [sem="geoname/PPL"|"geoname/ADM2", token=_c]! /! [sem="areas"]?;

# Ex: 4 cities out of 10, ()[syn=NP, sem=outbreak_locations, type=entity, loc1=__l1, loc2=__l2, loc3=__l3, loc4=__l4, loc5=__l5, loc6=__l6, loc7=__l7, loc8=__l8, rulid=locations9] =>

! [syn=CD]?,! [sem="areas"],! [token="out"]?,! [token="of"]?,! [syn=CD]?,! [orth=punct]?! \! [token="("]?,[orth=capitalized, token=__l1]{1,2},! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l2]{1,2},! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l3]*,! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l4]*,! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l5]*,! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l6]*,! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l7]*,! [token=","]?,! [token="and"]?,[orth=capitalized, token=__l8]*,! [token=")"]?! /;

94

# Ex: Ogooue-Ivindo[syn=NNP, sem=outbreak_locations, type=entity, key=__c, rulid=locations10] =>! [token="cases"|"deaths"|"case"],! [token="in"|","|"from"]! \!! [syn =DT]?,[orth=capitalized, token=__c],[token="-"],[token=__c]!! /! [sem="areas"];

# Ex: Ogooue Ivindo[syn=NNP, sem=outbreak_locations, type=entity, key=__c, rulid=locations11] =>! [token="cases"|"deaths"|"case"],! [token="in"|","|"In"|"from"] ! \! [syn =DT]?,! [orth=capitalized, token=__c],! [orth=capitalized, token=__c]?! /! [sem="areas"];

# Ex: Ogooue Ivindo[syn=NNP, sem=outbreak_locations, type=entity, key=__c, rulid=locations12] =>! [sem="reporting_verbs"],! [token="in"|","|"In"|"from"]! \! [syn =DT]?,! [orth=capitalized, token=__c],! [orth=capitalized, token=__c]?!! /! [sem="areas"];

# Ex: Ogooue Ivindo[syn=NNP, sem=outbreak_locations, type=entity, key=__c, rulid=locations13] =>

! [sem="disease_type"],! [token~"^[a-z]+$"]{0,2},! [token="in"|","|"In"|"from"]! \! [syn =DT]?,[orth=capitalized, token=__c],[orth=capitalized, token=__c]?! /! [sem="areas"];

# Ex: old male from Kong Pisey district[syn=NNP, sem=outbreak_locations, type=entity, key=__c, rulid=locations14] =>! [sem="people"],! [token="from"|"for"]?,! [token~"^[a-z]+$"]{0,2}! \! [orth=capitalized, token=__c],[token="-" ,token=__c]?,[orth=capitalized, token=__c]?! /! [sem="areas"]?;

# ex: 2 November 2003

95

[syn=np, sem=date, type=entity,key=__t, month=_mno, day=_day, year=_year, rulid=publish_date1] =>! \! [syn=CD, orth=number, token=__t, token=_day, sent<=1],! [sem="temporal/interval/month", monthno=_mno, key=__t, sent<=1],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_year, sent<=1]! /;

# ex: 2 November 2003 [syn=np, sem=date, type=entity,key=__t, month=_mno, day=_day, year=_year, rulid=report_date1] =>

! [token="As"|"as"]?,! [token="of"|"on"|"On"|"in"|"since"]! \ ! [syn=CD, orth=number, token=__t, token=_day, sent=2], ! [sem="temporal/interval/month", monthno=_mno, key=__t, sent=2], ! [token=","]?, ! [syn=CD, orth=number, token~"19??"|"20??", token=__t, token=_year, sent=2]! /;

# ex: November 2003 [syn=np, sem=date, type=entity,key=__t, month=_mno, year=_year, rulid=report_date2] =>

! [token="As"|"as"]?,! [token="of"|"on"|"On"|"in"|"since"]! \! [sem="temporal/interval/month", monthno=_mno, key=__t],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_year, sent=2]! /;

# ex: 21 November [syn=np, sem=date, type=entity,key=__t, month=_mno, rulid=report_date3] =>

! [token="As"|"as"]?,! [token="of"|"on"|"On"|"in"|"since"]! \! [syn=CD, orth=number, token=__t, token=_day, sent=2],! [sem="temporal/interval/month", monthno=_mno, key=__t, sent=2]! /;

# ex: 2 November 2003 [syn=np, sem=date, type=entity,key=__t, month=_mno, day=_day, year=_year, rulid=report_date4] =>

! [token="As"|"as"]?,! [token="of"|"on"|"On"|"in"|"since"]! \! [syn=CD, orth=number, token=__t, token=_day, sent=3],! [sem="temporal/interval/month", monthno=_mno, key=__t, sent=3],! [token=","]?,! [syn=CD, orth=number, token~"19??"|"20??", token=__t, token=_year, sent=3]! /;

96

# ex: November 2003 [syn=np, sem=date, type=entity,key=__t, month=_mno, year=_year, rulid=report_date5] =>

! [token="As"|"as"]?,! [token="of"|"on"|"On"|"in"|"since"]! \! [sem="temporal/interval/month", monthno=_mno, key=__t, sent=3],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_year, sent=3]! /;

# ex: November 2003 [syn=np, sem=date, type=entity,key=__t, month=_mno, rulid=report_date6] =>

! [token="As"|"as"]?,! [token="of"|"on"|"On"|"in"|"since"]! \! [syn=CD, orth=number, token=__t, token=_day, sent=3],! [sem="temporal/interval/month", monthno=_mno, key=__t, sent=3]! /;

# ex: During 1-26 January 2003 in sentence 2[syn=np, sem=date, type=entity,from_date=__tt, to_date=__t, day1=_dd, day2=_d ,month=_mno, year=_year, rulid=report_date7] =>

! [token="during"|"During"|"From"|"from"]! \! [syn=CD, orth=number, token=__tt, token=_dd, sent=2],! [token="-"|"/"|"to"],! [syn=CD, orth=number, token=__t, token=_d, sent=2],! [sem="temporal/interval/month", monthno=_mno, token=__tt, token=__t, sent=2],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__tt, token=__t, token=_year, sent=2]! /;

# ex: During 1-26 January 2003 in sentence 3[syn=np, sem=date, type=entity,from_date=__tt, to_date=__t, day1=_dd, day2=_d ,month=_mno, year=_year, rulid=report_date8] =>

! [token="during"|"During"|"From"|"from"]! \! [syn=CD, orth=number, token=__tt, token=_dd, sent=3],! [token="-"|"/"|"to"],! [syn=CD, orth=number, token=__t, token=_d, sent=3],! [sem="temporal/interval/month", monthno=_mno, token=__tt, token=__t, sent=3],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__tt, token=__t, token=_year, sent=3]! /;

97

# ex: From 1 January 2003 to 27 January 2003 in sentence 2[syn=np, sem=date, type=entity,from_date=__tt, to_date=__t, day1=_dd, day2=_d ,from_month=_mm, to_month=_m, from_year=_yy, to_year=_y, rulid=report_date9] =>

! [token="From"|"from"]! \! [syn=CD, orth=number, token=__tt, token=_dd, sent=2],! [sem="temporal/interval/month", token=_mm, token=__tt, sent=2],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__tt, token=_yy, sent=2]?,! [token="-"|"to"],! [syn=CD, orth=number, token=__t, token=_d, sent=2],! [sem="temporal/interval/month", token=_m, token=__t, sent=2],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_y, sent=2]! /;

# ex: From 1 January 2003 to 27 January 2003 in sentence 3[syn=np, sem=date, type=entity,from_date=__tt, to_date=__t, day1=_dd, day2=_d ,from_month=_mm, to_month=_m, from_year=_yy, to_year=_y, rulid=report_date10] =>

! [token="From"|"from"]! \! [syn=CD, orth=number, token=__tt, token=_dd, sent=3],! [sem="temporal/interval/month", token=_mm, token=__tt, sent=3],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__tt, token=_yy, sent=3]?,! [token="-"|"to"],! [syn=CD, orth=number, token=__t, token=_d, sent=3],! [sem="temporal/interval/month", token=_m, token=__t, sent=3],! [token=","]?,! [syn=CD,orth=number, token~"19??"|"20??", token=__t, token=_y, sent=3]! /;

#Ex: The Department of Santa Cruz [syn=clause, sem=reporting_authority, type=relationship, country=_c, health_agency=__h, rulid=reporting_authority1] =>! \! [token="the"|"The",token=__h], [token="department"|"Department", token=__h],[token="of", token=__h],[sem>="geoname", token=__h, token=_c]! /! [sem="haveverb"]?,! [sem="reporting_verbs"];

#Ex: Ministry of Health (MoH) of the Kingdom of Cambodia[syn=clause, sem=reporting_authority, type=relationship, key=__d, country=_c, health_agency=_h, rulid=reporting_authority2] =>! \! [sem="health_agency", token=__d, token=_h],! [token="("],[syn="NN"]?,[token=")"],! [token="of"|"in", token=__d],! [token="the", token=__d]?,[sem>="geoname", token=__d, token=_c]! /! [sem="haveverb"]?,! [sem="reporting_verbs"];

98

#Ex: Ministry of Health of the Kingdom of Cambodia[syn=clause, sem=reporting_authority, type=relationship, key=__d, country=_c, health_agency=_h, rulid=reporting_authority3] =>! \! [sem="health_agency", token=__d, token=_h],! [token="of"|"in", token=__d],! [token="the", token=__d]?,[sem>="geoname", token=__d, token=_c]!! /! [sem="haveverb"]?,! [sem="reporting_verbs"] ;

#Ex: The Afghan Ministry of Public Health[syn=clause, sem=reporting_authority, type=relationship, key=__d, nationality=_n, health_agency=_h, rulid=reporting_authority4] =>! \! [syn=DT, token=__d],[orth=capitalized, token=_n, token=__d],! [sem="health_agency", token=__d, token=_h]! /! [sem="haveverb"]?,! [sem="reporting_verbs"];

#Ex: Ministry of Health (MoH) of the Kingdom of Cambodia[syn=clause, sem=reporting_authority, type=relationship, key=__d, country=_c, health_agency=_h, rulid=reporting_authority5] =>! \! [sem="health_agency", token=__d, token=_h],! [token="("]?,[syn="NN"]?,[token=")"]?,! [token=",", token=__d],[token="the", token=__d]?,! [sem>="geoname", token=__d, token=_c]!! /! [sem="haveverb"]?,! [sem="reporting_verbs"];

#Ex: The Human Services Department, Public Health Division of the Government of Victoria[syn=clause, sem=reporting_authority, type=relationship, key=__d, country=_c, health_agency=__h, rulid=reporting_authority6] =>! \! [syn=DT, token=__d, sent<=4],! [orth=capitalized, token=__h, token=__d],! [orth=capitalized, token=__h, token=__d]?,! [orth=capitalized, token=__h, token=__d]?,! [token=","|"in"|"and"|"of",token=__h, token=__d]?,! [orth=capitalized, token=__h, token=__d]?,! [orth=capitalized, token=__h, token=__d]?,! [orth=capitalized, token=__h, token=__d]?,! [token="of"|"in"|",", token=__d],! [sem>="geoname", token=__d, token=_c, token=__d]!! /! [sem="haveverb"]?,! [sem="reporting_verbs"];

#Ex: The Ministry of Health and Public Health Division, Bangladesh[syn=clause, sem=reporting_authority, type=relationship, key=__d, country=_c, health_agency=__h, rulid=reporting_authority7] =>

99

\! [syn=DT, token=__d, sent<=4],! [sem="health_agency", token=__d, token=__h],! [token="and", token=__d, token=__h],! [token=__h, token=__d],! [token=__h, token=__d],! [token=__h, token=__d]?,! [token=__h, token=__d]?,! [token="of"|"in"|",", token=__d],! [sem>="geoname", token=__d, token=_c, token=__d]/! [sem="haveverb"]?,! [sem="reporting_verbs"];

#Ex: Ministry of Health (MoH) of the Kingdom of Cambodia[syn=clause, sem=reporting_authority, type=relationship, country=_c, health_agency=_h, rulid=reporting_authority8] =>! \! [sem="health_agency", token=_h]! /! [sem="haveverb"]?,! [sem="reporting_verbs"];

#Ex: The ministry of health has reported[syn=clause, sem=reporting_authority, type=relationship, health_authority=_h, rulid=reporting_authority9] =>

! \! [sem>="geoname", token=_h]! /! [sem="haveverb"]?,! [sem="reporting_verbs"];

#Ex: a total of 167 230 cases have been confirmed in Egypt, of which 60 have been fatal#Ex: 20 111 cases reported in Cambodia since 2005, 18 have been fatal.[syn=clause, sem=cumulative_incidence , type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak1] =>!! [token="total"],! [token="of"|"to"]?! \! [syn=CD, token=_a], [syn=CD,token=_b],! [token="cases"],! [sem="haveverb"]?,! [sem="beverb"]?,! [sem="reporting_verbs"],! [token="in"],! [sem>="geoname"],! [token="since"]?,! [token~"19??"|"20??"]?,! [token=","|"."]?,! [token="of"]?,! [token="which"|"these"]?,! [syn=CD, token=_d],

100

! [sem="haveverb"]?,! [sem="beverb"]?,! [sem="fatal_cases",dead=_answer]! /;

#Ex: 20 cases reported in Cambodia since 2005, 18 have been fatal.[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak2] =>

! [token="total"]?,! [token="of"|"to"]?! \! [syn=CD, token=_n],! [token="cases"],! [sem="haveverb"]?,! [sem="beverb"]?,! [sem="reporting_verbs"],! [token="in"],! [sem>="geoname"],! [token="since"]?,! [token~"19??"|"20??"]?,! [token=","|"."]?,! [token="of"]?,! [token="which"|"these"]?,! [syn=CD, token=_d], ! [sem="haveverb"]?,! [sem="beverb"]?,! [sem="fatal_cases",dead=_answer]! /;

#Ex: 130 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus #Ex: including 31 deaths

[syn=clause, sem=cumulative_incidence , type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak3] =>!! [token="total"]?,! [token="of"|"to"]?! \ ! [syn=CD, token=_a], [syn=CD,token=_b],! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token="human"]?,! [token="infection"]?,! [token="with"]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]{0,3},! [token=","]?,! [token="("]?,! [token="and"|"including"|"of"|"with"]?,! [token="which"]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [token=")"]! /;

101

#Ex: 826 cases of dengue (3 deaths)[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak4] =>!! [token="total"]?,! [token="of"|"to"]?! \ ! [syn=CD, token=_n], ! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token="human"]?,! [token="infection"]?,! [token="with"]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]{0,3},! [token=","]?,! [token="("]?,! [token="and"|"including"|"of"|"with"]?,! [token="which"]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [token=")"]! /;

#Ex: the total in Azerbaigjan to 8[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak5] =>! [token="total"],! [token="in"]?,! [sem="geoname/COUNTRY"],! [token="to"]! \! [syn=CD, token=_n], ! [token="."|","],! [syn=CD, token=_d], ! [token="of"],! [token="these"],! [token="cases"],! [sem="beverb"],! [sem="fatal_cases",dead=_answer]! /;

#Ex: 13 100 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus #Ex: including 31 deaths [syn=clause, sem=cumulative_incidence , type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d,disease_name=__dn, disease_code=__s, dead_cases=_answer, year=_year, rulid=reporting_outbreak6] =>! [token="total"]?,! [token="of"|"to"]?! \!! [syn=CD, token=_a], [syn=CD,token=_b], ! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,

102

! [token="cases"],! [token="of"]?,! [token="human"]?,! [token="infection"]?,! [token="with"],! [token=__dn]?,! [orth=uppercase]?,! [token="("]?,! [token=__s]?,! [token=")"]?,! [sem="disease_type"]?,! [token="and"|"including"],! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?! /;

#Ex: 13 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus [syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d,disease_name=__dn, disease_code=__s, dead_cases=_answer, year=_year, rulid=reporting_outbreak7] =>! [token="total"]?,! [token="of"|"to"]?! \!! [syn=CD, token=_n], ! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token="human"]?,! [token="infection"]?,! [token="with"],! [token=__dn]?,! [orth=uppercase]?,! [token="("]?,! [token=__s]?,! [token=")"]?,! [sem="disease_type"]?,! [token="and"|"including"],! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?! /;

#Ex: ! 27 confirmed cases with 13 deaths have been reported[syn=clause, sem=cumulative_incidence , type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak8] =>! [token="total"]?,! [token="of"|"to"]?! \! [syn=CD, token=_a], [syn=CD,token=_b], ! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="with"],! [syn=CD, token=_d],

103

! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,! [token="been"]?,! [sem="reporting_verbs"]?,! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?! / ;

#Ex: ! 27 confirmed cases with 13 deaths have been reported[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak9] =>! [token="total"]?,! [token="of"|"to"]?! \! [syn=CD, token=_n],! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="with"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,! [token="been"]?,! [sem="reporting_verbs"]?,! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?! / ;

#Ex: total of 130 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus #Ex: including 31 deaths [syn=clause, sem=cumulative_incidence , type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak10] =>! [token="total"],! [token="of"|"to"]?! \! [syn=CD, token=_a], [syn=CD,token=_b], ! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]{0,2},! [sem=disease_type]?,! [orth=uppercase]?,! [token="("]?,! [token=__s]?,! [token=")"]?,! [sem="disease_type"]?,! [token=","]?,! [token="of"]?,! [token="which"]?,! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,

104

! [token="been"]?,! [sem="reporting_verbs"]?,! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?! /;

#Ex: total of 130 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus #Ex: including 31 deaths [syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak11] =>! [token="total"],! [token="of"|"to"]?! \! [syn=CD, token=_n], ! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]{0,2},! [sem=disease_type]?,! [orth=uppercase]?,! [token="("]?,! [token=__s]?,! [token=")"]?,! [sem="disease_type"]?,! [token=","]?,! [token="of"]?,! [token="which"]?,! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,! [token="been"]?,! [sem="reporting_verbs"]?,! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?! /;

#Ex: a total of 132 cases, including 37 deaths[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak12] =>! [token="total"],! [token="of"|"to"]?! \! [syn=CD, token=_n],! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token=","]?,! [token="and"|"including"|"of"]?,! [token="which"]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,! [token="been"]?,

105

! [sem="reporting_verbs"]?,! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?! /;

!#Ex: total number of children affected to be 59. Of these, 52 have died[syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak13] =>! \! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?,! [orth=punct]?,! [token="the"],! [token="total"],! [token="number"]?,! [token="of"],! [token="children"|"people"|"human"]?,! [token="affected"]?,! [token="to"]?,! [sem="beverb"]?,! [syn=CD, token=_n], ! [token="cases"]?,! [token=","|"."]?,! [token="of"|"Of"]?,! [token="which"|"these"]?,! [token=","]?,!! [syn=CD, token=_d], ! [sem="haveverb"]?,! [sem="beverb"]?,! [sem="fatal_cases",dead=_answer]! / ;

#Ex: has reported 2 249 cases of dengue fever including 6 deaths [syn=clause, sem=outbreak_event, type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak14] =>! [sem="haveverb"],! [token="today"]?,! [sem="reporting_verbs"]! \ ! [syn=CD, token=_a], [syn=CD,token=_b],! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"|"Of"]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]{0,3},! [sem="disease_type"]?,! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]! /;!

#Ex: has reported 2 249 cases of dengue fever including 6 deaths

106

[syn=clause, sem=outbreak_event, type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak15] =>

! [sem="haveverb"],! [token="today"]?,! [sem="reporting_verbs"]! \! [syn=CD, token=__n], ! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"|"Of"]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]{0,3},! [sem="disease_type"]?,! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]! /;

#Ex: has reported 291 suspected cases of Rift Valley fever, including 64 deaths [syn=clause, sem=outbreak_event, type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak16] =>!! [sem="haveverb"],! [token="today"]?,! [sem="reporting_verbs"]! \! [syn=CD, token=_a], [syn=CD,token=_b],! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"|"Of"]?,! [orth=capitalized]{0,3},! [sem="disease_type"]?,! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]! /;

#Ex: has reported 291 suspected cases of Rift Valley fever, including 64 deaths [syn=clause, sem=outbreak_event, type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak17] =>!! [sem="haveverb"],! [token="today"]?,! [sem="reporting_verbs"]! \! [syn=CD, token=_n], ! [token="new"]?,! [sem="case_classification",classification=_c]?,! [token="cases"],! [token="of"|"Of"]?,! [orth=capitalized]{0,3},

107

! [sem="disease_type"]?,! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]! /;

#Ex: 130 laboratory-confirmed cases of human infection with avian influenza A(H7N9) virus #Ex: including 31 deaths [syn=clause, sem=outbreak_event, type=event, number_of_cases="(+ (* _a 1000)_b))", case1_class=_c, case2_class=_c1, case3_class3=_c2, number_of_deaths=_d, dead_cases=_answer, number_of_cases2=_n2, number_of_cases3=_n3, rulid=reporting_outbreak18] =>! \! [syn=CD, token=_a], [syn=CD,token=_b],! [sem="case_classification",classification=_c ]?,! [token="cases"],! [orth=punct]?,! [token="("],! [token=_n2],! [token="by"]?,! [sem="case_classification", classification=_c1],! [token=","|"and"],! [token=_n3],! [token="by"]?,! [sem="case_classification", classification=_c2],! [token=")"],! [token=","]?,! [token="and"|"including"|"of"]?,! [token="which"]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?! /;

#Ex: 30 confirmed cases (14 laboratory and 16 epidemiologically linked)[syn=clause, sem=outbreak_event, type=event, number_of_cases1=__n, case_class1=_c, case_class2=_a, case_class3=_b, number_of_deaths=_d, dead_cases=_answer, number_of_cases2=_n2, number_of_cases3=_n3, rulid=reporting_outbreak19] =>! \! [syn=CD, token=__n], ! [sem="case_classification",classification=_c ]?,! [token="cases"],! [orth=punct]?,! [token="("],! [token=_n2],! [token="by"]?,! [sem="case_classification", classification=_a],! [token=","|"and"],! [token=_n3],! [token="by"]?,! [sem="case_classification", classification=_b],! [token=")"],! [token=","]?,! [token="and"|"including"|"of"]?,! [token="which"]?,! [syn=CD, token=_d]?,

108

! [token="new"]?,! [sem="fatal_cases",dead=_answer]?! /;

#Ex: has today reported 39 new probable SARS cases and 4 new deaths[syn=clause, sem=outbreak_event, type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak20] =>! [sem="haveverb"]?,! [sem="beverb"]?,! [token="today"]?! \! [sem="reporting_verbs"],! [syn=CD, token=_n], ! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token~"^[a-z]+$"]?,! [token="cases"],! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"|"including"]?,! [sem="fatal_cases",dead=_answer]! /;

#Ex: 130 confirmed cases of avian influenza including 31 deaths[syn=clause, sem=outbreak_event , type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer,disease_name=_dn, year=_year, rulid=reporting_outbreak21] =>! \! [syn=CD, token=_a], [syn=CD,token=_b],! [token="new"|"additional"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token=_dn]?,! [sem=disease_type]?,! [token=","]?,! [token="and"|"including"|"of"]?,! [token="which"]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,! [token="been"]?,! [sem="reporting_verbs"]?,! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?! /;

#Ex: 130 confirmed cases of avian influenza A(H7N9) virus including 31 deaths [syn=clause, sem=outbreak_event , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, disease_name=_dn, rulid=reporting_outbreak22] =>!! \! [syn=CD, token=_n],! [token="new"|"additional"]?,! [sem="case_classification",token=_c ]?,! [token="cases"],

109

! [token="of"]?,! [token="_dn"]?,! [sem=disease_type]?,

! [orth=uppercase]?,! [token="("]?,! [token=__s]?,! [token=")"]?,! [sem="disease_type"]?,

! [token=","|"and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?,! [sem="haveverb"]?,! [token="been"]?,! [sem="reporting_verbs"]?! /;

#Ex: reported five new human cases of avian influenza[syn=clause, sem=outbreak_event, type=event, number_of_cases=__n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak23] =>! [sem="haveverb"]?,! [sem="beverb"]?,! [token="today"]?,! [sem="reporting_verbs"]! \ ! [syn=CD, token=__n], [syn=CD, token=__n]?,! [sem="case_classification",classification=_c ]?, ! [token="new"]?,! [token="human"]?,! [token="cases"],! [token="of"]?,! [token~"^[a-z]+$"]{0,2},! [orth=capitalized]{0,2},! [sem="disease_type"]?,! [token=","]?,! [token="and"|"including"]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?! /;

#Ex: 84 cases of acute flaccid paralysis and 85 deaths have been reported[syn=clause, sem=outbreak_event, type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak24] =>! \! [syn=CD, token=_a], [syn=CD,token=_b],! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [sem=disease_condition]?,[token=__o]{0,2},[sem=disease_condition]?,! [token="and"|"including"]?,! [syn=CD, token=_d], ! [sem="fatal_cases",dead=_answer],! [sem="haveverb"]?,

110

! [sem="beverb"]?,! [token="today"]?,! [sem="reporting_verbs"]! /;

#Ex: 25 suspect cases including 15 deaths were identified[syn=clause, sem=outbreak_event, type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak25] =>! \! [syn=CD, token=__n],! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [sem=disease_condition]?,[token=__o]{0,2},[sem=disease_condition]?,! [token="and"|"including"]?,! [syn=CD, token=_d], ! [sem="fatal_cases",dead=_answer],! [sem="haveverb"]?,! [sem="beverb"]?,! [sem="reporting_verbs"]! /;

#Ex: Twenty-five suspect cases including 15 deaths were identified[syn=clause, sem=outbreak_event, type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak26] =>! \! [syn=CD, orth=caphyphenated, token=_n], ! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token="cases"],! [token="of"]?,! [token~"^[a-z]+$"]{0,3},! [token="and"|"including"]?,! [syn=CD, token=_d], ! [sem="fatal_cases",dead=_answer],! [sem="beverb", ],! [token="today"]?,! [sem="reporting_verbs"]?! /;

#Ex: has today reported 39 new probable SARS cases and 4 new deaths[syn=clause, sem=outbreak_event, type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak27] =>! [sem="haveverb"],! [token="today"]?,! [sem="reporting_verbs"]! \! [syn=CD, token=_n],! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]?,! [token="cases"],! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,

111

! [sem="fatal_cases",dead=_answer]! / ;

#Ex: 39 new probable SARS cases and 4 new deaths has been reported[syn=clause, sem=outbreak_event, type=event, number_of_cases=__n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, year=_year, rulid=reporting_outbreak28] =>! \! [token="since"|"In"|"in"]?,! [syn=CD,orth=number, token~"19??"|"20??", token=_year]?,! [orth=punct]?,! [syn=CD, token=__n], [syn=CD, token=__n]?, ! [token="new"]?,! [sem="case_classification",classification=_c ]?,! [token~"^[A-Z]+$"|"^[a-z]+$"]?,! [token="cases"],! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer], ! [sem="haveverb"],! [token="been"]?,! [sem="reporting_verbs"]! /;

#Ex: has reported an outbreak of diarrhoeal diseases of 6 691 cases, including 3 deaths[syn=clause, sem=outbreak_event, type=event, number_of_cases="(+ (* _a 1000)_b))", case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak29] =>! [sem="haveverb"]?,! [token="today"]?,! [sem="reporting_verbs"]?,! [token="an"],! [token="outbreak"],! [token="of"],! [token~"^[A-Z]+$"|"^[a-z]+$"]?,! [sem="disease_type"]?,! [token="of"]! \! [syn=CD, token=_a], [syn=CD,token=_b],! [token="cases"],! [token=","]?,! [token="and"|"including"],! [syn=CD, token=_d], ! [token="new"]?,! [sem="fatal_cases",dead=_answer]! /;

#Ex: Eight cases have been laboratory confirmed, of which 2 have died.[syn=clause, sem=outbreak_event, type=event, number_of_cases=__n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak30] =>! \! [syn=CD, token=__n], [syn=CD, token=__n]?,! [token="new"|"additional"]?,! [token="cases"],

112

! [sem="haveverb"]?,! [token="been"]?,! [sem="case_classification",token=_c ],! [token=","]?,! [token="of"]?,! [token="and"|"including"|"which"]?,! [syn=CD, token=_d]?, ! [sem="haveverb"]?,! [sem="fatal_cases",dead=_answer]?! /;

#Ex: 8 laboratory-confirmed cases and 31 deaths [syn=clause, sem=cumulative_incidence , type=event, number_of_cases=_n, case_class=_c, number_of_deaths=_d, dead_cases=_answer, rulid=reporting_outbreak31] =>! \! ! [syn=CD, token=_n, sent<=3], ! [token="new"|"additional"]?,! [sem="case_classification",token=_c]?,! [token="cases"],! [token="and"|","]?,! [syn=CD, token=_d]?, ! [token="new"]?,! [sem="fatal_cases",dead=_answer]?! /;

# Ex: H1N1 infection[syn=np, sem=Disease_code, key=__s, rulid=disease_code1] =>! \! [orth="other", token=__s]! /! [sem="disease_type"];

# Ex: A(H1N1)[syn=np, sem=Disease_code, key1=__s, rulid=disease_code2] =>[sem!="symptoms"]! \! [orth=uppercase, token=__s]?,! [token="("],! [token=__s],! [token=")"]! /![token="disease"|"fever"|"outbreak"|"outbreaks"|"syndrome"|"influenza"|"illness"|"epidemic"|"virus"|"infection"|"infections"|"*virus"|"intoxication"];

113

Appendix B: Gazetteer entriesJANUARY:instance=january,class=temporal/interval/month,monthno=01

FEBRUARY:instance=february,class=temporal/interval/month,monthno=02

MARCH:instance=march,class=temporal/interval/month,monthno=03

APRIL:instance=april,class=temporal/interval/month,monthno=04

MAY:instance=may,class=temporal/interval/month,monthno=05

JUNE:instance=june,class=temporal/interval/month,monthno=06

JULY:instance=july,class=temporal/interval/month,monthno=07

AUGUST:instance=august,class=temporal/interval/month,monthno=08

SEPTEMBER:instance=september,class=temporal/interval/month,monthno=09

OCTOBER:instance=october,class=temporal/interval/month,monthno=10

NOVEMBER:instance=november,class=temporal/interval/month,monthno=11

DECEMBER:instance=december,class=temporal/interval/month,monthno=12

Anthrax:instance=anthrax,class=disease

botulism:instance=botulism,class=disease

Buffalopox:instance=buffalopox,class=disease

Chikungunya:instance=chikungunya,class=disease

Cholera:instance=cholera,class=disease

Coccidioidomycosis:instance=coccidioidomycosis,class=disease

E.coli O157:instance=e.coli O157,class=disease

Hepatitis E:instance=Hepatitis E,class=disease

Japanese Encephalitis:instance=japanese encephalitis,class=disease

Legionellosis:instance=legionellosis,class=disease

Leishmaniasis:instance=leishmaniasis,class=disease

Leptospirosis:instance=leptospirosis,class=disease

Listeria:instance=listeria,class=disease

Louseborne typhus:instance=louseborne typhus,class=disease

Malaria:instance=malaria,class=disease

Measles:instance=measles,class=disease

Monkeypox:instance=monkeypox,class=disease

Pertussis:instance=pertussis,class=disease

Plague:instance=plague,class=disease

Rabies:instance=rabies,class=disease

Tularemia:instance=tularemia,class=disease

flaccid:instance=flaccid,class=disease

poliovirus:instance=poliovirus,class=disease

hand, foot and mouth disease :instance=hand, foot and mouth

disease ,class=disease

acute:instance=acute,class=disease_condition

Acute:instance=acute,class=disease_condition

paralysis:instance=paralysis,class=disease_condition

haemorrhagic:instance=haemorrhagic,class=disease_condition

114

watery diarrhoeal:instance=watery diarrhoeal,class=disease_condition

wild:instance=wild,class=disease_condition

Wild:instance=wild,class=disease_condition

influenza:instance=influenza,class=disease_type

Virus:instance=virus,class=disease_type

intoxication:instance=intoxication,class=disease_type

infections:instance=infections,class=disease_type

epidemic:instance=epidemic,class=disease_type

Outbreaks:instance=outbreaks,class=disease_type

disease:instance=disease,class=disease_type

diseases:instance=diseases,class=disease_type

Virus:instance=virus,class=disease_type

illness:instance=illness,class=disease_type

fever:instance=fever,class=disease_type

syndrome:instance=syndrome,class=disease_type

report:instance=report,class=reporting_verbs

reports:instance=reports,class=reporting_verbs

identified:instance=identified,class=reporting_verbs

confirmed:instance=confirmed,class=reporting_verbs

reported:instance=reported,class=reporting_verbs

occurred:instance=occurred,class=reporting_verbs

occurring:instance=occurring,class=reporting_verbs

affecting:instance=affecting,class=reporting_verbs

Reported:instance=reported,class=reporting_verbs

notified:instance=notified,class=reporting_verbs

announced:instance=announced,class=reporting_verbs

informed:instance=informed,class=reporting_verbs

fatal:instance=fatal,class=fatal_cases,dead=no

deaths:instance=deaths,class=fatal_cases,dead=yes

death:instance=death,class=fatal_cases,dead=yes

died:instance=died,class=fatal_cases,dead=yes

fatal cases:instance=fatal cases,class=fatal_cases,dead=no

province:instance=province,class=areas

Province:instance=province,class=areas

Provinces:instance=provinces,class=areas

provinces:instance=provinces,class=areas

regions:instance=regions,class=areas

region:instance=region,class=areas

Region:instance=region,class=areas

cities:instance=cities,class=areas

city:instance=city,class=areas

City:instance=City,class=areas

state:instance=state,class=areas

115

states:instance=states,class=areas

State:instance=state,class=areas

District:instance=district,class=areas

Districts:instance=districts,class=areas

district:instance=district,class=areas

districts:instance=districts,class=areas

male:instance=male,class=people

female:instance=female,class=people

child:instance=child,class=people

man:instance=man,class=people

woman:instance=woman,class=people

Ministry of Health:instance=ministry of health,class=health_agency

The National Health and Family Planning Commission:instance=The National

Health,class=health_agency

The Ministry of Health and Population:instance=The Ministry of Health and

Population,class=health_agency

MoH:instance=ministry of health,class=health_agency

Ministry of Public Health:instance=ministry of Public Health,class=health_agency

suspect:instance=suspect,class=case_classification,classification=suspected

confirmed:instance=confirmed,class=case_classification,classification=confirmed

suspected:instance=suspectedd,class=case_classification,classification=confirmed

probable:instance=probable,class=case_classification,classification=probable

epidemiologically linked:instance=epidemiologically

linked,class=case_classification,classification=confirmed

laboratory-confirmed:instance=laboratory

confirmed,class=case_classification,classification=laboratory-confirmed

epidemiologically-linked:instance=epidemiologically-

linked,class=case_classification,classification=epidemiologically-linked

laboratory confirmed:instance=laboratory

confirmed,class=case_classification,classification=laboratory confirmed

laboratory testing:instance=laboratory

testing,class=case_classification,classification=confirmed

116

a dissertation submitted to the university of manchester...

Documents