Abstract— Information Extraction (IE) is the process of finding
structured text from unstructured or semi-structured text by
annotating semantic information. It becomes more and more
important because of explosion of unstructured electronic text on the
Web such as newswire, blogs, and email communications and so on.
IE is applied in a variety of application areas like news, financial and
biomedical domains. Moreover, it plays as a key component for
Natural Language Processing (NLP) areas including Automatic
Machine Translation, Automatic Text Summarization, and Question
Answering etc. Because of the effects of global warming, natural
disasters are occurring around the world in a growth number. In order
to make data analysis, comparisons and decisions at management
level, it is important to view natural disaster news in summary forms.
With the use of Automatic Myanmar Text Summarization, it saves
time by getting control over the information flood. Moreover, it
reduces tedious job to manual extract main facts from text
documents. It is beneficial in text summarization to use the templates,
the output of template-driven Information Extraction, to the summary
generation module. Thus, this paper presents Information Extraction
based on Conditional Random Fields (CRFs) by acting IE task as
sequence labeling task to support as a component of Automatic
Myanmar Text Summarization System. The proposed approach is
experimented in different kinds of information and achieved F-
measure score of 0.82%.
Keywords— Information Extraction, Automatic Text
Summarization, Conditional Random Fields
I. INTRODUCTION
NFORMATION Extraction (IE) can be defined as locating
the instances of facts from unstructured or semi-structured
text such as Web pages, news articles, call for papers (CFP),
e-mail, blogs and so on and produced them as structured
representation like relational database. It can also be stated as
the template filling process in which a predefined set of slots
in a particular template related to specific domain are filled
with the suitable values from the text and delivered the
template. Information Extraction is an element of other natural
language processing applications including Automatic Text
Summarization, Question Answering (QA), Machine
Translation, Document indexing etc.
Information Extraction (IE) is mainly used in Named Entity
Win Thuzar Kyaw, University of Computer Studies, Yangon, Republic of
the Union of Myanmar (e-mail: [email protected]).
Khin Mar Soe, University of Computer Studies, Yangon, Republic of the
Union of Myanmar (e-mail: [email protected]).
Hla Hla Hty, University of Computer Studies, Yangon, Republic of the
Union of Myanmar (e-mail: [email protected]).
Recognition (NER) which can stand independently as an
application or which can serve as a component of other natural
language processing applications such as Machine Translation,
Question Answering. NER means identification of proper
nouns, names of organizations, persons, locations etc, dates,
identification numbers, phone numbers, e-mail addresses and
so on.
IE is used in a variety of domains. In biomedical mining, it
is important to identify genes, proteins or other biomedical
entities automatically from a large set of scientific
publications. Similarly, intelligent analysts need to extract
information about terrorism events, people, used weapons and
the targets of the events automatically from a huge volume of
text documents. Information Extraction can also assist in
advanced search technology of search engines such as entity
search, structured search and question answering. In addition,
Information Extraction is used to automatically update a
natural disaster database by extracting relevant facts from
natural disaster news reports. The intention of both
Information Extraction and Text Summarization is to extract
applicable facts from user’s interested documents. However,
the presentation method of output to user is different. IE
delivers output as template or structured information (e.g.,
databases). On the other hand, Text Summarization uses one
form of summaries (running text or visualized form like table,
charts, graphs) to give as an output. Because the goal of these
two tasks is to retrieve relevant facts, by giving templates
obtained from template-based IE to the summarization
generation task that uses templates to develop summaries
enhances the summarization job.
Information Extraction process is performed based on
various methodologies. Three most popular supervised
machine learning approaches for IE task are rule learning
based method, classification model based method and
sequential labeling based method. These methods have two
main phases: training and extraction. The output of the training
stage is the model which is developed to identify the sub-
sequence of words or text that need to extract and consider the
input text as a sequence of words or text. In extraction stage,
with the use of resulted model, the data are extracted and
annotated as particular information according to the predefined
metadata.
The rest of the paper is organized as follows. In Section II,
some related work on information extraction is described.
Section III, identifies Information Extraction approaches. In
section IV, a detailed explanation of the proposed approach is
presented. Then, experimental results and evaluation of the
proposed approach is presented in Section V. The paper is
Information Extraction from Myanmar Text
using Condition Random Fields
Win Thuzar Kyaw, Khin Mar Soe, and Hla Hla Htay
I
3rd International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2014) Feb. 11-12, 2014 Singapore
62
concluded with summary and future work directions in Section
VI.
II. RELATED WORK
C. ZHANG et al. implemented automatic keyword
extraction from documents based on Conditional Random
Fields by considering keyword extraction as string labeling
task [4]. They compared their performance with other machine
learning methods such as support vector machine, multiple
linear regression etc and found that CRF performs better than
those methods.
K.M Schneider applied Conditional Random Fields by
regarding information extraction problem as a token
classification task to extract important information like
conference names, titles, dates, locations and submission
deadlines from call for papers (CFP) about academic
conferences, workshops, etc. received via email [11]. Generic
token classes, domain-specific dictionaries and layout features
are used together and found that layout features improved in
accuracy.
F.Peng and A. McCallum presented a paper that applies
Conditional Random Fields (CRFs) for the task of extracting
various common fields from the headers and citation of
research papers [8]. They described a large collection of
experimental results on two traditional benchmark data sets.
Dramatic improvements are obtained in comparison with
previous SVM and HMM based results.
A.T. Valero et al. proposed a system using a machine
learning approach to extract online news reports and
automatically populate a natural disaster database [3].
Although their system can be easily adapted to specific
domains and languages because they only used lexical
features without depending on the complex syntactic
attributes, this system has two drawbacks. The first is that
they cannot extract information from documents that contain
more than one disaster content. And the second one is that
the system is not allowed to group the extracted data of the
same kind of natural disasters from different documents.
They evaluated their system based on Boolean and Term
Weightings and the SVM, Naïve Bayes and C4.5 learning
algorithms and found that the combination of Boolean
Weighting and the SVM algorithm got the better accuracy
than other two supervised learning approaches.
M. Hatmi et al. proposed French named entity recognition
(NER) system by presenting a multi-level methodology based
on conditional random fields (CRFs) [12].In order to handle
structured tagging; they defined three levels of annotation. The
first level consists of annotating the 32 categories in a flat way.
The second level has to do with the annotation of components.
The last level allows overlapping annotation when a category
includes another category. They trained a CRF model for each
level of annotation with the use of CRF++toolkit which is an
open source implementation of CRF to implement the different
models.
III. INFORMATION EXTRACTION APPROACHES
A. Rule Learning based Approach
This approach can be categorized into three groups:
dictionary based method, rule based method, and wrapper
induction. Traditional Information Systems also called pattern
based systems construct a pattern (template) dictionary, and
then use the dictionary to extract needed information from the
new untagged text. AutoSlog [6], AutoSlog-TS [7], and
CRYSTAL [14] are dictionary based extraction systems.
In Rule-based Method, information extraction grammars are
developed manually by linguistic and domain experts to
recognize the entities or relations. There are two main rule
learning algorithms of rule-based information systems. They
are bottom-up method which learns rules from special cases to
general ones, and top-down method which learns rules from
general cases to special ones.
Wrapper induction is another type of rule based method. A
wrapper is an extraction procedure, which consists of a set
extraction rules and also program codes required to apply
these rules. Wrapper induction is a technique for automatically
learning the wrappers. Given a training data set, the induction
algorithm learns a wrapper for extracting the target
information. The typical wrapper systems include WIEN [13],
Stalker [9], and BWI [5].
Although this approach is simple and fast to construct with
skill and experience, collection and maintenance of rules is a
laborious and tedious process and cannot resolve ambiguity
because of the variety of forms and contexts of source text.
B. Classification Model based Approach
The method formalizes the IE problem as a classification
problem. One of the most popular methods for classification is
Support Vector Machines (SVMs). This type of IE system
consists of two distinct phases: learning and extracting. In the
learning phase our system uses a set of labeled documents to
generate models which we can use for future predictions. The
extraction phase takes the learned models and applies them to
new unlabelled documents using the learned models to
generate extractions. Many approaches can be used to training
the classification models, for example, Support Vector
Machines [15], Maximum Entropy [1]. Thus it has more
generalization capabilities than the rule based method. In
several real-world applications, it can outperform the rule
based method. Its drawback is that its model is usually
complex and it is difficult for the general user to understand
(e.g. the feature definition). Thus the performances of
extraction differ from application to application
C. Sequence Labeling based Approach
IE tasks are also considered as sequence labeling problem in
which each word (token) in the text is annotated with a tag
choosing from a predefined set of tags by using statistical
sequence models like Hidden Markov Models (HMMs) [16],
Maximum Entropy Markov Models (MEMMs) [8] and
Conditional Random Fields (CRFs) [20]. HMMs have been
successfully applied to Information Extraction by considering
IE as sequence labeling task. HMM is joint probability
distribution P (label sequence y, observation sequence x) and
3rd International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2014) Feb. 11-12, 2014 Singapore
63
it cannot represent overlapping features or non-independent
features between observed elements. A discriminative
conditional model P (label sequence y | observation sequence
x), Conditional Random Field (CRF) model has proven an
advantage over HMM. CRFs allow arbitrary, non-independent
features on the observation sequence X. It solves the problem
of complex independencies, the main difficulty of Hidden
Markov Models (HMMs) and avoid label bias problem which
is a restriction of Maximum Entropy Markov Models
(MEMMs). Statistical approaches need large amount of
training data which are very expensive to acquire and re-
annotation of large quantities of training data are required
when something needs to change in specification. However,
system expertise is not required for this change and domain
portability is quite easy.
IV. INFORMATION EXTRACTION USING CRF
A. Introduction to CRF
A Conditional Random Field (CRF), a variant of Markov
Random Network can be viewed as an undirected graphical
model. It combines classification and graphical modeling for
segmenting and labeling sequential data. Therefore, it has been
widely used in many natural language processing tasks.
Because CRF is simply a conditional probability distribution,
it is able to solve the problem of complex independencies, the
main difficulty of Hidden Markov Models (HMMs) that define
the joint probability distribution. Moreover, CRF avoids label
bias problem which is a restriction of Maximum Entropy
Markov Models (MEMMs). Let D (D1, D2,…, Dn) be the
observation sequential data and L (L1, L2,…, Ln) be the labels.
A linear chain CRF can be defined as follows:
(1)
in which ZD is a normalization factor which can be defined as
(2)
and fj (li-1,li,D,i) is a feature function and is the weight for
feature fj.
B. CRF based Information Extraction Process
1) Preprocessing
Myanmar natural disaster news articles collected from
Myanmar official newspapers are accepted as input. Myanmar
Language like other Asian languages such as Japanese, Thai
and Korea there is no boundaries between words. Thus, Word
Segmentation becomes an important preprocessing stage of
most Natural Language Processing applications. Before Word
Segmentation, Syllabification is performed, the process of
constituting and representing characters into syllables where a
syllable is a unit of sound composed of a central peak of
sonority (usually a vowel), and the consonants that cluster
around this central peak. Syllabification phase is required
because word segmentation works better with syllables than
with characters. After preprocessing, segmented words are
received as output.
2) Training for Information Extraction Model
We used CRF++ tool which is a simple, customizable, and
open source implementation of Conditional Random Fields
(CRFs) for segmenting/labeling sequential data by taking
Information Extraction as sequence labeling task. To train a
CRF model, we have to develop manually annotated data. In
this annotation, the word representing some information that
needs to extract is tagged with the type of information. For
example, “ B-Place”, “ I-
Place”, “ I-Place”,
“ I-Place”,
“ I-Place”, “ I-Place” represents the
place of the natural disaster by using tag ‘Place’. The tags are
described in IOB2 format. ‘B’ means the beginning word of
the information; ‘I’ refers to Intermediate word and O
represents the word that excludes in the answer tag. Tag sets of
the type of information are described in table I.
TABLE I
TAG SET OF THE RESPECTIVE NATURAL DISASTERS FOR CRF MODEL TO
EXTRACT REQUIRED INFORMATION
Types of Natural Disasters Tags
Earthquake Date, Time, Place, Magnitude,
Epicenter, Latitude, Longitude, Depth,
Fatalities, Injuries, H, Damage
Landslides Date, Time, Place, Cause, Volume,
Fatalities, Injuries, Missing People,
Damage
Floods Date, Time, Place, Cause, Rainfall,
Fatalities, Injuries, Missing People,
Damage
Volcanic Eruption Date, Time, Place, Volcano, Fatalities,
Injuries, Missing People, Damage
Forest Fire Date, Time, Place, Size, Fatalities,
Injuries, Missing People, Damage
Tornado Date, Time, Place, Fatalities, Injuries,
Missing People, Damage
Storms Date, Time, Place, Name, Type, Rate,
Fatalities, Injuries, Missing People,
Damage
In CRF++, feature functions are produced according to
predefined feature templates. The templates defined in our
system are shown in table II. In this template, the first
character ‘U’ means unigram template. And B refers to bigram
template. %x[x,y] will be used to specify a token in the input
data. That means row specifies the relative position from the
current focusing token and col specifies the absolute position
of the column. Here, the current word, its neighbor words and
the combination of current word and its neighbor words are
used as features.
3rd International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2014) Feb. 11-12, 2014 Singapore
64
TABLE II
FEATURE TEMPLATE
# Unigram
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-2,0]/%x[-1,0]/%x[0,0]
U06:%x[-1,0]/%x[0,0]/%x[1,0]
U07:%x[0,0]/%x[1,0]/%x[2,0]
U08:%x[-1,0]/%x[0,0]
U09:%x[0,0]/%x[1,0]
# Bigram
B
After the training phase, CRF model for information
extraction is ready to use in testing phase.
3) Information Extraction as template
After accepting new natural disaster news articles, they are
preprocessed and features are extracted. After that, predicted
answer tags can be produced with the use of CRF model
achieved from the training stage. The extracted information for
the following sample sentences can be depicted as a template
like in table III.
TABLE III
TEMPLATE OF EARTHQUAKE NEWS
V. EXPERIMENTAL RESULTS
A. Natural Disaster News Corpus
We collected around 600 Myanmar news articles about
seven types of natural disasters including Earthquakes, Floods,
Landslides, Volcano Eruption, Tornado, Hurricanes and
Wildfires from official Myanmar Newspapers. Since CRF is
supervised machine learning approach, it needs to tag the
collected news manually.
B. Evaluation Measures
The result of Information Extraction based on CRF is
evaluated through the general measuring method used in
Information Retrieval evaluation. The measures Precision,
Recall and F-Measure are defined as follows:
(3)
(4)
(5)
C. Experimental Results and Discussion
TABLE IV
EVALUATION RESULTS FOR DIFFERENT TYPES OF INFORMATION
Type of Information Precision Recall F-Measure
Date of Disaster
Time of Disaster
Place of Disaster
Fatalities
Injuries
Missing People
Physical Damage
Magnitude(Earthquake)
Epicenter (Earthquake)
Latitude (Earthquake)
Longitude (Earthquake)
Depth (Earthquake)
Cause (Landslide)
Name of Volcano
Size (Forest Fire)
Type (Storms)
Rate (Storms)
Name of Storms
0.96
0.93
0.92
0.95
0.44
0.46
0.82
0.95
0.75
0.93
0.93
0.8
0.87
0.8
0.6
0.92
0.71
0.64
0.96
1.0
0.98
0.98
0.47
0.67
0.93
0.95
0.86
1.0
1.0
0.5
1.0
0.8
0.75
1.0
0.83
0.69
0.96
0.96
0.95
0.96
0.46
0.54
0.87
0.95
0.80
0.97
0.97
0.62
0.93
0.8
0.67
0.96
0.77
0.67
Average 0.80 0.86 0.82
Table IV shows the evaluation results for different types of
extracted information. According to this table, it can be seen
that there are a variety of categories, and some categories are
much more similar among each other. In these similar kinds of
information, the performance of the approach is poor. For
example, the values of the attributes fatalities, injuries and
missing people are cardinalities. Thus, although fatalities are
well extracted, the precisions of injuries and missing people
are poor. That is because the fatalities are much including in
the training set than the other two. In order to improve this
performance, it will be necessary to collect a greater training
corpus. It is interesting to notice that for all categories the
recall rates are better than the precision scores. This fact
3rd International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2014) Feb. 11-12, 2014 Singapore
65
indicates that our system could extract most of the relevant
information from the natural disaster news, but that is also
extracts several irrelevant data.
VI. CONCLUSION AND FUTURE WORK
The use of template produced form the template-driven
Information Extraction in the summary generation module
enhances the automatic text summarization system. This paper
proposed Information Extraction from Myanmar text using
Conditional Random Fields which is one of supervised
machine learning approaches by considering the information
extraction as a sequence labeling task. The proposed approach
is applied in natural disaster news collected from official
Myanmar News papers. About 600 news articles of seven
disaster types are used as training data and over 80 documents
in seven different disasters are used for testing. The
experimental result is shown according to the type of extracted
information and got F-measure 0.82%. Since it is based on
lexical features without using complex syntactic attributes, it is
easily adaptable to specific domains and languages. To get
higher performance, it needs to expand bigger training set and
other features like Part of Speech Tagging.
REFERENCES
[1] A. L. Berger, S. A. Della Pietra, & V. J. Della Pietra, “A maximum
entropy approach to natural language processing.”, in 1996 In
Computational Linguistics (Vol.22, pp.39-71). MA: MIT Press.
[2] A. McCallum, D. Freitag, & F. Pereira, “Maximum Entropy Markov
Models for information extraction and segmentation,” in 2000 In
Proceedings of the 17th International Conference on Machine Learning
(ICML’00). pp. 591-598.
[3] A.T. Valero, M.M. Gomez and L.V. Pineda, “Using Machine Learning
for Extracting Information from Natural Disaster News Report,”
Computer and Systems Vol. 13 No. 1, 2009, pp 33-44 ISSN 1405-5546.
[4] C. ZHANG, H. WANG, Y. LIU, D. WU, Y. LIAO and B. WANG,
“Automatic Keyword Extraction from Documents Using Conditional
Random Fields,” Journal of Computational Information
Systems4:3(2008) 1169-1180.
[5] D. Freitag, & N. Kushmerick, “Boosted wrapper induction,” in 2000
In Proceedings of 17th National Conference on Artificial Intelligence.
pp. 577-58.
[6] E. Riloff, “Automatically Constructing a Dictionary for Information
Extraction Tasks,” in 1993 In Proceedings of the Eleventh National
Conference on Artificial Intelligence. pp. 811-816.
[7] E. Riloff, “Automatically Generating Extraction Patterns from Untagged
Text,” in 1996 In Proceedings of the Thirteenth National Conference on
Artificial Intelligence. pp. 1044-1049.
[8] F. Peng and A. McCallum, “Accurate Information Extraction from
Research Papers using Conditional Random Fields,” in 2004 HLT-
NAACL, pp. 329-336.
[9] I. Muslea, S. Minton, & C. Knoblock, STALKER: “Learning extraction
rules for semi-structured, web-based information sources,” in 1998 In
AAAI Workshop on AI and Information Integration. pp 74-81.
[10] J. Lafferty, A. McCallum, & F.Pereira, “Conditional Random Fields:
Probabilistic models for segmenting and labeling sequence data,” in
2001 In Proceedings of the 18th International Conference on Machine
Learning (ICML’01). pp. 282-289.
[11] K.M. Schneider, “Information Extraction from Calls for Papers with
Conditional Random Fields and Layout Features,” Artificial Intelligence
Review, Volume 25 Issue 1-2, April 2006, Pages 67 – 77.
[12] M. Hatmi, C. Jacquin, E. Morin and S. Meignier, “Named Entity
Recognition in Speech Transcripts following an Extended Taxonomy,”
Proceedings of the First Workshop on Speech, Language and Audio in
Multimedia (SLAM), Marseille, France, August 22-23, 2013.
[13] N. Kushmerick, D. S. Weld, & R. Doorenbos, “Wrapper induction for
information extraction,” in 1997 In Proceedings of the International
Joint Conference on Artificial Intelligence(IJCAI’97). pp. 729-737.
[14] S. Soderland, D. Fisher, J. Aseltine, & W. Lehnert, “CRYSTAL:
Inducing a conceptual dictionary,” in 1995 In Proceedings of the
Fourteenth International Joint Conference on Artificial Intelligence
(IJCAI’95). pp 1314-1319.
[15] V. Vapnik, “Statistical Learning Theroy,” in 1998 Springer Verlage,
New York.
[16] Z. Ghahramani, & M. I. Jordan, “Factorial Hidden Markov Models,” in
Machine Learning, Vol.29, pp.245-273.
3rd International Conference on Computational Techniques and Artificial Intelligence (ICCTAI'2014) Feb. 11-12, 2014 Singapore
66