analytics for noisy unstructured text data, hyderabad, 08/01/20071 a supervised machine learning...

Analytics for Noisy Unstructured Text Data, Hyderabad, 08/01/2007 1

A Supervised Machine Learning Approach to Conjunction Disambiguation in Named Entities

Paweł Mazur(University of Technology, Wrocław, Poland)

[email protected]

and

Robert Dale(Macquarie University, Sydney, Australia)

[email protected]


Agenda

• Conjunction in Named Entities• Our approach• Experiments• Results of the experiment• Results’ analysis• Conclusions• Further work


Conjunction in Candidate Named Entity StringFujitsu Australia and New Zealand

Australia and New Zealand Banking Group Limited

Peter Smith and Ann Arbor Software Council • Candidate named entity string:

– Any sequence of words starting with initial capitals

– Single instance of the word and or & form of conjunction

• 45 documents out of 13460: 5.7% of candidate named entity strings contained conjunction; in some documents the frequency is as high as 23%; in MUC-7 it is 4.5%

• A lot of candidate named entity strings in this domain contain company names and person names


Our Approach - A Classification Task

We distinguish 4 categories of a conjunction in a candidate NE string:

– Category A: Name Internal ConjunctionCopper Mines and Metals LimitedHerbert P Cooper & Son, Ernst and Young

– Category B: Name External ConjunctionProxy Form and Explanatory MemorandumHardware & Operating SystemsEchoStar and News Corporation

– Category C: Right-Copy SeparatorWilliam and Alma Ford, Connel and Bent Streets,Eastern and Western Australia

– Category D: Left-Copy SeparatorHospital Equipment & SystemsJ H Blair Company Secretary & Corporate Counsel

Could be seen as one linguistic category

The most common


Our Approach - Candidate NE String Pattern

String: Australia and New Zealand Banking Group LimitedPattern: (Loc and Loc Org CompDesig)String: Peter Smith and Ann Arbor Software Council Pattern: (GivenName FamilyName and GivenName FamilyName Noun Org)

Patterns are created using gazetteers and simple keyword-based heuristics.


Tag Set

InitCapped 925 42.24%Loc 245 11.19%Org 175 7.99%FamilyName 164 7.49%CompDesig 138 6.30%Initial 108 4.93%CompPos 99 4.52%GivenName 89 4.06%Of 76 3.47%Abbrev 73 3.33%PersDesig 39 1.78%Det 31 1.42%Dir 12 0.55%Son 7 0.32%Month 6 0.27%AlphaNum 3 0.14%

PersDesig: Mr, Mrs, Ms, Miss,Dr, Prof, Sir, Madam, Messrs, and Jnr.

CompDesig: Ltd, Limited, Pty Ltd, GmbH, plc and many more andInvestments Pty Ltd, Management Pty Ltd, Corporate Pty Ltd, Associates Pty Ltd, Family Trust, Co Limited, Partners, Partners Limited, Capital Limited, and Capital Pty Ltd.

CompPos: Director, Secretary, Manager, Counsel, Managing Director, Member, Chairman, Chief Executive, Chief Executive Officer, and CEO, and also some bodies within organizations, such as Board and Committee.


Data Encoding

Each instance is encoded with 33 attributes:• 1 binary attribute for each tag for each

conjunct signaling its presence in the string (2x16=32 attributes in total)

• 1 binary attribute ConjType encoding the lexical form of the conjunction in the string (0 for &, 1 for and)


Corpus & Data Sets

• Corpus: 13460 text documents – from 8 to 1000 lines long• Our corpus is a subcorpus drawn from a collection of

company announcements from the Australian Stock Exchange

• Selection of candidate named entity strings:sequence of initcapped words and a single conjunction (and or &),also optional: of, a, an, the

• We got a set of 10925 strings, 6437 of which are unique• Hand elimination of wrongly identified strings due to

typographic features of the documents (tables)• Random selection of 600 examples from the unique set

Name Interna

l

Name Extern

al

Right-Copy

Left-Copy

Sum

18530.8%

35058.3%

396.5%

264.3%

600100%


Machine-learned Classifiers

•Naïve Bayes•Multilayered Perceptron•IBk•K*•Random Tree•Logistic Model Trees (LMT)•J4.8•SMO

Implementations in WEKA (Waikato Environment for Knowledge Analysis), University of Waikato in New Zealand


Baseline

• Determined with the 0-R algorithm: always assigns the most common category (Name External) – 58.33%

• Better baseline is given by 1-R algorithm:

IF ConjForm=& THEN PredCat←InternalIF ConjForm=and THEN PredCat←External

baseline = 69.83%


Results

Algorithm Correctly classified

IBk 504 (84.00%)Random Tree 503 (83.83%)K* 501 (83.50%)SMO - quadratic kernel

494 (82.33%)

Mult. Perceptron 493 (82.17%)LMT 487 (81.17%)J4.8 477 (79.50%)SMO - linear kernel 468 (78.00%)Naïve Bayes 424 (70.67%)Baseline 419 (69.83%)


Accuracy by Conjunction Category

Category Precision Recall F-MeasureName

Internal 0.814 0.876 0.844

Name External 0.872 0.897 0.885

Right-Copy 0.615 0.410 0.492Left-Copy 0.800 0.462 0.585

weighted mean 0.834 0.840 0.833


Confusion Matrix

Name Internal

Name External

Right-Copy Left-Copy Classified

as:

162 28 6 3 Name Internal

18 314 17 11 Name External

4 6 16 0 Right-Copy|

1 2 0 12 Left-Copy


Results Analysis: Conjunction Cat. Indicators

For Name External conjunction:- Month & X- X & Month- CompDesig & X- X & PersDesig- X & GivenName- X & Dir- X & Deter- Abbrev & X- X & Abbrev

For Name Internal conjunction:

- X & Son(note: Sons of Gwalia Ltd and Gwalia Consolidated Ltd)


Error Analysis: InitCapped

38 of all 96 missclassified examples are InitCapped tag based only (~40%)

In these cases the classification ended up being determined on the basis of ConjForm attribute (just like the baseline was determined).

There were 134 InitCapped-only patterns in the data set; 96 of them (71.64%) were classified correctly

(comparative to the overall baseline result of 69.83%).

There were also 11 missclassified examples consisting mainly of InitCapped tag. Ex:Australian Labor Party and Independent MembersLoc InitCapped Org and InitCapped InitCapped


Error Analysis: Long Patterns

In 2 cases the misclassification was due to the long patterns of the examples:

Fellow of the Australian Institute of Geoscientists and The Australasian Institute of Mining

CompPos Of Det Loc Org Of InitCapped and Det Loc Org Of InitCapped

(Left-Copy => Name Internal)

Fellow of the Royal College of Pathologists of Australasia and Chairman of Scientific Services Limited

Pos Of Deter Org Of InitCapped Of Loc and Pos Of InitCapped InitCapped Desig

(Name External => Name Internal)


Error Analysis: Other Cases

• 2 cases of extended patterns – a pattern is built as another (common) pattern + additional tag:WD & HO Wills Holdings LimitedInitial Initial & Initial Initial FamilyName CompDesig (Name Inter) vsInitial Initial & Initial Initial FamilyName (Right-Copy)

• A conjunction of a person name and a company name Wayne Jones and Topsfield Pty Ltd– ambiguos even for humans without contextual

information• A conjunction of two person names: in our domain there is

only one case where this is name external type; • There are around 20 examples where it is difficult to judge

the reason for missclasification - perhaps the reason is the model we have built

• Influence of k-fold evaluation: different classification for the same pattern in different folds


Conclusions

• Distinguished 4 categories of conjunctions in NEs

• Presented the problem as one of classification

• Experiment with machine-learned classifiers• Results: F=0.833• Simple tag set used• Some examples are truly ambiguous even

for humans


Further Work

• Multiple conjunctions• Human supervised N-gram based preprocessing• Abbreviation preprocessing

• Limit the number of InitCapped tags• Take into account the syntactic number of tokens• Use contextual information (ex. syntactic number

of associated verb)

• Extend the evaluation data• Evaluation with full named entity recognition

process

analytics for noisy unstructured text data, hyderabad, 08/01/20071 a supervised machine learning...

Documents

noisy unstructured text

text documents

common slide

new zealand slide

form of conjunction

work slide

single conjunction

conjunction disambiguation