systemt: an algebraic approach to declarative information … · 2017. 8. 12. · lorem ipsum dolor...

© 2010 IBM Corporation

SystemT: an Algebraic Approach to Declarative Information Extraction

Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li,

Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan

IBM Research - Almaden

© 2010 IBM Corporation2

� Distill structured data from unstructured and semi-structured text

� Exploit the extracted data in your applications

For years, Microsoft

Corporation CEO Bill Gates

was against open source. But

today he appears to have

changed his mind. "We can be

open source. We love the

concept of shared source,"

said Bill Veghte, a Microsoft

VP. "That's a super-important

shift for us in terms of code

access.“

Richard Stallman, founder of

the Free Software Foundation,

countered saying…

Name Title Organization

Bill Gates CEO Microsoft

Bill Veghte VP Microsoft

Richard Stallman Founder Free Soft..

(from Cohen’s IE tutorial, 2003)

AnnotationsAnnotations

Information Extraction (IE)


Information Extraction in Enterprise Applications

� Information extraction is essential for many emerging enterprise applications

– Semantic search, compliance, BI over text, …

� Requirements driven by enterprise apps that use IE for practical success

– Accuracy

• Garbage-in garbage-out: Usefulness of application is often tied to quality of extraction

– Scalability

• Large data volumes, often orders of magnitude larger than classical NLP corpora

• Many applications (e.g. email) require on-the-fly information extraction

– Flexible Runtime Model

• Heterogeneous runtime environments with different resource constraints, from laptop applications to distributed environment

– Transparency

• Customer complaints needs to address ASAP without compromising overall experience

– Usability

• Building an accurate IE system is labor-intensive

• Critical for establishing an ecosystem of users


A Brief History of IE in the NLP Community

� 1978-1997: MUC (Message Understanding Conference) –DARPA competition 1987 to 1997

– FRUMP [DeJong82]

– FASTUS [Appelt93],

– TextPro, PROTEUS

� 1998: Common Pattern Specification Language (CPSL) standard [Appelt98]

– Standard for subsequent rule-based systems

� 1999-presents: Commercial products, GATE

� At first: Simple techniques like Naive Bayes

� 1990’s: Learning Rules

– AUTOSLOG [Riloff93]

– CRYSTAL [Soderland98]

– SRV [Freitag98]

� 2000’s: More specialized models

– Maximum Entropy Models [Berger96]

– Hidden Markov Models [Leek97]

– Maximum Entropy Markov Models [McCallum00]

– Conditional Random Fields [Lafferty01]

– Automatic feature expansion

Rule-Based Machine Learning


Large number of

annotators

System T(algebraic information

extraction system)

2007

2004

2005

2006

Evolution of the SystemT Project

Performance,

Expressivity

Custom Code

Diverse data sets,

Complex

extraction tasks

Grammar(CPSL-style cascading

grammar system)

Evolutionary Triggers

Grammar ++(Grammar + Extensions outside

the scope of grammars)

2010


Outline

� Challenges in Grammar-based IE systems

� SystemT : an Algebraic approach

– The AQL language

– Annotator Library

– Performance

– Tooling

� Related Work


Finite-state Grammars

� Common formalism underlying most rule-based IE systems

– Input text viewed as a sequence of tokens

– Rules expressed as regular expression patterns over the lexical features of these tokens

� Several levels of processing �� Cascading Grammars

– Typically, at higher levels of the grammar, larger segments of text are analyzed and annotated

� Common Pattern Specification Language (CPSL)

– A common language to specify and represent finite-state transducers

– Each transducer accepts a sequence of annotations and outputs a sequence of annotations


Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem ,

volutpat dapibus, ultrices sit amet, sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu

tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.

Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus

luctus, risus in sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum.

Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus

luctus, risus in e sagittis Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi sed ipsum.

Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in

feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit

amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent

Example: Simplified Person Annotator

Tokenization

(preprocessing step)

Level 1

⟨Gazetteer⟩[type = LastGaz] � ⟨Last⟩⟨Gazetteer⟩[type = LastGaz] � ⟨Last⟩

⟨Gazetteer⟩[type = FirstGaz] � ⟨First⟩⟨Gazetteer⟩[type = FirstGaz] � ⟨First⟩

⟨Token⟩[~ “[A-Z]\w+”] � ⟨Caps⟩⟨Token⟩[~ “[A-Z]\w+”] � ⟨Caps⟩

Level 2 ⟨First⟩ ⟨Last⟩ � ⟨Person⟩⟨First⟩ ⟨Last⟩ � ⟨Person⟩

⟨First⟩ ⟨Caps⟩ � ⟨Person⟩⟨First⟩ ⟨Caps⟩ � ⟨Person⟩

⟨First⟩ � ⟨Person⟩⟨First⟩ � ⟨Person⟩Rule priority used to

prefer First over Caps

Rule priority used to prefer First over Caps

First preferred over Last since it was declared earlierRigid Rule Priority in Level 1

caused partial results


A More Complex Example: Signature

Laura Haas, PhDDistinguished Engineer and Director, Computer ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs

Person

Organization

Phone

URL

Person

Organization Phone

URL

At least 1 Phone

At least 2 of {Phone, Organization, URL, Email, Address}

End with one of these.

Start with Person Within 250 characters

Difficult to express using grammars

(counting and aggregation not supported)

Difficult to express using grammars

(counting and aggregation not supported)

Extraction Task : Identify Signatures


Limitations of Classical Grammar-based Extraction

� Expressivity problems

– Rigid Matching Priority• Leads to mistakes when two rules of same priority match the

same region of text

– Lossy Sequencing• If Input annotations to a grammar phase overlap, CPSL engine

must drop some of them � may lead to mistakes

– Limited expressivity in Rule Patterns• Cannot express aggregation operations, span overlap conditions

� Performance problems

– Complete pass through tokens for each rule

– Many of these passes are wasted work


CPSL Extensions to Address Certain Limitations

� Lossy Sequencing

– Grammar rules operate on graphs of annotations

� Rigid Matching priority

– Additional matching regimes introduced

� Limited Expressivity

– Rule pattern expanded to allow more expressivity

� Performance

– Faster finite state machines

Popular CPSL extensions : JAPE, AFst, XTDL


Outline

� Limitations of Grammar-based IE systems




– Performance

– Tooling

� Related Work


� Approaching IE differently

– Identify the most basic operations

– Create an operator for each basic operation

– Compose operators to build complex annotators

� Benefits:

– Richer, cleaner rule semantics

– Better performance through optimization

Algebraic IE System from IBM Research - Almaden


SystemT Overview

AQLAQL SystemT

Optimizer

SystemT

OptimizerSystemT

Runtime

SystemT

Runtime

Compiled

Operator

Graph

Compiled

Operator

Graph

Rule language with

familiar SQL-like syntax

Specify annotator

semantics declaratively

Rule language with

familiar SQL-like syntax

Specify annotator

semantics declaratively

Choose an

efficient

execution plan

that implements

the semantics

Choose an

efficient

execution plan

that implements

the semantics

Highly scalable,

embeddable

Java runtime

Highly scalable,

embeddable

Java runtime

Input

Document

Stream

Annotated

Document

Stream


AQL: SystemT’s Rule Language

� Declarative language for defining annotators

– Compiles into SystemT’s algebra

� Main features

– Separates semantics from implementation

– Familiar syntax

– Full expressive power of algebra


create view FirstCaps as

select CombineSpans(F.name, C.name) as name

from First F, Caps C

where FollowsTok(F.name, C.name, 0, 0);

<First> <Caps>

0 tokens

AQL by Example


Regular Expression Extraction Operator

[A-Z][a-z]+

DocumentInput Tuple

…

we will meet Mark

Scott and

…

Output Tuple 2 Span 2Document

Span 1Output Tuple 1 Document

Regex


Some Example Operators

� Regex

– Find all matches of a character-based regular expression

� Dictionary

– Find all matches of an exhaustive dictionary of terms

� Join

– Find pairs of sub-annotations that match a predicate

� Block

– Identify contiguous blocks of lower-level matches


Revisiting Person Example

Dictionary<First>

SmithScott

TomorrowMarkScott

HowardSmith

Join<First> <Caps>

Join<First> <Last>

Mark ScottHoward Smith


Union



ScottMark

Howard

Consolidate


Dictionary<Last>

Regex<Caps>

……Tomorrow, we will meet Mark Scott, Howard Smith …

Explicit operator for

resolving ambiguityInput may contain overlapping annotations

(No Lossy Sequencing problem)

Output may contain overlapping annotations

(No Rigid Matching Regimes)

ScottMark

Howard

Rich set of algebraic operators (Regex, Join, Union, Consolidate)

supported by SystemT alleviates expressivity problems in CPSL

Rich set of algebraic operators (Regex, Join, Union, Consolidate)

supported by SystemT alleviates expressivity problems in CPSL


A More Complex Example: Signature

Org Phone URL

Person

Join

Union

Organization Phone

URL

Organization Phone

URL

Person

BlockOrganization

Phone

URLPerson

SignatureJoin predicates

enforce additional

constraints

Find blocks of two or

more “contact info”

patterns


Addressing Issues of Grammar-based Systems

� Grammar

– Rigid matching regimes

– Lossy sequencing

– Impossible to support native

aggregation

� SystemT

– No fixed sequencing. Retain

all matches and consolidate

overlapping matches

explicitly

– Retain all matches and

discard ambiguous matches

explicitly

– Rich set of operators

Theorem: The class of extraction tasks expressible as AQL queries is a strict superset of that expressible

through expanded code-free CPSL grammars.

Theorem: The class of extraction tasks expressible as AQL queries is a strict superset of that expressible

through expanded code-free CPSL grammars.


Comparison with Grammar-based Annotators

74.06 / 75.7663.72 / 65.1888.40 / 90.43ANNIE

80.60 / 81.3071.10 / 71.7293.04 / 93.85SystemTLocation

42.43 / 49.6438.07 / 44.5447.92 / 56.06ANNIE2005

60.58 / 63.5847.97 / 50.3582.17 / 86.25SystemTOrganizationACE

76.05 / 80.6568.69 / 72.8485.17 / 90.32SystemTPersonEnron

52.48 / 70.6948.59 / 65.4657.05 / 76.84ANNIE

34.74 / 67.1530.89 / 59.7239.68 / 76.69ANNIE

77.63 / 79.3468.28 / 69.7889.96 / 91.94SystemTPerson

F-measure

Exact / Partial

Recall

Exact / Partial

Precision

Exact / PartialSystemEntity TypeDataset

Extraction Task : Named-entity extraction

Systems compared : SystemT (out-of-box) vs. ANNIE (uses JAPE in GATE)


Comparison with State-of-the-art Results

SystemEntity Type F-measureRecallPrecisionDataset

Minkov

SystemT

Florian

SystemT

Florian

SystemT

Florian

SystemT

93.8595.2492.49

94.3292.3996.32Person

84.6783.4485.93

88.6585.3192.25Organization

91.1591.7390.59

92.3591.6193.11Location

CoNLL 2003

77.974.981.1

84.4681.8287.27PersonEnron

Extraction Task : Named-entity extraction

Systems compared : SystemT (customized) vs. [Florian et al.’03] [Minkov et al.’05]


Outline





– Performance

– Tooling

� Related Work


SystemT in IBM Products and Research Projects

…

Research Projects

IBM Products

(Transferred)

Patent searchSIMPLE (Almaden)

ComplianceMidas Finance (Almaden)

Email searchOmnifind Personal Email Search

(Almaden)

SecurityEISM (Watson)

Customer engagementSystemS (Watson)

W3 search (live in Q2 2011)Gumshoe (Almaden)

Business insightsIBM Content Analyzer

MashupsInfoSphere MashupHub

Compliance/SearcheDiscovery Analyzer

Data redactionOptim Integrated Data Management

LiveText client mashupsLotus Notes, Expeditor, Symphony


SystemT Move to SWG

� Will be run by the Business Analytics side of SWG

� All existing and new products will be supported by

SWG


SystemT Annotator Library� Named Entities

– Complex entities: Person, Location (including Address), Organization

– “Simple” entities: Phone Number, Email address, URL, Time, Date

� Email

– ConferenceCall, Signature, Agenda, ForwardBlock, DrivingDirection,

– Relationships: Persons � Phone / Address / EmailAddress

� Blogs: Sentiment, InformalReview

� Financial

– Merger/acquisition, Joint Venture, Earnings announcement, Analyst earnings estimate

– Directors and officers, Corporate Actions, Appointment / Resignation of Officers / Directors, Institutional Loan agreements, Beneficial Ownership, Subsidiaries, …

� Web pages: Homepage, Geography, Category,…

� Healthcare: Disease, Drug, ChemicalCompound

� Many more…


Multilingual Support

� Via integration with LanguageWare

– Tokenization, POS tagging

� Increasing language coverage for annotators

– Major Western languages (SWG): DE, EN, ES, FR, IT, NL, PT• Named Entities + Sentiment analysis

– Chinese (Almaden + CRL)

• Named Entities + Sentiment analysis (ongoing)

– Japanese (SWG)• Named Entities

– Extension of Named Entity annotators for Indian names in English text (IRL)


Beyond Annotators

� Syntactic Coreference Resolution (Almaden)

– Requested by multiple teams

– Recently delivered to eDiscovery Analyzer

• Only name variants have been included in eDA 2.2

• Pronoun resolution will be included in the next version

– Resulted in new primitives that are currently being

considered as native operations in SystemT


Outline





– Performance

– Tooling

� Related Work


Performance

� Performance issues with grammars

– Complete pass through tokens for each grammar level

– Many of these passes are wasted work

� Dominant approach: Make each pass go faster

– Doesn’t solve root problem!

� Algebraic approach: Build a query optimizer!


Optimizations

� Query optimization is a familiar topic in databases

� What’s different in text?

– Operations over sequences and spans

– Document boundaries

– Costs concentrated in extraction operators (dictionary, regular expression)

� Can leverage these characteristics

– Text-specific optimizations

– Significant performance improvements


SystemT Block Diagram

AQLAQL SystemT

Optimizer

SystemT

OptimizerSystemT

Runtime

SystemT

Runtime

Compiled

Operator

Graph

Compiled

Operator

Graph

Input

Document

Stream

Annotated

Document

Stream


The SystemT Optimizer

� AQL allows the development of highly complex rulesets

– Example: Annotator stack for InfoSphere MashupHub consists of more than 950 AQL statements

� Execution plan optimization is crucial to having acceptable throughput for enterprise apps

– Naïve execution plan would be painfully slow

• Direct translation from AQL to algebra

� Constant feedback loop:

1. SystemT optimizer improves…

2. …enabling more complex rules…

3. …which improve precision and recall…

4. …and drive the need for more optimization


Example: Shared Dictionary Matching (SDM)

� Rewrite-based optimization

– Applied to the algebraic plan during postprocessing

� Evaluate multiple dictionaries in a single pass

DictD1 D2

subplan

D1

D2

subplan

Dict SDMDict

SDM

Dictionary

Operator


Example: Restricted Span Evaluation (RSE)

� Leverage the sequential nature of text

– Join predicates on character or token distance

� Only evaluate the inner on the relevant portions of the document

� Limited applicability

– Need to guarantee exact same results

…we will meet Mark Scott…

ScottMark

Mark Scott

Regex

<Caps>

Dictionary

<First>

RSEJoin

Only look for regex matches

in the vicinity of a first name.


Some Example Execution Plans

First Caps

(followed within 0 tokens)

First

Identify Caps starting

within 0 tokens

Caps

Identify First ending

within 0 tokens

Extract text to the right Extract text to the left

Plan B Plan C

Plan A

(naïve) Join


Performance Comparison (with ANNIE)

0

100

200

300

400

500

600

700

0 20 40 60 80 100

Average document size (KB)

Thro

ughput (K

B/s

ec)

Open Source Entity Tagger

SystemT

ANNIE

Task: Named Entity

Dataset : Different document collections from the Enron corpus obtained

by randomly sampling 1000 documents for each size

>10x faster>10x faster


Performance Comparison on Larger Documents

189.62683.5954.521.11.54 MB1 MB – 3.4 MBLarge

SEC Flings

143.7601.8703.526.3401 KB240 KB – 0.9 MBMedium

SEC Filings

77.2201.8498.842.88.8 KB68 B – 388 KBWeb Crawl

SystemTANNIESystemTANNIEAverageRange

Average Memory

(MB)Throughput (KB/sec)Document SizeDataset

Datasets : Web crawl and filings from the Securities and Exchanges Commission (SEC)

Throughput benefits carryover for

wide-variety of document sizes

Throughput benefits carryover for

wide-variety of document sizes

Much lower

memory footprint

Much lower

memory footprint

Theorem: For any acyclic token-based FST T,

there exists an operator graph G such that evaluating

T and G has the same computational complexity

Theorem: For any acyclic token-based FST T,

there exists an operator graph G such that evaluating

T and G has the same computational complexity


SystemT Block Diagram

AQLAQL SystemT

Optimizer

SystemT

OptimizerSystemT

Runtime

SystemT

Runtime

Compiled

Operator

Graph

Compiled

Operator

Graph

Input

Document

Stream

Annotated

Document

Stream


SystemT Runtime Environment

� Compact Java-based runtime

– Small memory footprint

– High performance

� Designed to be embedded in a larger system

– Lotus Notes

– UIMA/GATE

– Hadoop

– Custom integrations


Hadoop Cluster

Jaql Function WrapperJaql Function Wrapper

SystemT

Runtime

Input

Adapter

Output

Adapter


SystemT

Runtime

Input

Adapter

Output

Adapter


SystemT

RuntimeInput

Adapter

Output

Adapter


SystemT

RuntimeInput

Adapter

Output

Adapter


SystemT

RuntimeInput

Adapter

Output

Adapter


SystemT

RuntimeInput

Adapter

Output

Adapter

Lotus Notes

Client

Lotus Notes

Client

Scaling SystemT: From Laptop to Cluster

Cognos Toro AnalyticsIn Lotus Notes Live Text In Cognos Consumer Insights

SystemT

Runtime

Email

Message Display

Annotated Email

Documents

Jaql Runtime

Hadoop Map-Reduce


SystemT

RuntimeInput

Adapter

Output

Adapter


SystemT

RuntimeInput

Adapter

Output

Adapter


SystemT

RuntimeInput

AdapterOutput

Adapter


Outline





– Performance

– Tooling

� Related Work


Rule Development for Information Extraction

Develop

Test

Analyze

Development

Deploy

Refine

Test

Maintenance

Very labor intensive and time-consuming


---------------------------------------create view ValidLastNameAll asselect N.lastname as lastnamefrom LastNameAll N-- do not allow partially all capitalized wordswhere Not(MatchesRegex(/(\p{Lu}\p{M}*)+-.*([\p{Ll}\p{Lo}]\p{M}*).*/, N.lastname))and Not(MatchesRegex(/.*([\p{Ll}\p{Lo}]\p{M}*).*-

(\p{Lu}\p{M}*)+/, N.lastname));

create view LastName asselect C.lastname as lastname--from Consolidate(ValidLastNameAll.lastname) C;from ValidLastNameAll Cconsolidate on C.lastname;

-- Find dictionary matches for all first names-- Mostly US first namescreate view StrictFirstName1 asselect D.match as firstnamefrom Dictionary('strictFirst.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

-- German first namescreate view StrictFirstName2 asselect D.match as firstnamefrom Dictionary('strictFirst_german.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

-- nick names for US first namescreate view StrictFirstName3 asselect D.match as firstnamefrom Dictionary('strictNickName.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

-- german first name from blue pagecreate view StrictFirstName4 asselect D.match as firstnamefrom Dictionary('strictFirst_german_bluePages.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

-- Italy first name from blue pagescreate view StrictFirstName5 asselect D.match as firstnamefrom Dictionary('names/strictFirst_italy.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

-- France first name from blue pagescreate view StrictFirstName6 asselect D.match as firstnamefrom Dictionary('names/strictFirst_france.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

-- Spain first name from blue pagescreate view StrictFirstName7 asselect D.match as firstnamefrom Dictionary('names/strictFirst_spain.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

-- Indian first name from blue pages-- TODO: still need to clean up the remaining entriescreate view StrictFirstName8 asselect D.match as firstnamefrom Dictionary('names/strictFirst_india.partial.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

-- Israel first name from blue pagescreate view StrictFirstName9 asselect D.match as firstnamefrom Dictionary('names/strictFirst_israel.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

-- union all the dictionary matches for first namescreate view StrictFirstName as

(select S.firstname as firstname from StrictFirstName1 S)

union all(select S.firstname as firstname from

StrictFirstName2 S)union all(select S.firstname as firstname from







StrictFirstName9 S);

-- Relaxed versions of first namecreate view RelaxedFirstName1 asselect CombineSpans(S.firstname, CP.name) as firstnamefrom StrictFirstName S,

StrictCapsPerson CPwhere FollowsTok(S.firstname, CP.name, 1, 1)and MatchesRegex(/\-/, SpanBetween(S.firstname,

CP.name));

create view RelaxedFirstName2 asselect CombineSpans(CP.name, S.firstname) as firstnamefrom StrictFirstName S,

StrictCapsPerson CPwhere FollowsTok(CP.name, S.firstname, 1, 1)and MatchesRegex(/\-/, SpanBetween(CP.name,

S.firstname));

-- all the first namescreate view FirstNameAll as

(select N.firstname as firstname from StrictFirstName N)

union all(select N.firstname as firstname from

RelaxedFirstName1 N)union all(select N.firstname as firstname from

RelaxedFirstName2 N);

create view ValidFirstNameAll asselect N.firstname as firstnamefrom FirstNameAll Nwhere Not(MatchesRegex(/(\p{Lu}\p{M}*)+-.*([\p{Ll}\p{Lo}]\p{M}*).*/, N.firstname))and Not(MatchesRegex(/.*([\p{Ll}\p{Lo}]\p{M}*).*-

(\p{Lu}\p{M}*)+/, N.firstname));

create view FirstName asselect C.firstname as firstname--from Consolidate(ValidFirstNameAll.firstname) C;from ValidFirstNameAll Cconsolidate on C.firstname;

-- Combine all dictionary matches for both last names and first namescreate view NameDict asselect D.match as namefrom Dictionary('name.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

create view NameDict1 asselect D.match as namefrom Dictionary('names/name_italy.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

create view NameDict2 asselect D.match as namefrom Dictionary('names/name_france.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

create view NameDict3 asselect D.match as namefrom Dictionary('names/name_spain.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

create view NameDict4 asselect D.match as name

from FirstName FN,InitialWord IW,CapsPerson CP

where FollowsTok(FN.firstname, IW.word, 0, 0)and FollowsTok(IW.word, CP.name, 0, 0);

/*** Translation for Rule 3r2* * This relaxed version of rule '3' will find person names like Thomas B.M. David* But it only insists that the second word is in the person dictionary*//*<rule annotation=Person id=3r2><internal><token attribute={etc}>CAPSPERSON</token><token attribute={etc}>INITIALWORD</token><token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token></internal></rule>*/

create view Person3r2 asselect CombineSpans(CP.name, LN.lastname) as personfrom LastName LN,

InitialWord IW,CapsPerson CP

where FollowsTok(CP.name, IW.word, 0, 0)and FollowsTok(IW.word, LN.lastname, 0, 0);

/*** Translation for Rule 4** This rule will find person names like David Thomas*//*<rule annotation=Person id=4><internal><token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token><token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token></internal></rule>*/create view Person4WithNewLine asselect CombineSpans(FN.firstname, LN.lastname) as personfrom FirstName FN,

LastName LNwhere FollowsTok(FN.firstname, LN.lastname, 0, 0);

-- Yunyao: 05/20/2008 revised to Person4WrongCandidates due to performance reason-- NOTE: current optimizer execute Equals first thus make Person4Wrong very expensive--create view Person4Wrong as--select CombineSpans(FN.firstname, LN.lastname) as person--from FirstName FN,-- LastName LN--where FollowsTok(FN.firstname, LN.lastname, 0, 0)-- and ContainsRegex(/[\n\r]/, SpanBetween(FN.firstname, LN.lastname))-- and Equals(GetText(FN.firstname), GetText(LN.lastname));

create view Person4WrongCandidates asselect FN.firstname as firstname, LN.lastname as lastnamefrom FirstName FN,

LastName LNwhere FollowsTok(FN.firstname, LN.lastname, 0, 0)

and ContainsRegex(/[\n\r]/, SpanBetween(FN.firstname, LN.lastname));

create view Person4 as(select P.person as person from

Person4WithNewLine P)minus(select CombineSpans(P.firstname, P.lastname) as

person from Person4WrongCandidates Pwhere Equals(GetText(P.firstname),

GetText(P.lastname)));/*** Translation for Rule4a* This rule will find person names like Thomas, David*//*<rule annotation=Person id=4a><internal><token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token><token attribute={etc}>\,</token><token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token></internal></rule>*/create view Person4a asselect CombineSpans(LN.lastname, FN.firstname) as personfrom FirstName FN,

LastName LNwhere FollowsTok(LN.lastname, FN.firstname, 1, 1)and ContainsRegex(/,/,SpanBetween(LN.lastname, FN.firstname));

-- relaxed version of Rule4a-- Yunyao: split the following rules into two to improve performance-- TODO: Test case for optimizer -- create view Person4ar1 as-- select CombineSpans(CP.name, FN.firstname) as person--from FirstName FN,-- CapsPerson CP--where FollowsTok(CP.name, FN.firstname, 1, 1)--and ContainsRegex(/,/,SpanBetween(CP.name, FN.firstname))--and Not(MatchesRegex(/(.|\n|\r)*(\.|\?|!|'|\sat|\sin)( )*/, LeftContext(CP.name, 10)))--and Not(MatchesRegex(/(?i)(.+fully)/, CP.name))--and GreaterThan(GetBegin(CP.name), 10);

create view Person4ar1temp asselect FN.firstname as firstname, CP.name as namefrom FirstName FN,

CapsPerson CPwhere FollowsTok(CP.name, FN.firstname, 1, 1)and ContainsRegex(/,/,SpanBetween(CP.name, FN.firstname));

create view Person4ar1 asselect CombineSpans(P.name, P.firstname) as personfrom Person4ar1temp P

where Not(MatchesRegex(/(.|\n|\r)*(\.|\?|!|'|\sat|\sin)( )*/, LeftContext(P.name, 10))) --'and Not(MatchesRegex(/(?i)(.+fully)/, P.name))and GreaterThan(GetBegin(P.name), 10);

create view Person4ar2 asselect CombineSpans(LN.lastname, CP.name) as personfrom CapsPerson CP,

LastName LNwhere FollowsTok(LN.lastname, CP.name, 0, 1)and ContainsRegex(/,/,SpanBetween(LN.lastname, CP.name));

/*** Translation for Rule2** This rule will handles names of persons like B.M. Thomas David, where Thomas occurs in some person dictionary*//*<rule annotation=Person id=2><internal><token attribute={etc}>INITIALWORD</token><token attribute={etc}PERSON{etc}>CAPSPERSON</token><token attribute={etc}>CAPSPERSON</token></internal></rule>*/

create view Person2 asselect CombineSpans(IW.word, CP.name) as personfrom InitialWord IW,

PersonDict P,CapsPerson CP

where FollowsTok(IW.word, P.name, 0, 0)and FollowsTok(P.name, CP.name, 0, 0);

/*** Translation for Rule 2a** The rule handles names of persons like B.M. Thomas David, where David occurs in some person dictionary*//*<rule annotation=Person id=2a><internal><token attribute={etc}>INITIALWORD</token><token attribute={etc}>CAPSPERSON</token><token attribute={etc}>NEWLINE</token>?<token attribute={etc}PERSON{etc}>CAPSPERSON</token></internal></rule>*/

create view Person2a asselect CombineSpans(IW.word, P.name) as personfrom InitialWord IW,

CapsPerson CP,PersonDict P

where FollowsTok(IW.word, CP.name, 0, 0)and FollowsTok(CP.name, P.name, 0, 0);

/*<rule annotation=Person id=4r1><internal><token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token><token attribute={etc}>NEWLINE</token>?<token attribute={etc}>CAPSPERSON</token></internal></rule>*/create view Person4r1 asselect CombineSpans(FN.firstname, CP.name) as personfrom FirstName FN,

CapsPerson CPwhere FollowsTok(FN.firstname, CP.name, 0, 0);

/*** Translation for Rule 4r2** This relaxed version of rule '4' will find person names Thomas, David* But it only insists that the SECOND word is in some person dictionary*//*<rule annotation=Person id=4r2><token attribute={etc}>ANYWORD</token><internal><token attribute={etc}>CAPSPERSON</token><token attribute={etc}>NEWLINE</token>?<token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token></internal></rule>*/create view Person4r2 asselect CombineSpans(CP.name, LN.lastname) as personfrom CapsPerson CP,

LastName LNwhere FollowsTok(CP.name, LN.lastname, 0, 0);

/*** Translation for Rule 5** This rule will find other single token person first names*//* <rule annotation=Person id=5><internal><token attribute={etc}>INITIALWORD</token>?<token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token></internal></rule>*/create view Person5 asselect CombineSpans(IW.word, FN.firstname) as personfrom InitialWord IW,

FirstName FNwhere FollowsTok(IW.word, FN.firstname, 0, 0);

/*** Translation for Rule 6** This rule will find other single token person last names*//* <rule annotation=Person id=6><internal><token attribute={etc}>INITIALWORD</token>?<token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token></internal></rule>*/

create view Person6 asselect CombineSpans(IW.word, LN.lastname) as personfrom InitialWord IW,

LastName LNwhere FollowsTok(IW.word, LN.lastname, 0, 0);

--==========================================================-- End of rules---- Create final list of names based on all the matches extracted----==========================================================

/*** Union all matches found by strong rules, except the ones directly come* from dictionary matches*/create view PersonStrongWithNewLine as

(select P.person as person from Person1 P)--union all-- (select P.person as person from Person1a_more P)union all

(select P.person as person from Person3 P)union all


(select P.person as person from Person3P1 P);

create view PersonStrongSingleTokenOnly as(select P.person as person from Person5 P)

union all(select P.person as person from Person6 P)

union all(select P.firstname as person from FirstName P)

union all(select P.lastname as person from LastName P)

union all(select P.person as person from Person1a P);

-- Yunyao: added 05/09/2008 to expand person names with suffixcreate view PersonStrongSingleTokenOnlyExpanded1 asselect CombineSpans(P.person,S.suffix) as personfrom

PersonStrongSingleTokenOnly P,PersonSuffix S

where FollowsTok(P.person, S.suffix, 0, 0);

-- Yunyao: added 04/14/2009 to expand single token person name with a single initial-- extend single token person with a single initialcreate view PersonStrongSingleTokenOnlyExpanded2 as select CombineSpans(R.person, RightContext(R.person,2)) as personfrom PersonStrongSingleTokenOnly Rwhere MatchesRegex(/ +[\p{Upper}]\b\s*/, RightContext(R.person,3));

create view PersonStrongSingleToken as(select P.person as person from

PersonStrongSingleTokenOnly P)union all (select P.person as person from

PersonStrongSingleTokenOnlyExpanded1 P)union all (select P.person as person from

PersonStrongSingleTokenOnlyExpanded2 P);

/*** Union all matches found by weak rules*/create view PersonWeak1WithNewLine as

(select P.person as person from Person3r1 P)union all





(select P.person as person from Person2a P)union all

(select P.person as person from Person3P2 P)union all

(select P.person as person from Person3P3 P);

-- weak rules that identify (LastName, FirstName)create view PersonWeak2WithNewLine as

(select P.person as person from Person4a P)union all

(select P.person as person from Person4ar1 P)union all

(select P.person as person from Person4ar2 P);

--include 'core/GenericNE/Person-FilterNewLineSingle.aql';--include 'core/GenericNE/Person-Filter.aql';

create view PersonBase as(select P.person as person from

PersonStrongWithNewLine P)union all

(select P.person as person from PersonWeak1WithNewLine P)union all

(select P.person as person from PersonWeak2WithNewLine P);

output view PersonBase;

from Dictionary('names/name_israel.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);

create view NamesAll as(select P.name as name from NameDict P)union all(select P.name as name from NameDict1 P)union all(select P.name as name from NameDict2 P)union all(select P.name as name from NameDict3 P)union all(select P.name as name from NameDict4 P)union all(select P.firstname as name from FirstName P)union all

create view PersonDict asselect C.name as name--from Consolidate(NamesAll.name) C;from NamesAll Cconsolidate on C.name;

--==========================================================-- Actual Rules--==========================================================

-- For 3-part Person namescreate view Person3P1 as select CombineSpans(F.firstname, L.lastname) as personfrom StrictFirstName F,

StrictCapsPersonR S,StrictLastName L

where FollowsTok(F.firstname, S.name, 0, 0)--and FollowsTok(S.name, L.lastname, 0, 0)and FollowsTok(F.firstname, L.lastname, 1, 1)and Not(Equals(GetText(F.firstname), GetText(L.lastname)))and Not(Equals(GetText(F.firstname), GetText(S.name)))and Not(Equals(GetText(S.name), GetText(L.lastname)))and Not(ContainsRegex(/[\n\r\t]/, SpanBetween(F.firstname, L.lastname)));

create view Person3P2 as select CombineSpans(P.name, L.lastname) as personfrom PersonDict P,

StrictCapsPersonR S,StrictLastName L

where FollowsTok(P.name, S.name, 0, 0)--and FollowsTok(S.name, L.lastname, 0, 0)and FollowsTok(P.name, L.lastname, 1, 1)and Not(Equals(GetText(P.name), GetText(L.lastname)))and Not(Equals(GetText(P.name), GetText(S.name)))and Not(Equals(GetText(S.name), GetText(L.lastname)))and Not(ContainsRegex(/[\n\r\t]/, SpanBetween(P.name, L.lastname)));

create view Person3P3 as select CombineSpans(F.firstname, P.name) as personfrom PersonDict P,

StrictCapsPersonR S,StrictFirstName F

where FollowsTok(F.firstname, S.name, 0, 0)--and FollowsTok(S.name, P.name, 0, 0)and FollowsTok(F.firstname, P.name, 1, 1)and Not(Equals(GetText(P.name), GetText(F.firstname)))and Not(Equals(GetText(P.name), GetText(S.name)))and Not(Equals(GetText(S.name), GetText(F.firstname)))and Not(ContainsRegex(/[\n\r\t]/, SpanBetween(F.firstname, P.name)));

/*** Translation for Rule 1* Handles names of persons like Mr. Vladimir E. Putin*//*<rule annotation=Person id=1><token attribute={etc}INITIAL{etc}>CANYWORD</token><internal><token attribute={etc}>CAPSPERSON</token><token attribute={etc}>INITIALWORD</token><token attribute={etc}>CAPSPERSON</token></internal></rule>*/

create view Person1 asselect CombineSpans(CP1.name, CP2.name) as personfrom Initial I,

CapsPerson CP1,InitialWord IW,CapsPerson CP2

where FollowsTok(I.initial, CP1.name, 0, 0)and FollowsTok(CP1.name, IW.word, 0, 0)and FollowsTok(IW.word, CP2.name, 0, 0);--and Not(ContainsRegex(/[\n\r]/, SpanBetween(I.initial, CP2.name)));

/*** Translation for Rule 1a* Handles names of persons like Mr. Vladimir Putin*//* <rule annotation=Person id=1a><token attribute={etc}INITIAL{etc}>CANYWORD</token><internal><token attribute={etc}>CAPSPERSON</token>{1,3}</internal></rule>*/

-- Split into two rules so that single token annotations are serperated from others-- Single token annotationscreate view Person1a1 asselect CP1.name as personfrom Initial I,

CapsPerson CP1where FollowsTok(I.initial, CP1.name, 0, 0)--- start changing this block--- disallow allow newline and Not(ContainsRegex(/[\n\t]/,SpanBetween(I.initial,CP1.name)))--- end changing this block;

-- Yunyao: added 05/09/2008 to match patterns such as "Mr. B. B. Buy"/*create view Person1a2 as select CombineSpans(name.block, CP1.name) as personfrom Initial I,

BlockTok(0, 1, 2, InitialWord.word) name,CapsPerson CP1

where FollowsTok(I.initial, name.block, 0, 0)and FollowsTok(name.block, CP1.name, 0, 0)and Not(ContainsRegex(/[\n\t]/,CombineSpans(I.initial, CP1.name)));

*/

create view Person1a as-- (

select P.person as person from Person1a1 P-- )-- union all-- (select P.person as person from Person1a2 P);

/*create view Person1a_more as select name.block as personfrom Initial I,

BlockTok(0, 2, 3, CapsPerson.name) namewhere FollowsTok(I.initial, name.block, 0, 0)

and Not(ContainsRegex(/[\n\t]/,name.block))--- start changing this block-- disallow newlineand Not(ContainsRegex(/[\n\t]/,SpanBetween(I.initial,name.block)))

--- end changing this block;

*/

/*** Translation for Rule 3* Find person names like Thomas B.M. David*//*<rule annotation=Person id=3><internal><token attribute={etc}PERSON{etc}>CAPSPERSON</token><token attribute={etc}>INITIALWORD</token><token attribute={etc}PERSON{etc}>CAPSPERSON</token></internal></rule>*/

create view Person3 asselect CombineSpans(P1.name, P2.name) as personfrom PersonDict P1,

--InitialWord IW,WeakInitialWord IW,PersonDict P2

where FollowsTok(P1.name, IW.word, 0, 0)and FollowsTok(IW.word, P2.name, 0, 0)and Not(Equals(GetText(P1.name), GetText(P2.name)));

/*** Translation for Rule 3r1* * This relaxed version of rule '3' will find person names like Thomas B.M. David* But it only insists that the first word is in the person dictionary*//*<rule annotation=Person id=3r1><internal><token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token><token attribute={etc}>INITIALWORD</token><token attribute={etc}>CAPSPERSON</token></internal></rule>*/

create view Person3r1 as

create view Initial as

--'Junior' (Yunyao: comments out to avoid mismatches such as Junior National [team player],-- If we can have large negative dictionary to eliminate such mismatches, -- then this may be recovered --'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name')-- for German names-- TODO: need further test,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor','Herr Professor', 'Frau professor', 'Baron', 'graf'

);

-- Find dictionary matches for all title initials

select D.match as initial--'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name')-- for German names-- TODO: need further test,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor','Herr Professor', 'Frau professor', 'Baron', 'graf'

);

-- Find dictionary matches for all title initials

from Dictionary('InitialDict', Doc.text) D;

-- Yunyao: added 05/09/2008 to capture person name suffixcreate dictionary PersonSuffixDict as(

',jr.', ',jr', 'III', 'IV', 'V', 'VI');

create view PersonSuffix asselect D.match as suffixfrom Dictionary('PersonSuffixDict', Doc.text) D;

-- Find capitalized words that look like person names and not in the non-name dictionarycreate view CapsPersonCandidate asselect R.match as name--from Regex(/\b\p{Upper}\p{Lower}[\p{Alpha}]{1,20}\b/, Doc.text) R--from Regex(/\b\p{Upper}\p{Lower}[\p{Alpha}]{0,10}(['-][\p{Upper}])?[\p{Alpha}]{1,10}\b/, Doc.text) R -- change to enable unicode match--from Regex(/\b\p{Lu}\p{M}*[\p{Ll}\p{Lo}]\p{M}*[\p{L}\p{M}*]{0,10}(['-][\p{Lu}\p{M}*])?[\p{L}\p{M}*]{1,10}\b/, Doc.text) R --from Regex(/\b\p{Lu}\p{M}*[\p{Ll}\p{Lo}]\p{M}*[\p{L}\p{M}*]{0,10}(['-][\p{Lu}\p{M}*])?(\p{L}\p{M}*){1,10}\b/, Doc.text) R -- Allow fully capitalized words--from Regex(/\b\p{Lu}\p{M}*(\p{L}\p{M}*){0,10}(['-][\p{Lu}\p{M}*])?(\p{L}\p{M}*){1,10}\b/, Doc.text) R from RegexTok(/\p{Lu}\p{M}*(\p{L}\p{M}*){0,10}(['-][\p{Lu}\p{M}*])?(\p{L}\p{M}*){1,10}/, 4, Doc.text) R --'where Not(ContainsDicts(

'FilterPersonDict', 'filterPerson_position.dict','filterPerson_german.dict','InitialDict','StrongPhoneVariantDictionary','stateList.dict','organization_suffix.dict',

'industryType_suffix.dict','streetSuffix_forPerson.dict', 'wkday.dict','nationality.dict','stateListAbbrev.dict','stateAbbrv.ChicagoAPStyle.dict', R.match));

create view CapsPerson asselect C.name as namefrom CapsPersonCandidate Cwhere Not(MatchesRegex(/(\p{Lu}\p{M}*)+-.*([\p{Ll}\p{Lo}]\p{M}*).*/, C.name))and Not(MatchesRegex(/.*([\p{Ll}\p{Lo}]\p{M}*).*-(\p{Lu}\p{M}*)+/, C.name));

-- Find strict capitalized words with two letter or more (relaxed version of StrictCapsPerson)

--============================================================--TODO: need to think through how to deal with hypened name -- one way to do so is to run Regex(pattern, CP.name) and enforce CP.name does not contain '-- need more testing before confirming the change

create view CapsPersonNoP asselect CP.name as namefrom CapsPerson CPwhere Not(ContainsRegex(/'/, CP.name)); --'

create view StrictCapsPersonR asselect R.match as name--from Regex(/\b\p{Lu}\p{M}*(\p{L}\p{M}*){1,20}\b/, CapsPersonNoP.name) R;from RegexTok(/\p{Lu}\p{M}*(\p{L}\p{M}*){1,20}/, 1, CapsPersonNoP.name) R;

--============================================================

-- Find strict capitalized words--create view StrictCapsPerson ascreate view StrictCapsPerson asselect R.name as namefrom StrictCapsPersonR Rwhere MatchesRegex(/\b\p{Lu}\p{M}*[\p{Ll}\p{Lo}]\p{M}*(\p{L}\p{M}*){1,20}\b/, R.name);

-- Find dictionary matches for all last namescreate view StrictLastName1 asselect D.match as lastnamefrom Dictionary('strictLast.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);

create view StrictLastName2 asselect D.match as lastnamefrom Dictionary('strictLast_german.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);

create view StrictLastName3 asselect D.match as lastnamefrom Dictionary('strictLast_german_bluePages.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);

create view StrictLastName4 asselect D.match as lastnamefrom Dictionary('uniqMostCommonSurname.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);

create view StrictLastName5 asselect D.match as lastnamefrom Dictionary('names/strictLast_italy.dict', Doc.text) Dwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);

create view StrictLastName6 asselect D.match as lastnamefrom Dictionary('names/strictLast_france.dict', Doc.text) Dwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);

create view StrictLastName7 asselect D.match as lastnamefrom Dictionary('names/strictLast_spain.dict', Doc.text) Dwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);

create view StrictLastName8 asselect D.match as lastnamefrom Dictionary('names/strictLast_india.partial.dict', Doc.text) Dwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);

create view StrictLastName9 asselect D.match as lastnamefrom Dictionary('names/strictLast_israel.dict', Doc.text) Dwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);

create view StrictLastName as(select S.lastname as lastname from StrictLastName1 S)union all(select S.lastname as lastname from StrictLastName2 S)union all(select S.lastname as lastname from StrictLastName3 S)union all(select S.lastname as lastname from StrictLastName4 S)union all(select S.lastname as lastname from StrictLastName5 S)union all(select S.lastname as lastname from StrictLastName6 S)union all(select S.lastname as lastname from StrictLastName7 S)union all(select S.lastname as lastname from StrictLastName8 S)union all(select S.lastname as lastname from StrictLastName9 S);

-- Relaxed version of last namecreate view RelaxedLastName1 asselect CombineSpans(SL.lastname, CP.name) as lastnamefrom StrictLastName SL,

StrictCapsPerson CPwhere FollowsTok(SL.lastname, CP.name, 1, 1)and MatchesRegex(/\-/, SpanBetween(SL.lastname, CP.name));

create view RelaxedLastName2 asselect CombineSpans(CP.name, SL.lastname) as lastnamefrom StrictLastName SL,

StrictCapsPerson CPwhere FollowsTok(CP.name, SL.lastname, 1, 1)and MatchesRegex(/\-/, SpanBetween(CP.name, SL.lastname));

-- all the last namescreate view LastNameAll as

(select N.lastname as lastname from StrictLastName N)union all(select N.lastname as lastname from RelaxedLastName1 N)union all(select N.lastname as lastname from RelaxedLastName2 N);

create view ValidLastNameAll asselect N.lastname as lastname

----------------------------------------- Document Preprocessing---------------------------------------create view Doc asselect D.text as textfrom DocScan D;

------------------------------------------ Basic Named Entity Annotators----------------------------------------

-- Find initial words create view InitialWord1 asselect R.match as word--from Regex(/\b([\p{Upper}]\.\s*){1,5}\b/, Doc.text) Rfrom RegexTok(/([\p{Upper}]\.\s*){1,5}/, 10, Doc.text)

R-- added on 04/18/2008where Not(MatchesRegex(/M\.D\./, R.match));

-- Yunyao: added on 11/21/2008 to capture names with prefix (we use it as initial

-- to avoid adding too many commplex rules)create view InitialWord2 asselect D.match as wordfrom Dictionary('specialNamePrefix.dict', Doc.text)

D;

create view InitialWord as(select I.word as word from InitialWord1 I)union all(select I.word as word from InitialWord2 I);

-- Find weak initial words create view WeakInitialWord asselect R.match as word--from Regex(/\b([\p{Upper}]\.?\s*){1,5}\b/, Doc.text)

R;from RegexTok(/([\p{Upper}]\.?\s*){1,5}/, 10, Doc.text)

R-- added on 05/12/2008-- Do not allow weak initial word to be a word longer

than three characterswhere Not(ContainsRegex(/[\p{Upper}]{3}/,

R.match))-- added on 04/14/2009-- Do not allow weak initial words to match the

timezonand Not(ContainsDict('timeZone.dict', R.match));

------------------------------------------------- Strong Phone Numbers-----------------------------------------------create dictionary StrongPhoneVariantDictionary as (

'phone','cell','contact','direct','office',-- Yunyao: Added new strong clues for phone

numbers'tel','dial','Telefon','mobile','Ph','Phone Number','Direct Line','Telephone No','TTY','Toll Free','Toll-free',-- German'Fon','Telefon Geschaeftsstelle', 'Telefon Geschäftsstelle','Telefon Zweigstelle','Telefon Hauptsitz','Telefon (Geschaeftsstelle)', 'Telefon (Geschäftsstelle)','Telefon (Zweigstelle)','Telefon (Hauptsitz)','Telefonnummer','Telefon Geschaeftssitz','Telefon Geschäftssitz','Telefon (Geschaeftssitz)','Telefon (Geschäftssitz)','Telefon Persönlich','Telefon persoenlich','Telefon (Persönlich)','Telefon (persoenlich)','Handy','Handy-Nummer','Telefon arbeit','Telefon (arbeit)'

);

--include 'core/GenericNE/Person.aql';

create dictionary FilterPersonDict as(

'Travel', 'Fellow', 'Sir', 'IBMer', 'Researcher', 'All','Tell',

'Friends', 'Friend', 'Colleague', 'Colleagues', 'Managers','If',

'Customer', 'Users', 'User', 'Valued', 'Executive', 'Chairs',

'New', 'Owner', 'Conference', 'Please', 'Outlook', 'Lotus', 'Notes',

'This', 'That', 'There', 'Here', 'Subscribers', 'What', 'When', 'Where', 'Which',

'With', 'While', 'Thanks', 'Thanksgiving','Senator', 'Platinum', 'Perspective',

'Manager', 'Ambassador', 'Professor', 'Dear', 'Contact', 'Cheers', 'Athelet',

'And', 'Act', 'But', 'Hello', 'Call', 'From', 'Center', 'The', 'Take', 'Junior',

'Both', 'Communities', 'Greetings', 'Hope', 'Restaurants', 'Properties',

'Let', 'Corp', 'Memorial', 'You', 'Your', 'Our', 'My', 'His','Her',

'Their','Popcorn', 'Name', 'July', 'June','Join','Business', 'Administrative', 'South', 'Members',

'Address', 'Please', 'List','Public', 'Inc', 'Parkway', 'Brother', 'Buy', 'Then',

'Services', 'Statements','President', 'Governor', 'Commissioner',

'Commitment', 'Commits', 'Hey','Director', 'End', 'Exit', 'Experiences', 'Finance',

'Elementary', 'Wednesday','Nov', 'Infrastructure', 'Inside', 'Convention','Judge', 'Lady', 'Friday', 'Project', 'Projected', 'Recalls', 'Regards', 'Recently', 'Administration',

'Independence', 'Denied','Unfortunately', 'Under', 'Uncle', 'Utility', 'Unlike',

'Was', 'Were', 'Secretary','Speaker', 'Chairman', 'Consider', 'Consultant',

'County', 'Court', 'Defensive','Northwestern', 'Place', 'Hi', 'Futures', 'Athlete',

'Invitational', 'System','International', 'Main', 'Online', 'Ideally'-- more entries,'If','Our', 'About', 'Analyst', 'On', 'Of', 'By', 'HR', 'Mkt',

'Pre', 'Post','Condominium', 'Ice', 'Surname', 'Lastname',

'firstname', 'Name', 'familyname',-- Italian greeting'Ciao',-- Spanish greeting'Hola',-- French greeting'Bonjour',-- new entries 'Pro','Bono','Enterprises','Group','Said','Says','Assis

tant','Vice','Warden','Contribution','Research', 'Development', 'Product', 'Sales',

'Support', 'Manager', 'Telephone', 'Phone', 'Contact', 'Information',

'Electronics','Managed','West','East','North','South', 'Teaches','Ministry', 'Church', 'Association',

'Laboratories', 'Living', 'Community', 'Visiting','Officer', 'After', 'Pls', 'FYI', 'Only', 'Additionally',

'Adding', 'Acquire', 'Addition', 'America',-- short phrases that are likely to be at the start of a

sentence'Yes', 'No', 'Ja', 'Nein','Kein', 'Keine', 'Gegenstimme',-- TODO: to be double checked'Another', 'Anyway','Associate', 'At', 'Athletes', 'It',

'Enron', 'EnronXGate', 'Have', 'However','Company', 'Companies', 'IBM','Annual', -- common verbs appear with person names in

financial reports-- ideally we want to have a general comprehensive

verb list to use as a filter dictionary'Joins', 'Downgrades', 'Upgrades', 'Reports', 'Sees', 'Warns', 'Announces', 'Reviews'-- Laura 06/02/2009: new filter dict for title for SEC

domain in filterPerson_title.dict);

create dictionary GreetingsDict as(

'Hey', 'Hi', 'Hello', 'Dear',-- German greetings'Liebe', 'Lieber', 'Herr', 'Frau', 'Hallo', -- Italian'Ciao',-- Spanish'Hola',-- French'Bonjour'

);

create dictionary InitialDict as(

'rev.', 'col.', 'reverend', 'prof.', 'professor.', 'lady', 'miss.', 'mrs.', 'mrs', 'mr.', 'pt.', 'ms.','messrs.', 'dr.', 'master.', 'marquis', 'monsieur','ds', 'di'--'Dear' (Yunyao: comments out to avoid

mismatches such as Dear Member),--'Junior' (Yunyao: comments out to avoid

mismatches such as Junior National [team player],-- If we can have large negative dictionary to

eliminate such mismatches, -- then this may be recovered

IE Rule Development Is Hard---------------------------------------

-- Document Preprocessing---------------------------------------create view Doc asselect D.text as textfrom DocScan D;

------------------------------------------ Basic Named Entity Annotators----------------------------------------

-- Find initial words create view InitialWord1 asselect R.match as word--from Regex(/\b([\p{Upper}]\.\s*){1,5}\b/, Doc.text) Rfrom RegexTok(/([\p{Upper}]\.\s*){1,5}/, 10, Doc.text)

R-- added on 04/18/2008where Not(MatchesRegex(/M\.D\./, R.match));

-- Yunyao: added on 11/21/2008 to capture names with prefix (we use it as initial

-- to avoid adding too many commplex rules)create view InitialWord2 asselect D.match as wordfrom Dictionary('specialNamePrefix.dict', Doc.text)

D;

create view InitialWord as(select I.word as word from InitialWord1 I)union all(select I.word as word from InitialWord2 I);

-- Find weak initial words create view WeakInitialWord asselect R.match as word--from Regex(/\b([\p{Upper}]\.?\s*){1,5}\b/, Doc.text)

R;from RegexTok(/([\p{Upper}]\.?\s*){1,5}/, 10, Doc.text)

R-- added on 05/12/2008-- Do not allow weak initial word to be a word longer

than three characterswhere Not(ContainsRegex(/[\p{Upper}]{3}/,

R.match))-- added on 04/14/2009-- Do not allow weak initial words to match the

timezonand Not(ContainsDict('timeZone.dict', R.match));

------------------------------------------------- Strong Phone Numbers-----------------------------------------------create dictionary StrongPhoneVariantDictionary as (

'phone','cell','contact','direct','office',-- Yunyao: Added new strong clues for phone

numbers'tel','dial','Telefon','mobile','Ph','Phone Number','Direct Line','Telephone No','TTY','Toll Free','Toll-free',-- German'Fon','Telefon Geschaeftsstelle', 'Telefon Geschäftsstelle','Telefon Zweigstelle','Telefon Hauptsitz','Telefon (Geschaeftsstelle)', 'Telefon (Geschäftsstelle)','Telefon (Zweigstelle)','Telefon (Hauptsitz)','Telefonnummer','Telefon Geschaeftssitz','Telefon Geschäftssitz','Telefon (Geschaeftssitz)','Telefon (Geschäftssitz)','Telefon Persönlich','Telefon persoenlich','Telefon (Persönlich)','Telefon (persoenlich)','Handy','Handy-Nummer','Telefon arbeit','Telefon (arbeit)'

);

--include 'core/GenericNE/Person.aql';

create dictionary FilterPersonDict as(

'Travel', 'Fellow', 'Sir', 'IBMer', 'Researcher', 'All','Tell',

'Friends', 'Friend', 'Colleague', 'Colleagues', 'Managers','If',

'Customer', 'Users', 'User', 'Valued', 'Executive', 'Chairs',

'New', 'Owner', 'Conference', 'Please', 'Outlook', 'Lotus', 'Notes',

'This', 'That', 'There', 'Here', 'Subscribers', 'What', 'When', 'Where', 'Which',

'With', 'While', 'Thanks', 'Thanksgiving','Senator', 'Platinum', 'Perspective',

'Manager', 'Ambassador', 'Professor', 'Dear', 'Contact', 'Cheers', 'Athelet',

'And', 'Act', 'But', 'Hello', 'Call', 'From', 'Center', 'The', 'Take', 'Junior',

'Both', 'Communities', 'Greetings', 'Hope', 'Restaurants', 'Properties',

'Let', 'Corp', 'Memorial', 'You', 'Your', 'Our', 'My', 'His','Her',

'Their','Popcorn', 'Name', 'July', 'June','Join','Business', 'Administrative', 'South', 'Members',

'Address', 'Please', 'List','Public', 'Inc', 'Parkway', 'Brother', 'Buy', 'Then',

'Services', 'Statements','President', 'Governor', 'Commissioner',

'Commitment', 'Commits', 'Hey','Director', 'End', 'Exit', 'Experiences', 'Finance',

'Elementary', 'Wednesday','Nov', 'Infrastructure', 'Inside', 'Convention','Judge', 'Lady', 'Friday', 'Project', 'Projected', 'Recalls', 'Regards', 'Recently', 'Administration',

'Independence', 'Denied','Unfortunately', 'Under', 'Uncle', 'Utility', 'Unlike',

'Was', 'Were', 'Secretary','Speaker', 'Chairman', 'Consider', 'Consultant',

'County', 'Court', 'Defensive','Northwestern', 'Place', 'Hi', 'Futures', 'Athlete',

'Invitational', 'System','International', 'Main', 'Online', 'Ideally'-- more entries,'If','Our', 'About', 'Analyst', 'On', 'Of', 'By', 'HR', 'Mkt',

'Pre', 'Post','Condominium', 'Ice', 'Surname', 'Lastname',

'firstname', 'Name', 'familyname',-- Italian greeting'Ciao',-- Spanish greeting'Hola',-- French greeting'Bonjour',-- new entries 'Pro','Bono','Enterprises','Group','Said','Says','Assis

tant','Vice','Warden','Contribution','Research', 'Development', 'Product', 'Sales',

'Support', 'Manager', 'Telephone', 'Phone', 'Contact', 'Information',

'Electronics','Managed','West','East','North','South', 'Teaches','Ministry', 'Church', 'Association',

'Laboratories', 'Living', 'Community', 'Visiting','Officer', 'After', 'Pls', 'FYI', 'Only', 'Additionally',

'Adding', 'Acquire', 'Addition', 'America',-- short phrases that are likely to be at the start of a

sentence'Yes', 'No', 'Ja', 'Nein','Kein', 'Keine', 'Gegenstimme',-- TODO: to be double checked'Another', 'Anyway','Associate', 'At', 'Athletes', 'It',

'Enron', 'EnronXGate', 'Have', 'However','Company', 'Companies', 'IBM','Annual', -- common verbs appear with person names in

financial reports-- ideally we want to have a general comprehensive

verb list to use as a filter dictionary'Joins', 'Downgrades', 'Upgrades', 'Reports', 'Sees', 'Warns', 'Announces', 'Reviews'-- Laura 06/02/2009: new filter dict for title for SEC

domain in filterPerson_title.dict);

create dictionary GreetingsDict as(

'Hey', 'Hi', 'Hello', 'Dear',-- German greetings'Liebe', 'Lieber', 'Herr', 'Frau', 'Hallo', -- Italian'Ciao',-- Spanish'Hola',-- French'Bonjour'

);

create dictionary InitialDict as(

'rev.', 'col.', 'reverend', 'prof.', 'professor.', 'lady', 'miss.', 'mrs.', 'mrs', 'mr.', 'pt.', 'ms.','messrs.', 'dr.', 'master.', 'marquis', 'monsieur','ds', 'di'--'Dear' (Yunyao: comments out to avoid

mismatches such as Dear Member),

SystemT’s Person extractor

~250 AQL rules

SystemT’s Person extractor

~250 AQL rules

“Global financial services firm Morgan Stanley announced … ““Global financial services firm Morgan Stanley announced … “

Person


Lessons Learned

� Managing rule sets for multiple products is difficult

– Maintain a core generic library

– Expose hooks for application-specific customizations• Propagate some customizations back to generic library

– New requests keep coming !

� Creating an ecosystem of developers is important

– Writing rules is still an “art”

– 30+ AQL rule developers across Almaden, CRL, CDL IRL, SVL, TRL, YSL, Watson, SWG + one business partner

• Some involved in the AQL working group

� Several new research problems have surfaced

How to assist developers in

building and maintaining rules ?

How to assist developers in

building and maintaining rules ?


SystemT Development Environment (Ongoing Effort)

Develop

Test

Analyze

• Regex learning [Li08]

• Suggest rule changes [Liu10]

• Rule induction

• Use reference information

• Regex learning [Li08]

• Suggest rule changes [Liu10]

• Rule induction

• Use reference information

• Track provenance [Liu10]

• Contextual clue discovery

• Track provenance [Liu10]

• Contextual clue discovery

• Concordance Viewer

• Active labeling

• Concordance Viewer

• Active labeling

• NE Interface [Chiticariu10]

• Tagger UI [Kandogan07]

• NE Interface [Chiticariu10]

• Tagger UI [Kandogan07]

Development

Deploy

Refine

Test

Maintenance


Outline

� Challenges in Grammar-based IE systems




– Performance

– Tooling

� Related Work


Related Work (1/2): Declarative IE Systems from the Database Community

� Common vision:

– Separate semantics from order of execution

– Build the system around a language like SQL or Datalog

� Different interpretations of declarativity

High-level declarative Mixed declarative Completely declarative

CIMPLE (U. Wisconsin)

CIMPLE (U. Wisconsin)

SystemT (IBM)

BayesStore (UC Berkeley)

SystemT (IBM)

BayesStore (UC Berkeley)PSOX (Yahoo!)

SQoUT (Columbia U)

PSOX (Yahoo!)

SQoUT (Columbia U)

• Overall IE pipeline replaced

with a declarative language

• Individual extraction

components still “black boxes”

• Overall IE pipeline replaced

with a declarative language

• Individual extraction

components still “black boxes”

• One declarative

language covers all

stages of extraction

• One declarative

language covers all

stages of extraction

• Declarative language for some

(not all) extraction operations

• Both at the individual and

pipeline level

• Declarative language for some

(not all) extraction operations

• Both at the individual and

pipeline level


Related Work (2/2): Design Considerations

� Optimization

– Granularity

• High-level: annotator composition

• Low-level: basic extraction operators

– Strategy: Rewrite-based vs. Cost-based

� Runtime Model

– Document-Centric vs. Collection-Centric

� More details: SIGMOD 2010 tutorial [Chiticariu et al., 2010]


Summary

� SystemT

– Declarative IE system based on an algebraic framework

– Expressivity and performance advantages over grammar-based IE systems

– Text-specific optimizations

� Ongoing Work

– Tooling support for rule development/maintenance

– Improved optimization strategies

– New operators for advanced features (e.g. co-reference resolution)


Thank you!

� For more information…

– Download a free version of SystemT (newer version coming soon)

• https://www.alphaworks.ibm.com/tech/systemt/

– Visit the SystemT home page

• http://www.almaden.ibm.com/cs/projects/systemt/

– Contact me

• [email protected]