systemt: an algebraic approach to declarative information … · 2017. 8. 12. · lorem ipsum dolor...
TRANSCRIPT
© 2010 IBM Corporation
SystemT: an Algebraic Approach to Declarative Information Extraction
Laura Chiticariu, Rajasekar Krishnamurthy, Yunyao Li,
Sriram Raghavan, Frederick Reiss, Shivakumar Vaithyanathan
IBM Research - Almaden
© 2010 IBM Corporation2
� Distill structured data from unstructured and semi-structured text
� Exploit the extracted data in your applications
For years, Microsoft
Corporation CEO Bill Gates
was against open source. But
today he appears to have
changed his mind. "We can be
open source. We love the
concept of shared source,"
said Bill Veghte, a Microsoft
VP. "That's a super-important
shift for us in terms of code
access.“
Richard Stallman, founder of
the Free Software Foundation,
countered saying…
Name Title Organization
Bill Gates CEO Microsoft
Bill Veghte VP Microsoft
Richard Stallman Founder Free Soft..
(from Cohen’s IE tutorial, 2003)
AnnotationsAnnotations
Information Extraction (IE)
© 2010 IBM Corporation3
Information Extraction in Enterprise Applications
� Information extraction is essential for many emerging enterprise applications
– Semantic search, compliance, BI over text, …
� Requirements driven by enterprise apps that use IE for practical success
– Accuracy
• Garbage-in garbage-out: Usefulness of application is often tied to quality of extraction
– Scalability
• Large data volumes, often orders of magnitude larger than classical NLP corpora
• Many applications (e.g. email) require on-the-fly information extraction
– Flexible Runtime Model
• Heterogeneous runtime environments with different resource constraints, from laptop applications to distributed environment
– Transparency
• Customer complaints needs to address ASAP without compromising overall experience
– Usability
• Building an accurate IE system is labor-intensive
• Critical for establishing an ecosystem of users
© 2010 IBM Corporation4
A Brief History of IE in the NLP Community
� 1978-1997: MUC (Message Understanding Conference) –DARPA competition 1987 to 1997
– FRUMP [DeJong82]
– FASTUS [Appelt93],
– TextPro, PROTEUS
� 1998: Common Pattern Specification Language (CPSL) standard [Appelt98]
– Standard for subsequent rule-based systems
� 1999-presents: Commercial products, GATE
� At first: Simple techniques like Naive Bayes
� 1990’s: Learning Rules
– AUTOSLOG [Riloff93]
– CRYSTAL [Soderland98]
– SRV [Freitag98]
� 2000’s: More specialized models
– Maximum Entropy Models [Berger96]
– Hidden Markov Models [Leek97]
– Maximum Entropy Markov Models [McCallum00]
– Conditional Random Fields [Lafferty01]
– Automatic feature expansion
Rule-Based Machine Learning
© 2010 IBM Corporation5
Large number of
annotators
System T(algebraic information
extraction system)
2007
2004
2005
2006
Evolution of the SystemT Project
Performance,
Expressivity
Custom Code
Diverse data sets,
Complex
extraction tasks
Grammar(CPSL-style cascading
grammar system)
Evolutionary Triggers
Grammar ++(Grammar + Extensions outside
the scope of grammars)
2010
© 2010 IBM Corporation6
Outline
� Challenges in Grammar-based IE systems
� SystemT : an Algebraic approach
– The AQL language
– Annotator Library
– Performance
– Tooling
� Related Work
© 2010 IBM Corporation7
Finite-state Grammars
� Common formalism underlying most rule-based IE systems
– Input text viewed as a sequence of tokens
– Rules expressed as regular expression patterns over the lexical features of these tokens
� Several levels of processing ���� Cascading Grammars
– Typically, at higher levels of the grammar, larger segments of text are analyzed and annotated
� Common Pattern Specification Language (CPSL)
– A common language to specify and represent finite-state transducers
– Each transducer accepts a sequence of annotations and outputs a sequence of annotations
© 2010 IBM Corporation8
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin, in sagittis facilisis, volutpat dapibus, ultrices sit amet, sem ,
volutpat dapibus, ultrices sit amet, sem Tomorrow, we will meet Mark Scott, Howard Smith and amet lt arcu
tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis nunc volutpat enim, quis viverra lacus nulla sit lectus.
Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus
luctus, risus in sagittis facilisis arcu Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi ipsum.
Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus
luctus, risus in e sagittis Tomorrow, we will meet Mark Scott, Howard Smith and hendrerit faucibus pede mi sed ipsum.
Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in
feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit
amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent
Example: Simplified Person Annotator
Tokenization
(preprocessing step)
Level 1
⟨Gazetteer⟩[type = LastGaz] � ⟨Last⟩⟨Gazetteer⟩[type = LastGaz] � ⟨Last⟩
⟨Gazetteer⟩[type = FirstGaz] � ⟨First⟩⟨Gazetteer⟩[type = FirstGaz] � ⟨First⟩
⟨Token⟩[~ “[A-Z]\w+”] � ⟨Caps⟩⟨Token⟩[~ “[A-Z]\w+”] � ⟨Caps⟩
Level 2 ⟨First⟩ ⟨Last⟩ � ⟨Person⟩⟨First⟩ ⟨Last⟩ � ⟨Person⟩
⟨First⟩ ⟨Caps⟩ � ⟨Person⟩⟨First⟩ ⟨Caps⟩ � ⟨Person⟩
⟨First⟩ � ⟨Person⟩⟨First⟩ � ⟨Person⟩Rule priority used to
prefer First over Caps
Rule priority used to prefer First over Caps
First preferred over Last since it was declared earlierRigid Rule Priority in Level 1
caused partial results
© 2010 IBM Corporation9
A More Complex Example: Signature
Laura Haas, PhDDistinguished Engineer and Director, Computer ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs
Person
Organization
Phone
URL
Person
Organization Phone
URL
At least 1 Phone
At least 2 of {Phone, Organization, URL, Email, Address}
End with one of these.
Start with Person Within 250 characters
Difficult to express using grammars
(counting and aggregation not supported)
Difficult to express using grammars
(counting and aggregation not supported)
Extraction Task : Identify Signatures
© 2010 IBM Corporation10
Limitations of Classical Grammar-based Extraction
� Expressivity problems
– Rigid Matching Priority• Leads to mistakes when two rules of same priority match the
same region of text
– Lossy Sequencing• If Input annotations to a grammar phase overlap, CPSL engine
must drop some of them � may lead to mistakes
– Limited expressivity in Rule Patterns• Cannot express aggregation operations, span overlap conditions
� Performance problems
– Complete pass through tokens for each rule
– Many of these passes are wasted work
© 2010 IBM Corporation11
CPSL Extensions to Address Certain Limitations
� Lossy Sequencing
– Grammar rules operate on graphs of annotations
� Rigid Matching priority
– Additional matching regimes introduced
� Limited Expressivity
– Rule pattern expanded to allow more expressivity
� Performance
– Faster finite state machines
Popular CPSL extensions : JAPE, AFst, XTDL
© 2010 IBM Corporation12
Outline
� Limitations of Grammar-based IE systems
� SystemT : an Algebraic approach
– The AQL language
– Annotator Library
– Performance
– Tooling
� Related Work
© 2010 IBM Corporation13
� Approaching IE differently
– Identify the most basic operations
– Create an operator for each basic operation
– Compose operators to build complex annotators
� Benefits:
– Richer, cleaner rule semantics
– Better performance through optimization
Algebraic IE System from IBM Research - Almaden
© 2010 IBM Corporation14
SystemT Overview
AQLAQL SystemT
Optimizer
SystemT
OptimizerSystemT
Runtime
SystemT
Runtime
Compiled
Operator
Graph
Compiled
Operator
Graph
Rule language with
familiar SQL-like syntax
Specify annotator
semantics declaratively
Rule language with
familiar SQL-like syntax
Specify annotator
semantics declaratively
Choose an
efficient
execution plan
that implements
the semantics
Choose an
efficient
execution plan
that implements
the semantics
Highly scalable,
embeddable
Java runtime
Highly scalable,
embeddable
Java runtime
Input
Document
Stream
Annotated
Document
Stream
© 2010 IBM Corporation15
AQL: SystemT’s Rule Language
� Declarative language for defining annotators
– Compiles into SystemT’s algebra
� Main features
– Separates semantics from implementation
– Familiar syntax
– Full expressive power of algebra
© 2010 IBM Corporation16
create view FirstCaps as
select CombineSpans(F.name, C.name) as name
from First F, Caps C
where FollowsTok(F.name, C.name, 0, 0);
<First> <Caps>
0 tokens
AQL by Example
© 2010 IBM Corporation17
Regular Expression Extraction Operator
[A-Z][a-z]+
DocumentInput Tuple
…
we will meet Mark
Scott and
…
Output Tuple 2 Span 2Document
Span 1Output Tuple 1 Document
Regex
© 2010 IBM Corporation18
Some Example Operators
� Regex
– Find all matches of a character-based regular expression
� Dictionary
– Find all matches of an exhaustive dictionary of terms
� Join
– Find pairs of sub-annotations that match a predicate
� Block
– Identify contiguous blocks of lower-level matches
© 2010 IBM Corporation19
Revisiting Person Example
Dictionary<First>
SmithScott
TomorrowMarkScott
HowardSmith
Join<First> <Caps>
Join<First> <Last>
Mark ScottHoward Smith
Mark ScottHoward Smith
Union
Mark ScottHoward Smith
Mark ScottHoward Smith
ScottMark
Howard
Consolidate
Mark ScottHoward Smith
Dictionary<Last>
Regex<Caps>
……Tomorrow, we will meet Mark Scott, Howard Smith …
Explicit operator for
resolving ambiguityInput may contain overlapping annotations
(No Lossy Sequencing problem)
Output may contain overlapping annotations
(No Rigid Matching Regimes)
ScottMark
Howard
Rich set of algebraic operators (Regex, Join, Union, Consolidate)
supported by SystemT alleviates expressivity problems in CPSL
Rich set of algebraic operators (Regex, Join, Union, Consolidate)
supported by SystemT alleviates expressivity problems in CPSL
© 2010 IBM Corporation20
A More Complex Example: Signature
Org Phone URL
Person
Join
Union
Organization Phone
URL
Organization Phone
URL
Person
BlockOrganization
Phone
URLPerson
SignatureJoin predicates
enforce additional
constraints
Find blocks of two or
more “contact info”
patterns
© 2010 IBM Corporation21
Addressing Issues of Grammar-based Systems
� Grammar
– Rigid matching regimes
– Lossy sequencing
– Impossible to support native
aggregation
� SystemT
– No fixed sequencing. Retain
all matches and consolidate
overlapping matches
explicitly
– Retain all matches and
discard ambiguous matches
explicitly
– Rich set of operators
Theorem: The class of extraction tasks expressible as AQL queries is a strict superset of that expressible
through expanded code-free CPSL grammars.
Theorem: The class of extraction tasks expressible as AQL queries is a strict superset of that expressible
through expanded code-free CPSL grammars.
© 2010 IBM Corporation22
Comparison with Grammar-based Annotators
74.06 / 75.7663.72 / 65.1888.40 / 90.43ANNIE
80.60 / 81.3071.10 / 71.7293.04 / 93.85SystemTLocation
42.43 / 49.6438.07 / 44.5447.92 / 56.06ANNIE2005
60.58 / 63.5847.97 / 50.3582.17 / 86.25SystemTOrganizationACE
76.05 / 80.6568.69 / 72.8485.17 / 90.32SystemTPersonEnron
52.48 / 70.6948.59 / 65.4657.05 / 76.84ANNIE
34.74 / 67.1530.89 / 59.7239.68 / 76.69ANNIE
77.63 / 79.3468.28 / 69.7889.96 / 91.94SystemTPerson
F-measure
Exact / Partial
Recall
Exact / Partial
Precision
Exact / PartialSystemEntity TypeDataset
Extraction Task : Named-entity extraction
Systems compared : SystemT (out-of-box) vs. ANNIE (uses JAPE in GATE)
© 2010 IBM Corporation23
Comparison with State-of-the-art Results
SystemEntity Type F-measureRecallPrecisionDataset
Minkov
SystemT
Florian
SystemT
Florian
SystemT
Florian
SystemT
93.8595.2492.49
94.3292.3996.32Person
84.6783.4485.93
88.6585.3192.25Organization
91.1591.7390.59
92.3591.6193.11Location
CoNLL 2003
77.974.981.1
84.4681.8287.27PersonEnron
Extraction Task : Named-entity extraction
Systems compared : SystemT (customized) vs. [Florian et al.’03] [Minkov et al.’05]
© 2010 IBM Corporation24
Outline
� Limitations of Grammar-based IE systems
� SystemT : an Algebraic approach
– The AQL language
– Annotator Library
– Performance
– Tooling
� Related Work
© 2010 IBM Corporation25
SystemT in IBM Products and Research Projects
…
Research Projects
IBM Products
(Transferred)
Patent searchSIMPLE (Almaden)
ComplianceMidas Finance (Almaden)
Email searchOmnifind Personal Email Search
(Almaden)
SecurityEISM (Watson)
Customer engagementSystemS (Watson)
W3 search (live in Q2 2011)Gumshoe (Almaden)
Business insightsIBM Content Analyzer
MashupsInfoSphere MashupHub
Compliance/SearcheDiscovery Analyzer
Data redactionOptim Integrated Data Management
LiveText client mashupsLotus Notes, Expeditor, Symphony
© 2010 IBM Corporation26
SystemT Move to SWG
� Will be run by the Business Analytics side of SWG
� All existing and new products will be supported by
SWG
© 2010 IBM Corporation27
SystemT Annotator Library� Named Entities
– Complex entities: Person, Location (including Address), Organization
– “Simple” entities: Phone Number, Email address, URL, Time, Date
– ConferenceCall, Signature, Agenda, ForwardBlock, DrivingDirection,
– Relationships: Persons � Phone / Address / EmailAddress
� Blogs: Sentiment, InformalReview
� Financial
– Merger/acquisition, Joint Venture, Earnings announcement, Analyst earnings estimate
– Directors and officers, Corporate Actions, Appointment / Resignation of Officers / Directors, Institutional Loan agreements, Beneficial Ownership, Subsidiaries, …
� Web pages: Homepage, Geography, Category,…
� Healthcare: Disease, Drug, ChemicalCompound
� Many more…
© 2010 IBM Corporation28
Multilingual Support
� Via integration with LanguageWare
– Tokenization, POS tagging
� Increasing language coverage for annotators
– Major Western languages (SWG): DE, EN, ES, FR, IT, NL, PT• Named Entities + Sentiment analysis
– Chinese (Almaden + CRL)
• Named Entities + Sentiment analysis (ongoing)
– Japanese (SWG)• Named Entities
– Extension of Named Entity annotators for Indian names in English text (IRL)
© 2010 IBM Corporation29
Beyond Annotators
� Syntactic Coreference Resolution (Almaden)
– Requested by multiple teams
– Recently delivered to eDiscovery Analyzer
• Only name variants have been included in eDA 2.2
• Pronoun resolution will be included in the next version
– Resulted in new primitives that are currently being
considered as native operations in SystemT
© 2010 IBM Corporation30
Outline
� Limitations of Grammar-based IE systems
� SystemT : an Algebraic approach
– The AQL language
– Annotator Library
– Performance
– Tooling
� Related Work
© 2010 IBM Corporation31
Performance
� Performance issues with grammars
– Complete pass through tokens for each grammar level
– Many of these passes are wasted work
� Dominant approach: Make each pass go faster
– Doesn’t solve root problem!
� Algebraic approach: Build a query optimizer!
© 2010 IBM Corporation32
Optimizations
� Query optimization is a familiar topic in databases
� What’s different in text?
– Operations over sequences and spans
– Document boundaries
– Costs concentrated in extraction operators (dictionary, regular expression)
� Can leverage these characteristics
– Text-specific optimizations
– Significant performance improvements
© 2010 IBM Corporation33
SystemT Block Diagram
AQLAQL SystemT
Optimizer
SystemT
OptimizerSystemT
Runtime
SystemT
Runtime
Compiled
Operator
Graph
Compiled
Operator
Graph
Input
Document
Stream
Annotated
Document
Stream
© 2010 IBM Corporation34
The SystemT Optimizer
� AQL allows the development of highly complex rulesets
– Example: Annotator stack for InfoSphere MashupHub consists of more than 950 AQL statements
� Execution plan optimization is crucial to having acceptable throughput for enterprise apps
– Naïve execution plan would be painfully slow
• Direct translation from AQL to algebra
� Constant feedback loop:
1. SystemT optimizer improves…
2. …enabling more complex rules…
3. …which improve precision and recall…
4. …and drive the need for more optimization
© 2010 IBM Corporation35
Example: Shared Dictionary Matching (SDM)
� Rewrite-based optimization
– Applied to the algebraic plan during postprocessing
� Evaluate multiple dictionaries in a single pass
DictD1 D2
subplan
D1
D2
subplan
Dict SDMDict
SDM
Dictionary
Operator
© 2010 IBM Corporation36
Example: Restricted Span Evaluation (RSE)
� Leverage the sequential nature of text
– Join predicates on character or token distance
� Only evaluate the inner on the relevant portions of the document
� Limited applicability
– Need to guarantee exact same results
…we will meet Mark Scott…
ScottMark
Mark Scott
Regex
<Caps>
Dictionary
<First>
RSEJoin
Only look for regex matches
in the vicinity of a first name.
© 2010 IBM Corporation37
Some Example Execution Plans
First Caps
(followed within 0 tokens)
First
Identify Caps starting
within 0 tokens
Caps
Identify First ending
within 0 tokens
Extract text to the right Extract text to the left
Plan B Plan C
Plan A
(naïve) Join
© 2010 IBM Corporation38
Performance Comparison (with ANNIE)
0
100
200
300
400
500
600
700
0 20 40 60 80 100
Average document size (KB)
Thro
ughput (K
B/s
ec)
Open Source Entity Tagger
SystemT
ANNIE
Task: Named Entity
Dataset : Different document collections from the Enron corpus obtained
by randomly sampling 1000 documents for each size
>10x faster>10x faster
© 2010 IBM Corporation39
Performance Comparison on Larger Documents
189.62683.5954.521.11.54 MB1 MB – 3.4 MBLarge
SEC Flings
143.7601.8703.526.3401 KB240 KB – 0.9 MBMedium
SEC Filings
77.2201.8498.842.88.8 KB68 B – 388 KBWeb Crawl
SystemTANNIESystemTANNIEAverageRange
Average Memory
(MB)Throughput (KB/sec)Document SizeDataset
Datasets : Web crawl and filings from the Securities and Exchanges Commission (SEC)
Throughput benefits carryover for
wide-variety of document sizes
Throughput benefits carryover for
wide-variety of document sizes
Much lower
memory footprint
Much lower
memory footprint
Theorem: For any acyclic token-based FST T,
there exists an operator graph G such that evaluating
T and G has the same computational complexity
Theorem: For any acyclic token-based FST T,
there exists an operator graph G such that evaluating
T and G has the same computational complexity
© 2010 IBM Corporation40
SystemT Block Diagram
AQLAQL SystemT
Optimizer
SystemT
OptimizerSystemT
Runtime
SystemT
Runtime
Compiled
Operator
Graph
Compiled
Operator
Graph
Input
Document
Stream
Annotated
Document
Stream
© 2010 IBM Corporation41
SystemT Runtime Environment
� Compact Java-based runtime
– Small memory footprint
– High performance
� Designed to be embedded in a larger system
– Lotus Notes
– UIMA/GATE
– Hadoop
– Custom integrations
© 2010 IBM Corporation42
Hadoop Cluster
Jaql Function WrapperJaql Function Wrapper
SystemT
Runtime
Input
Adapter
Output
Adapter
Jaql Function WrapperJaql Function Wrapper
SystemT
Runtime
Input
Adapter
Output
Adapter
Jaql Function WrapperJaql Function Wrapper
SystemT
RuntimeInput
Adapter
Output
Adapter
Jaql Function WrapperJaql Function Wrapper
SystemT
RuntimeInput
Adapter
Output
Adapter
Jaql Function WrapperJaql Function Wrapper
SystemT
RuntimeInput
Adapter
Output
Adapter
Jaql Function WrapperJaql Function Wrapper
SystemT
RuntimeInput
Adapter
Output
Adapter
Lotus Notes
Client
Lotus Notes
Client
Scaling SystemT: From Laptop to Cluster
Cognos Toro AnalyticsIn Lotus Notes Live Text In Cognos Consumer Insights
SystemT
Runtime
Message Display
Annotated Email
Documents
Jaql Runtime
Hadoop Map-Reduce
Jaql Function WrapperJaql Function Wrapper
SystemT
RuntimeInput
Adapter
Output
Adapter
Jaql Function WrapperJaql Function Wrapper
SystemT
RuntimeInput
Adapter
Output
Adapter
Jaql Function WrapperJaql Function Wrapper
SystemT
RuntimeInput
AdapterOutput
Adapter
© 2010 IBM Corporation43
Outline
� Limitations of Grammar-based IE systems
� SystemT : an Algebraic approach
– The AQL language
– Annotator Library
– Performance
– Tooling
� Related Work
© 2010 IBM Corporation44
Rule Development for Information Extraction
Develop
Test
Analyze
Development
Deploy
Refine
Test
Maintenance
Very labor intensive and time-consuming
© 2010 IBM Corporation45
---------------------------------------create view ValidLastNameAll asselect N.lastname as lastnamefrom LastNameAll N-- do not allow partially all capitalized wordswhere Not(MatchesRegex(/(\p{Lu}\p{M}*)+-.*([\p{Ll}\p{Lo}]\p{M}*).*/, N.lastname))and Not(MatchesRegex(/.*([\p{Ll}\p{Lo}]\p{M}*).*-
(\p{Lu}\p{M}*)+/, N.lastname));
create view LastName asselect C.lastname as lastname--from Consolidate(ValidLastNameAll.lastname) C;from ValidLastNameAll Cconsolidate on C.lastname;
-- Find dictionary matches for all first names-- Mostly US first namescreate view StrictFirstName1 asselect D.match as firstnamefrom Dictionary('strictFirst.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
-- German first namescreate view StrictFirstName2 asselect D.match as firstnamefrom Dictionary('strictFirst_german.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
-- nick names for US first namescreate view StrictFirstName3 asselect D.match as firstnamefrom Dictionary('strictNickName.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
-- german first name from blue pagecreate view StrictFirstName4 asselect D.match as firstnamefrom Dictionary('strictFirst_german_bluePages.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
-- Italy first name from blue pagescreate view StrictFirstName5 asselect D.match as firstnamefrom Dictionary('names/strictFirst_italy.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
-- France first name from blue pagescreate view StrictFirstName6 asselect D.match as firstnamefrom Dictionary('names/strictFirst_france.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
-- Spain first name from blue pagescreate view StrictFirstName7 asselect D.match as firstnamefrom Dictionary('names/strictFirst_spain.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
-- Indian first name from blue pages-- TODO: still need to clean up the remaining entriescreate view StrictFirstName8 asselect D.match as firstnamefrom Dictionary('names/strictFirst_india.partial.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
-- Israel first name from blue pagescreate view StrictFirstName9 asselect D.match as firstnamefrom Dictionary('names/strictFirst_israel.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
-- union all the dictionary matches for first namescreate view StrictFirstName as
(select S.firstname as firstname from StrictFirstName1 S)
union all(select S.firstname as firstname from
StrictFirstName2 S)union all(select S.firstname as firstname from
StrictFirstName3 S)union all(select S.firstname as firstname from
StrictFirstName4 S)union all(select S.firstname as firstname from
StrictFirstName5 S)union all(select S.firstname as firstname from
StrictFirstName6 S)union all(select S.firstname as firstname from
StrictFirstName7 S)union all(select S.firstname as firstname from
StrictFirstName8 S)union all(select S.firstname as firstname from
StrictFirstName9 S);
-- Relaxed versions of first namecreate view RelaxedFirstName1 asselect CombineSpans(S.firstname, CP.name) as firstnamefrom StrictFirstName S,
StrictCapsPerson CPwhere FollowsTok(S.firstname, CP.name, 1, 1)and MatchesRegex(/\-/, SpanBetween(S.firstname,
CP.name));
create view RelaxedFirstName2 asselect CombineSpans(CP.name, S.firstname) as firstnamefrom StrictFirstName S,
StrictCapsPerson CPwhere FollowsTok(CP.name, S.firstname, 1, 1)and MatchesRegex(/\-/, SpanBetween(CP.name,
S.firstname));
-- all the first namescreate view FirstNameAll as
(select N.firstname as firstname from StrictFirstName N)
union all(select N.firstname as firstname from
RelaxedFirstName1 N)union all(select N.firstname as firstname from
RelaxedFirstName2 N);
create view ValidFirstNameAll asselect N.firstname as firstnamefrom FirstNameAll Nwhere Not(MatchesRegex(/(\p{Lu}\p{M}*)+-.*([\p{Ll}\p{Lo}]\p{M}*).*/, N.firstname))and Not(MatchesRegex(/.*([\p{Ll}\p{Lo}]\p{M}*).*-
(\p{Lu}\p{M}*)+/, N.firstname));
create view FirstName asselect C.firstname as firstname--from Consolidate(ValidFirstNameAll.firstname) C;from ValidFirstNameAll Cconsolidate on C.firstname;
-- Combine all dictionary matches for both last names and first namescreate view NameDict asselect D.match as namefrom Dictionary('name.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
create view NameDict1 asselect D.match as namefrom Dictionary('names/name_italy.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
create view NameDict2 asselect D.match as namefrom Dictionary('names/name_france.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
create view NameDict3 asselect D.match as namefrom Dictionary('names/name_spain.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
create view NameDict4 asselect D.match as name
from FirstName FN,InitialWord IW,CapsPerson CP
where FollowsTok(FN.firstname, IW.word, 0, 0)and FollowsTok(IW.word, CP.name, 0, 0);
/*** Translation for Rule 3r2* * This relaxed version of rule '3' will find person names like Thomas B.M. David* But it only insists that the second word is in the person dictionary*//*<rule annotation=Person id=3r2><internal><token attribute={etc}>CAPSPERSON</token><token attribute={etc}>INITIALWORD</token><token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token></internal></rule>*/
create view Person3r2 asselect CombineSpans(CP.name, LN.lastname) as personfrom LastName LN,
InitialWord IW,CapsPerson CP
where FollowsTok(CP.name, IW.word, 0, 0)and FollowsTok(IW.word, LN.lastname, 0, 0);
/*** Translation for Rule 4** This rule will find person names like David Thomas*//*<rule annotation=Person id=4><internal><token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token><token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token></internal></rule>*/create view Person4WithNewLine asselect CombineSpans(FN.firstname, LN.lastname) as personfrom FirstName FN,
LastName LNwhere FollowsTok(FN.firstname, LN.lastname, 0, 0);
-- Yunyao: 05/20/2008 revised to Person4WrongCandidates due to performance reason-- NOTE: current optimizer execute Equals first thus make Person4Wrong very expensive--create view Person4Wrong as--select CombineSpans(FN.firstname, LN.lastname) as person--from FirstName FN,-- LastName LN--where FollowsTok(FN.firstname, LN.lastname, 0, 0)-- and ContainsRegex(/[\n\r]/, SpanBetween(FN.firstname, LN.lastname))-- and Equals(GetText(FN.firstname), GetText(LN.lastname));
create view Person4WrongCandidates asselect FN.firstname as firstname, LN.lastname as lastnamefrom FirstName FN,
LastName LNwhere FollowsTok(FN.firstname, LN.lastname, 0, 0)
and ContainsRegex(/[\n\r]/, SpanBetween(FN.firstname, LN.lastname));
create view Person4 as(select P.person as person from
Person4WithNewLine P)minus(select CombineSpans(P.firstname, P.lastname) as
person from Person4WrongCandidates Pwhere Equals(GetText(P.firstname),
GetText(P.lastname)));/*** Translation for Rule4a* This rule will find person names like Thomas, David*//*<rule annotation=Person id=4a><internal><token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token><token attribute={etc}>\,</token><token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token></internal></rule>*/create view Person4a asselect CombineSpans(LN.lastname, FN.firstname) as personfrom FirstName FN,
LastName LNwhere FollowsTok(LN.lastname, FN.firstname, 1, 1)and ContainsRegex(/,/,SpanBetween(LN.lastname, FN.firstname));
-- relaxed version of Rule4a-- Yunyao: split the following rules into two to improve performance-- TODO: Test case for optimizer -- create view Person4ar1 as-- select CombineSpans(CP.name, FN.firstname) as person--from FirstName FN,-- CapsPerson CP--where FollowsTok(CP.name, FN.firstname, 1, 1)--and ContainsRegex(/,/,SpanBetween(CP.name, FN.firstname))--and Not(MatchesRegex(/(.|\n|\r)*(\.|\?|!|'|\sat|\sin)( )*/, LeftContext(CP.name, 10)))--and Not(MatchesRegex(/(?i)(.+fully)/, CP.name))--and GreaterThan(GetBegin(CP.name), 10);
create view Person4ar1temp asselect FN.firstname as firstname, CP.name as namefrom FirstName FN,
CapsPerson CPwhere FollowsTok(CP.name, FN.firstname, 1, 1)and ContainsRegex(/,/,SpanBetween(CP.name, FN.firstname));
create view Person4ar1 asselect CombineSpans(P.name, P.firstname) as personfrom Person4ar1temp P
where Not(MatchesRegex(/(.|\n|\r)*(\.|\?|!|'|\sat|\sin)( )*/, LeftContext(P.name, 10))) --'and Not(MatchesRegex(/(?i)(.+fully)/, P.name))and GreaterThan(GetBegin(P.name), 10);
create view Person4ar2 asselect CombineSpans(LN.lastname, CP.name) as personfrom CapsPerson CP,
LastName LNwhere FollowsTok(LN.lastname, CP.name, 0, 1)and ContainsRegex(/,/,SpanBetween(LN.lastname, CP.name));
/*** Translation for Rule2** This rule will handles names of persons like B.M. Thomas David, where Thomas occurs in some person dictionary*//*<rule annotation=Person id=2><internal><token attribute={etc}>INITIALWORD</token><token attribute={etc}PERSON{etc}>CAPSPERSON</token><token attribute={etc}>CAPSPERSON</token></internal></rule>*/
create view Person2 asselect CombineSpans(IW.word, CP.name) as personfrom InitialWord IW,
PersonDict P,CapsPerson CP
where FollowsTok(IW.word, P.name, 0, 0)and FollowsTok(P.name, CP.name, 0, 0);
/*** Translation for Rule 2a** The rule handles names of persons like B.M. Thomas David, where David occurs in some person dictionary*//*<rule annotation=Person id=2a><internal><token attribute={etc}>INITIALWORD</token><token attribute={etc}>CAPSPERSON</token><token attribute={etc}>NEWLINE</token>?<token attribute={etc}PERSON{etc}>CAPSPERSON</token></internal></rule>*/
create view Person2a asselect CombineSpans(IW.word, P.name) as personfrom InitialWord IW,
CapsPerson CP,PersonDict P
where FollowsTok(IW.word, CP.name, 0, 0)and FollowsTok(CP.name, P.name, 0, 0);
/*<rule annotation=Person id=4r1><internal><token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token><token attribute={etc}>NEWLINE</token>?<token attribute={etc}>CAPSPERSON</token></internal></rule>*/create view Person4r1 asselect CombineSpans(FN.firstname, CP.name) as personfrom FirstName FN,
CapsPerson CPwhere FollowsTok(FN.firstname, CP.name, 0, 0);
/*** Translation for Rule 4r2** This relaxed version of rule '4' will find person names Thomas, David* But it only insists that the SECOND word is in some person dictionary*//*<rule annotation=Person id=4r2><token attribute={etc}>ANYWORD</token><internal><token attribute={etc}>CAPSPERSON</token><token attribute={etc}>NEWLINE</token>?<token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token></internal></rule>*/create view Person4r2 asselect CombineSpans(CP.name, LN.lastname) as personfrom CapsPerson CP,
LastName LNwhere FollowsTok(CP.name, LN.lastname, 0, 0);
/*** Translation for Rule 5** This rule will find other single token person first names*//* <rule annotation=Person id=5><internal><token attribute={etc}>INITIALWORD</token>?<token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token></internal></rule>*/create view Person5 asselect CombineSpans(IW.word, FN.firstname) as personfrom InitialWord IW,
FirstName FNwhere FollowsTok(IW.word, FN.firstname, 0, 0);
/*** Translation for Rule 6** This rule will find other single token person last names*//* <rule annotation=Person id=6><internal><token attribute={etc}>INITIALWORD</token>?<token attribute={etc}PERSON:ST:LNAME{etc}>CAPSPERSON</token></internal></rule>*/
create view Person6 asselect CombineSpans(IW.word, LN.lastname) as personfrom InitialWord IW,
LastName LNwhere FollowsTok(IW.word, LN.lastname, 0, 0);
--==========================================================-- End of rules---- Create final list of names based on all the matches extracted----==========================================================
/*** Union all matches found by strong rules, except the ones directly come* from dictionary matches*/create view PersonStrongWithNewLine as
(select P.person as person from Person1 P)--union all-- (select P.person as person from Person1a_more P)union all
(select P.person as person from Person3 P)union all
(select P.person as person from Person4 P)union all
(select P.person as person from Person3P1 P);
create view PersonStrongSingleTokenOnly as(select P.person as person from Person5 P)
union all(select P.person as person from Person6 P)
union all(select P.firstname as person from FirstName P)
union all(select P.lastname as person from LastName P)
union all(select P.person as person from Person1a P);
-- Yunyao: added 05/09/2008 to expand person names with suffixcreate view PersonStrongSingleTokenOnlyExpanded1 asselect CombineSpans(P.person,S.suffix) as personfrom
PersonStrongSingleTokenOnly P,PersonSuffix S
where FollowsTok(P.person, S.suffix, 0, 0);
-- Yunyao: added 04/14/2009 to expand single token person name with a single initial-- extend single token person with a single initialcreate view PersonStrongSingleTokenOnlyExpanded2 as select CombineSpans(R.person, RightContext(R.person,2)) as personfrom PersonStrongSingleTokenOnly Rwhere MatchesRegex(/ +[\p{Upper}]\b\s*/, RightContext(R.person,3));
create view PersonStrongSingleToken as(select P.person as person from
PersonStrongSingleTokenOnly P)union all (select P.person as person from
PersonStrongSingleTokenOnlyExpanded1 P)union all (select P.person as person from
PersonStrongSingleTokenOnlyExpanded2 P);
/*** Union all matches found by weak rules*/create view PersonWeak1WithNewLine as
(select P.person as person from Person3r1 P)union all
(select P.person as person from Person3r2 P)union all
(select P.person as person from Person4r1 P)union all
(select P.person as person from Person4r2 P)union all
(select P.person as person from Person2 P)union all
(select P.person as person from Person2a P)union all
(select P.person as person from Person3P2 P)union all
(select P.person as person from Person3P3 P);
-- weak rules that identify (LastName, FirstName)create view PersonWeak2WithNewLine as
(select P.person as person from Person4a P)union all
(select P.person as person from Person4ar1 P)union all
(select P.person as person from Person4ar2 P);
--include 'core/GenericNE/Person-FilterNewLineSingle.aql';--include 'core/GenericNE/Person-Filter.aql';
create view PersonBase as(select P.person as person from
PersonStrongWithNewLine P)union all
(select P.person as person from PersonWeak1WithNewLine P)union all
(select P.person as person from PersonWeak2WithNewLine P);
output view PersonBase;
from Dictionary('names/name_israel.dict', Doc.text) Dwhere MatchesRegex(/\p{Lu}\p{M}*.{1,20}/, D.match);
create view NamesAll as(select P.name as name from NameDict P)union all(select P.name as name from NameDict1 P)union all(select P.name as name from NameDict2 P)union all(select P.name as name from NameDict3 P)union all(select P.name as name from NameDict4 P)union all(select P.firstname as name from FirstName P)union all
create view PersonDict asselect C.name as name--from Consolidate(NamesAll.name) C;from NamesAll Cconsolidate on C.name;
--==========================================================-- Actual Rules--==========================================================
-- For 3-part Person namescreate view Person3P1 as select CombineSpans(F.firstname, L.lastname) as personfrom StrictFirstName F,
StrictCapsPersonR S,StrictLastName L
where FollowsTok(F.firstname, S.name, 0, 0)--and FollowsTok(S.name, L.lastname, 0, 0)and FollowsTok(F.firstname, L.lastname, 1, 1)and Not(Equals(GetText(F.firstname), GetText(L.lastname)))and Not(Equals(GetText(F.firstname), GetText(S.name)))and Not(Equals(GetText(S.name), GetText(L.lastname)))and Not(ContainsRegex(/[\n\r\t]/, SpanBetween(F.firstname, L.lastname)));
create view Person3P2 as select CombineSpans(P.name, L.lastname) as personfrom PersonDict P,
StrictCapsPersonR S,StrictLastName L
where FollowsTok(P.name, S.name, 0, 0)--and FollowsTok(S.name, L.lastname, 0, 0)and FollowsTok(P.name, L.lastname, 1, 1)and Not(Equals(GetText(P.name), GetText(L.lastname)))and Not(Equals(GetText(P.name), GetText(S.name)))and Not(Equals(GetText(S.name), GetText(L.lastname)))and Not(ContainsRegex(/[\n\r\t]/, SpanBetween(P.name, L.lastname)));
create view Person3P3 as select CombineSpans(F.firstname, P.name) as personfrom PersonDict P,
StrictCapsPersonR S,StrictFirstName F
where FollowsTok(F.firstname, S.name, 0, 0)--and FollowsTok(S.name, P.name, 0, 0)and FollowsTok(F.firstname, P.name, 1, 1)and Not(Equals(GetText(P.name), GetText(F.firstname)))and Not(Equals(GetText(P.name), GetText(S.name)))and Not(Equals(GetText(S.name), GetText(F.firstname)))and Not(ContainsRegex(/[\n\r\t]/, SpanBetween(F.firstname, P.name)));
/*** Translation for Rule 1* Handles names of persons like Mr. Vladimir E. Putin*//*<rule annotation=Person id=1><token attribute={etc}INITIAL{etc}>CANYWORD</token><internal><token attribute={etc}>CAPSPERSON</token><token attribute={etc}>INITIALWORD</token><token attribute={etc}>CAPSPERSON</token></internal></rule>*/
create view Person1 asselect CombineSpans(CP1.name, CP2.name) as personfrom Initial I,
CapsPerson CP1,InitialWord IW,CapsPerson CP2
where FollowsTok(I.initial, CP1.name, 0, 0)and FollowsTok(CP1.name, IW.word, 0, 0)and FollowsTok(IW.word, CP2.name, 0, 0);--and Not(ContainsRegex(/[\n\r]/, SpanBetween(I.initial, CP2.name)));
/*** Translation for Rule 1a* Handles names of persons like Mr. Vladimir Putin*//* <rule annotation=Person id=1a><token attribute={etc}INITIAL{etc}>CANYWORD</token><internal><token attribute={etc}>CAPSPERSON</token>{1,3}</internal></rule>*/
-- Split into two rules so that single token annotations are serperated from others-- Single token annotationscreate view Person1a1 asselect CP1.name as personfrom Initial I,
CapsPerson CP1where FollowsTok(I.initial, CP1.name, 0, 0)--- start changing this block--- disallow allow newline and Not(ContainsRegex(/[\n\t]/,SpanBetween(I.initial,CP1.name)))--- end changing this block;
-- Yunyao: added 05/09/2008 to match patterns such as "Mr. B. B. Buy"/*create view Person1a2 as select CombineSpans(name.block, CP1.name) as personfrom Initial I,
BlockTok(0, 1, 2, InitialWord.word) name,CapsPerson CP1
where FollowsTok(I.initial, name.block, 0, 0)and FollowsTok(name.block, CP1.name, 0, 0)and Not(ContainsRegex(/[\n\t]/,CombineSpans(I.initial, CP1.name)));
*/
create view Person1a as-- (
select P.person as person from Person1a1 P-- )-- union all-- (select P.person as person from Person1a2 P);
/*create view Person1a_more as select name.block as personfrom Initial I,
BlockTok(0, 2, 3, CapsPerson.name) namewhere FollowsTok(I.initial, name.block, 0, 0)
and Not(ContainsRegex(/[\n\t]/,name.block))--- start changing this block-- disallow newlineand Not(ContainsRegex(/[\n\t]/,SpanBetween(I.initial,name.block)))
--- end changing this block;
*/
/*** Translation for Rule 3* Find person names like Thomas B.M. David*//*<rule annotation=Person id=3><internal><token attribute={etc}PERSON{etc}>CAPSPERSON</token><token attribute={etc}>INITIALWORD</token><token attribute={etc}PERSON{etc}>CAPSPERSON</token></internal></rule>*/
create view Person3 asselect CombineSpans(P1.name, P2.name) as personfrom PersonDict P1,
--InitialWord IW,WeakInitialWord IW,PersonDict P2
where FollowsTok(P1.name, IW.word, 0, 0)and FollowsTok(IW.word, P2.name, 0, 0)and Not(Equals(GetText(P1.name), GetText(P2.name)));
/*** Translation for Rule 3r1* * This relaxed version of rule '3' will find person names like Thomas B.M. David* But it only insists that the first word is in the person dictionary*//*<rule annotation=Person id=3r1><internal><token attribute={etc}PERSON:ST:FNAME{etc}>CAPSPERSON</token><token attribute={etc}>INITIALWORD</token><token attribute={etc}>CAPSPERSON</token></internal></rule>*/
create view Person3r1 as
create view Initial as
--'Junior' (Yunyao: comments out to avoid mismatches such as Junior National [team player],-- If we can have large negative dictionary to eliminate such mismatches, -- then this may be recovered --'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name')-- for German names-- TODO: need further test,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor','Herr Professor', 'Frau professor', 'Baron', 'graf'
);
-- Find dictionary matches for all title initials
select D.match as initial--'Name:' ((Yunyao: comments out to avoid mismatches such as 'Name: Last Name')-- for German names-- TODO: need further test,'herr', 'Fraeulein', 'Doktor', 'Herr Doktor', 'Frau Doktor','Herr Professor', 'Frau professor', 'Baron', 'graf'
);
-- Find dictionary matches for all title initials
from Dictionary('InitialDict', Doc.text) D;
-- Yunyao: added 05/09/2008 to capture person name suffixcreate dictionary PersonSuffixDict as(
',jr.', ',jr', 'III', 'IV', 'V', 'VI');
create view PersonSuffix asselect D.match as suffixfrom Dictionary('PersonSuffixDict', Doc.text) D;
-- Find capitalized words that look like person names and not in the non-name dictionarycreate view CapsPersonCandidate asselect R.match as name--from Regex(/\b\p{Upper}\p{Lower}[\p{Alpha}]{1,20}\b/, Doc.text) R--from Regex(/\b\p{Upper}\p{Lower}[\p{Alpha}]{0,10}(['-][\p{Upper}])?[\p{Alpha}]{1,10}\b/, Doc.text) R -- change to enable unicode match--from Regex(/\b\p{Lu}\p{M}*[\p{Ll}\p{Lo}]\p{M}*[\p{L}\p{M}*]{0,10}(['-][\p{Lu}\p{M}*])?[\p{L}\p{M}*]{1,10}\b/, Doc.text) R --from Regex(/\b\p{Lu}\p{M}*[\p{Ll}\p{Lo}]\p{M}*[\p{L}\p{M}*]{0,10}(['-][\p{Lu}\p{M}*])?(\p{L}\p{M}*){1,10}\b/, Doc.text) R -- Allow fully capitalized words--from Regex(/\b\p{Lu}\p{M}*(\p{L}\p{M}*){0,10}(['-][\p{Lu}\p{M}*])?(\p{L}\p{M}*){1,10}\b/, Doc.text) R from RegexTok(/\p{Lu}\p{M}*(\p{L}\p{M}*){0,10}(['-][\p{Lu}\p{M}*])?(\p{L}\p{M}*){1,10}/, 4, Doc.text) R --'where Not(ContainsDicts(
'FilterPersonDict', 'filterPerson_position.dict','filterPerson_german.dict','InitialDict','StrongPhoneVariantDictionary','stateList.dict','organization_suffix.dict',
'industryType_suffix.dict','streetSuffix_forPerson.dict', 'wkday.dict','nationality.dict','stateListAbbrev.dict','stateAbbrv.ChicagoAPStyle.dict', R.match));
create view CapsPerson asselect C.name as namefrom CapsPersonCandidate Cwhere Not(MatchesRegex(/(\p{Lu}\p{M}*)+-.*([\p{Ll}\p{Lo}]\p{M}*).*/, C.name))and Not(MatchesRegex(/.*([\p{Ll}\p{Lo}]\p{M}*).*-(\p{Lu}\p{M}*)+/, C.name));
-- Find strict capitalized words with two letter or more (relaxed version of StrictCapsPerson)
--============================================================--TODO: need to think through how to deal with hypened name -- one way to do so is to run Regex(pattern, CP.name) and enforce CP.name does not contain '-- need more testing before confirming the change
create view CapsPersonNoP asselect CP.name as namefrom CapsPerson CPwhere Not(ContainsRegex(/'/, CP.name)); --'
create view StrictCapsPersonR asselect R.match as name--from Regex(/\b\p{Lu}\p{M}*(\p{L}\p{M}*){1,20}\b/, CapsPersonNoP.name) R;from RegexTok(/\p{Lu}\p{M}*(\p{L}\p{M}*){1,20}/, 1, CapsPersonNoP.name) R;
--============================================================
-- Find strict capitalized words--create view StrictCapsPerson ascreate view StrictCapsPerson asselect R.name as namefrom StrictCapsPersonR Rwhere MatchesRegex(/\b\p{Lu}\p{M}*[\p{Ll}\p{Lo}]\p{M}*(\p{L}\p{M}*){1,20}\b/, R.name);
-- Find dictionary matches for all last namescreate view StrictLastName1 asselect D.match as lastnamefrom Dictionary('strictLast.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);
create view StrictLastName2 asselect D.match as lastnamefrom Dictionary('strictLast_german.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);
create view StrictLastName3 asselect D.match as lastnamefrom Dictionary('strictLast_german_bluePages.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);
create view StrictLastName4 asselect D.match as lastnamefrom Dictionary('uniqMostCommonSurname.dict', Doc.text) D--where MatchesRegex(/\p{Upper}\p{Lower}[\p{Alpha}]{0,20}/, D.match);--where MatchesRegex(/\p{Upper}.{1,20}/, D.match);-- changed to enable unicode matchwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);
create view StrictLastName5 asselect D.match as lastnamefrom Dictionary('names/strictLast_italy.dict', Doc.text) Dwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);
create view StrictLastName6 asselect D.match as lastnamefrom Dictionary('names/strictLast_france.dict', Doc.text) Dwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);
create view StrictLastName7 asselect D.match as lastnamefrom Dictionary('names/strictLast_spain.dict', Doc.text) Dwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);
create view StrictLastName8 asselect D.match as lastnamefrom Dictionary('names/strictLast_india.partial.dict', Doc.text) Dwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);
create view StrictLastName9 asselect D.match as lastnamefrom Dictionary('names/strictLast_israel.dict', Doc.text) Dwhere MatchesRegex(/((\p{L}\p{M}*)+\s+)?\p{Lu}\p{M}*.{1,20}/, D.match);
create view StrictLastName as(select S.lastname as lastname from StrictLastName1 S)union all(select S.lastname as lastname from StrictLastName2 S)union all(select S.lastname as lastname from StrictLastName3 S)union all(select S.lastname as lastname from StrictLastName4 S)union all(select S.lastname as lastname from StrictLastName5 S)union all(select S.lastname as lastname from StrictLastName6 S)union all(select S.lastname as lastname from StrictLastName7 S)union all(select S.lastname as lastname from StrictLastName8 S)union all(select S.lastname as lastname from StrictLastName9 S);
-- Relaxed version of last namecreate view RelaxedLastName1 asselect CombineSpans(SL.lastname, CP.name) as lastnamefrom StrictLastName SL,
StrictCapsPerson CPwhere FollowsTok(SL.lastname, CP.name, 1, 1)and MatchesRegex(/\-/, SpanBetween(SL.lastname, CP.name));
create view RelaxedLastName2 asselect CombineSpans(CP.name, SL.lastname) as lastnamefrom StrictLastName SL,
StrictCapsPerson CPwhere FollowsTok(CP.name, SL.lastname, 1, 1)and MatchesRegex(/\-/, SpanBetween(CP.name, SL.lastname));
-- all the last namescreate view LastNameAll as
(select N.lastname as lastname from StrictLastName N)union all(select N.lastname as lastname from RelaxedLastName1 N)union all(select N.lastname as lastname from RelaxedLastName2 N);
create view ValidLastNameAll asselect N.lastname as lastname
----------------------------------------- Document Preprocessing---------------------------------------create view Doc asselect D.text as textfrom DocScan D;
------------------------------------------ Basic Named Entity Annotators----------------------------------------
-- Find initial words create view InitialWord1 asselect R.match as word--from Regex(/\b([\p{Upper}]\.\s*){1,5}\b/, Doc.text) Rfrom RegexTok(/([\p{Upper}]\.\s*){1,5}/, 10, Doc.text)
R-- added on 04/18/2008where Not(MatchesRegex(/M\.D\./, R.match));
-- Yunyao: added on 11/21/2008 to capture names with prefix (we use it as initial
-- to avoid adding too many commplex rules)create view InitialWord2 asselect D.match as wordfrom Dictionary('specialNamePrefix.dict', Doc.text)
D;
create view InitialWord as(select I.word as word from InitialWord1 I)union all(select I.word as word from InitialWord2 I);
-- Find weak initial words create view WeakInitialWord asselect R.match as word--from Regex(/\b([\p{Upper}]\.?\s*){1,5}\b/, Doc.text)
R;from RegexTok(/([\p{Upper}]\.?\s*){1,5}/, 10, Doc.text)
R-- added on 05/12/2008-- Do not allow weak initial word to be a word longer
than three characterswhere Not(ContainsRegex(/[\p{Upper}]{3}/,
R.match))-- added on 04/14/2009-- Do not allow weak initial words to match the
timezonand Not(ContainsDict('timeZone.dict', R.match));
------------------------------------------------- Strong Phone Numbers-----------------------------------------------create dictionary StrongPhoneVariantDictionary as (
'phone','cell','contact','direct','office',-- Yunyao: Added new strong clues for phone
numbers'tel','dial','Telefon','mobile','Ph','Phone Number','Direct Line','Telephone No','TTY','Toll Free','Toll-free',-- German'Fon','Telefon Geschaeftsstelle', 'Telefon Geschäftsstelle','Telefon Zweigstelle','Telefon Hauptsitz','Telefon (Geschaeftsstelle)', 'Telefon (Geschäftsstelle)','Telefon (Zweigstelle)','Telefon (Hauptsitz)','Telefonnummer','Telefon Geschaeftssitz','Telefon Geschäftssitz','Telefon (Geschaeftssitz)','Telefon (Geschäftssitz)','Telefon Persönlich','Telefon persoenlich','Telefon (Persönlich)','Telefon (persoenlich)','Handy','Handy-Nummer','Telefon arbeit','Telefon (arbeit)'
);
--include 'core/GenericNE/Person.aql';
create dictionary FilterPersonDict as(
'Travel', 'Fellow', 'Sir', 'IBMer', 'Researcher', 'All','Tell',
'Friends', 'Friend', 'Colleague', 'Colleagues', 'Managers','If',
'Customer', 'Users', 'User', 'Valued', 'Executive', 'Chairs',
'New', 'Owner', 'Conference', 'Please', 'Outlook', 'Lotus', 'Notes',
'This', 'That', 'There', 'Here', 'Subscribers', 'What', 'When', 'Where', 'Which',
'With', 'While', 'Thanks', 'Thanksgiving','Senator', 'Platinum', 'Perspective',
'Manager', 'Ambassador', 'Professor', 'Dear', 'Contact', 'Cheers', 'Athelet',
'And', 'Act', 'But', 'Hello', 'Call', 'From', 'Center', 'The', 'Take', 'Junior',
'Both', 'Communities', 'Greetings', 'Hope', 'Restaurants', 'Properties',
'Let', 'Corp', 'Memorial', 'You', 'Your', 'Our', 'My', 'His','Her',
'Their','Popcorn', 'Name', 'July', 'June','Join','Business', 'Administrative', 'South', 'Members',
'Address', 'Please', 'List','Public', 'Inc', 'Parkway', 'Brother', 'Buy', 'Then',
'Services', 'Statements','President', 'Governor', 'Commissioner',
'Commitment', 'Commits', 'Hey','Director', 'End', 'Exit', 'Experiences', 'Finance',
'Elementary', 'Wednesday','Nov', 'Infrastructure', 'Inside', 'Convention','Judge', 'Lady', 'Friday', 'Project', 'Projected', 'Recalls', 'Regards', 'Recently', 'Administration',
'Independence', 'Denied','Unfortunately', 'Under', 'Uncle', 'Utility', 'Unlike',
'Was', 'Were', 'Secretary','Speaker', 'Chairman', 'Consider', 'Consultant',
'County', 'Court', 'Defensive','Northwestern', 'Place', 'Hi', 'Futures', 'Athlete',
'Invitational', 'System','International', 'Main', 'Online', 'Ideally'-- more entries,'If','Our', 'About', 'Analyst', 'On', 'Of', 'By', 'HR', 'Mkt',
'Pre', 'Post','Condominium', 'Ice', 'Surname', 'Lastname',
'firstname', 'Name', 'familyname',-- Italian greeting'Ciao',-- Spanish greeting'Hola',-- French greeting'Bonjour',-- new entries 'Pro','Bono','Enterprises','Group','Said','Says','Assis
tant','Vice','Warden','Contribution','Research', 'Development', 'Product', 'Sales',
'Support', 'Manager', 'Telephone', 'Phone', 'Contact', 'Information',
'Electronics','Managed','West','East','North','South', 'Teaches','Ministry', 'Church', 'Association',
'Laboratories', 'Living', 'Community', 'Visiting','Officer', 'After', 'Pls', 'FYI', 'Only', 'Additionally',
'Adding', 'Acquire', 'Addition', 'America',-- short phrases that are likely to be at the start of a
sentence'Yes', 'No', 'Ja', 'Nein','Kein', 'Keine', 'Gegenstimme',-- TODO: to be double checked'Another', 'Anyway','Associate', 'At', 'Athletes', 'It',
'Enron', 'EnronXGate', 'Have', 'However','Company', 'Companies', 'IBM','Annual', -- common verbs appear with person names in
financial reports-- ideally we want to have a general comprehensive
verb list to use as a filter dictionary'Joins', 'Downgrades', 'Upgrades', 'Reports', 'Sees', 'Warns', 'Announces', 'Reviews'-- Laura 06/02/2009: new filter dict for title for SEC
domain in filterPerson_title.dict);
create dictionary GreetingsDict as(
'Hey', 'Hi', 'Hello', 'Dear',-- German greetings'Liebe', 'Lieber', 'Herr', 'Frau', 'Hallo', -- Italian'Ciao',-- Spanish'Hola',-- French'Bonjour'
);
create dictionary InitialDict as(
'rev.', 'col.', 'reverend', 'prof.', 'professor.', 'lady', 'miss.', 'mrs.', 'mrs', 'mr.', 'pt.', 'ms.','messrs.', 'dr.', 'master.', 'marquis', 'monsieur','ds', 'di'--'Dear' (Yunyao: comments out to avoid
mismatches such as Dear Member),--'Junior' (Yunyao: comments out to avoid
mismatches such as Junior National [team player],-- If we can have large negative dictionary to
eliminate such mismatches, -- then this may be recovered
IE Rule Development Is Hard---------------------------------------
-- Document Preprocessing---------------------------------------create view Doc asselect D.text as textfrom DocScan D;
------------------------------------------ Basic Named Entity Annotators----------------------------------------
-- Find initial words create view InitialWord1 asselect R.match as word--from Regex(/\b([\p{Upper}]\.\s*){1,5}\b/, Doc.text) Rfrom RegexTok(/([\p{Upper}]\.\s*){1,5}/, 10, Doc.text)
R-- added on 04/18/2008where Not(MatchesRegex(/M\.D\./, R.match));
-- Yunyao: added on 11/21/2008 to capture names with prefix (we use it as initial
-- to avoid adding too many commplex rules)create view InitialWord2 asselect D.match as wordfrom Dictionary('specialNamePrefix.dict', Doc.text)
D;
create view InitialWord as(select I.word as word from InitialWord1 I)union all(select I.word as word from InitialWord2 I);
-- Find weak initial words create view WeakInitialWord asselect R.match as word--from Regex(/\b([\p{Upper}]\.?\s*){1,5}\b/, Doc.text)
R;from RegexTok(/([\p{Upper}]\.?\s*){1,5}/, 10, Doc.text)
R-- added on 05/12/2008-- Do not allow weak initial word to be a word longer
than three characterswhere Not(ContainsRegex(/[\p{Upper}]{3}/,
R.match))-- added on 04/14/2009-- Do not allow weak initial words to match the
timezonand Not(ContainsDict('timeZone.dict', R.match));
------------------------------------------------- Strong Phone Numbers-----------------------------------------------create dictionary StrongPhoneVariantDictionary as (
'phone','cell','contact','direct','office',-- Yunyao: Added new strong clues for phone
numbers'tel','dial','Telefon','mobile','Ph','Phone Number','Direct Line','Telephone No','TTY','Toll Free','Toll-free',-- German'Fon','Telefon Geschaeftsstelle', 'Telefon Geschäftsstelle','Telefon Zweigstelle','Telefon Hauptsitz','Telefon (Geschaeftsstelle)', 'Telefon (Geschäftsstelle)','Telefon (Zweigstelle)','Telefon (Hauptsitz)','Telefonnummer','Telefon Geschaeftssitz','Telefon Geschäftssitz','Telefon (Geschaeftssitz)','Telefon (Geschäftssitz)','Telefon Persönlich','Telefon persoenlich','Telefon (Persönlich)','Telefon (persoenlich)','Handy','Handy-Nummer','Telefon arbeit','Telefon (arbeit)'
);
--include 'core/GenericNE/Person.aql';
create dictionary FilterPersonDict as(
'Travel', 'Fellow', 'Sir', 'IBMer', 'Researcher', 'All','Tell',
'Friends', 'Friend', 'Colleague', 'Colleagues', 'Managers','If',
'Customer', 'Users', 'User', 'Valued', 'Executive', 'Chairs',
'New', 'Owner', 'Conference', 'Please', 'Outlook', 'Lotus', 'Notes',
'This', 'That', 'There', 'Here', 'Subscribers', 'What', 'When', 'Where', 'Which',
'With', 'While', 'Thanks', 'Thanksgiving','Senator', 'Platinum', 'Perspective',
'Manager', 'Ambassador', 'Professor', 'Dear', 'Contact', 'Cheers', 'Athelet',
'And', 'Act', 'But', 'Hello', 'Call', 'From', 'Center', 'The', 'Take', 'Junior',
'Both', 'Communities', 'Greetings', 'Hope', 'Restaurants', 'Properties',
'Let', 'Corp', 'Memorial', 'You', 'Your', 'Our', 'My', 'His','Her',
'Their','Popcorn', 'Name', 'July', 'June','Join','Business', 'Administrative', 'South', 'Members',
'Address', 'Please', 'List','Public', 'Inc', 'Parkway', 'Brother', 'Buy', 'Then',
'Services', 'Statements','President', 'Governor', 'Commissioner',
'Commitment', 'Commits', 'Hey','Director', 'End', 'Exit', 'Experiences', 'Finance',
'Elementary', 'Wednesday','Nov', 'Infrastructure', 'Inside', 'Convention','Judge', 'Lady', 'Friday', 'Project', 'Projected', 'Recalls', 'Regards', 'Recently', 'Administration',
'Independence', 'Denied','Unfortunately', 'Under', 'Uncle', 'Utility', 'Unlike',
'Was', 'Were', 'Secretary','Speaker', 'Chairman', 'Consider', 'Consultant',
'County', 'Court', 'Defensive','Northwestern', 'Place', 'Hi', 'Futures', 'Athlete',
'Invitational', 'System','International', 'Main', 'Online', 'Ideally'-- more entries,'If','Our', 'About', 'Analyst', 'On', 'Of', 'By', 'HR', 'Mkt',
'Pre', 'Post','Condominium', 'Ice', 'Surname', 'Lastname',
'firstname', 'Name', 'familyname',-- Italian greeting'Ciao',-- Spanish greeting'Hola',-- French greeting'Bonjour',-- new entries 'Pro','Bono','Enterprises','Group','Said','Says','Assis
tant','Vice','Warden','Contribution','Research', 'Development', 'Product', 'Sales',
'Support', 'Manager', 'Telephone', 'Phone', 'Contact', 'Information',
'Electronics','Managed','West','East','North','South', 'Teaches','Ministry', 'Church', 'Association',
'Laboratories', 'Living', 'Community', 'Visiting','Officer', 'After', 'Pls', 'FYI', 'Only', 'Additionally',
'Adding', 'Acquire', 'Addition', 'America',-- short phrases that are likely to be at the start of a
sentence'Yes', 'No', 'Ja', 'Nein','Kein', 'Keine', 'Gegenstimme',-- TODO: to be double checked'Another', 'Anyway','Associate', 'At', 'Athletes', 'It',
'Enron', 'EnronXGate', 'Have', 'However','Company', 'Companies', 'IBM','Annual', -- common verbs appear with person names in
financial reports-- ideally we want to have a general comprehensive
verb list to use as a filter dictionary'Joins', 'Downgrades', 'Upgrades', 'Reports', 'Sees', 'Warns', 'Announces', 'Reviews'-- Laura 06/02/2009: new filter dict for title for SEC
domain in filterPerson_title.dict);
create dictionary GreetingsDict as(
'Hey', 'Hi', 'Hello', 'Dear',-- German greetings'Liebe', 'Lieber', 'Herr', 'Frau', 'Hallo', -- Italian'Ciao',-- Spanish'Hola',-- French'Bonjour'
);
create dictionary InitialDict as(
'rev.', 'col.', 'reverend', 'prof.', 'professor.', 'lady', 'miss.', 'mrs.', 'mrs', 'mr.', 'pt.', 'ms.','messrs.', 'dr.', 'master.', 'marquis', 'monsieur','ds', 'di'--'Dear' (Yunyao: comments out to avoid
mismatches such as Dear Member),
SystemT’s Person extractor
~250 AQL rules
SystemT’s Person extractor
~250 AQL rules
“Global financial services firm Morgan Stanley announced … ““Global financial services firm Morgan Stanley announced … “
Person
© 2010 IBM Corporation46
Lessons Learned
� Managing rule sets for multiple products is difficult
– Maintain a core generic library
– Expose hooks for application-specific customizations• Propagate some customizations back to generic library
– New requests keep coming !
� Creating an ecosystem of developers is important
– Writing rules is still an “art”
– 30+ AQL rule developers across Almaden, CRL, CDL IRL, SVL, TRL, YSL, Watson, SWG + one business partner
• Some involved in the AQL working group
� Several new research problems have surfaced
How to assist developers in
building and maintaining rules ?
How to assist developers in
building and maintaining rules ?
© 2010 IBM Corporation47
SystemT Development Environment (Ongoing Effort)
Develop
Test
Analyze
• Regex learning [Li08]
• Suggest rule changes [Liu10]
• Rule induction
• Use reference information
• Regex learning [Li08]
• Suggest rule changes [Liu10]
• Rule induction
• Use reference information
• Track provenance [Liu10]
• Contextual clue discovery
• Track provenance [Liu10]
• Contextual clue discovery
• Concordance Viewer
• Active labeling
• Concordance Viewer
• Active labeling
• NE Interface [Chiticariu10]
• Tagger UI [Kandogan07]
• NE Interface [Chiticariu10]
• Tagger UI [Kandogan07]
Development
Deploy
Refine
Test
Maintenance
© 2010 IBM Corporation48
Outline
� Challenges in Grammar-based IE systems
� SystemT : an Algebraic approach
– The AQL language
– Annotator Library
– Performance
– Tooling
� Related Work
© 2010 IBM Corporation49
Related Work (1/2): Declarative IE Systems from the Database Community
� Common vision:
– Separate semantics from order of execution
– Build the system around a language like SQL or Datalog
� Different interpretations of declarativity
High-level declarative Mixed declarative Completely declarative
CIMPLE (U. Wisconsin)
CIMPLE (U. Wisconsin)
SystemT (IBM)
BayesStore (UC Berkeley)
SystemT (IBM)
BayesStore (UC Berkeley)PSOX (Yahoo!)
SQoUT (Columbia U)
PSOX (Yahoo!)
SQoUT (Columbia U)
• Overall IE pipeline replaced
with a declarative language
• Individual extraction
components still “black boxes”
• Overall IE pipeline replaced
with a declarative language
• Individual extraction
components still “black boxes”
• One declarative
language covers all
stages of extraction
• One declarative
language covers all
stages of extraction
• Declarative language for some
(not all) extraction operations
• Both at the individual and
pipeline level
• Declarative language for some
(not all) extraction operations
• Both at the individual and
pipeline level
© 2010 IBM Corporation50
Related Work (2/2): Design Considerations
� Optimization
– Granularity
• High-level: annotator composition
• Low-level: basic extraction operators
– Strategy: Rewrite-based vs. Cost-based
� Runtime Model
– Document-Centric vs. Collection-Centric
� More details: SIGMOD 2010 tutorial [Chiticariu et al., 2010]
© 2010 IBM Corporation51
Summary
� SystemT
– Declarative IE system based on an algebraic framework
– Expressivity and performance advantages over grammar-based IE systems
– Text-specific optimizations
� Ongoing Work
– Tooling support for rule development/maintenance
– Improved optimization strategies
– New operators for advanced features (e.g. co-reference resolution)
© 2010 IBM Corporation52
Thank you!
� For more information…
– Download a free version of SystemT (newer version coming soon)
• https://www.alphaworks.ibm.com/tech/systemt/
– Visit the SystemT home page
• http://www.almaden.ibm.com/cs/projects/systemt/
– Contact me