patent processing with gate kalina bontcheva, valentin tablan university of sheffield
TRANSCRIPT
Patent Processing with GATE
Kalina Bontcheva, Valentin Tablan University of Sheffield
2
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Outline
• Why patent annotation?
• The data model
• The annotation guidelines
• Building the IE pipeline
• Evaluation
• Scaling up and optimisation
• Find the needle in the annotation (hay)stack
3
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
What is Semantic Annotation?
• Semantic Annotation: Is about attaching tags and/or ontology classes
to text segments; Creates a richer data space and can allow
conceptual search;
• Suitable for high-value content• Can be:
Fully automatic, semi-automatic, manual Social Learned
4
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Semantic Annotation
5
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Why annotate patents?
• Simple text search works well for the Web, but, patent searchers require high recall (web search
requires high precision); patents don't contain hyperlinks; patent searchers need richer semantics than
offered by simple text search; patent text amenable to HLT due to regularities
and sub-language effects.
6
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
How can annotation help?
• Format irregularities “Fig. 3”, “FIG 3”, “Figure 3”, etc.
• Data normalisation “Figures. 3 to 5” -> FIG. 2, FIG 4, FIG 5. “23rd Oct 1998” -> 19981023
• Text mining – discovery of: product names and materials; references to other patents, publications and prior art; measurements. etc.
7
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Manual vs. Automatic
• Manual SA high quality very expensive requires small data or many users (e.g flickr, del.icio.us).
• Automatic SA inexpensive medium quality can only do simple tasks
• Patent data too large to annotate manually too difficult to annotate fully automatically
8
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
The SAM Projects
• Collaboration between Matrixware, Sheffield GATE team, and Ontotext
• Started in 2007 and ongoing Pilot study for applicability of Semantic
Annotation to patents GATE Teamware: Infrastructure for collaborative
semantic annotation Large scale experiments Mimir: Large scale indexing infrastructure
supporting hybrid search (text, annotations, meaning)
9
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Technologies
Teamware
GATE OWLIM
TRREEJBPM, etc…
Data Enrichment(Semantic Annotation)
KIM
Knowledge Management
GATE OWLIM
TRREELucene,
etc…
Data Access(Search/Browsing)
GATE ORDI
TRREEMG4J,etc…
Large ScaleHybrid Index
Sheffield Ontotext Other
10
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Teamware revisited: A Key SAM Infrastructure
Collaborative Semantic Annotation Environment
• Tools for semi-automatic annotation;
• Scalable distributed text analytics processing;
• Data curation;
• User/role management;
• Web-based user interface.
11
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Semantic Annotation Experiments
Wide Annotation
Cover a range of generally useful concepts:
Documents, document parts, references
High level detail.
Deep Annotation
Cover a narrow range of concepts
Measurements
As much detail as possible.
12
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Data Model
13
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Example Bibliographic Data
14
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Example measurements
15
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Example References
16
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
The Patent Annotation Guidelines
• 11 pages (10 point font), with concrete examples, general rules, specific guidelines per type, lists of exceptions, etc.
• The section on annotating measurements is 2 pages long!
• The clearer the guidelines – the better Inter-Annotator Agreement you’re likely to achieve
• The higher the IAA – the better automatic results can be obtained (less noise!)
• The lengthier the annotations – the more scope for error there is, e.g., references to other papers had the lowest IAA
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Annotating Scalar Measurements
• numeric value including formulae
• always related to a unit
• more than one value can be related to the same unit
... [80]% of them measure less than [6] um [2] ...
[2x10 -7] Torr [29G×½]” needle [3], [5], [6] cm turbulence intensity may
be greater than [0.055], [0.06] ...
... [80]% of them measure less than [6] um [2] ...
[2x10 -7] Torr [29G×½]” needle [3], [5], [6] cm turbulence intensity may
be greater than [0.055], [0.06] ...
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
• including compound unit
• always related to at least one scalarValue
• do not include a final dot
• %, :, / should be annotated as unit
deposition rates up to 20 [nm/sec]
a fatigue life of 400 MM [cycles]
ratio is approximately 9[:]7
deposition rates up to 20 [nm/sec]
a fatigue life of 400 MM [cycles]
ratio is approximately 9[:]7
Annotating Measurement Units
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
<?xml version="1.0"?><schema xmlns="http://www.w3.org/2000/10/XMLSchema"> <element name="Measurement"> <complexType> <attribute name="type" use="required">
<simpleType> <restriction base="string">
<enumeration value="scalarValue"/><enumeration value="unit"/>
</restriction> </simpleType> </attribute>
<attribute name="requires-attention" use="optional"> <simpleType> <restriction base="string">
<enumeration value="true"/> <enumeration value="false"/>
</restriction> </simpleType> </attribute>
Annotation Schemas: Measurements Example
20
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
The IE Pipeline
• JAPE Rules vs Machine Learning Moving the goal posts: dealing with unstable annotation
guidelines• JAPE – just change a few rules hopefully
• ML – could require significant manual re-annotation effort of the training data
Bootstrapping training data creation with JAPE patterns – significantly reduces the manual effort
For ML to be successful, we need IAA to be as high as possible – noisy data problem otherwise
Insufficient training data initially, so chose JAPE approach
21
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Example JAPEs for References
Macro: FIGNUMBER //Numbers 3, 45, also 3a, 3b( {Token.kind == "number"} ({Token.length == "1",Token.kind == "word"})?)
Rule:IgnoreFigRefsIfTherePriority: 1000( {Reference.type == "Figure"} )--> {}
Rule:FindFigRefsPriority: 50( (
({Token.root == "figure"} | {Token.root == "fig"}) ({Token.string == "."})? ((FIGNUMBER) | (FIGNUMBERBRACKETS) ):number ):figref)-->
:figref.Reference = {type = "Figure", id = :number.Token.string}
22
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Example Rule for Measurements
Rule: SimpleMeasure/* * Number followed by a unit. */( ({Token.kind == "number"})):amount ({Lookup.majorType == "unit"}):unit-->:amount.Measurement = {type = scalarValue, rule = "measurement.SimpleMeasure"},:unit.Measurement = {type = unit, rule = "measurement.SimpleMeasure"}
23
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
The IE Annotation Pipeline
24
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Hands-on: Identify More Patterns
• Open Teamware and login
• Find corpus patents-sample
• Run ANNIC to identify some patterns for references to tables and figures and measurements There are already POS tags, Lookup
annotations, morphological ones Units for measurements are Lookup.majorType
== “unit”
25
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
The Teamware Annotation Project
• Iterated between JAPE grammar development, manual annotation for gold-standard creation, measuring IAA and precision/recall for JAPE improvements
• Initially gold standard doubly annotated until good IAA is obtained, then moved to 1 annotator per document
• Had 15 annotators working at the same time
26
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Measuring IAA with Teamware
• Open Teamware
• Find corpus patents-double-annotation
• Measure IAA with the respective tool
• Analyse the disagreements with the AnnDiff tool
27
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Producing the Gold Standard
• Selected patents from two very different fields: mechanical engineering and biomedical technology
• 51 patents, 2.5 million characters
• 15 annotators, 1 curator reconciling the differences
28
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
The Evaluation Gold Standard
29
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Preliminary Results
30
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Running GATE Apps on Millions of Documents
• Processed 1.3 million patents in 6 days with 12 parallel processes.
• Data sets from Matrixware: American patents (USPTO): 1.3 million, 108 GB,
average file size - 85KB. European patents (EPO): 27 thousand, 780MB,
average file size - 29KB.
31
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Large-scale Parallel IE
• Our experiments were carried out on the IRF’s supercomputer with Java (jrockit-R27.4.0-jdk1.5.0 12) with up to 12 processes
• SGI Altix 4700 system comprising 20 nodes each with four 1.4GHz Itanium cores and 18GB RAM
• In comparison, we found it 4x faster on Intel Core 2 2.4GHz
32
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Large-Scale, Parallel IE (2)
• GATE Cloud (A3): dispatches documents to process in parallel; does not stop on error Ongoing project, moving towards Hadoop Contact Hamish for further details
• Benchmarking facilities: generate time stamps for each resource and display charts from them Help optimising the IE pipelines, esp. JAPE rules Doubled the speed of the patent processing pipeline For a similar third-party GATE-based application we
achieved a 10-fold improvement
33
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
Optimisation Results
34
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
MIMIR: Accessing the Text and the Semantic Annotations
• Documents: 981,315
• Tokens: 7,228,889,715 (> 7 billion)
• Distinct tokens: 18,539,315 (> 18m)
• Annotation occurrences: 151,775,533 (> 151m)
35
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
36
University of Sheffield NLP
GATE Summer School - July 27-31, 2009
37
University of Sheffield NLP
GATE Summer School - July 27-31, 2009