patent processing with gate kalina bontcheva, valentin tablan university of sheffield

37
Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

Upload: anis-greer

Post on 17-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

Patent Processing with GATE

Kalina Bontcheva, Valentin Tablan University of Sheffield

Page 2: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

2

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Outline

• Why patent annotation?

• The data model

• The annotation guidelines

• Building the IE pipeline

• Evaluation

• Scaling up and optimisation

• Find the needle in the annotation (hay)stack

Page 3: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

3

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

What is Semantic Annotation?

• Semantic Annotation: Is about attaching tags and/or ontology classes

to text segments; Creates a richer data space and can allow

conceptual search;

• Suitable for high-value content• Can be:

Fully automatic, semi-automatic, manual Social Learned

Page 4: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

4

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Semantic Annotation

Page 5: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

5

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Why annotate patents?

• Simple text search works well for the Web, but, patent searchers require high recall (web search

requires high precision); patents don't contain hyperlinks; patent searchers need richer semantics than

offered by simple text search; patent text amenable to HLT due to regularities

and sub-language effects.

Page 6: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

6

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

How can annotation help?

• Format irregularities “Fig. 3”, “FIG 3”, “Figure 3”, etc.

• Data normalisation “Figures. 3 to 5” -> FIG. 2, FIG 4, FIG 5. “23rd Oct 1998” -> 19981023

• Text mining – discovery of: product names and materials; references to other patents, publications and prior art; measurements. etc.

Page 7: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

7

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Manual vs. Automatic

• Manual SA high quality very expensive requires small data or many users (e.g flickr, del.icio.us).

• Automatic SA inexpensive medium quality can only do simple tasks

• Patent data too large to annotate manually too difficult to annotate fully automatically

Page 8: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

8

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

The SAM Projects

• Collaboration between Matrixware, Sheffield GATE team, and Ontotext

• Started in 2007 and ongoing Pilot study for applicability of Semantic

Annotation to patents GATE Teamware: Infrastructure for collaborative

semantic annotation Large scale experiments Mimir: Large scale indexing infrastructure

supporting hybrid search (text, annotations, meaning)

Page 9: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

9

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Technologies

Teamware

GATE OWLIM

TRREEJBPM, etc…

Data Enrichment(Semantic Annotation)

KIM

Knowledge Management

GATE OWLIM

TRREELucene,

etc…

Data Access(Search/Browsing)

GATE ORDI

TRREEMG4J,etc…

Large ScaleHybrid Index

Sheffield Ontotext Other

Page 10: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

10

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Teamware revisited: A Key SAM Infrastructure

Collaborative Semantic Annotation Environment

• Tools for semi-automatic annotation;

• Scalable distributed text analytics processing;

• Data curation;

• User/role management;

• Web-based user interface.

Page 11: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

11

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Semantic Annotation Experiments

Wide Annotation

Cover a range of generally useful concepts:

Documents, document parts, references

High level detail.

Deep Annotation

Cover a narrow range of concepts

Measurements

As much detail as possible.

Page 12: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

12

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Data Model

Page 13: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

13

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Example Bibliographic Data

Page 14: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

14

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Example measurements

Page 15: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

15

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Example References

Page 16: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

16

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

The Patent Annotation Guidelines

• 11 pages (10 point font), with concrete examples, general rules, specific guidelines per type, lists of exceptions, etc.

• The section on annotating measurements is 2 pages long!

• The clearer the guidelines – the better Inter-Annotator Agreement you’re likely to achieve

• The higher the IAA – the better automatic results can be obtained (less noise!)

• The lengthier the annotations – the more scope for error there is, e.g., references to other papers had the lowest IAA

Page 17: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Annotating Scalar Measurements

• numeric value including formulae

• always related to a unit

• more than one value can be related to the same unit

... [80]% of them measure less than [6] um [2] ...

[2x10 -7] Torr [29G×½]” needle [3], [5], [6] cm turbulence intensity may

be greater than [0.055], [0.06] ...

... [80]% of them measure less than [6] um [2] ...

[2x10 -7] Torr [29G×½]” needle [3], [5], [6] cm turbulence intensity may

be greater than [0.055], [0.06] ...

Page 18: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

• including compound unit

• always related to at least one scalarValue

• do not include a final dot

• %, :, / should be annotated as unit

deposition rates up to 20 [nm/sec]

a fatigue life of 400 MM [cycles]

ratio is approximately 9[:]7

deposition rates up to 20 [nm/sec]

a fatigue life of 400 MM [cycles]

ratio is approximately 9[:]7

Annotating Measurement Units

Page 19: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

<?xml version="1.0"?><schema xmlns="http://www.w3.org/2000/10/XMLSchema"> <element name="Measurement"> <complexType> <attribute name="type" use="required">

<simpleType> <restriction base="string">

<enumeration value="scalarValue"/><enumeration value="unit"/>

</restriction> </simpleType> </attribute>

<attribute name="requires-attention" use="optional"> <simpleType> <restriction base="string">

<enumeration value="true"/> <enumeration value="false"/>

</restriction> </simpleType> </attribute>

Annotation Schemas: Measurements Example

Page 20: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

20

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

The IE Pipeline

• JAPE Rules vs Machine Learning Moving the goal posts: dealing with unstable annotation

guidelines• JAPE – just change a few rules hopefully

• ML – could require significant manual re-annotation effort of the training data

Bootstrapping training data creation with JAPE patterns – significantly reduces the manual effort

For ML to be successful, we need IAA to be as high as possible – noisy data problem otherwise

Insufficient training data initially, so chose JAPE approach

Page 21: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

21

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Example JAPEs for References

Macro: FIGNUMBER //Numbers 3, 45, also 3a, 3b( {Token.kind == "number"} ({Token.length == "1",Token.kind == "word"})?)

Rule:IgnoreFigRefsIfTherePriority: 1000( {Reference.type == "Figure"} )--> {}

Rule:FindFigRefsPriority: 50( (

({Token.root == "figure"} | {Token.root == "fig"}) ({Token.string == "."})? ((FIGNUMBER) | (FIGNUMBERBRACKETS) ):number ):figref)-->

:figref.Reference = {type = "Figure", id = :number.Token.string}

Page 22: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

22

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Example Rule for Measurements

Rule: SimpleMeasure/* * Number followed by a unit. */( ({Token.kind == "number"})):amount ({Lookup.majorType == "unit"}):unit-->:amount.Measurement = {type = scalarValue, rule = "measurement.SimpleMeasure"},:unit.Measurement = {type = unit, rule = "measurement.SimpleMeasure"}

Page 23: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

23

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

The IE Annotation Pipeline

Page 24: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

24

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Hands-on: Identify More Patterns

• Open Teamware and login

• Find corpus patents-sample

• Run ANNIC to identify some patterns for references to tables and figures and measurements There are already POS tags, Lookup

annotations, morphological ones Units for measurements are Lookup.majorType

== “unit”

Page 25: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

25

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

The Teamware Annotation Project

• Iterated between JAPE grammar development, manual annotation for gold-standard creation, measuring IAA and precision/recall for JAPE improvements

• Initially gold standard doubly annotated until good IAA is obtained, then moved to 1 annotator per document

• Had 15 annotators working at the same time

Page 26: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

26

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Measuring IAA with Teamware

• Open Teamware

• Find corpus patents-double-annotation

• Measure IAA with the respective tool

• Analyse the disagreements with the AnnDiff tool

Page 27: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

27

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Producing the Gold Standard

• Selected patents from two very different fields: mechanical engineering and biomedical technology

• 51 patents, 2.5 million characters

• 15 annotators, 1 curator reconciling the differences

Page 28: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

28

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

The Evaluation Gold Standard

Page 29: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

29

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Preliminary Results

Page 30: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

30

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Running GATE Apps on Millions of Documents

• Processed 1.3 million patents in 6 days with 12 parallel processes.

• Data sets from Matrixware: American patents (USPTO): 1.3 million, 108 GB,

average file size - 85KB. European patents (EPO): 27 thousand, 780MB,

average file size - 29KB.

Page 31: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

31

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Large-scale Parallel IE

• Our experiments were carried out on the IRF’s supercomputer with Java (jrockit-R27.4.0-jdk1.5.0 12) with up to 12 processes

• SGI Altix 4700 system comprising 20 nodes each with four 1.4GHz Itanium cores and 18GB RAM

• In comparison, we found it 4x faster on Intel Core 2 2.4GHz

Page 32: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

32

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Large-Scale, Parallel IE (2)

• GATE Cloud (A3): dispatches documents to process in parallel; does not stop on error Ongoing project, moving towards Hadoop Contact Hamish for further details

• Benchmarking facilities: generate time stamps for each resource and display charts from them Help optimising the IE pipelines, esp. JAPE rules Doubled the speed of the patent processing pipeline For a similar third-party GATE-based application we

achieved a 10-fold improvement

Page 33: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

33

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Optimisation Results

Page 34: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

34

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

MIMIR: Accessing the Text and the Semantic Annotations

• Documents: 981,315

• Tokens: 7,228,889,715 (> 7 billion)

• Distinct tokens: 18,539,315 (> 18m)

• Annotation occurrences: 151,775,533 (> 151m)

Page 35: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

35

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Page 36: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

36

University of Sheffield NLP

GATE Summer School - July 27-31, 2009

Page 37: Patent Processing with GATE Kalina Bontcheva, Valentin Tablan University of Sheffield

37

University of Sheffield NLP

GATE Summer School - July 27-31, 2009