natural language processing within the archaeotools project

Natural Language Processing within Natural Language Processing within the Archaeotools Projectthe Archaeotools Project

Michael Charno, Stuart Jeffrey, Julian Richards, Fabio CiravegnaMichael Charno, Stuart Jeffrey, Julian Richards, Fabio Ciravegna,Stewart Waller, Sam Chapman and Ziqi Zhang. Stewart Waller, Sam Chapman and Ziqi Zhang. CAA Williamsburg, March 2009CAA Williamsburg, March 2009

“To support research, learning and teaching with high quality and dependable digital resources.”

http://www.jisc.ac.uk/

AHRC-EPSRC-JISC eScience research grants scheme:AHRC-EPSRC-JISC eScience research grants scheme:

AIM: To allow archaeologists to discover, share and analyse datasets and legacy publications which have hitherto been very difficult to integrate into existing digital frameworks

BUILDS UPON: Common Information Environment Enhanced Geospatial browser

PARTNERS: Natural Language Processing Research Group, Department of Computer Science, University of Sheffield

Joint Information Systems Committee

• Work package 1 - Advanced Faceted Classification /Geo-spatial Work package 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When browser – 1m+ records; 4 primary facets (What, Where, When and Media) – Reported on in CAA, Budapest.and Media) – Reported on in CAA, Budapest.

• Work package 2 – Natural language processing /Data-mining of Work package 2 – Natural language processing /Data-mining of Grey Literature.Grey Literature.

• Work package 3 – Data-mining of Historic Literature; plus Work package 3 – Data-mining of Historic Literature; plus geoXwalkgeoXwalk

Three distinct Work packages:

• WP1 Datasets include:– National Monuments Records (Scotland, Wales, England)– Excavation Index (EH)– Archive Holdings– Local Authority Historic Environment Records

• WP2/3 Datasets include:– ‘Grey’ (Gray) Literature– Proceedings of the Society of Antiquaries of Scotland (PSAS)

• Thesauri include:– Thesaurus of Monuments Types (TMT)– Thesaurus of Object Types – MIDAS Period list– UK Government list of administrative areas, County, District, Parish (CDP) –

Not MIDAS

OracleRDBMS

MIDAS XML Record

Information Extraction RDF Resource

Knowledge triple store

XML Docs of Thesaurus

Query

User Interface

Information Extraction

When, Where, What ontologiesas entries to faceted index

Input

Input

UP TO DATE VERIOSN OF THIS

• Work package 1 - Advanced Faceted Classification /Geo-spatial Work package 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When browser – 1m+ records; 4 primary facets (What, Where, When and Media) – Reported on in CAA, Budapest.and Media) – Reported on in CAA, Budapest.

• Work package 2 – Natural language processing /Data-mining of Work package 2 – Natural language processing /Data-mining of Grey Literature.Grey Literature.

• Work package 3 – Data-mining of Historic Literature; plus Work package 3 – Data-mining of Historic Literature; plus geoXwalkgeoXwalk

Three distinct Work packages:

BARROW

BARROW BARROW

“I never said she stole my money”





“I never said she stole my money”“I never said she stole my money”

Someone else said it, but I didn’t.

I simply didn’t ever say it.

I might have implied it, but I never said it.

I said someone stole it, I didn’t say it was she.

I just said she probably borrowed it.

I said she stole someone else’s moneyI said she stole something, but not my money.


Was it Bonnie or Clyde?

State-of-the-art review – approaches to rule induction… Two mainstream methodologies towards rule induction:

Human handcrafted rules (Rule based system) – built manually by analysing example annotations and derive human readable discriminative patterns

• Easy to understand, easy to implement, effective for structured texts and simple patterns, no need for training learning models but…

• Not robust to less-structured texts, and time consuming and difficult to derive rules for large amount of example annotations

Machine learned rules (Machine Learning) – built automatically by analysing example annotations and converting features into numeric representations, which are to be consumed by mathematic models to derive discriminative pattern that are not readable

• Very robust, copes with large amounts of data and complex patterns; we only select features and machine analyses examples and induce rules, but…

• Very sensitive to feature selections, implementation and feature tuning are difficult and takes time; may not work well with few amounts of examples

The fundamental idea... The fundamental idea is to study the features of positive and negative examples of

entities, and/or their surrounding N words over a large collection of annotated documents (training data prepared by human) and design rules that capture instances of a given type. (Nadeau et al, 2006) Then apply the rules to new corpus, and classify each individual token (both previously seen and unseen) into suitable classes.

Features - descriptors or characteristic attributes of words designed for algorithmic consumption

Positive examples - instances of a given type to be extracted Negative examples - any text units that are not annotated as the given type

The fundamental idea...

Example annotations

in highlighted colours are

positive examples

Un-annotated texts are negative examples

Features of this annotation:•first_letter_capitalised: true•word_found_in_gazetteer: true• preceded_by: the

• followed_by: period

Rule based systems are good for extracting information that match with simple patterns, and/or occur in regular contexts, thus are applied to:

• Grid reference (easting and northing)• Report title*• Report creator*• Report publisher*• Report publication date*• Report publisher contact• Bibliography & references

Machine Learning is good for extracting information that can not be matched by patterns, or occur irregularly with contexts, or are large amount, thus is applied to:

• What (subject)• Where (place name)• When (temporal info)• Event date

From the 1st batch of annotated corpus 35 unique annotated documents Number of annotations by class:

publisher.name: 93 title: 53 date.event: 129 coverage.temporal: 2185 subject: 7935 publisher.contact: 21 date.publication: 28 coverage.spatial.placename:1467 creator: 67

Class Useful features to test *

What (subject) • word text• word stem (root)• word lemma (root format in dictionary)• word orthography• word Part-of-Speech• word position in document (e.g., on page #)• word position in page (e.g., position relative to page start offset and end offset)• word membership in Gazetteer• word general entity class (e.g., person, organisation, location, date, time)

• Plus above features of preceding 5 words’ and succeeding 5 words’

Where (placename)

When (temporal)

Event date

* These features are generally applied to every other classes too. See following slides.

Class Useful features to test

Title Features marked with * plus special word identifier (e.g., report, survey, evaluation)

Creator Features marked with * plus relative position to title

Publisher name Features marked with * plus relative position to title

Publisher contact Features marked with * plus relative position to title

Publication date Features marked with * plus relative position to title

Grid reference points • Identifier special word token (e.g., grid point, grid reference, easting, northing)• Pattern

Bibliography/references • Identifier special word token• Pattern (e.g., person name followed by year)

natural language processing within the archaeotools project

Documents

elses moneyi

historic literature

grey literature

rules machine learning

human handcrafted rules

wp1 datasets

support research

structured texts