natural language processing within the archaeotools project
DESCRIPTION
Natural Language Processing within the Archaeotools Project. Michael Charno, Stuart Jeffrey, Julian Richards, Fabio Ciravegna , Stewart Waller, Sam Chapman and Ziqi Zhang. CAA Williamsburg, March 2009. - PowerPoint PPT PresentationTRANSCRIPT
Natural Language Processing within Natural Language Processing within the Archaeotools Projectthe Archaeotools Project
Michael Charno, Stuart Jeffrey, Julian Richards, Fabio CiravegnaMichael Charno, Stuart Jeffrey, Julian Richards, Fabio Ciravegna,Stewart Waller, Sam Chapman and Ziqi Zhang. Stewart Waller, Sam Chapman and Ziqi Zhang. CAA Williamsburg, March 2009CAA Williamsburg, March 2009
“To support research, learning and teaching with high quality and dependable digital resources.”
AHRC-EPSRC-JISC eScience research grants scheme:AHRC-EPSRC-JISC eScience research grants scheme:
AIM: To allow archaeologists to discover, share and analyse datasets and legacy publications which have hitherto been very difficult to integrate into existing digital frameworks
BUILDS UPON: Common Information Environment Enhanced Geospatial browser
PARTNERS: Natural Language Processing Research Group, Department of Computer Science, University of Sheffield
Joint Information Systems Committee
• Work package 1 - Advanced Faceted Classification /Geo-spatial Work package 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When browser – 1m+ records; 4 primary facets (What, Where, When and Media) – Reported on in CAA, Budapest.and Media) – Reported on in CAA, Budapest.
• Work package 2 – Natural language processing /Data-mining of Work package 2 – Natural language processing /Data-mining of Grey Literature.Grey Literature.
• Work package 3 – Data-mining of Historic Literature; plus Work package 3 – Data-mining of Historic Literature; plus geoXwalkgeoXwalk
Three distinct Work packages:
• WP1 Datasets include:– National Monuments Records (Scotland, Wales, England)– Excavation Index (EH)– Archive Holdings– Local Authority Historic Environment Records
• WP2/3 Datasets include:– ‘Grey’ (Gray) Literature– Proceedings of the Society of Antiquaries of Scotland (PSAS)
• Thesauri include:– Thesaurus of Monuments Types (TMT)– Thesaurus of Object Types – MIDAS Period list– UK Government list of administrative areas, County, District, Parish (CDP) –
Not MIDAS
OracleRDBMS
MIDAS XML Record
Information Extraction RDF Resource
Knowledge triple store
XML Docs of Thesaurus
Query
User Interface
Information Extraction
When, Where, What ontologiesas entries to faceted index
Input
Input
UP TO DATE VERIOSN OF THIS
• Work package 1 - Advanced Faceted Classification /Geo-spatial Work package 1 - Advanced Faceted Classification /Geo-spatial browser – 1m+ records; 4 primary facets (What, Where, When browser – 1m+ records; 4 primary facets (What, Where, When and Media) – Reported on in CAA, Budapest.and Media) – Reported on in CAA, Budapest.
• Work package 2 – Natural language processing /Data-mining of Work package 2 – Natural language processing /Data-mining of Grey Literature.Grey Literature.
• Work package 3 – Data-mining of Historic Literature; plus Work package 3 – Data-mining of Historic Literature; plus geoXwalkgeoXwalk
Three distinct Work packages:
BARROW
BARROW BARROW
“I never said she stole my money”
“I never said she stole my money”
“I never said she stole my money”
“I never said she stole my money”
“I never said she stole my money”
“I never said she stole my money”“I never said she stole my money”
Someone else said it, but I didn’t.
I simply didn’t ever say it.
I might have implied it, but I never said it.
I said someone stole it, I didn’t say it was she.
I just said she probably borrowed it.
I said she stole someone else’s moneyI said she stole something, but not my money.
“I never said she stole my money”
Was it Bonnie or Clyde?
State-of-the-art review – approaches to rule induction… Two mainstream methodologies towards rule induction:
Human handcrafted rules (Rule based system) – built manually by analysing example annotations and derive human readable discriminative patterns
• Easy to understand, easy to implement, effective for structured texts and simple patterns, no need for training learning models but…
• Not robust to less-structured texts, and time consuming and difficult to derive rules for large amount of example annotations
Machine learned rules (Machine Learning) – built automatically by analysing example annotations and converting features into numeric representations, which are to be consumed by mathematic models to derive discriminative pattern that are not readable
• Very robust, copes with large amounts of data and complex patterns; we only select features and machine analyses examples and induce rules, but…
• Very sensitive to feature selections, implementation and feature tuning are difficult and takes time; may not work well with few amounts of examples
The fundamental idea... The fundamental idea is to study the features of positive and negative examples of
entities, and/or their surrounding N words over a large collection of annotated documents (training data prepared by human) and design rules that capture instances of a given type. (Nadeau et al, 2006) Then apply the rules to new corpus, and classify each individual token (both previously seen and unseen) into suitable classes.
Features - descriptors or characteristic attributes of words designed for algorithmic consumption
Positive examples - instances of a given type to be extracted Negative examples - any text units that are not annotated as the given type
The fundamental idea...
Example annotations
in highlighted colours are
positive examples
Un-annotated texts are negative examples
Features of this annotation:•first_letter_capitalised: true•word_found_in_gazetteer: true• preceded_by: the
• followed_by: period
Rule based systems are good for extracting information that match with simple patterns, and/or occur in regular contexts, thus are applied to:
• Grid reference (easting and northing)• Report title*• Report creator*• Report publisher*• Report publication date*• Report publisher contact• Bibliography & references
Machine Learning is good for extracting information that can not be matched by patterns, or occur irregularly with contexts, or are large amount, thus is applied to:
• What (subject)• Where (place name)• When (temporal info)• Event date
From the 1st batch of annotated corpus 35 unique annotated documents Number of annotations by class:
publisher.name: 93 title: 53 date.event: 129 coverage.temporal: 2185 subject: 7935 publisher.contact: 21 date.publication: 28 coverage.spatial.placename:1467 creator: 67
Class Useful features to test *
What (subject) • word text• word stem (root)• word lemma (root format in dictionary)• word orthography• word Part-of-Speech• word position in document (e.g., on page #)• word position in page (e.g., position relative to page start offset and end offset)• word membership in Gazetteer• word general entity class (e.g., person, organisation, location, date, time)
• Plus above features of preceding 5 words’ and succeeding 5 words’
Where (placename)
When (temporal)
Event date
* These features are generally applied to every other classes too. See following slides.
Class Useful features to test
Title Features marked with * plus special word identifier (e.g., report, survey, evaluation)
Creator Features marked with * plus relative position to title
Publisher name Features marked with * plus relative position to title
Publisher contact Features marked with * plus relative position to title
Publication date Features marked with * plus relative position to title
Grid reference points • Identifier special word token (e.g., grid point, grid reference, easting, northing)• Pattern
Bibliography/references • Identifier special word token• Pattern (e.g., person name followed by year)