ir & metadata. metadata didn’t we already talk about this? we discussed what metadata is and...

IR & Metadata

Metadata

Didn’t we already talk about this?We discussed what metadata is and its

types– Data about data– Descriptive metadata is external to meaning

of content– Semantic metadata is related to content

How is it created?– Catalogers, authors, data entry, etc.– Requires lots of human effort

Automating Metadata

Can some metadata be assigned automatically?– Yes, depending on how willing you are to live

with mistakes– But humans also make mistakes …

How to determine metadata values?– Natural language processing– Pattern matching– Term/phrase recognition– Information retrieval

Natural Language Processing

Use rules of sentence construction (grammar) to “understand” the meaning of the text.

Difficulties– Grammar is not from grammar school– Human communication requires non-literal

interpretation

What types of metadata fields could NLP provide?

Example: weather forecasts

Pattern Matching

Use patterns (e.g. regular expressions) to locate and interpret specific forms of meaning

Difficulties– Patterns must be expressible in pattern

language– Lots of variations require lots of patterns– Polysemy

What types of metadata fields could pattern matching provide?

Term/Phrase Matching

Look for specific terms or phrases in order to determine document characteristics

Difficulties– No understanding of context– Polysemy

What types of metadata fields could term/phrase matching provide?

Information Retrieval

Use statistical analysis of vocabulary use and document structure to determine document characteristics

Difficulties– No understanding of terms– No understanding of semantic context

What types of metadata fields could information retrieval provide?

Practical Metadata

No metadata extraction algorithm works 100% of the time– Could send results to human to okay

• Still requires lots of human resources

– Need to decide how good algorithm has to be or how sure the algorithm is if it provides confidence values before accepting results

INFOMINE– Project crawling and generating metadata for

scholarly resources on the Web– Has 100,000 automatically created records

Types of Metadata Extraction

Assignment– Assigns values drawn from text of the

document– NLP, pattern matching, term/phrase matching

Classification– Assigns values from a controlled vocabulary– Use machine learning during training stage to

match document attributes (e.g. term vector) to element in controlled vocabulary

Evaluating Metadata Extraction

Automatic evaluation– Based on document set with human-expert

previously assigned metadata– Compare similarity between system-assigned and

human-assigned metadata– Limited to document/metadata sets where the values

are known

Human evaluation– Subject experts rate the appropriateness of the

assigned metadata– Allows for near misses and alternate values– Expensive to do

Metadata Extraction Metrics

Single-value metadata fields– Accuracy is a good performance measure– Partial match fields

• Parent or child in ontological hierarchies

Multi-value metadata fields– Precision = # right / # assigned– Recall = # right / # of expert-assigned values

Semantic summaries and keyphrases– Content-word precision = # same words / # words– Content-word recall = # same words / # expert words– Requires stopword and stemming

INFOMINE Assignments

Title– Single value open text field– Title tag worked well

Creator– Multiple value field– Used “creator” meta tag if there (good

precision, no smarts)

Keyphrase– Used “keyword” meta tag with PhraseRate

(IR approach)

INFOMINE Assignments

Description– 1-2 paragraphs long– Meta tags and AutoAnnotator (NLP + IR approach)

LCSH– Selected from over 200,000 values– Determines nearest neighbor in human-assigned

data set (IR and ML)

INFOMINE Category– Put document in set of nine categories– Uses nine binary classifiers created using ML

Summary

Metadata is useful but expensive– Lots of human effort to generate– Need to automate when possible

Metadata generation– NLP, pattern matching, term/phrase matching, IR– Approaches appropriate for generating different types

of metadata

Evaluating generated metadata– Automatic vs. human evaluation– Accuracy, precision/recall, etc.

ir & metadata. metadata didn’t we already talk about this? we discussed what metadata is and...

Documents

why we want to implement iso metadata

agile enterprise metadata management · agile enterprise...

why we want to implemented iso metadata · 2017-02-03 ·...

muttering about metadata

about the new energy industry metadata standards

about metadata - adobe.com · printed in the usa. 9/04....

spfest chicago - do we need metadata in office365

metadata esa workshop. in this session we will discuss… ...

amga metadata catalogue - agenda catania [home] · algiers,...

vispubdata.org: a metadata collection about ieee

swoogle's metadata about the semantic web · pdf...

everything you wanted to know about dublin core metadata

swoogle's metadata about the semantic...

nhd high resolution metadata. the fuss about metadata...

interoperability of metadata - unesco · iso/tc 37 –...

metadata repository and elsa linked data · 2011. 4....

a chain-reds perspective about data access and metadata...

september, 1999 grace agnew metadata overview metadata: data...

thoughts about metadata standards for data

about scanning and metadata standards - nemo 2010