ir & metadata. metadata didn’t we already talk about this? we discussed what metadata is and...

14
IR & Metadata

Post on 19-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

IR & Metadata

Page 2: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

Metadata

Didn’t we already talk about this?We discussed what metadata is and its

types– Data about data– Descriptive metadata is external to meaning

of content– Semantic metadata is related to content

How is it created?– Catalogers, authors, data entry, etc.– Requires lots of human effort

Page 3: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

Automating Metadata

Can some metadata be assigned automatically?– Yes, depending on how willing you are to live

with mistakes– But humans also make mistakes …

How to determine metadata values?– Natural language processing– Pattern matching– Term/phrase recognition– Information retrieval

Page 4: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

Natural Language Processing

Use rules of sentence construction (grammar) to “understand” the meaning of the text.

Difficulties– Grammar is not from grammar school– Human communication requires non-literal

interpretation

What types of metadata fields could NLP provide?

Example: weather forecasts

Page 5: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

Pattern Matching

Use patterns (e.g. regular expressions) to locate and interpret specific forms of meaning

Difficulties– Patterns must be expressible in pattern

language– Lots of variations require lots of patterns– Polysemy

What types of metadata fields could pattern matching provide?

Page 6: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

Term/Phrase Matching

Look for specific terms or phrases in order to determine document characteristics

Difficulties– No understanding of context– Polysemy

What types of metadata fields could term/phrase matching provide?

Page 7: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

Information Retrieval

Use statistical analysis of vocabulary use and document structure to determine document characteristics

Difficulties– No understanding of terms– No understanding of semantic context

What types of metadata fields could information retrieval provide?

Page 8: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

Practical Metadata

No metadata extraction algorithm works 100% of the time– Could send results to human to okay

• Still requires lots of human resources

– Need to decide how good algorithm has to be or how sure the algorithm is if it provides confidence values before accepting results

INFOMINE– Project crawling and generating metadata for

scholarly resources on the Web– Has 100,000 automatically created records

Page 9: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

Types of Metadata Extraction

Assignment– Assigns values drawn from text of the

document– NLP, pattern matching, term/phrase matching

Classification– Assigns values from a controlled vocabulary– Use machine learning during training stage to

match document attributes (e.g. term vector) to element in controlled vocabulary

Page 10: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

Evaluating Metadata Extraction

Automatic evaluation– Based on document set with human-expert

previously assigned metadata– Compare similarity between system-assigned and

human-assigned metadata– Limited to document/metadata sets where the values

are known

Human evaluation– Subject experts rate the appropriateness of the

assigned metadata– Allows for near misses and alternate values– Expensive to do

Page 11: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

Metadata Extraction Metrics

Single-value metadata fields– Accuracy is a good performance measure– Partial match fields

• Parent or child in ontological hierarchies

Multi-value metadata fields– Precision = # right / # assigned– Recall = # right / # of expert-assigned values

Semantic summaries and keyphrases– Content-word precision = # same words / # words– Content-word recall = # same words / # expert words– Requires stopword and stemming

Page 12: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

INFOMINE Assignments

Title– Single value open text field– Title tag worked well

Creator– Multiple value field– Used “creator” meta tag if there (good

precision, no smarts)

Keyphrase– Used “keyword” meta tag with PhraseRate

(IR approach)

Page 13: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

INFOMINE Assignments

Description– 1-2 paragraphs long– Meta tags and AutoAnnotator (NLP + IR approach)

LCSH– Selected from over 200,000 values– Determines nearest neighbor in human-assigned

data set (IR and ML)

INFOMINE Category– Put document in set of nine categories– Uses nine binary classifiers created using ML

Page 14: IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external

Summary

Metadata is useful but expensive– Lots of human effort to generate– Need to automate when possible

Metadata generation– NLP, pattern matching, term/phrase matching, IR– Approaches appropriate for generating different types

of metadata

Evaluating generated metadata– Automatic vs. human evaluation– Accuracy, precision/recall, etc.