ir & metadata. metadata didn’t we already talk about this? we discussed what metadata is and...
Post on 19-Dec-2015
215 Views
Preview:
TRANSCRIPT
IR & Metadata
Metadata
Didn’t we already talk about this?We discussed what metadata is and its
types– Data about data– Descriptive metadata is external to meaning
of content– Semantic metadata is related to content
How is it created?– Catalogers, authors, data entry, etc.– Requires lots of human effort
Automating Metadata
Can some metadata be assigned automatically?– Yes, depending on how willing you are to live
with mistakes– But humans also make mistakes …
How to determine metadata values?– Natural language processing– Pattern matching– Term/phrase recognition– Information retrieval
Natural Language Processing
Use rules of sentence construction (grammar) to “understand” the meaning of the text.
Difficulties– Grammar is not from grammar school– Human communication requires non-literal
interpretation
What types of metadata fields could NLP provide?
Example: weather forecasts
Pattern Matching
Use patterns (e.g. regular expressions) to locate and interpret specific forms of meaning
Difficulties– Patterns must be expressible in pattern
language– Lots of variations require lots of patterns– Polysemy
What types of metadata fields could pattern matching provide?
Term/Phrase Matching
Look for specific terms or phrases in order to determine document characteristics
Difficulties– No understanding of context– Polysemy
What types of metadata fields could term/phrase matching provide?
Information Retrieval
Use statistical analysis of vocabulary use and document structure to determine document characteristics
Difficulties– No understanding of terms– No understanding of semantic context
What types of metadata fields could information retrieval provide?
Practical Metadata
No metadata extraction algorithm works 100% of the time– Could send results to human to okay
• Still requires lots of human resources
– Need to decide how good algorithm has to be or how sure the algorithm is if it provides confidence values before accepting results
INFOMINE– Project crawling and generating metadata for
scholarly resources on the Web– Has 100,000 automatically created records
Types of Metadata Extraction
Assignment– Assigns values drawn from text of the
document– NLP, pattern matching, term/phrase matching
Classification– Assigns values from a controlled vocabulary– Use machine learning during training stage to
match document attributes (e.g. term vector) to element in controlled vocabulary
Evaluating Metadata Extraction
Automatic evaluation– Based on document set with human-expert
previously assigned metadata– Compare similarity between system-assigned and
human-assigned metadata– Limited to document/metadata sets where the values
are known
Human evaluation– Subject experts rate the appropriateness of the
assigned metadata– Allows for near misses and alternate values– Expensive to do
Metadata Extraction Metrics
Single-value metadata fields– Accuracy is a good performance measure– Partial match fields
• Parent or child in ontological hierarchies
Multi-value metadata fields– Precision = # right / # assigned– Recall = # right / # of expert-assigned values
Semantic summaries and keyphrases– Content-word precision = # same words / # words– Content-word recall = # same words / # expert words– Requires stopword and stemming
INFOMINE Assignments
Title– Single value open text field– Title tag worked well
Creator– Multiple value field– Used “creator” meta tag if there (good
precision, no smarts)
Keyphrase– Used “keyword” meta tag with PhraseRate
(IR approach)
INFOMINE Assignments
Description– 1-2 paragraphs long– Meta tags and AutoAnnotator (NLP + IR approach)
LCSH– Selected from over 200,000 values– Determines nearest neighbor in human-assigned
data set (IR and ML)
INFOMINE Category– Put document in set of nine categories– Uses nine binary classifiers created using ML
Summary
Metadata is useful but expensive– Lots of human effort to generate– Need to automate when possible
Metadata generation– NLP, pattern matching, term/phrase matching, IR– Approaches appropriate for generating different types
of metadata
Evaluating generated metadata– Automatic vs. human evaluation– Accuracy, precision/recall, etc.
top related