quality taxonomies
DESCRIPTION
Quality Taxonomies. Jim Nisbet Senior Vice President of Technology Semio Corporation Knowledge Technologies 2001 March 5 th , 2001. Ontology / Taxonomy. Static Discovery. Root Ontology. Taxonomy Generation. Dynamic Discovery. What is Quality ?. “Best value for the money” - PowerPoint PPT PresentationTRANSCRIPT
Quality TaxonomiesQuality Taxonomies
Jim NisbetSenior Vice President of Technology
Semio Corporation
Knowledge Technologies 2001March 5th, 2001
Ontology / Taxonomy
Root Ontology
Taxonomy Generation
Static Discovery
Dynamic Discovery
What is Quality ?
“Best value for the money” According to this definition, you are entitled to
get high performance from a costly product; likewise a low cost product or service is expected to be a poor delivery. For example, a loose demo delivery is both predictable and acceptable, since its quality is: low conformance / low cost.
What is Quality ?
“Good Quality is Nominal Conformance”
Taxonomy Quality is defined as Taxonomy Conformance to: • Valid requirements;• Explicitly documented development standards; and, • Implicit characteristics that are expected of all
professionally developed taxonomies, such as the desire for good maintainability.
Standards
ISO 2788-1986• International Organization for Standardization. Documentation—Guidelines for the Establishment and
Development of Monolingual Thesauri. 2nd ed. n.p.: ISO, 1986. (ISO 2788-1986(E)). (Available in the U.S. from American National Standards Institute)
ISO 5964-1985 • International Organization for Standardization. Documentation—Guidelines for the Establishment and
Development of Multilingual Thesauri. n.p.: ISO, 1985. (ISO 5964-1985(E)). (Available in the U.S. from American National Standards Institute)
ANSI/NISO Z39.19-1993• National Information Standards Institute. Guidelines for the Construction, Format, and Management of
Monolingual Thesauri. Bethesda, MD: NISO Press, 1994. 69p. (ANSI/NISO Z39.19-1993)
SEMIO Quality Plan v1 2000 ISO/IEC 13250 Topic Maps RDF
• Please refer to RDF at http://www.w3.org/RDF and XML at http://www/w3/org/XML
Project Plan
1. Kick-off2. Requirements Review3. Lexicon Review4. Taxonomy Review5. Tags Review6. Final Review
1. Kick-off Objectives
• Purpose• Scope• Scale• Users• Conditions of receipt
Roles• Supplier• Customer
– Admin– KE– Experts– Users
Planning Training and Transfer
2. Requirements Review
Sources Lexicon Ontology Install
Sources
Dispersion (Multiplicity, Size, Homogeneity) Refresh AccessFeatures Internet,
News,E-Mail
Reports,Patents
E-Trade,Logs
Informative content - + +Number of topics covered + + -Structured information - + +Size of records - + -Number of records + - +
Typical Patterns
Disparity Adjust sources Adjust crawl strategy Isolate communities / taxonomies
Lexicon
Vocabularies, etc. Substitutions: Acronyms, Synonyms, etc. Preferred Keywords: Brand Names, etc. Banned Keywords
Typical Patterns
Lack of requirements Use Librarian Resources
Ontology
Thesaurus ? Is the information domain analysis
complete, consistent, and accurate ? Is the partitioning of the problem
complete ?
Typical Patterns
Directory versus Taxonomy Isolate “directory” branches
Thesaurus versus Taxonomy Put an ontology on top of thesaurus Check ASAP match of thesaurus generics with
extracted lexicon
Very high level design for top categories requirements Plan to work bottom-up
See also Taxonomy (functions, combinations, etc.)
Install
Implementation / Integration:• Are external and internal interfaces properly
defined? • Are all requirements traceable to the system level? • Has prototyping been conducted for the
user/customer? • Is performance achievable within the constraints
imposed by other system elements? • Are requirements consistent with schedule,
resources, and budget?
Typical Patterns
Scale Security Missing Documents
3. Lexicon Review
Coverage• Extracted words / Words• (Extracted Index / Index)
Sources bench-marking• Coverage• Extraction quality• Topic distribution
Structure• Most Frequent Phrases• Most Productive Generics
Substitutions Exceptions
Typical Patterns
Low level of frequency / quality for the most meaningful content Increase size of value corpus Filter and re-import lexicon
4. Taxonomy Review Taxonomy Operation
• Correctness• Reliability• Usability• Integrity• Efficiency
Taxonomy Revision• Maintainability• Flexibility• Testability
Taxonomy Transition• Portability• Reusability• Interoperability
UB
i j
lf lflf1 2
g g gn 1 2 i
n3 4 mg g g g g g s s s s s s25 6 1 3 4
s s s s s s5 6 7 8 m n
v v v v1 2 m n
Level 0
Level 1
Level 2
Level 3
Level 4
UB = unique beginner lf = life-form g = generic s = specific v = varietal
Tax
Liability
Loan
Term loan
Short-term loan
Unique Beginner
Life Form
Generic
Specific
Varietal
Folk Taxonomies Design
The Berlin and Kay model: Taxonomy = Nomenclature + Terminology
Correctness
Accuracy Completeness Consistency
Accuracy
Precision Recall
Completeness
Taxonomy Maps Lexicon Collection
Concentration Works Against Quality
Lexicon
Document Collection
Maps
Taxonomy
Tagging
Tagging Coverage Ontology Coverage Hook Coverage Map Coverage Lexical Coverage Collection Coverage
Consistency:Typical Patterns
Objectivization Hyperonymy Speciation Necessity
Objectivization
EmploymentFiringHiring
Salaries
Avoid functional categories
Don’t mix functions / objects
Exhaust scripts Match idiomatic phrases
Genericity
PartsAir ConditioningBelts and HosesBodyBrake SystemChassisEngineExhaust SystemFuel SystemGlassIgnition
Avoid meronymy Don’t mix
meronymy / hyperonymy
Exhaust prototypes
Speciation
Person Unwelcome person
Unpleasant personSelfish person
OpportunistBackscratcher
Avoid “strings” of categories Avoid (non-idioms) properties for categories
(WordNet)
Necessity
Tax
Individuals Corporations
Assets Liability Assets Liability
BC
D
E
FG
H
I
K
Tax
Individuals Corporations
Assets Liability
Individuals Corporations
Avoid non-productive categories
Avoid combinations of categories
Nomenclature (Design Structure) Quality Index
Depth Width Balance
UB
i j
lf lflf1 2
g g gn 1 2 i
n3 4 mg g g g g g s s s s s s25 6 1 3 4
s s s s s s5 6 7 8 m n
v v v v1 2 m n
Level 0
Level 1
Level 2
Level 3
Level 4
UB = unique beginner lf = life-form g = generic s = specific v = varietal
Complexity Index
Cyclometric complexity increases with number of Cross References within the Taxonomy, giving an indication of complexity and difficulty of testing.
Taxonomy Complexity Index combines:• autonomy• closure• similarity• typicality• commonality• redundancy• stability
Maturity index
The IEEE standard 982.1-1988 suggests a taxonomy maturity index to provide an indication of the stability of the taxonomy .
Maturity Index combines:• number of modules in current ontology / taxonomy.• number of modules in current ontology / taxonomy that have
been changed.• number of modules added to current ontology / taxonomy. • number of modules deleted from the previous version of the
ontology / taxonomy.
5. Tags Review
Document coverage Concepts coverage
<tagset> <document> <docurl>http://www.TaxSource.com</docurl> <tag> <tagname>Liability</tagname> <weight>1.289</weight> </tag> <tag> <tagname>Federal Funds</tagname> <weight>0.746</weight> </tag> </document></tagset>
6. Final Review
Receipt Maintenance