content management and the role of taxonomies judith molka-danielsen oct. 13, 2003

35
Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Content Management

and the role of taxonomies

Judith Molka-DanielsenOct. 13, 2003

Page 2: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Primary Challenges for Content Management Systems

Heterogenenous Data Sources – create some normalized representation of data to provide equal (reading) accessibility for human and machine alike. retrieving data from a RDBMS involves

programmatic access (ODBC, SQL) HTML files consist of tagged text. Stylistic and

structural info, different code is interpreted by browsers in different ways, confusing for automated programs, but humans manage it.

Word processing applications – Word, Acrobat, binary data converted to text with proprietary interpreter, and associated viewer. Want interoperability of viewers with other formats.

Page 3: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Primary Challenges for Content Management Systems

Distribution of Data Sources Access involves use of protocols (HTTP,

HTTPS, FTP, SCP,…) to go through firewalls. With business applications we still need security

and to limit views to selected individuals and groups.

Additional protocols (XML, IIOP, SOAP and Web Services) are being used to build tools for integrating systems. To deliver messages to components through http, a

protocol is needed. The Simple Object Access Protocol (SOAP), written in XML, is emerging as the protocol.

Page 4: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Primary Challenges for Content Management Systems

What is being used to identify distributed data sources: Distributed Directories and protocols The Domain Name Service (DNS) is a hierarchically

distributed directory of Names (home.himolde.no) and IP addresses.

The X.500 directory service is a hierarchically distributed directory of objects. Object attribute-value pairs may be stored and looked up.

LDAP is a protocol for accessing a directory service. Most visions of the Web imagine “federated” servers to help find objects.

UDDI is one protocol for advertising and discovery

Page 5: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

The Web Today

WebServer

Client

DNServer

DNServer DN

ServerDNServer

1. LocationLookup

2. ObjectRequest

Page 6: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

The Web with Object Directories

WebServer

Client

DNServer

DNServer

DNServer

DNServer

LDAPServer

1. RegistrationWeb

Server

2. Attribute/Value Requestand Object/Location Response

3. The Rest

Page 7: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Primary Challenges for Content Management Systems

Data Size and the Relevance Factor Large repositories like WWW Need a system to drill down to subsets of

relevant information. Speed and automation is critical. (Find not just more results, but better.)

Find a particular needle in a haystack with a billion needles.

Find all the needles which are similar to some other needle which has already been discovered.

Page 8: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

What can help? Semantic web technology

XML and the Resource Description Framework (RDF) will allow XML tags to be labeled in conjunction with a referential knowledge representation.

Machine based inference engines should replace today's search engines.

New editors are needed to infuse semantic information into the content easily, as some editors allow users that do not know html syntax to create web pages.

Page 9: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Syntactic Integration

Page 10: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Structural Integration

Page 11: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Semantic Integration

Page 12: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

RDF

RDF provides a simple data model for expressing statements using (subject, predicate, object) triples, and an associated serialization syntax in XML. All three elements of the triple can be defined within the current document or refer to another resource on the Web.

As an example of RDF applied in a logistic context we model the three entities ship,container and item.

Page 13: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

RDF in use

In RDF we can express relations between entities, such as a ship transports a container, and a container contains an item. These relations can but need not to be hierarchical, i.e. a business can be the owner of the transported item, and at the same time the user of the container. It is important to note that these relations can change over time, ownership moves from one business to another, and container move from ships to trucks for further transportation. These transitions may trigger events, like financial transactions or notifications.

An ontology can be used to define all the concepts and their meaning used in a certain (set of) schema(s).

Page 14: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Components of Semantic Technology

Classification Metadata Ontologies (taxonomies)

Page 15: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Classification General keyword searches lead to many irrelevant

results. An automatic classification system could for example,

divide a 1000 stories into 5 categories, so keyword searches would be more relevant.

Techniques for classification Statistical analysis and pattern matching Rule-based methods Linguistic analysis Bayesian theory (probabilistic) Ontology driven: name-entity and domain-phrase recognition Committee-based approaches use various techniques

Classification is more precise if documents are tagged with metadata and conform to a predetermined schema.

Page 16: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Metadata

Data about the data Levels of Metadata

Syntatic Structural Semantic

Page 17: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Syntatic Metadata

General information Little for context determination Document size, location, date of creation.. Used in

Assessment of the document’s relevance Version tracking User level access policies

Email, docs in file systems, have this info.

Page 18: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Structural Metadata

Information about the structure of content

Varies widely with document type XML allows creators to enclose

content within meaningful tags. Can make associations between

content from multiple documents.

Page 19: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Semantic Metadata

Semantic Metadata is “data which may be associated explicitly or implicitly with a given piece of content (such as a document) and whose relevance for that content is determined by its ontological position (its context) within one or more domains of knowledge.”

Page 20: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Semantic Metadata

Metadata receives its contextual information from a reference knowledgebase.

Metadata that is extracted from any document may be stored as a snapshot of that document’s relevant information.

The metadata contained within this snapshot simply references the instances of name-entities, which are stored in the ontology.

Each name-entity has related information stored: synonyms, attributes, related entities.

Page 21: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Semantic Metadata

Documents can link to each other in several ways Explicit metadata – docs that mention the same

exact metadata Implicitly related metadata – docs that contain

synonyms or hierarchically related name entities. Ontoloical associations – by name-entities

associations, one doc mentions a company name while another mentions the ticker symbol.

Page 22: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Standards: DCML defines a generic element set, non-specific to domain of knowledge. Can be used as a top domain.

Page 23: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Forms of knowledge representation Dictionary – terms are the keys and definitions are

the values. There are no links between terms. Thesaurus – includes antonyms and synonyms.

The pieces of knowledge are linked. Taxonomy – includes etymological information

(derivation) and synonyms are organized hierarchically (inheritance). Flower is a subclass of plant. But a rose may be related to

love. Associations may be emotional, cultural, temporal. Relevant associations Can be discovered by a data-

analysis system utilizing a reference knowledge base. Ontology – is the labeling of the relationship in the

taxonomy.

Page 24: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Types of Metadata

Page 25: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Ontology Description Languages

Knowledge model building in a given domain is subjective

Problems combining independently developed ontologies

Resource Description Framework (RDF) and RDF-Schema (RDF-S) data model tries to address this: Resource – is an item of interest at the atomic level, entitity,

concept or document. Each resource is uniquely identified by a URI

Properties – descriptive, characteristics and attributes of a resource. They may be associative, relating one resource to another.

Statement – is what is known as an RDF triple. It contains a reference to a resource, a property names, and that property’s value. These identifiers take the form of link addresses.

Page 26: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Ontology Description Languages

RDF-S (specification for ontoloy modeling.) http://www.w3.org/TR/2000/CR-rdf-schema-200003

27/ Dublin Core Metadata Initiative

http://dc2003.ischool.washington.edu/program.html DARPA Agent Markup Language + Ontology

Interface Layer (DAML+OIL) expands on the RDF-S. Classes are defined as elements and can be related to other classes in disjunction, union, or equality. The W3C has a ontology web language (OWL) that

is based on OIL.

Page 27: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Meta-data Interpretation DAML (DARPA) endeavor to interpret a

simple ontology to infer information about resources.

Put very simply: If people have names If students are people If resource X is about a student Resource X should have a name

This kind of inference could be easily constructed within the context of an object-oriented directory

Page 28: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Schema Interpretation – and integration

consider two sets of resources: For set A, the attributes are structured in accord with

the kind of meta data described on the previous slide.

Imagine the same for set B, but using different attribute names and values

Accept that the attribute-values are called resource descriptions and a document called a resource description schema defines the relations for each set.

Imagine the two schema are related through a third schema

Finally imagine an engine that relates resources in set A to resources in set B based on schema level inferences

Page 29: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

The Semantic Web Vision

WebServer

Client

DNServer

DNServer DN

ServerDNServer

LDAPServer

WebServer

5. The Rest

LDAPServer

LDAPServer

SchemaServer

SchemaServer

SchemaServer

SchemaServer

2. DescriptionAssociation

1. SchemaRegistration

3. Object Query

4. Inferencing

Page 30: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Sample Knowledgebases

WordNet is a networked thesaurus, developed at Princeton, in the form of a lexical matrix. It maps word forms to word meanings, M2M relationship. The set of word-meanings for a word is a synset. It is not an ontology because it does not contain real world

information required in labeled relationships, such as, a “branch” is an administrative division with a chairman above it.

Open Directory Project http://www.dmoz.org/

National Library of Medicine has an ontology system, Unified Medical Language System (UMLS), with researchers and intstitutions contributing to it. http://www.nlm.nih.gov/research/umls/

Page 31: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Toolkits – should provide for.. Establishing of configurable parameters Extraction agents and classifiers modules The system should accept training sets of

data, and learn from patterns, so future items are classified without manual trigger.

Easily navigatible visual environment Tracking date and time of data entry ROADS provides tools for creating subject

gateways, http://www.ilrt.bristol.ac.uk/roads/

Page 32: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Extracting Wrapper Technologies WysiWyg Web Wrapper Factory (W4F), crawl and

retrieve data from web pages, to create wrappers that represent the content of the pages.

ANDES, uses XPath rules XWRAP toolkit, has interactive rules formulation S-CREAM (semiautomatic creation of metadata)

lets the user annotate documents. Ontoprise (product by Semagix)

http://www.ontoprise.com BUT, an ontology driven classifier and domain

specific metadata annotator allows searching on classification by keyword AND on implied entity association. (SEE example on next slide.)

Page 33: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003
Page 34: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Semagix Visualizer – is a visualization tool for viewing an ontology or schema.

Page 35: Content Management and the role of taxonomies Judith Molka-Danielsen Oct. 13, 2003

Related References

http://bazaar.sis.pitt.edu/ The E-Speak Initiative at the University of Pittsburgh E-Speak Overview (

http://bazaar.sis.pitt.edu/es_ppt_over/AIntrotoESpeak_files/frame.htm )

E-Speak Revised (http://bazaar.sis.pitt.edu/es_ppt_over/AESpeakRevisited_files/frame.htm )

Oracle9i Data Mining Concepts Oracle9i AS Personalization is used to build

data mining models.