TU/e technische universiteit eindhoven
Web Data and Metadata
Geert-Jan Houben
TU/e technische universiteit eindhoven
Contents
• Evolution in Web data• Techniques and Languages for Web data:
– XML
– XML Querying: XQuery
– RDF (& RQL)
– OWL
Note: here the context, not the details!
TU/e technische universiteit eindhoven
Evolution
TU/e technische universiteit eindhoven
Future of the Web
1. common syntax: XML• HTML: a fixed set of tags complicates the
identification of information elements• XML allows to define data structures:
• Tags with freely chosen names– No predefined tags enables definition, transmission,
validation and interpretation of data between applications (and organizations)
• Freely chosen attributes• Simple definition: DTD• Extended definition: XML-Schema
TU/e technische universiteit eindhoven
<skills> <people>
<person> <name>Bob</name> <know-how>Quilt</know-how>
</person> <person>
<name>Peter</name> <know-how>Quilt</know-how> <know-how>XML-GL</know-how>
</person> </people> <seminars>
<seminar> <topic>Quilt</topic> <participant>
<name>Karin</name> <name>Alice</name>
</participant> </seminar>
</seminars> </skills>
TU/e technische universiteit eindhoven
//person/name[../know-how="Quilt"]
$union$
//seminar[topic="Quilt"]/participant/name
TU/e technische universiteit eindhoven
Future of the Web
2. Specification of meaning: RDF• Resource: denotes an information item, e.g. via a URL
• Property type: name of a property of a resource
• Value: value for that property
Example: Resource = URL of web page
Property type = “author”
Value = “John Smith”
TU/e technische universiteit eindhoven
<?xml:namespace ns = "http://www.w3.org/RDF/RDF/" prefix = "RDF" ?> <?xml:namespace ns = "http://purl.oclc.org/DC/" prefix = "DC" ?> <?xml:namespace ns = "http://person.org/BusinessCard/" prefix = "CARD" ?>
<RDF:RDF> <RDF:Description RDF:HREF = "http://uri-of-Document-1"> <DC:Creator RDF:HREF = "#Creator_001"/> </RDF:Description>
<RDF:Description ID="Creator_001"> <CARD:Name>John Smith</CARD:Name> <CARD:Email>[email protected]</CARD:Email> <CARD:Affiliation>Home, Inc.</CARD:Affiliation> </RDF:Description> </RDF:RDF>
TU/e technische universiteit eindhoven
Future of the Web
3. Meaning: ontologies• Ontology = a vocabulary with associated
meaning• Possibility to define synonyms,
specializations and other relationships• Use of same ontology = contract on meaning
of words (tags, attributes)• Often, industry or domain dependent
TU/e technische universiteit eindhoven
Future of the Web
4. Logic to derive conclusions• Necessary in electronic commerce: What do
messages mean exchanged between supplier and customer?
5. Goal: trust in the meaning of communication between Web systems, and hence the possibility to automate using agents
Ref: www.w3.org
TU/e technische universiteit eindhoven
Web Data Integration
• WIS repository (back-end) typically assembled from different heterogeneous sources, e.g. databases, files, WWW
• To manage (coordinate) data from different sources, metadata helps to structure the data
TU/e technische universiteit eindhoven
Metadata
• Describing the data and its availability• Sometimes provided by sources• Needed by IS• Engineering metadata:
– Meaning– Validity– Quality
• Specifying “logistics” of data
TU/e technische universiteit eindhoven
XML
Semistructured data
TU/e technische universiteit eindhoven
XML: Complex data
• Structure is irregular (missing/extra data)• Schema does not exist or is unknown• Schema is rapidly evolving • Relational and ODB models are too rigid• Standard is a document/hypertext language HTML
• Solution: semistructured data model XML – data model consists of a type definition language, a query/update
language and more
TU/e technische universiteit eindhoven
XML Environment
• Follow-up of SGML, markup language for documents, and OO databases
• XML eXtensible Mark-up Language– W3C and most industrial companies [B2B]– Main idea: separate content and presentation– Use tags to represent structure and semantics
Ref: www-rocq.inria.fr/~abitebou/pub/lics01.ppt
TU/e technische universiteit eindhoven
HTML = Hypertext Language
Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99
Information System
HTML
The <b> X23 </b> new camera replaces the <b> X22 </b>. It comes equipped with a flash (worth by itself <i>53.99 $</i>) and provides great quality for only <i>359.99 $</i>.
Text + presentationWhere is the data ?
hard
TU/e technische universiteit eindhoven
XML = Semistructured Data
Ref Name PriceX23 Camera 359.99 R2D2 Robot 19350.00Z25 PC 1299.99...
Information System
<product-table>< product reference=”X23"> <designation> camera </designation> <price unit=Dollars> 359.99 </price> <description> … </description></product>< product reference=”R2D2"> <designation> Robot </designation> <price unit=Dollars> 19350 </price> <description> … </description>...</product-table>
XML
Data + StructureSemistructured: more flexible
easy
TU/e technische universiteit eindhoven
XML Flexibility
• no fixed set of tags
• no fixed interpretation/rendering of tags
• no fixed structure
TU/e technische universiteit eindhoven
<?xml version="1.0"?> <purchaseOrder orderDate="1999-10-20">
<shipTo country="US"> <name>Alice Smith</name> <street>123 Maple Street</street> <city>Mill Valley</city> <state>CA</state> <zip>90952</zip>
</shipTo> <billTo country="US">
<name>Robert Smith</name> <street>8 Oak Avenue</street> <city>Old Town</city> <state>PA</state> <zip>95819</zip>
</billTo> <comment>Hurry, my lawn is going wild!</comment> <items>
<item partNum="872-AA"> <productName>Lawnmower</productName> <quantity>1</quantity> <USPrice>148.95</USPrice> <comment>Confirm this is electric</comment>
</item> <item partNum="926-AA">
<productName>Baby Monitor</productName> <quantity>1</quantity> <USPrice>39.98</USPrice> <shipDate>1999-05-21</shipDate>
</item> </items>
</purchaseOrder>
TU/e technische universiteit eindhoven
XML Documents
• elements and attributes
• elements are ordered
• attribute values are strings
• well-formed documents (e.g. proper nesting)
• namespaces: vocabularies for tags
• valid documents: DTD, Schema
TU/e technische universiteit eindhoven
DTD: a grammar
Catalog Product*
Product Name Price? Cat
(Part Quantity)*
Part BasicPart + ComposedPart
BasicPart Name
ComposedPart Name (Part Quantity)*
TU/e technische universiteit eindhoven
XML Schema
• to define a class of documents: conforming to a schema
• in XML syntax
• built-in types
TU/e technische universiteit eindhoven
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <xsd:annotation>
<xsd:documentation xml:lang="en"> Purchase order schema for Example.com. Copyright 2000 Example.com. All rights reserved. </xsd:documentation>
</xsd:annotation> <xsd:element name="purchaseOrder" type="PurchaseOrderType"/> <xsd:element name="comment" type="xsd:string"/> <xsd:complexType name="PurchaseOrderType">
<xsd:sequence> <xsd:element name="shipTo" type="USAddress"/> <xsd:element name="billTo" type="USAddress"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="items" type="Items"/>
</xsd:sequence> <xsd:attribute name="orderDate" type="xsd:date"/>
</xsd:complexType> <xsd:complexType name="USAddress">
<xsd:sequence> <xsd:element name="name" type="xsd:string"/> <xsd:element name="street" type="xsd:string"/> <xsd:element name="city" type="xsd:string"/> <xsd:element name="state" type="xsd:string"/> <xsd:element name="zip" type="xsd:decimal"/>
</xsd:sequence> <xsd:attribute name="country" type="xsd:NMTOKEN" fixed="US"/>
</xsd:complexType> ...</xsd:schema>
TU/e technische universiteit eindhoven
...<xsd:complexType name="Items"> <xsd:sequence> <xsd:element name="item" minOccurs="0" maxOccurs="unbounded"> <xsd:complexType> <xsd:sequence> <xsd:element name="productName" type="xsd:string"/> <xsd:element name="quantity"> <xsd:simpleType> <xsd:restriction base="xsd:positiveInteger"> <xsd:maxExclusive value="100"/> </xsd:restriction> </xsd:simpleType> </xsd:element> <xsd:element name="USPrice" type="xsd:decimal"/> <xsd:element ref="comment" minOccurs="0"/> <xsd:element name="shipDate" type="xsd:date" minOccurs="0"/> </xsd:sequence> <xsd:attribute name="partNum" type="SKU" use="required"/> </xsd:complexType> </xsd:element> </xsd:sequence> </xsd:complexType>
<!-- Stock Keeping Unit, a code for identifying products --> <xsd:simpleType name="SKU"> <xsd:restriction base="xsd:string"> <xsd:pattern value="\d{3}-[A-Z]{2}"/> </xsd:restriction> </xsd:simpleType>
</xsd:schema>
TU/e technische universiteit eindhoven
Typing XML
• Not really, the true spirit of the Web, but essential for data management: query optimization, user interfaces, applications
• Differences with standard database typing– Collections are sequences instead of sets– Types may be very large (e.g., from integration)– Data is more irregular so types should be more
permissive– New issues sometimes: you have the data, extract its
type: an approximate type
TU/e technische universiteit eindhoven
More on XML
• The Database Models course in BIS, given by De Bra and Paredaens, will pay much more attention to the XML data model.
• Also, look at the W3C site: w3c.org
TU/e technische universiteit eindhoven
XML Querying
XQuery
TU/e technische universiteit eindhoven
XML query language
• XML is used for data exchange on the Web• W3C develops standard: XML Query Working
Group• XML Query Data Model• XPath and XQuery
Ref: www.w3.org/XML/Query
TU/e technische universiteit eindhoven
XPath
• Path expressions in OO databases/Students/Student/Status
• Semistructured: – missing parts
/Students//Status
– conditions/Students/Student[Status=“U4”]
• Indexing, wildcards• Selection, string manipulation, aggregation, attribute
existence, union
TU/e technische universiteit eindhoven
XSLT
• XSL: XML Stylesheet Language – (XSLT: XSL Transformations)
• declarative language for transforming XML documents using an XSLT processor
TU/e technische universiteit eindhoven
XQuery
• http://www.w3.org/XML/Query• “the” standard for XML querying• Goal WG: “data model for XML documents, a set
of query operators on that data model, and a query language based on these query operators”
• General query language (next to XPath + XSLT)
TU/e technische universiteit eindhoven
XQuery Path Expressions
Based on XPath
In the second chapter of the document named “zoo.xml”, find the figure(s) with caption “Tree Frogs”.
document(“zoo.xml”)/chapter[2]//figure[caption=“Tree Frogs”]
Find captions of figures that are referenced by <figref> elements in the chapter of “zoo.xml” with title “Frogs”.
document(“zoo.xml”)/chapter[title=“Frogs”]//figref/@refid->fig/caption
TU/e technische universiteit eindhoven
XQuery Element Constructor
Generate an <emp> element that has an “empid” attribute. The value of the attribute and the content of the subelements are specified by variables that are bound in other parts of the query.
<emp empid={$id}>{$name}{$job}
</emp>
TU/e technische universiteit eindhoven
XQuery FLWR Expression
FOR var IN expr binding-clauseLET var := expr binding-clauseWHERE expr select-predicateRETURN expr output-generation
List the titles of books published by Morgan Kaufmann in 1998.
FOR $b IN document(“bib.xml”)//bookWHERE $b/publisher = “Morgan Kaufmann” AND $b/year =
“1998”RETURN $b/title
TU/e technische universiteit eindhoven
FLWR Expression
List each publisher and the average price of its books.
FOR $p IN distinct(document(“bib.xml”)//publisher)
LET $a := avg(document(“bib.xml”)/book[publisher=$p]/price)
RETURN
<publisher>
<name>{$p/text()}</name>
<avgprice>{$a}</avgprice>
</publisher>
TU/e technische universiteit eindhoven
Operators and FunctionsFind the maximum depth of the document named “partlist.xml”.
NAMESPACE xsd=http://www.w3.org/2001/XMLSchema-datatypes
FUNCTION depth(ELEMENT $e) RETURNS xsd:integer{
-- An empty element has depth 1-- Otherwise, add 1 to max depth of childrenIF empty($e/*) THEN 1ELSE max(depth($e/*)) + 1
}
depth(document(“partlist.xml”))
TU/e technische universiteit eindhoven
Conditional Expression
Make a list of holdings, ordered by title. For journals, include the editor, and for all other holdings, include the author.
FOR $h IN //holdingRETURN
<holding>{$h/title,IF $h/@type=“Journal”
THEN $h/editorELSE $h/author
}</holding> SORTBY (title)
TU/e technische universiteit eindhoven
Quantified Expressions
Find titles of books in which both sailing and windsurfing are mentioned in the same paragraph.
FOR $b IN //bookWHERE SOME $p IN $b//para SATISFIES
contains($p,”sailing”) AND contains($p,”windsurfing”)RETURN $b/title
Find titles of books in which sailing is mentioned in every paragraph.
FOR $b IN //bookWHERE EVERY $p IN $b//para SATISFIES
contains($p,”sailing”)RETURN $b/title
TU/e technische universiteit eindhoven
Other expressions
• Sequence-related expressions– Example: ($x,$y,$z)– PRECEDES, FOLLOWS
• Operators on data types– INSTANCEOF– CAST– TREAT
TU/e technische universiteit eindhoven
More on XQuery
• The Database Models course in BIS, given by De Bra and Paredaens, will pay much more attention to XML query languages.
• Also, look at the W3C site: w3c.org
TU/e technische universiteit eindhoven
RDF
RQL
TU/e technische universiteit eindhoven
Resource Description Framework
• W3C standard for metadata description• Describes the “meaning” of data like Web sites, parts
of HTML pages, etc.• Makes data “machine - understandable” – allows
automated data processing• Framework that allows you to make simple assertions
about anything: distributed and extensible (as is the Web)
• “meaning” expressed via “subclass of”
Ref: www.w3.org/RDF, www.w3.org/TR/rdf-primer
TU/e technische universiteit eindhoven
Basic RDF Model
• Recognizes 3 object types:– Resources – always named by URI, e.g. web
site, part of web page, others– Properties – an attribute of a Resource, its
characteristics– Statements – Resource + Property + Property
Value
TU/e technische universiteit eindhoven
Basic RDF Model Example
• RDF representation of the sentence:“Ora Lassila is the creator of the resource
www.w3.org/Home/Lassila.”
Statement:
Subject (Resource) www.w3.org/Home/Lassila
Predicate (Property) Creator
Object (Literal) “Ora Lassila”
TU/e technische universiteit eindhoven
Basic RDF Model Example
• In general: <subject> HAS <predicate><object>
herewww.w3.org/Home/Lassila HAS Creator Ora Lassila
• Diagram of the statement:
www.w3.org/Home/Lassila Ora LassilaCreator
TU/e technische universiteit eindhoven
RDF and XML
•RDF can be implemented using XML•The example of complete XML for the previous example is:
<?xml version=“1.0”> <rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#” xmlns:s=http://description.org/schema/> <rdf:Description about=www.w3.org/Home/Lassila> <s:Creator>Ora Lassila</s:Creator> </rdf:Description> </rdf:RDF>
TU/e technische universiteit eindhoven
Structured Value Example
• “The employee with ID 85740, Ora Lassila, with Email [email protected], is the creator of the resource www.w3.org/Home/Lassila”
www.w3.org/staffid/85740
www.w3.org/Home/Lassila
Ora Lassila [email protected]
Creator
Name Email
In XML it is:<rdf:RDF>
<rdf:Description about=“www.w3.org/Home/Lassila”>
<s:Creator>
<rdf:Description about=“www.w3.org/staffid/85740”>
<v:Name>Ora Lassila</v:Name>
<v:Email>[email protected]</v:Email>
</rdf:Description>
</s:Creator>
<rdf:Description>
</rdf:RDF>
TU/e technische universiteit eindhoven
RDF - more• Property value can be literal or resource• One subject can have more than one property• It is possible to make statements about statements• It is possible to refer a collection of resources
(containers) of 3 types:– Bag – a property has multiple values, order has no significance
– Sequence – a property has multiple value, order is significant
– Alternative – list of literals/resources representing alternatives for single property
TU/e technische universiteit eindhoven
RDF Schemas and Namespaces
• Meaning of terms used in statements like “Creator”, “Name”, “Email” is expressed by referencing to RDF Schemas (“domain-definition”)
• RDF Schema provides information about the interpretation of the statement in given RDF model
• RDF Schema is usually separate document
• To avoid confusion between different definitions of the same term, RDF Schemas use Namespace facility.
xmlns:s=“http://description.org/schema”
xmlns:v=“http://description.org/differentschema”
<s:Creator>Ora Lassila</s:Creator>
<v:Creator>Ora Lassila</v:Creator>
TU/e technische universiteit eindhoven
RDF Query Language
• Querying RDF metadata– SQL/XQL style approach, viewing RDF metadata as
relational or XML database [RDF Query Specification (IBM)]– viewing Web descriptions by RDF metadata as knowledge
base, applying knowledge representation and reasoning techniques [W3C related]
• RQL
Ref: 139.91.183.30:9090/RDF/publications/bda01.PDF139.91.183.30:8999/RQLdemo/
TU/e technische universiteit eindhoven
RQL
subClassOf(Artist) subClassOf^(Artist)
SELECT $C1, $C2 FROM {$C1}creates{$C2}
SELECT X, Y FROM {X}last_modified{Y} WHERE Y >= 2000-01-01
TU/e technische universiteit eindhoven
OWL
TU/e technische universiteit eindhoven
OWL
• Web Ontology Language• used to explicitly represent meaning of
terms in vocabularies and relationships between terms: ontology– ontology engineering
• beyond XML and RDF(S)• revision of DAML+OIL
TU/e technische universiteit eindhoven
Stack
• XML: surface syntax for structured documents (no semantic constraints on meaning)
• XML Schema: restricting structure of XML documents• RDF: datamodel for objects (resources) and relationships,
provides simple semantics for this datamodel• RDF Schema: vocabulary for describing properties and classes
of RDF resources, with semantics for generalization-hierarchies• OWL: adds vocabulary for describing properties and classes,
e.g. relations between classes (disjoint), cardinality (exactly one), equality, richer typing of properties, characteristics of properties (symmetry), enumerated classes
TU/e technische universiteit eindhoven
OWL Sublanguages
• OWL Lite: classification hierarchy and simple constraints
• OWL DL: maximum expressiveness while retaining computational completeness and decidability (description logics)
• OWL Full: maximum expressiveness and syntactic freedom of RDF with no computational guarantees
TU/e technische universiteit eindhoven
OWL Lite
• RDF Schema features: Class, rdf:Property, rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain, rdfs:range, Individual
• (In)Equality: equivalentClass, equivalentProperty, sameIndividualAs, differentFrom, allDifferent
• Property characteristics: inverseOf, TransitiveProperty, SymmetricProperty, FunctionalProperty, InverseFuntionalProperty
• Property type restrictions: allValuesFrom, someValuesFrom• Restricted cardinality: minCardinality (0/1), maxCardinality
(0/1), cardinality (0/1)• Class intersection: intersectionOf
TU/e technische universiteit eindhoven
OWL DL and Full
• Class axioms: oneOf, disjointWith, equivalentClass, rdfs:subClassOf (both applied to class expressions)
• Boolean combinations of class expressions: unionOf, intersectionOf, complementOf
• Arbitrary cardinality: minCardinality, maxCardinality, cardinality
TU/e technische universiteit eindhoven
References
• There is a lot of information available through the W3C site.
• Depending on your background, have a close look at some of the languages and the ideas behind them.
TU/e technische universiteit eindhoven