semistructured data and xml...semistructured data markup languages allows marking up documents by...
TRANSCRIPT
1
Semistructured data and XML
© Institutt for Informatikk INF3100 – 30.03.2016 – Ahmet Soylu
Unstructured data
Unstructured data data can be of any type not necessarily following any format or
sequence does not follow any rules is not predictable examples include: text, video, sound, images
2 INF3100 – 30.03.2016 – Ahmet Soylu
Structured data
data is organized in semantic chunks (entities) similar entities are grouped together (classes) entities in the same group have the same
descriptions (attributes) descriptions for all entities in a group (schema)
have the same defined format, have a predefined length and are all present
and follow the same order
3 INF3100 – 30.03.2016 – Ahmet Soylu
Semistructured data
organized in semantic entities similar entities are grouped together entities in same group may not have same
attributes order of attributes not necessarily important not all attributes may be required size of same attributes in a group may differ type of same attributes in a group may differ
4 INF3100 – 30.03.2016 – Ahmet Soylu
Semistructured data Why semistructured data?
Integration of databases similar data different with schemas
Information share on the Web e.g., XML, JSON etc.
Flexible: irregular structure, evolves rapidly add new attributes freely empty values new relationships without needing to change a
schema 5 INF3100 – 30.03.2016 – Ahmet Soylu
Semistructured data Example
name: Peter Wood email: [email protected], [email protected] name:
first name: Mark last name: Levene
email: [email protected] name: Alex Poulovassilis affiliation: Birkbeck
6 INF3100 – 30.03.2016 – Ahmet Soylu
Semistructured data Representation
7
Labelled directed graph, nodes: leaf or interior schema information is in the edge labels data stored at the leaves
Carrie Fisher
StarMovieData
Star Movie Star
Name
Address
Name
Address
Street
Street
Street
City City
City Title Year
StarsIn
StarsIn
StarOf
StarOf Mark Hamill
Maple
Hollywood
Star Wars
Oak
Redwood
Locust Malibu
1977
INF3100 – 30.03.2016 – Ahmet Soylu
Semistructured data Information integration
8
interface
Database Database
Other applications
Other applications
• No common schema, legacy-database problem • Approach: semistructured data with wrappers
INF3100 – 30.03.2016 – Ahmet Soylu
Semistructured data Markup languages
Allows marking up documents by representing structural, presentational, and semantic information alongside content
Markup languages play a key role: notably XML XML is derived from SGML (Standard
Generalized Markup Language) SGML is a ISO standard technology for defining
markup languages HTML is another example of a markup language
originally derived from SGML 9 INF3100 – 30.03.2016 – Ahmet Soylu
XML Extensible Markup Language
Follows a tag-based notation, similar to HTML HTML tags talk about the presentation while
XML tags talk about the meaning
10
HTML
<html> <body> <i>This is italic</i> <p>This is a paragraph.</p> </body> </html>
XML
<note> <to>Tove</to> <from>Jani</from> <subject /> <heading>Reminder</heading> <body>Call me!</body> </note>
INF3100 – 30.03.2016 – Ahmet Soylu
XML With and without schema
XML can be used in different modes Well-formed XML
no predefined schema invent your own tags nesting rules has to be obeyed (syntactically
correct) – i.e., has to be well-formed Valid XML:
involves a schema definition allowable tags and grammar is specified between strict-schema and schemaless models
11 INF3100 – 30.03.2016 – Ahmet Soylu
Well-formed XML
Begins with a declaration of the document type (i.e., XML)
It has a root element that is the entire body
12
<?xml version="1.0" encoding=”utf-8” standalone=“yes” ?> <someTag> ... </someTag>
character encoding
well-formed or valid
root element
INF3100 – 30.03.2016 – Ahmet Soylu
Well-formed XML example
<?xml version="1.0" encoding="utf-8"?> <StarMovieData>
<Star> <Name>Carrie Fisher</Name> <Address> <Street>123 Maple Street</Street> <City>Hollywood</City> </Address>
</Star> <Movie>
<Title>Star Wars</Title> <Year>1977</Year>
</Movie> </StarMovieData>
13
Movie
Carrie Fisher
StarMovieData
Star
Name Address
Street
City
Title Year
Maple
Hollywood
Star Wars
1977
INF3100 – 30.03.2016 – Ahmet Soylu
Well-formed XML Attributes
XML elements can have attributes within opening tags
An alternative way to represent a leaf node Attributes can represent labeled arcs
14
<Movie><Title> Star Wars</Title><year>1977</year></Movie>
<Movie year = 1977><Title> Star Wars</Title></Movie>
<Movie title=“Star Wars” year = 1977></Movie>
<Movie title=“Star Wars” year = 1977 />
INF3100 – 30.03.2016 – Ahmet Soylu
Well-formed XML Attributes
Attributes can also represent relationships <?xml version="1.0" encoding="utf-8"?> <StarMovieData>
<Star starID="cf" starredIn="sw"> <Name>Carrie Fisher</Name> <Address> <Street>123 Maple Street</Street> <City>Hollywood</City> </Address>
</Star> <Movie movieID="sw” starOf="cf">
<Title>Star Wars</Title> <Year>1977</Year>
</Movie> </StarMovieData>
15 INF3100 – 30.03.2016 – Ahmet Soylu
Well-formed XML Namespaces
Can qualify the tags in the XML document Facilitate reuse of vocabularies Use several vocabularies in the same XML
document without name conflicts Namespace specified by a URI which is typically
a URL that refers to a document describing the interpretation of the tags in the namespace
This document can be an XML document, an informal document (HTML), ... or nothing
16 INF3100 – 30.03.2016 – Ahmet Soylu
Well-formed XML Namespaces
17
HTML table <table> <tr> <td>Apples</td> <td>Bananas</td> </tr> </table>
A real table <table> <name>African Coffee Table</name> <width>80</width> <length>120</length> </table>
<root> <h:table xmlns:h="http://www.w3.org/"> <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td> </h:tr> </h:table> <f:table xmlns:f="http://www.furniture.com"> <f:name>African Coffee Table</f:name> <f:width>80</f:width> <f:length>120</f:length> </f:table> </root>
INF3100 – 30.03.2016 – Ahmet Soylu
Well-formed XML XML and Databases
It is common for computers to share data across the internet by passing messages in XML
It is increasingly common for XML to be used for data storage similar to relational databases How do we catch efficiency in data access
with XML? Store XML data in parsed form, e.g., SAX (Simple
API for XML) and DOM (Document Object Model) Represent documents and their elements as
relations and store in conventional databases
18 INF3100 – 30.03.2016 – Ahmet Soylu
Well-formed XML XML and Databases
A possible relational schema for storing XML is:
DocRoot(docID, rootElementID) SubElement(parentID, childID, position) ElementAttribute(elementID, name, value) ElementValue(elementID, value)
19
Relates document IDs to the IDs of their root element
Connects an element to each of its immediate sub elements
Relates elements to their attributes
Relates leaf elements to their values
INF3100 – 30.03.2016 – Ahmet Soylu
Valid XML
Valid: well-formed and follows a particular schema
A schema is a definition of the syntax of an XML-based language (i.e., it defines a class of XML documents)
Allows automatically interpreting the meaning or semantics of the elements
Two prominent alternatives: XML DTD (document type definition) and XML Schema
20 INF3100 – 30.03.2016 – Ahmet Soylu
Valid XML XML DTD
<!DOCTYPE StarMovieData [ <!ELEMENT StarMovieData (Star*, Movie*)> <!ELEMENT Star (Name, Address+)> <!ATTLIST Star starId ID #REQUIRED
starredIn IDREFS #IMPLIED > <!ELEMENT Name (#PCDATA)> <!ELEMENT Address (Street, (City | Zip))> <!ELEMENT Street (#PCDATA)> <!ELEMENT City (#PCDATA)> <!ELEMENT Movie (Title, Year, Genre)> <!ATTLIST Movie movieId ID #REQUIRED
starsOf IDREFS #IMPLIED > <!ELEMENT Title (#PCDATA)> <!ELEMENT Year (#PCDATA)> <!ELEMENT Genre (Comedy | Drama | SciFi)> ]>
21
ELEMENT: element declaration ATTLIST: attribute declarations #PCDATA: data should be parsed #CDATA: data should not be parsed #REQUIRED: attribute must be present #IMPLIED: attribute is optional ID: defines an identifier IDREF: references to other elements
*: element may occur any # of times +: element may occur 1 or more times ?: element may occur 0 or 1 time | : exactly 1 option appears
INF3100 – 30.03.2016 – Ahmet Soylu
Valid XML XML Schema
It is more powerful than DTD provides far more control for the developer over
what is legal and a detailed way to define what the data can and cannot contain allows arbitrary restrictions on the number of
occurrences of sub elements allows to declare types such as integer, float... gives ability to declare keys and foreign keys
XML schemas themselves are XML documents
22 INF3100 – 30.03.2016 – Ahmet Soylu
XML Schema
23
<?xml version = "1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
</xs:schema>
INF3100 – 30.03.2016 – Ahmet Soylu
XML Schema Elements and simple types
24
<?xml version = "1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Title" type="xs:string" /> <xs:element name="Year" type="xs:integer" />
</xs:schema>
INF3100 – 30.03.2016 – Ahmet Soylu
XML Schema Complex types
25
<?xml version = "1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:complexType name="movieType”>
<xs:sequence> <xs:element name="Title" type="xs:string" /> <xs:element name="Year" type="xs:integer" /> </xs:sequence>
</xs:complexType> <xs:element name="Movies">
<xs:complexType> <xs:sequence> <xs:element name="Movie" type="movieType" minOccurs="0" maxOccurs="unbounded" /> </xs:sequence> </xs:complexType>
</xs:element> </xs:schema>
INF3100 – 30.03.2016 – Ahmet Soylu
XML Schema Example XML document
26
<?xml version = "1.0"encoding="utf-8"?> <Movies xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Movies.xsd"”>
<Movie> <Title>Star Wars</Title> <Year>1977</Year> </Movie>
<Movie> … </Movie>
…
</Movies>
INF3100 – 30.03.2016 – Ahmet Soylu
XML Schema Attributes
27
<?xml version = "1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:complexType name="movieType"> <xs:attribute name="movieID" type="xs:string" use="required" /> <xs:attribute name="starOf" type="xs:string" />
<xs:sequence> <xs:element name="Title" type="xs:string" /> <xs:element name="Year" type="xs:integer" /> </xs:sequence>
</xs:complexType> <xs:element name="Movies">
<xs:complexType> <xs:sequence> <xs:element name="Movie" type="movieType" minOccurs="0" maxOccurs="unbounded" /> </xs:sequence> </xs:complexType>
</xs:element> </xs:schema>
INF3100 – 30.03.2016 – Ahmet Soylu
XML Schema Example XML Document
28
<?xml version = "1.0" encoding="utf-8"?> <Movies xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="Movies.xsd">
<Movie movieID="sw"> <Title>Star Wars</Title> <Year>1977</Year> </Movie>
<Movie movieID="rj"> … </Movie>
…
</Movies>
INF3100 – 30.03.2016 – Ahmet Soylu
XML Schema Restricted Simple Types
<xs:simpleType name = "MovieYearType”> <xs:restriction base = ”xs:integer”> <xs:minInclusive value = ”1915” /> </xs:restriction> </xs:simpleType>
<xs:simpleType name = "genreType"> <xs:restriction base = "xs:string"> <xs:enumeration value = "comedy" />
<xs:enumeration value = "drama" /> <xs:enumeration value = "sciFi" />
</xs:restriciton> </xs:ssimpleType>
29
restrict numerical values with minInclusive and maxInclusive
restrict values to an enumerated type
INF3100 – 30.03.2016 – Ahmet Soylu
XML Schema Restricted Simple Types
30
<?xml version = "1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:complexType name="movieType"> <xs:attribute name="movieID" type="xs:string" use="required" /> <xs:attribute name="starOf" type="xs:string" />
<xs:sequence> <xs:element name="Title" type=”movieYearType" /> <xs:element name="Year" type=”genreType" /> </xs:sequence>
</xs:complexType> <xs:element name="Movies">
<xs:complexType> <xs:sequence> <xs:element name="Movie" type="movieType" minOccurs="0" maxOccurs="unbounded" /> </xs:sequence> </xs:complexType>
</xs:element> </xs:schema>
INF3100 – 30.03.2016 – Ahmet Soylu
XML Schema Keys
31
<?xml version = "1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> ... <xs:element name="Movies">
<xs:complexType> <xs:sequence> <xs:element name="Movie" type="movieType" minOccurs="0" maxOccurs="unbounded" /> </xs:sequence> </xs:complexType>
<xs:key name="movieKey"> <xs:selector xpath="Movie" /> <xs:field xpath="Title" />
<xs:field xpath=”Year" /> </xs:key> </xs:element> </xs:schema> INF3100 – 30.03.2016 – Ahmet Soylu
XML Schema Foreign Keys
32
… <xs:element name="Stars"> <xs:complexType> … <xs:element name="StarredIn" minOccurs="0" maxOccurs="unbounded"> <xs:complexType> <xs:element name="title" type="xs:string" /> <xs:element name="year" type="xs:integer" /> </xs:complexType> </xs:element> … </xs:complexType> <xs:keyref name="movieRef" refers = "movieKey"> <xs:selector xpath="Star/StarredIn" /> <xs:field xpath=”title" /> <xs:field xpath=”year" /> </xs:keyref> </xs:element> INF3100 – 30.03.2016 – Ahmet Soylu
XML Programming Languages
XPath uses path expressions to navigate in XML
documents XQuery
is the language for querying XML data and is built on XPath expressions (like SQL for DBs)
XSLT transforms an XML document into another
XML document
33 INF3100 – 30.03.2016 – Ahmet Soylu
XPath
XPath expressions generally returns a sequence of items that satisfy certain patterns
A sequence of elements can be specified using an absolute or relative path
34
/Movies - root element and all its content
/Movies/Movie – all Movie elements inside (direct child of) Movies element
/Movies//Title – all Title elements inside (at any level) Movies element
* - any element
/Movies/Movie/[Year="1980"] - all Movie elements with Year value 1980
INF3100 – 30.03.2016 – Ahmet Soylu
XQuery
Allows specification of more complex queries on one or more documents
The typical form of XQuery is known FLWR expression FOR <variable bindings to individual nodes> LET <variable bindings to collection of nodes> WHERE <qualifier conditions> RETURN <query result specification>
35 INF3100 – 30.03.2016 – Ahmet Soylu
XQuery Example XML Document
36
<?xml version = "1.0" encoding="utf-8"?> <Movies> <Movie genre="comedy"> <Title>Bruce Almighty</Title> <Star><Name>Jim Carrey</Name></Star> </Movie> <Movie genre="comedy"> <Title>Dumb & Dumber</Title> <Star><Name>Jim Carrey</Name></Star> </Movie> <Movie genre="drama"> <Title>The Truman Show</Title> <Star><Name>Jim Carrey</Name></Star> </Movie> <Movie genre="comedy"> <Title>Nine Months</Title> <Star><Name>Hugh Grant<Name></Star> </Movie> </Movies>
INF3100 – 30.03.2016 – Ahmet Soylu
XQuery Example XQuery
Find all comedy movies in which Jim Carrey is an actor
Find the cities in which stars are mentioned
37
let $movies := doc("movies.xml") for $movie in $movies//Movie[@genre="comedy"] where $movie/Star/[Name="Jim Carrey"] return $movie/Title
let $movies := doc("movies.xml") let $stars := doc(”stars.xml") for $s1 in $movies/Movies/Movie/Star, $s2 in $stars/Stars/Star where data(s1) = data($s2/Name) return $s2/Address/City
INF3100 – 30.03.2016 – Ahmet Soylu
XQuery Other features
Eliminating duplicates let $s := distinct-values(…)
Quantifiers every $s in … satisfies …
some $s in … satisfies … Aggregation (count, sum, max, …) Branching
if (…) then … else …
38 INF3100 – 30.03.2016 – Ahmet Soylu
XSLT
Extensible Stylesheet Language for Transformations original purpose is to transform XML
documents to other document forms (XML, HTML etc.)
in practice is another query language uses XPath for navigating in XML documents
39 INF3100 – 30.03.2016 – Ahmet Soylu
XSLT Example
40
<?xml version = "1.0" encoding = "utf-8" ?> <xsl:stylesheet xmlns:xsl = "http:...XSL/Transform” version = "1.0"> <xsl:output method = ”xml” indent = ”yes” /> <xsl:template match = "/Movies"> <ComedyMovies> <xsl:apply-templates /> </ComedyMovies> ...
XSLT Processor
<?xml version = "1.0" encoding="utf-8"?> <Movies> <Movie genre="comedy"> <Title>Bruce Almighty</Title> <Star><Name>Jim Carrey</Name></Star> </Movie> ...
XSLT stylesheet
<?xml version = "1.0" encoding="utf-8"?> <ComedyMovies> <Comedy title = "Bruce Almighty" /> <Comedy title = "Dumb & Dumber" /> <Comedy title = "Nine Months" /> </ComedyMovies>
XML-document
XML-document
INF3100 – 30.03.2016 – Ahmet Soylu
XSLT Example
<?xml version = "1.0" encoding = "utf-8" ?> <xsl:stylesheet xmlns:xsl = "http://www.w3.org/1999/XSL/Transform” version = "1.0"> <xsl:output method = ”xml” indent = ”yes” /> <xsl:template match = "/Movies"> <ComedyMovies> <xsl:apply-templates /> </ComedyMovies> </xsl:template> <xsl:template match = "Movie[@genre="comedy"]"> <xsl:apply-templates /> </xsl:template> <xsl:template match = "Title"> <Comedy title = "<xsl:value-of select = "." /> " /> </xsl:template> <xsl:stylesheet>
41 INF3100 – 30.03.2016 – Ahmet Soylu
Some online resources
XML: http://www.w3schools.com/xml/ XPath: www.w3schools.com/xpath/ XPath tester: http://www.xpathtester.com/test XQuery: www.w3schools.com/xquery/ XQuery tester:
http://www.zorba-xquery.com/html/demo XSLT: www.w3schools.com/XSL/ XSLT tester:
http://www.w3.org/2005/08/online_xslt/ 42 INF3100 – 30.03.2016 – Ahmet Soylu