information integration with xml, part ii · studio: • define xml schemas (dcd) • generate...
TRANSCRIPT
1
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Information Integration with XMLPART II
Chaitan BaruRichard Marciano
{baru,marciano}@sdsc.edu
Data Intensive Computing GroupSan Diego Supercomputer Center
2
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
PART II
• Storing XML documents• Querying XML documents• XML and GIS• Technical Issues• Projects at SDSC
3
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Storing XML documents
• Pure XML data servers• Documents are stored in native XML form• XML-based query languages are used to
retrieve data• Relational DBMS’s
• Documents are stored as BLOB’s• Or, XML elements are mapped to columns in
tables• SQL is used to retrieve data
4
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Storing in “pure” XML data servers• eXcelon, from eXcelon Corp. (ex ODI)• Dynamic Application Platform
• Data Server• Toolbox• Xconnects
• B2B Integration Services• B2B Translator• Business Process Workflow Engine• Enterprise Connectivity• Business Module eXtensions
5
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
eXcelon• Stores XML and non-XML (blob) data• Supports queries and indexes on stored XML
data• Uses a file system metaphor• Supports the use of (server-side) XSL
stylesheets• Provides visual tools (Studio, Explorer, Manager, Stylus)
• Provides Web & COM client interfaces• Provides Java & COM APIs to extend data server• Supports DOM for data access on the server• Can distribute XML data access across caches• Connects to 70 sources using ADBC / ADO
6
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Studio:• define XML schemas (DCD)
• generate XML-based pages
Explorer:• browser to view, import, organize, modify, query and set security on data
• Xpath/ XQL query wizard
Manager:• administer & configure
• set server properties
• set load balancing parameters
Stylus:• Build Web pages using XML & XSL
• Transforms XML to HTML
eXcelon Tool Box
7
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Connect to any data source:
• Cobol
• dBaseIII
• Act
• etc.
eXcelon Xconnects
8
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Programming against eXcelon• Out-of-the-box tools: “no writing code”
• create / update / delete / query XML
• Programming:• In eXcelon server extensions
• COM / JAVA & DOM to manipulate XML contained in eXcelon XMLStores
• In Web server• Active Server Application that uses the eXcelon COM client API & ship
HTML to the browser. XSL can be applied in the context of the Web server
• In Browser• DHTML (VBScript, JavaScript, Visual Basic) or Java applet that
manipulates XML. Apply XSL stylesheet in the browser
9
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Storing XML documents in RDBMS’s
• Store documents as BLOBs• Map document elements into a set of
relational tables• Need a DTD or schema for documents• Need to map the XML DTD or schema into a relational
schema• Relational schema will capture the hierarchical
“containment” relationship among elements as 1-1 or 1-many relationships
10
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Example of an XML document and DTDPublication
Title AuthorName
XML Tutorial Richard Marciano
AuthorName
Chaitan Baru
Abstract Section
Heading Para Para
IntroDTD<!ELEMENT Publications (Publication)*><!ELEMENT Publication (Title, AuthorName+, Abstract, Section*)><!ELEMENT Section (Heading, Paragraph*))>
Pub_ID
<!ATTLIST Publication Pub_ID ID #REQUIRED>
11
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Store data as BLOB’s in RDBMS
• Store XML document as BLOB, with text/path indexes
XML Document<title></title>
<abstract></abstract>RDBMS
textblob
textindex
12
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Provide indexing of XML documents
XML Document<title></title>
<abstract></abstract>RDBMS text
blob
Title
textindex
Column index
• Store specified elements as columns in a table
13
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Map DTD to a relational schema
• Un-nest the DTD hierarchy• Stop at a point where it is “sufficient” to
represent an element as a single compound value, rather than a hierarchy (e.g. Address)
Pub_ID Title Abstract Auth_ID Pub_ID AuthName
Sec_Num Pub_ID Heading Sec_Num Pub_ID Para_Num Text
Publication Author
Section Paragraph
14
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Storing un-nested DTD hierarchies
• Store document elements across multiple tables• (Not yet available in COTS products)
XML Document<title> </title>
<author></author><author></author>
<abstract></abstract>
RDBMS
Pub_ID Title Abstract
Publication
Auth_ID Pub_ID AuthName
Author
15
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Retrieving XML data from DBMS
• Retrieving from pure XML data servers• Use XML query languages, e.g. XQL
• Retrieving from RDBMS• Use SQL to query data from database tables• “Wrap” output of SQL query as an XML document• Define XML views over relational schemas – Xviews
• Use SQL statement(s) to create XML output
16
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Retrieving from “pure” XML servers• XML Query Language (XQL), supported by
eXcelon• Reference
• http://www.w3.org/TandS/QL/QL98/pp/xql.html• Example: Publication DTD
<ELEMENT Publications (Publication)*><ELEMENT Publication (Title, AuthorName+, Abstract, Section*)><ELEMENT Section (Heading,Paragraph*))>
PublicationsPublication
(Title, AuthorName+, Abstract, Section*(Heading, Paragraph*))
17
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
XQL• Example queries:• Output all section headings of all publications
/Publications/Publication/Section/Heading
• Output all documents that have a section called, “Conclusion”
/Publications/Publication[Section/Heading=“Conclusion”]
18
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
XML-QL• XML Query Language (XML-QL)
• http://www.w3.org/TR/1998/NOTE-xml-ql-19980819/• XML-QL example
WHERE <Publications><Publication>
<Title> XML Tutorial </Title><Section> $S </Section><AuthorName> $A </AuthorName>
<Publication></Publications> IN www.sdsc.edu/publications/pubs.xml”
CONSTRUCT $A• Meaning: list all authors of all publications with
title=“XML Tutorial” that have at least one section and one author
19
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
XMAS• XML Matching And Structuring (XMAS)
CONSTRUCT <my_authors><my_author> $A </my_author>
</my_authors>WHERE<Publications>
<Publication> <Title> $T </Title><Section> $S </Section><AuthorName> $A </AuthorName>
</Publication> </Publications>IN "http://www.sdsc.edu/publications/pubs.xml” AND substr(”XML Tutorial", $T)
20
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Retrieving XML from RDBMS servers• Wrapping SQL output in XML• Example query:
SELECT Title, AuthNameFROM Publication, AuthorWHERE Publication.Pub_ID = Author.Pub_ID
• Result:<result>
<row><title> XML Tutorial </title><author>Marciano</author>
</row><row>
<title> XML Tutorial </title><author>Baru</author>
</row></result>
21
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Retrieving from RDBMS Servers - 2
•Define XML views over relational schemas• How to interpret relational data as XML documents• Relational schemas are “flat”, XML documents are hierarchical
Database
Relations
Tuples
Attributes
PublicationsDB
Publication
t1 t2 t3
Title Author Abstract
22
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
The Xviews concept• Derive a DTD by “identifying” a “containment”
relationship among the set of tables• Example: the “canonical” data warehouse
schema
Lineitem
Region Product
Customer Candidate containment:Lineitem
Customer Product Region
• DTD<ELEMENT Lineitem (Customer,Product,Region)>
23
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
The Xviews concept
• Alternative containment:Customer
Lineitem Product Region
• DTD<ELEMENT Customer (Lineitem*)><ELEMENT Lineitem (Product,Region)>
• Note: outer joins are needed in order to output customers who have no lineitems
24
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
RDBMS product support
XML Document RDBMS
Database tables
Package the query output into XML
SQL queries
25
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Oracle: XML SQL and XSQL Utility
• Retrieving data as XML• Generates an XML Document from SQL queries• Outputs text or Document Object Model from a SQL query
string or a JDBC ResultSet object• Inserting XML data into tables
• Writes data from an XML document into a (single) database table or (updateable) view
26
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
XSQL Utility/Servlet
Oracle 8i
Web Server
Web browser
XSQL Servlet
XML-formatted SQL queries (.xsql)
Query result in XML, or transformed into HTML by XSL
{xsql filename, params, XSL stylesheet}
XSLTprocessor
XMLSQL
utility
27
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
XSQL Example
<?xml version=“1.0”?><?xml-stylesheet type=“text/xsl”
href=“query1.xsl”?><query connection = “PublicationDB”
<doc-element = “Publications”<row-element = “Publication”>SELECT title, abstract, authornameFROM publication p, author aWHERE p.Pub_ID = a.Pub_ID
</query>
<Publications><Publication>
<title>XML Tutorial</title><abstract>...</abstract><authorname>Marciano </authorname>
</Publication><Publication>
<title>XML Tutorial</title><abstract>...</abstract><authorname>Baru</authorname>
</Publication>..... more rows...
</Publications>
Example XSQL file Sample XML output
28
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
DB2: XML Extender• XML Extender in UDB Version 6.1• XML_Column type and Document Access
Definitions (DAD’s)• Insertion into a column of type
XML_Column triggers extraction of elements specified in DAD’s
RDBMS
Title XML blob
DADXML Column
XML Document<title> </title>
<abstract></abstract>
29
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
GIS and XML
• Represent GIS metadata in XML• Represent spatial features in XML
30
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
GIS & XML: 1st experiment (the data)
31
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
GIS & XML: 1st experiment (XML wrapping)
32
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Exporting GIS data in XML
33
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Work in ProgressSpatial XML Markup Languages
• Geography Markup Language (GML) 1.0• OGC Working Draft 17-Jan-2000
• Web Mapping Testbed (WMT): NIMA, USACoE, FGDC, NASA, USDA, USGS ...
• Digital Earth (www.digitalearth.gov)
• AXL (ArcXML) pre-release• part of ESRI ArcIMS 31-Jan-2000
34
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
GML• GML: XML specification to encode geo. info.
For both Data Storage & Data Transport• Initial release deals with OGC Simple Features:
• vector geodata: e.g. digital map info (streets, population, land use zones, property lines, watersheds, etc.)
• GML is not concerned with the visualization of geographic features (drawing of maps)
GMLin XML
Direct rendering Graphicformat
Transformation into a vector graphics rendering format
• SVG• VML• VRML
Direct routing w.o. viz. Numerical model
35
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
GML•• GoalGoal: enable organizations to share geo. info. & to
enable linked geographic datasets• When GML data is exchanged over the Internet, it is
transmitted in “feature collection”• GML Simple features:
• geometry classes: Point, LineString, Polygon• geometry properties: coordinate lists, spatial reference system name
• pointproperty• linestringproperty• polygonproperty• multipointproperty• multilineproperty• multipolygonproperty
36
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
GML Example• <?xml version="1.0" standalone="yes"?>
<!DOCTYPE FeatureCollection SYSTEM "FeatureCollection.dtd" [ <!--Description : Illustration of an area feature using the polygon property. Author : Ron Lake --> ]> <FeatureCollection xmlns:ogcgml="http://www.opengis.org/gml#" >
<BoundingBox> <coordinates>0.0,0.0 3.0,4.0</coordinates>
</BoundingBox> <Feature typeName="http://www.usgs.org/tp#Building" ID="1">
<Description>Hotel Vancouver</Description> <Property typeName="http://www.usgs.org/tp#Number of Rooms" type="int">4</Property> <polygonproperty parseType = "Resource" roleName="http://www.usgs.org/tp#extent"
srsName="http://www.opengis.org/srs/epsg:26751" > <type resource = "http://www.opengis.org/gml#Polygon" /> <boundary parseType = "Resource">
<type resource = "http://www.opengis.org/gml#LineString" /> <coordinates>0.0,0.0 1.123,1.56 2.34,4.5 0.0,0.0</coordinates>
</boundary> </polygonproperty>
</Feature>
</FeatureCollection>
37
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
ArcXML (AXL)
• Being developed by ESRI, available in ArcIMS 3.0• Format for data exchange within ArcIMS 3.0• Provides tags for:
• Request / Response between Client, Middleware, and Server
• MapService Configuration• Viewer Configuration
38
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
AXL• Config tags:• Properties (properties, extent, background, imagesize, output, featurecoordsys, etc.)
• Workspaces (workspaces, SDEworkspaces, shapeworkspaces, imageworkspaces, etc.)
• Layers (layer, dataset, query, coordsys)• Renderers (simple, group, scaledependent, valuemap, simplelabel, valuemap, etc.)• Symbols (simplemarker, rastermarker, simpleline, hashline, simplefill, simplepolygon, rasterfill,
gradientfill, text, etc.)
• Acetate layer objects (object, point, line, polygon, text, scalebar, northarrow)
• Admin tags: (admin, addservice, changeservice, removeservice, image)
• Request tags: • (request, get_service_info, get_map, get_features, get_extract, get_geocode)• Feature Server Request Tags (layer, query, spatialquery, spatialfilter, envelope)
• Response tags: (response, error)
• serviceinfo (serviceinfo, layerinfo, fclass, field)• featureserver• queryserver (features, feature)
• imageserver (map, output, legend)
39
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Sample Request/Response AXL
• Example—Get_Map
<ARCXML VERSION="1.0">
<REQUEST>
<GET_MAP>
<PROPERTIES>
<EXTENT MINX="-180" MINY="-90" MAXX="180" MAXY="90" />
</PROPERTIES>
</GET_MAP>
</REQUEST>
</ARCXML>
40
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Sample MapService AXL
• Example<WORKSPACES>
<SHAPEWORKSPACE name="shp_ws-0”
directory="D:\Data\ESRIDATA\USA" />
<SDEWORKSPACE name="sde_ws-0"
server="ims" instance="esri_sde"
user="gdt" password="gdt" />
</WORKSPACES>
41
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Sample Viewer AXL
• Example<WORKSPACES>
<MAPPERWORKSPACE name="mapper_ws-0”
url="http://mammoth" service="baseimage" />
</WORKSPACES>
<LAYER type="image" name="baseimage" visible="false"
minscale="0.0” maxscale="1.7976931348623157E308”/>
<DATASET name="baseimage" type="image”
workspace="mapper_ws-0" />
</LAYER>
42
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
43
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
44
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
45
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
46
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Open Technical Issues
• DTD inference• DTD evolution• Specifying access controls on XML
documents• Specifying, enforcing intr-document
constraints
47
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
DTD Inference
• Document collections without DTD’s• “Tight” vs. “loose” DTD’s• Document 1:
<title> XML Tutorial </title><author> Richard Marciano </author>
• Possible DTD<ELEMENT document (title author))>
48
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
DTD Inference• Document 2:
<title> XML Tutorial </title><author> Richard Marciano </author><author> Chaitan Baru </author>
• Document DTD 1<!ELEMENT document
(title (author1 | author1 author2))>• Document DTD 2
<!ELEMENT document (title (author+))>
49
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
DTD Inference
• Alternative DTD’s. Introduce an extra level (authors) in the tree<!ELEMENT document (title authors)><!ELEMENT authors (author1 |
author1 author2)>OR<!ELEMENT document (title authors)><!ELEMENT authors (author+) >
50
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
DTD Evolution
• Example, Document 3:<title> XML Primer </title><author> Richard Marciano </author><author> Chaitan Baru </author><keywords> XML, XSL, Schema </keywords>
• Document does not satisfy the Document DTD• Report an error• Record as exception and store the document• Evolve the DTD
51
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Specifying access controls• User is associated with a level of access• Document elements are assigned levels of
access• Example
<abstract level=“unclassified”>….</abstract><section level=“classified”><heading>Introduction
</heading></section><section level=“top secret”><heading>Architecture
</heading></section>• Stylesheet processor matches authorization level of user
with auth level of the document element
52
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Specifying access controls
• Do access control processing in the stylesheet language
• Useful for content dependent access control• Example
If title contains “nuclear” then show only abstract Else show the full document
• Access control processing should be done on server side in secure fashion
53
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Specifying constraints
• Enforcing intra-document constraints• Constraints on structure• Example: A short paper may contain only one
section, but long papers must have at least two.<!ELEMENT Publication (Title, AuthorName*, Section*)<!ATTLIST Publication Type CDATA #REQUIRED>
• Specify type of document in Type attribute. Use that to check if document satisfies the constraint
• Constraints on value
54
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Specifying constraints
• Example of value constraint<!ELEMENT Publication (Title, AuthorName*, Section*)<!ATTLIST Publication NumSecs CDATA #REQUIRED><Publication NumSecs=“3”>
<Title>…</Title><AuthorName>…</AuthorName><Section>……</Section><Section>……</Section><Section>……</Section>
</Publication>
55
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Projects at SDSC• National Archives and Records Administration,
NARA• Persistent Archives and Electronic Records
• NHPRC• NPACI Neuroscience
• Federation of multiple brain image databases• I2T: An Information Integration Testbed for
Digital Government• Funded by NSF• Spatial mediation, wrapping of “unstructured” text
56
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Projects at SDSC
• InterLib and California Digital Library• Funded by the NSF DLI-2 program• Implemented the Art Museum Image Consortium (AMICO)
Digital Library at SDSC• Community of Science, Inc. (www.cos.com)
• Specifying XML standards for Current Research Information Systems (CRIS)
• Enable creation of warehouse of research information and enable e-commerce
57
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Projects at SDSC
• ESRI• Developers of ArcInfo, ArcView, ArcIMS products• Evaluate ArcXML (AXL) standard• Keep AXL developments in synch with activities in
OpenGIS Consortium, e.g. the evolving Geography Markup Language (GML) standard
• Connect AXL with other XML Web standards such as WAP (Wireless Application Protocol)
58
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure
Projects at SDSC
• NEES• Proposal to NSF’s Networked Earthquake Engineering
Simulation (NEES) program• Develop NeesML, an XML-based standard for
representing earthquake engineering simulation metadataand data
• NeesML will facilitate the creation of a NEES Curated Database, a warehouse of earthquake engineering simulation information
59