what is metadata? - nc state computer science · exercise produce an example xml document...
TRANSCRIPT
Module 4: XML Representation
ConceptsParsing and ValidationSchemas
c©Munindar P. Singh, CSC 513, Spring 2008 p.106
What is Metadata?Literally, data about data
Description of data that captures someuseful property regarding its
Structure and meaningProvenance: originsTreatment as permitted or allowed:storage, representation, processing,presentation, or sharing
Markup is metadata pertaining to mediaartifacts (documents, images), generallyspecified for suitable parsable units
c©Munindar P. Singh, CSC 513, Spring 2008 p.107
Motivations for MetadataMediating information structure (surrogate formeaning) over time and space
Storage: extend life of informationInteroperation for businessInteroperation (and storage) for regulatoryreasonsGeneral themes
Make meaning of information explicitEnable reuse across applications:repurposing compare to screen-scrapingEnable better tools to improve productivity
Reduce need for detailed prior agreementsc©Munindar P. Singh, CSC 513, Spring 2008 p.108
Markup HistoryHow much prior agreement do you need?
No markup: significant prior agreement
Comma Separated Values (CSV): no nesting
Ad hoc tags
SGML (Standard Generalized Markup L): complex,few reliable tools; used for document management
HTML (HyperText ML): simplistic, fixed, unprincipledvocabulary that mixes structure and display
XML (eXtensible ML): simple, yet extensible subset ofSGML to capture custom vocabularies
Machine processibleComprehensible to people: easier debugging
c©Munindar P. Singh, CSC 513, Spring 2008 p.109
Uses of XML
Supporting arms-length relationshipsExchanging information across softwarecomponents, even within an administrativedomainStoring information in nonproprietary formatRepresenting semistructured descriptions:
Products, services, catalogsContractsQueries, requests, invocations, responses(as in SOAP): basis for Web services
c©Munindar P. Singh, CSC 513, Spring 2008 p.110
Example XML Document
<?xml vers ion ="1.0"? > <!−− processing i n s t r u c t i o n −−><topelem a t t r 0 =" foo "> <!−− exac t l y one roo t −−>
3 <subelem a t t r 1 ="v1 " a t t r 2 ="v2">Opt iona l t e x t (PCDATA) <!−− parsed charac te r data −−><subsubelem a t t r 1 ="v1 " a t t r 2 ="v2 " / >
</subelem><nul l_e lem / >
8 <short_elem a t t r 3 ="v3 " / ></ topelem >
c©Munindar P. Singh, CSC 513, Spring 2008 p.111
ExerciseProduce an example XML documentcorresponding to a directed graph
c©Munindar P. Singh, CSC 513, Spring 2008 p.112
Compare with Lisp
List processing languageS-expressionsCons pairs: car and cdrLists as nil-terminated s-expressionsArbitrary structures built from few primitivesUntypedEasy parsingRegularity of structure encourages recursion
c©Munindar P. Singh, CSC 513, Spring 2008 p.113
ExerciseProduce an example XML documentcorresponding to
An invoice from Locke Brothers for 100 unitsof door locks at $19.95, each ordered on 15January and delivered to Custom HomeBuildersFactor in certified delivery via UPS for$200.00 on 18 JanuaryFactor in addresses and contact info for eachpartyFactor in late payments
c©Munindar P. Singh, CSC 513, Spring 2008 p.114
Meaning in XML
Relational DBMSs work for highly structuredinformation, but rely on column names formeaningSame problem in XML (reliance on namesfor meaning) but better connections to richermeaning representations
c©Munindar P. Singh, CSC 513, Spring 2008 p.115
XML Namespaces: 1
Because XML supports custom vocabulariesand interoperation, there is a high risk ofname collisionA namespace is a collection of namesNamespaces must be identical or disjoint
Crucial to support independentdevelopment of vocabulariesMAC addressesPostal and telephone codesVehicle identification numbersDomains as for the InternetOn the Web, use URIs for uniqueness
c©Munindar P. Singh, CSC 513, Spring 2008 p.116
XML Namespaces: 21 <!−− xml∗ i s reserved −−>
<?xml vers ion ="1.0"? >< a r b i t : top xmlns ="a URI " <!−− d e f a u l t namespace −−>
xmlns : a r b i t =" h t t p : / / wherever . i t . might . be / a r b i t −ns "xmlns : random=" h t t p : / / another . one / random−ns">
6 < a r b i t : aElem a t t r 1 ="v1 " a t t r 2 ="v2">Opt iona l t e x t (PCDATA)< a r b i t : bElem a t t r 1 ="v1 " a t t r 2 ="v2 " / >
</ a r b i t : aElem><random : simple_elem/ >
11 <random : aElem a t t r 3 ="v3 " / ><!−− compare a r b i t : aElem −−>
</ a r b i t : top >
c©Munindar P. Singh, CSC 513, Spring 2008 p.117
Uniform Resource IdentifierURIs are abstractWhat matters is their (purported) uniquenessURIs have no proper syntax per seKinds of URIs
URLs, as in browsing: not used instandards any moreURNs, which leave the mapping of namesto locations up in the air
Good design: the URI resource existsIdeally, as a description of the resource inRDDLUse a URL or URN
c©Munindar P. Singh, CSC 513, Spring 2008 p.118
RDDLResource Directory Description Language
Meant to solve the problem that a URI maynot have any real content, but people expectto see some (human readable) contentCaptures namespace description for people
XML SchemaText description
c©Munindar P. Singh, CSC 513, Spring 2008 p.119
Well-Formedness and Parsing
An XML document maps to a parse tree (ifwell-formed; otherwise not XML)
Each element must end (exactly once):obvious nesting structure (one root)An attribute can have at most oneoccurrence within an element; anattribute’s value must be a quoted string
Well-formed XML documents can be parsed
c©Munindar P. Singh, CSC 513, Spring 2008 p.120
XML InfoSet
A standardization of the low-level aspects of XMLWhat an element looks likeWhat an attribute looks likeWhat comments and namespace referenceslook likeOrdering of attributes is irrelevantRepresentations of strings and characters
Primarily directed at tool vendors
c©Munindar P. Singh, CSC 513, Spring 2008 p.121
Elements Versus Attributes: 1
Elements are essential for XML: structureand expressiveness
Have subelements and attributesCan be repeatedLoosely might correspond toindependently existing entitiesCan capture all there is to attributes
c©Munindar P. Singh, CSC 513, Spring 2008 p.122
Elements Versus Attributes: 2Attributes are not essential
End of the road: no subelements orattributesLike text; restricted to string valuesGuaranteed unique for each elementCapture adjunct information about anelementGreat as references to elements
Good idea to use in such cases to improvereadability
c©Munindar P. Singh, CSC 513, Spring 2008 p.123
Elements Versus Attributes: 3
< invo ice >2 < p r i ce currency = ’USD’ >
19.95</ pr ice >
</ invo ice >
Or< invo i ce amount = ’19.95 ’ currency = ’USD’ / >
Or even< invo i ce amount = ’USD 19.95 ’ / >
c©Munindar P. Singh, CSC 513, Spring 2008 p.124
Validating
Verifying whether a document matches a givengrammar (assumes well-formedness)
Applications have an explicit or implicitsyntax (i.e., grammar) for their particularelements and attributes
Explicit is better have definitionsBest to refer to definitions in separatedocuments
When docs are produced by externalsoftware components or by humanintervention, they should be validated
c©Munindar P. Singh, CSC 513, Spring 2008 p.125
Specifying Document Grammars
Verifying whether a document matches a givengrammar
Implicitly in the applicationWorst possible solution, because it isdifficult to develop and maintain
Explicit in a formal document; languagesinclude
Document Type Definition (DTD): inessence obsoleteXML Schema: good and prevalentRelax NG: (supposedly) better but not asprevalent
c©Munindar P. Singh, CSC 513, Spring 2008 p.126
XML Schema
Same syntax as regular XML documentsLocal scoping of subelement namesIncorporates namespaces(Data) Types
Primitive (built-in): string, integer, float,date, ID (key), IDREF (foreign key), . . .simpleType constructors: list, unionRestrictions: intervals, lengths,enumerations, regex patterns,Flexible ordering of elements
Key and referential integrity constraints
c©Munindar P. Singh, CSC 513, Spring 2008 p.127
XML Schema: complexType
Specifies types of elements with structure:Must use a compositor if ≥ 1subelementsSubelements with typesMin and max occurrences (default 1) ofsubelements
Elements with text content are easyEMPTY elements: easy
Example?Compare to nulls, later
c©Munindar P. Singh, CSC 513, Spring 2008 p.128
XML Schema: Compositors
Sequence: ordered listCan occur within other compositorsAllows varying min and max occurrence
All: unorderedMust occur directly below root elementMax occurrence of each element is 1
Choice: exclusive orCan occur within other compositors
c©Munindar P. Singh, CSC 513, Spring 2008 p.129
XML Schema: Main Namespaces
Part of the standardxsd: http://www.w3.org/2001/XMLSchema
Terms for defining schemas: schema,element, attribute, . . .The schema element has an attributetargetNamespace
xsi: http://www.w3.org/2001/XMLSchema-instance
Terms for use in instances:schemaLocation,noNamespaceSchemaLocation, nil, type
targetNamespace: user-defined
c©Munindar P. Singh, CSC 513, Spring 2008 p.130
XML Schema Instance Doc
<!−− Comment −−><Music xmlns =" h t t p : / / a . b . c / Muse"
xmlns : x s i =" the standard−x s i "4 x s i : schemaLocation ="schema−URI schema−l o ca t i on−URL">
<!−− Not ice space charac te r i n above s t r i n g −−>. . .
</ Music>
Define null values as<aElem x s i : n i l =" t r u e " / >
c©Munindar P. Singh, CSC 513, Spring 2008 p.131
XML Schema: NillableAn xsd:element declaration may statenillable=’true’
An instance of the element might statexsi:nil="true"The instance would be valid even if nocontent is present, even if content isrequired by default
c©Munindar P. Singh, CSC 513, Spring 2008 p.132
Creating XML Schema Docs: 1
Included into the same namespace as theincluding doc<xsd : schema xmlns : xsd =" the−standard−xsd "
xsd : targetNamespace =" the−t a r g e t ">< inc lude xsd : schemaLocation =" par t−one . xsd " / >
4 < inc lude xsd : schemaLocation =" par t−two . xsd " / ><!−− schemaLocation as i n xsd , not x s i −−>
</ xsd : schema>
c©Munindar P. Singh, CSC 513, Spring 2008 p.133
Creating XML Schema Docs: 2
Use import instead of includeImports may have different targetsIncluded schemas have the same targetSpecify namespaces from whichschemas are to be importedLocation of schemas not required andmay be ignored if provided
c©Munindar P. Singh, CSC 513, Spring 2008 p.134
Foreign Attributes in XML Schema
XML Schema elements allow attributes that areforeign, i.e., with a namespace other than the xsdnamespace
Must have an explicit namespaceCan be used to insert any additionalinformation, not interpreted by a processorSpecific usage is with attributes from thexlink: namespace
<xsd : schema><xsd : element name= ’ course ’ type = ’cT ’
x l i n k : r o l e = ’ work ’ ncsu : o f f e r i n g = ’ t rue ’ >4 </ xsd : schema>
c©Munindar P. Singh, CSC 513, Spring 2008 p.135
XML Schema Style Guidelines: 1
Flatten the structure of the schemaDon’t nest declarations as you would adesired instance documentMake sure that element names are notreusedUnqualified attributes cannot be globalIf dealing with legacy documents with thesame element names having differentmeanings, place them in differentnamespaces where possible
Use named types where appropriate
c©Munindar P. Singh, CSC 513, Spring 2008 p.136
XML Schema Style Guidelines: 2
Don’t have elements with mixed contentDon’t have attribute values that need parsingAdd unique IDs for information that mayrepeatGroup information that may repeatEmphasize commonalities and reuse
Derive types from related typesCreate attribute groups
c©Munindar P. Singh, CSC 513, Spring 2008 p.137
XML Schema Documentationxsd:annotation
Should be the first subelement, except forthe whole schemaContainer for two mixed-contentsubelements
xsd:documentation: for humansxsd:appinfo: for machine-processible data
Such as application-specific metadataPossibly using the Dublin Corevocabulary, which describes librarycontent and other media
c©Munindar P. Singh, CSC 513, Spring 2008 p.138
Module 5: XML Manipulation
Key XML query and manipulation languagesinclude
XPathXQueryXSLT
c©Munindar P. Singh, CSC 513, Spring 2008 p.139
Metaphors for Handling XML: 1
How we conceptualize what XML documents aredetermines our approach for handling suchdocuments
Text: an XML document is textIgnore any structure and perform simplepattern matches
Tags: an XML document is text interspersedwith tags
Treat each tag as an “event” duringreading a document, as in SAX (SimpleAPI for XML)Construct regular expressions as inscreen scraping
c©Munindar P. Singh, CSC 513, Spring 2008 p.140
Metaphors for Handling XML: 2
Tree: an XML document is a treeWalk the tree using DOM (DocumentObject Model)
Template: an XML document has regularstructure
Let XPath, XSLT, XQuery do the workThought: an XML document represents agraph structure
Access knowledge via RDF or OWL
c©Munindar P. Singh, CSC 513, Spring 2008 p.141
XPath
Used as part of XPointer, SQL/XML, XQuery,and XSLT
Models XML documents as trees with nodesElementsAttributesText (PCDATA)CommentsRoot node: above root of document
c©Munindar P. Singh, CSC 513, Spring 2008 p.142
Achtung!
Parent in XPath is like parent as traditionallyin computer scienceChild in XPath is confusing:
An attribute is not a child of its parentMakes a difference for recursion (e.g., inXSLT apply-templates)
Our terminology follows computer science:e-children, a-children, t-childrenSets via et-, ta-, and so on
c©Munindar P. Singh, CSC 513, Spring 2008 p.143
XPath Location Paths: 1Relative or absoluteReminiscent of file system paths, but muchmore subtle
Name of an element to walk downLeading /: root/: indicates walking down a tree.: currently matched (context) node..: parent node
c©Munindar P. Singh, CSC 513, Spring 2008 p.144
XPath Location Paths: 2
@attr: to check existence or access value ofthe given attributetext(): extract the textcomment(): extract the comment[ ]: generalized array accessorsVariety of axes, discussed below
c©Munindar P. Singh, CSC 513, Spring 2008 p.145
XPath Navigation
Select children according to position, e.g., [j],where j could be 1 . . . last()Descendant-or-self operator, //
.//elem finds all elems under the currentnode//elem finds all elems in the document
Wildcard, *:collects e-children (subelements) of thenode where it is applied, but omits thet-children@*: finds all attribute values
c©Munindar P. Singh, CSC 513, Spring 2008 p.146
XPath Queries (Selection Conditions)Attributes: //Song[@genre="jazz"]Text: //Song[starts-with(.//group, "Led")]Existence of attribute: //Song[@genre]Existence of subelement: //Song[group]Boolean operators: and, not, orSet operator: union (|), analogous to choiceArithmetic operators: >, <, . . .String functions: contains(), concat(),length(), starts-with(), ends-with()distinct-values()Aggregates: sum(), count()
c©Munindar P. Singh, CSC 513, Spring 2008 p.147
XPath Axes: 1Axes are addressable node sets based on thedocument tree and the current node
Axes facilitate navigation of a treeSeveral are definedMostly straightforward but some of themorder the nodes as the reverse of othersSome captured via special notation
current, child, parent, attribute, . . .
c©Munindar P. Singh, CSC 513, Spring 2008 p.148
XPath Axes: 2
preceding: nodes that precede the start ofthe context node (not ancestors, attributes,namespace nodes)following: nodes that follow the end of thecontext node (not descendants, attributes,namespace nodes)preceding-sibling: preceding nodes that arechildren of the same parent, in reversedocument orderfollowing-sibling: following nodes that arechildren of the same parent
c©Munindar P. Singh, CSC 513, Spring 2008 p.149
XPath Axes: 3ancestor: proper ancestors, i.e., element nodes (otherthan the context node) that contain the context node,in reverse document order
descendant: proper descendants
ancestor-or-self: ancestors, including self (if itmatches the next condition)
descendant-or-self: descendants, including self (if itmatches the next condition)
c©Munindar P. Singh, CSC 513, Spring 2008 p.150
XPath Axes: 4
Longer syntax: child::SongSome captured via special notation
self::*:child::node(): node() matches all nodespreceding::*descendant::text()ancestor::Songdescendant-or-self::node(), whichabbreviates to //Compare /descendant-or-self::Song[1](first descendant Song) and //Song[1](first Songs (children of their parents))
c©Munindar P. Singh, CSC 513, Spring 2008 p.151
XPath Axes: 5Each axis has a principal node kind
attribute: attributenamespace: namespaceAll other axes: element
* matches whatever is the principal node kind of thecurrent axis
node() matches all nodes
c©Munindar P. Singh, CSC 513, Spring 2008 p.152
XPointer
Enables pointing to specific parts of documentsCombines XPath with URLsURL to get to a document; XPath to walkdown the documentCan be used to formulate queries, e.g.,
Song-URL#xpointer(//Song[@genre="jazz"])The part after # is a fragment identifier
Fine-grained addressability enhances theWeb architecture
High-level “conceptual” identification of node sets
c©Munindar P. Singh, CSC 513, Spring 2008 p.153
XQuery
The official query language for XML, now aW3C recommendation, as version 1.0Given a non-XML syntax, easier on thehuman eye than XMLAn XML rendition, XqueryX, is in the works
c©Munindar P. Singh, CSC 513, Spring 2008 p.154
XQuery Basic Paradigm
The basic paradigm mimics the SQL(SELECT–FROM–WHERE) clause
1 f o r $x i n doc ( ’ q2 . xml ’ ) / / Songwhere $x / @lg = ’ en ’r e t u r n<Engl ish−Sgr name= ’ { $x / Sgr /@name} ’ t i = ’ { $x / @ti } ’ / >
c©Munindar P. Singh, CSC 513, Spring 2008 p.155
FLWOR Expressions
Pronounced “flower”For: iterative binding of variables over rangeof valuesLet: one shot binding of variables over vectorof valuesWhere (optional)Order by (sort: optional)Return (required)
Need at least one of for or let
c©Munindar P. Singh, CSC 513, Spring 2008 p.156
XQuery For Clause
The for clauseIntroduces one or more variablesGenerates possible bindings for eachvariableActs as a mapping functor or iterator
In essence, all possible combinations ofbindings are generated: like a Cartesianproduct in relational algebraThe bindings form an ordered list
c©Munindar P. Singh, CSC 513, Spring 2008 p.157
XQuery Where Clause
The where clauseSelects the combinations of bindings that aredesiredBehaves like the where clause in SQL, inessence producing a join based on theCartesian product
c©Munindar P. Singh, CSC 513, Spring 2008 p.158
XQuery Return Clause
The return clauseSpecifies what node-sets are returned basedon the selected combinations of bindings
c©Munindar P. Singh, CSC 513, Spring 2008 p.159
XQuery Let Clause
The let clauseLike for, introduces one or more variablesLike for, generates possible bindings foreach variableUnlike for, generates the bindings as a list inone shot (no iteration)
c©Munindar P. Singh, CSC 513, Spring 2008 p.160
XQuery Order By Clause
The order by clauseSpecifies how the vector of variable bindingsis to be sorted before the return clauseSorting expressions can be nested byseparating them with commasVariants allow specifying
descending or ascending (default)empty greatest or empty least toaccommodate empty elementsstable sorts: stable order bycollations: order by $t collationcollation-URI: (obscure, so skip)
c©Munindar P. Singh, CSC 513, Spring 2008 p.161
XQuery Positional Variables
The for clause can be enhanced with a positionalvariable
A positional variable captures the position ofthe main variable in the given for clause withrespect to the expression from which themain variable is generatedIntroduce a positional variable via the at $varconstruct
c©Munindar P. Singh, CSC 513, Spring 2008 p.162
XQuery Declarations
The declare clause specifies things likeNamespaces: declare namespacepref=’value’
Predefined prefixes include XML, XMLSchema, XML Schema-Instance, XPath,and local
Settings: declare boundary-space preserve(or strip)Default collation: a URI to be used forcollation when no collation is specified
c©Munindar P. Singh, CSC 513, Spring 2008 p.163
XQuery Quantification: 1
Two quantifiers some and everyEach quantifier expression evaluates to trueor falseEach quantifier introduces a bound variable,analogous to for
1 f o r $x i n . . .where some $y i n . . .s a t i s f i e s $y . . . $xr e t u r n . . .
Here the second $x refers to the same variableas the first
c©Munindar P. Singh, CSC 513, Spring 2008 p.164
XQuery Quantification: 2
A typical useful quantified expression would usevariables that were introduced outside of itsscope
The order of evaluation isimplementation-dependent: enablesoptimizationIf some bindings produce errors, this canmattersome: trivially false if no variable bindingsare found that satisfy itevery: trivially true if no variable bindings arefound
c©Munindar P. Singh, CSC 513, Spring 2008 p.165
Variables: Scoping, Bound, and Free
for, let, some, and every introduce variablesThe visibility variable follows typical scopingrulesA variable referenced within a scope is
Bound if it is declared within the scopeFree if it not declared within the scope
1 f o r $x i n . . .where some $x i n . . .s a t i s f i e s . . .r e t u r n . . .
Here the two $x refer to different variables
c©Munindar P. Singh, CSC 513, Spring 2008 p.166
XQuery Conditionals
Like a classical if-then-else clauseThe else is not optionalEmpty sequences or node sets, written ( ),indicate that nothing is returned
c©Munindar P. Singh, CSC 513, Spring 2008 p.167
XQuery Constructors
Braces { } to delimit expressions that areevaluated to generate the content to be included;analogous to macros
document { }: to create a document nodewith the specified contentselement { } { }: to create an element
element foo { ’bar’ }: creates<foo>Bar</foo>element { ’foo’ } { ’bar’ }: also evaluatesthe name expression
attribute { } { }: likewisetext { body}: simpler, because anonymous
c©Munindar P. Singh, CSC 513, Spring 2008 p.168
XQuery Effective Boolean Value
Analogous to Lisp, a general value can betreated as if it were a Boolean
A xs:boolean value maps to itselfEmpty sequence maps to falseSequence whose first member is a nodemaps to trueA numeric that is 0, negative, or NaN mapsto false, else trueAn empty string maps to false, others to true
c©Munindar P. Singh, CSC 513, Spring 2008 p.169
Defining Functions
1 dec lare f u n c t i o n l o c a l : i t em f top ( $ t ){
l o c a l : i t em f ( $t , ( ) )} ;
Here local: is the namespace of the queryThe arguments are specified in parenthesesAll of XQuery may be used within thedefining bracesSuch functions can be used in place ofXPath expressions
c©Munindar P. Singh, CSC 513, Spring 2008 p.170
Functions with Types
1 dec lare f u n c t i o n l o c a l : i t em f top ( $ t as element ( ) )as element ( ) ∗
{l o c a l : i t em f ( $t , ( ) )
} ;
Return types as aboveAlso possible for parameters, but ignore suchfor this course
c©Munindar P. Singh, CSC 513, Spring 2008 p.171
XSLT
A programming language with a functional flavorSpecifies (stylesheet) transforms fromdocuments to documentsCan be included in a document (best not to)
<?xml vers ion ="1.0"? ><?xml−s t y l eshee t type =" t e x t / x s l "
h re f ="URL−to−xs l−sheet "?><main−element >
5 . . .</main−element >
c©Munindar P. Singh, CSC 513, Spring 2008 p.172
XQuery versus XSLT: 1
Competitors in some ways, butShare a basis in XPathConsequently share the same data modelSame type systems (in the type-sensitiveversions)XSLT got out first and has a sizablefollowing, but XQuery has strong backingamong vendors and researchers
c©Munindar P. Singh, CSC 513, Spring 2008 p.173
XQuery versus XSLT: 2
XQuery is geared for querying databasesSupported by major relational DBMSvendors in their XML offeringsSupported by native XML DBMSsOffers superior coverage of processingjoinsIs more logical (like SQL) and potentiallymore optimizable
XSLT is geared for transforming documentsIs functional rather than declarativeBased on template matching
c©Munindar P. Singh, CSC 513, Spring 2008 p.174
XQuery versus XSLT: 3
There is a bit of an arms race between themTypes
XSLT 1.0 didn’t support typesXQuery 1.0 doesXSLT 2.0 does too
XQuery presumably will be enhanced withcapabilities to make updates, but XSLT couldtoo
c©Munindar P. Singh, CSC 513, Spring 2008 p.175
XSLT Stylesheets
A programming language that follows XMLsyntax
Use the XSLT namespace (conventionallyabbreviated xsl)Includes a large number of primitives,especially:
<copy-of> (deep copy)<copy> (shallow copy)<value-of><for-each select="..."><if test="..."><choose>
c©Munindar P. Singh, CSC 513, Spring 2008 p.176
XSLT Templates: 1
A pattern to specify where the giventransform should apply: an XPath expression
This match only works on the root:< x s l : template match =" / " >
. . .</ x s l : template >
Example: Duplicate text in an element< x s l : template match=" t e x t ( ) " >
2 < x s l : value−of s e l e c t = ’ . ’ / >< x s l : value−of s e l e c t = ’ . ’ / >
</ x s l : template >
c©Munindar P. Singh, CSC 513, Spring 2008 p.177
XSLT Templates: 2
If no pattern is specified, apply recursively onet-children via <xsl:apply-templates/>By default, if no other template matches,recursively apply to et-children of currentnode (ignores attributes) and to root:
1 < x s l : template match = "∗ | / " >< x s l : apply−templates / >
</ x s l : template >
c©Munindar P. Singh, CSC 513, Spring 2008 p.178
XSLT Templates: 3
Copy text node by defaultUse an empty template to override thedefault:< x s l : template match="X" / >
2 <!−− X = des i red pa t t e rn −−>
Confine ourselves to the examples discussed inclass (ignore explicit priorities, for example)
c©Munindar P. Singh, CSC 513, Spring 2008 p.179
XSLT Templates: 4
Templates can be namedTemplates can have parameters
Values for parameters are supplied atinvocationEmpty node sets by defaultAdditional parameters are ignored
c©Munindar P. Singh, CSC 513, Spring 2008 p.180
XSLT VariablesExplicitly declaredValues are node setsConvenient way to document templates
c©Munindar P. Singh, CSC 513, Spring 2008 p.181
Document Object Model (DOM)
Basis for parsing XML, which provides anode-labeled tree in its API
Conceptually simple: traverse by requestingelement, its attribute values, and its childrenProcessing program reflects documentstructure, as in recursive descentCan edit documentsInefficient for large documents: parses themfirst entirely even if a tiny part is neededCan validate with respect to a schema
c©Munindar P. Singh, CSC 513, Spring 2008 p.182
DOM Example
DOMParser p = new DOMParser ( ) ;p . parse ( " f i lename " ) ;
3 Document d = p . getDocument ( )Element s = d . getDocumentElement ( ) ;NodeList l = s . getElementsByTagName ( " member " ) ;Element m = ( Element ) l . i tem ( 0 ) ;i n t code = m. g e t A t t r i b u t e ( " code " ) ;
8 NodeList k ids = m. getChi ldNodes ( ) ;Node k id = k ids . i tem ( 0 ) ;S t r i n g elemName = ( ( Element ) k i d ) . getTagName ( ) ; . . .
c©Munindar P. Singh, CSC 513, Spring 2008 p.183
Simple API for XML (SAX)
Parser generates a sequence of events:startElement, endElement, . . .
Programmer implements these as callbacksMore control for the programmer
Processing program does not necessarilyreflect document structure
c©Munindar P. Singh, CSC 513, Spring 2008 p.184
SAX Example: 1
c lass MemberProcess extends Defau l tHand ler {p u b l i c vo id s ta r tE lement ( S t r i n g u r i , S t r i n g n ,
S t r i n g qName, A t t r i b u t e s a t t r s ) {i f ( n . equals ( " member " ) ) code = a t t r s . getValue ( " code " )
5 i f ( n . equals ( " p r o j e c t " ) ) i n P r o j e c t = t r ue ;b u f f e r . rese t ( ) ;
}
. . .
c©Munindar P. Singh, CSC 513, Spring 2008 p.185
SAX Example: 2
1
. . .
p u b l i c vo id endElement ( S t r i n g u r i , S t r i n g n ,S t r i n g qName) {
6 i f ( n . equals ( " p r o j e c t " ) ) i n P r o j e c t = f a l s e ;i f ( n . equals ( " member " ) && ! i n P r o j e c t )
. . . do something . . .}
}
c©Munindar P. Singh, CSC 513, Spring 2008 p.186
SAX FiltersA component that mediates between anXMLReader (parser) and a client
A filter would present a modified set ofevents to the clientTypical uses:
Make minor modifications to the structureSearch for patterns efficiently
What kinds of patterns, though?Ideally modularize treatment of differentevent patternsIn general, a filter can alter the structure ofthe document
c©Munindar P. Singh, CSC 513, Spring 2008 p.187
Creating XML from Legacy Sources
Often need to read in information from non-XMLsources
From relational databasesEasier because of structureSupported by vendor tools
From flat files, CSV documents, HTML Webpages
Bit of a black art: lots of heuristicsTools based on regular expressions
c©Munindar P. Singh, CSC 513, Spring 2008 p.188
Programming with XML
LimitationsDifficult to construct and maintaindocumentsInternal structures are cumbersome;hence the criticisms of DOM parsers
Emerging approaches provide superiorbinding from XML to
Programming languagesRelational databases
Check pull-based versus push-based parsers
c©Munindar P. Singh, CSC 513, Spring 2008 p.189
Module 6: XML Storage
The major aspects of storing XML includeXML KeysConcepts: Data and Document CentrismStorageMapping to relational schemasSQL/XML
c©Munindar P. Singh, CSC 513, Spring 2008 p.190
Integrity Constraints in XML
Entity: xsd:unique and xsd:keyReferential: xsd:keyrefData type: XML Schema specificationsValue: Solve custom queries using XPath orXQuery
Entity and referential constraints are based onXPath
c©Munindar P. Singh, CSC 513, Spring 2008 p.191
XML Keys: 1
Keys serve as generalized identifiers, and arecaptured via XML Schema elements:
Unique: candidate keyThe selected elements yield unique fieldtuples
Key: primary key, which means candidatekey plus
The tuples exist for each selected elementKeyref: foreign key
Each tuple of fields of a selected elementcorresponds to an element in thereferenced key
c©Munindar P. Singh, CSC 513, Spring 2008 p.192
XML Keys: 2
Two subelements built using restrictedapplication of XPath from within XML Schema
Selector: specify a set of objects: this is thescope over which uniqueness appliesField: specify what is unique for eachmember of the above set: this is the identifierwithin the targeted scope
Multiple fields are treated as ordered toproduce a tuple of values for eachmember of the setThe order matters for matching keyref tokey
c©Munindar P. Singh, CSC 513, Spring 2008 p.193
Selector XPath Expression
A selector finds descendant elements of thecontext node
The sublanguage of XPath used allowsChildren via ./child or ./* or childDescendants via .// (not within a path)Choice via |
The subset of XPath used does not allowParents or ancestorstext()AttributesFancy axes such as preceding,preceding-sibling, . . .
c©Munindar P. Singh, CSC 513, Spring 2008 p.194
Field XPath Expression
A field finds a unique descendant element(simple type only) or attribute of the context node
The subset of XPath used allowsChildren via ./child or ./*Descendants via .// (not within a path)Choice via |Attributes via @attribute or @*
The subset of XPath used does not allowParents or ancestorstext()Fancy axes such as preceding, . . .
An element yields its text()c©Munindar P. Singh, CSC 513, Spring 2008 p.195
XML Foreign Keys
<keyre f name = " . . . " r e f e r =" pr imary−key−name">< s e l e c t o r xpath = " . . . " / >< f i e l d name = " . . . " / >
</ keyref >
Relational requirement: foreign keys don’thave to be unique or non-null, but if onecomponent is null, then all components mustbe null.
c©Munindar P. Singh, CSC 513, Spring 2008 p.196
Placing Keys in Schemas
Keys are associated with elements, not withtypesThus the . in a key selector expression isboundCould have been (but are not) associatedwith types where the . could be bound towhichever element was an instance of thetype
c©Munindar P. Singh, CSC 513, Spring 2008 p.197
Data-Centric View: 1
1 < r e l a t i o n name= ’ Student ’ >< tup le >< a t t r 1 >V11</ a t t r 1 >
. . .< a t t r n >V1n</ a t t r n >
</ tup le >6 . . .
</ r e l a t i o n >
Extract and store via mapping to DB modelRegular, homogeneous structure
c©Munindar P. Singh, CSC 513, Spring 2008 p.198
Data-Centric View: 2Ideally, no mixed content: an elementcontains text or subelements, not bothAny mixed content would be templatic, i.e.,
Generated from a database via suitabletransformationsGenerated via a form that a user or anapplication fills out
Order among siblings likely irrelevant (as isorder among relational columns)
Expensive if documents are repeatedly parsedand instantiated
c©Munindar P. Singh, CSC 513, Spring 2008 p.199
Document-Centric ViewIrregular: doesn’t map well to a relationHeterogeneous dataDepending on entire doc forapplication-specific meaning
c©Munindar P. Singh, CSC 513, Spring 2008 p.200
Data- vs Document-Centric ViewsData-centric: data is the main thing
XML simply renders the data for transportStore as dataConvert to/from XML as neededThe structure is important
Document-centric: documents are the mainthing
Documents are complex (e.g., designdocuments) and irregularStore documents whereverUse DBMS where it facilitates performingimportant searches
c©Munindar P. Singh, CSC 513, Spring 2008 p.201
Storing Documents in Databases
Use character large objects (CLOBs) withinDB: searchable only as textStore paths to external files containing docs
Simple, but no support for integrityUse some structured elements for easysearch as well as unstructured clobs or filesHeterogeneity complicates mappings totyped OO programming languages
Storing documents in their entirety maysometimes be necessary for external reasons,such as regulatory compliance
c©Munindar P. Singh, CSC 513, Spring 2008 p.202
Database Features
Storage: schema definition languageQuerying: query languageTransactions: concurrencyRecovery
c©Munindar P. Singh, CSC 513, Spring 2008 p.203
Potential DBMS Types for XML: 1
Object-orientedNice structureIntellectual basis of many XML concepts,including schema representations andpath expressionsNot highly popular in standalone products
RelationalLimited structuring ability (1NF: each cellis atomic)Extremely popularWell optimized for flat queries
c©Munindar P. Singh, CSC 513, Spring 2008 p.204
Potential DBMS Types for XML: 2
Object relational: hybrids of aboveNot highly popular in standalone products
Custom XML stores or native XMLdatabases
Emerging ideas: may lack core databasefeatures (e.g., recovery, . . . )Enable fancier content managementsystemsLeading open source products:
Apache Xindice (server; XPath)Berkeley DB XML (libraries; XQuery)
c©Munindar P. Singh, CSC 513, Spring 2008 p.205
XML to Relational DatabasesUsing large objectsFlatten XML structuresReferring to external files
Recall that for a relational schema, its entire setof attributes is necessarily a superkey
c©Munindar P. Singh, CSC 513, Spring 2008 p.206
Artificial Representation: Repetitious
Capturing an object hierarchy in a relationImagine an artificial identifier for each nodeConstruct a relation with three mainrelational attributes or columns
One column for the identifierOne column for the name of an attribute(i.e., element name)One column for the value (assumes thevalue would fit into the same relationaltype: potentially this could be CLOB orBLOB)
c©Munindar P. Singh, CSC 513, Spring 2008 p.207
Artificial Representation: Graph
Use four generic relations to represent a graphVertices:
Element ID, NameContents
Element ID, Text, number (to allowmultiple text nodes)
AttributesID, Attribute name, Attribute value
EdgesSource ID, Target ID
Better typed than repetitious style because thishas no nulls
c©Munindar P. Singh, CSC 513, Spring 2008 p.208
Shallow Representation: 1
The “natural” approaches are based ontuple-generating elements (TGEs)
Choose one XML element type as the TGETGE corresponds to a tupleThe key is based on an ID attribute or textof the TGE
A relational attribute (column) for eachsubelement or attributeEasiest if there is an attribute for IDs andthere are no other attributes
c©Munindar P. Singh, CSC 513, Spring 2008 p.209
Shallow Representation: 2
ConsequencesNulls for missing subelements canproliferateSubelements with structure (subelementsor attributes) aren’t represented wellAncestors cannot be searched for
c©Munindar P. Singh, CSC 513, Spring 2008 p.210
Deep Representation
Also called shredding an XML documentChoose a TGE as beforeA column for each descendant, except that
Can skip wrapper elements (no text, onlysubelements), but must reconstruct themto create an XML document
ConsequencesNulls for missing subelementsLots of columns in a relationAncestors cannot be searched forLoses structural information
c©Munindar P. Singh, CSC 513, Spring 2008 p.211
Representing Ancestors
Ancestors are the elements that are above thescope of the given TGE
Choose a TGE as beforeA column for each descendant as beforeA column for each ancestor (that needs to besearched)
Appropriate attributes or text fields tomake the search worthwhile
ConsequencesNulls for missing subelementsLots of columns in a relation
c©Munindar P. Singh, CSC 513, Spring 2008 p.212
Generalized TGE
Each element is a TGE, yielding a differentrelationA column for each terminal child: attribute ortextA column for each ancestor to capture theentire path from root to this node
Must promote uniquifying content so thateach TGE yields unique tuples
ConsequencesNulls for missing subelementsLots of relationsLots of columns in a relation
c©Munindar P. Singh, CSC 513, Spring 2008 p.213
Variations in Structure
Create separate relations for each variantConsequences
Lots of possible structures to storeQueries would not be succinctAcceptable only if we know in advancethat the number of variants is small andthe data in each is substantial
c©Munindar P. Singh, CSC 513, Spring 2008 p.214
Semistructured Representation
Create two (sets of) relationsSpecific part: one (or more) relations basedon one of the natural approachesGeneric part: one relation based on anartificial approach
c©Munindar P. Singh, CSC 513, Spring 2008 p.215
Thoughtful Design
The above approaches are not sensitive tothe meaning and motivation behind the XMLstructureUnderstand the XML structure via aconceptual model (in terms of entities andrelationships)Avoid unnecessary nesting in the XMLstructure, if possibleDesign a corresponding relational schema byhand
This is not always possible, though
c©Munindar P. Singh, CSC 513, Spring 2008 p.216
Evaluation
How does the above work for data-centric anddocument-centric views?
Compare with respect toDocument structureDocument “roundtripping” (compare &,&, #a39)Normalization
Are the documents unique?Are the documents unique up to“isomorphism”?
c©Munindar P. Singh, CSC 513, Spring 2008 p.217
Schema Evolution
A big problem for databases in practical settingsFor relational schemas, certain kinds ofupdates are simpler than othersCan have consequences on optimizationXML schemas can be evolved by using XSLTto map old data to new schema
c©Munindar P. Singh, CSC 513, Spring 2008 p.218
From Relations to XML
Mapping a relation schema (set of relations plusfunctional dependencies) to an XML document
Map relation R to an element RE with key orunique constraintsMap column C of R to an attribute of RE orequivalently a child element with just textMap relation S with a foreign key to R to
A child element SE of RE (omit foreignkey content from SE): works if only onesuch RE for SE; ORAn element SE that includes the foreignkey content, and includes a keyref to RE
c©Munindar P. Singh, CSC 513, Spring 2008 p.219
Null Value: 1A special value, not in any domain, butcombinable with any domain
Need?Possible meanings
Not applicableUnknown: missingQuestionable existenceAbsent (known but absent)
Hazards of null values?
c©Munindar P. Singh, CSC 513, Spring 2008 p.220
Null Value: 2
XML Schema enables developing custom nullvalues for each domain
Create an arbitrary value thatMatches the given data typeIs not a valid value of the domain,however
Design applications to understand specificrestricted type
c©Munindar P. Singh, CSC 513, Spring 2008 p.221
XML Schema Null
<elem/> (equivalently <elem></elem>)means that the element contains the emptystring
This is not nullxsi defines the attribute nil
Used as <elem xsi:nil="true"/> if elem isdeclared nillable (via nillable="true")
c©Munindar P. Singh, CSC 513, Spring 2008 p.222
Quick Look at SQL
Structured Query LanguageData Definition Language: CREATE TABLEData Manipulation Language: SELECT,INSERT, DELETE, UPDATEBasic paradigm for SELECTSELECT t1 . column−1, t1 . column−2 . . . tm . column−nFROM tab le −1 t1 , tab le−m tm
3 WHERE t1 . column−3=t4 . column−4 AND . . .
c©Munindar P. Singh, CSC 513, Spring 2008 p.223