documents: form vs. content ?
DESCRIPTION
Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002. Documents: form vs. content ?. Traditional environment:. Form. Content. Documents: form vs. content ?. Digital environment:. Content. Form. - PowerPoint PPT PresentationTRANSCRIPT
1
Digital preservation.Principles and potential role of XML
Giovanni Michetti
Urbino, 9th october 2002
2
Documents:form vs. content ?
Traditional environment:
Form
Content
3
Documents:form vs. content ?
Digital environment:
Form
Content
4
Documents:structure
Structure is unavoidably inside documents
Complexity grows structure grows Structure is (part of the) message
We deal with structure not in digital environment only
5
Documents:structure and digital environment
Moving information onto new media
Need of functionalities to manage the explosive growth of information
Need to make structure explicit
6
Markup
The proper description of an information resource requires: identifying its logical components making its structure explicit
Markup
7
Markup
Markup:every means of making interpretation of a document explicit
8
From a record ...University of Urbino
Faculty of Arts
Rome, 1st August 2002Dr. Giovanni Michetti
Protocol n. 1234/ABSubject: Teaching appointment
We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree n. 80/1998, the authorization by the administration you belong to.
The DeanProf. Giorgio Cerboni Baiardi
Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino
Tel: 0722.320125 Fax: 0722.322553 Email: [email protected]
9
… to a marked record ...<XML><letter><sender>University of Urbino
Faculty of Arts </sender>
<date>Rome, 1st August 2002</date><addressee>Dr. Giovanni Michetti</addressee>
<protocolnumb>Protocol n. 1234/AB</protocolnumb><subject>Subject: Teaching appointment</subject>
<text>We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree 80/88, the authorization by the administration you belong to</text>
<author>The DeanProf. Giorgio Cerboni Baiardi</author>
<heading>Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino
Tel: 0722.320125 Fax: 0722.322553 Email: [email protected]</heading></letter></XML>
10
… to a DTD ...
<! ELEMENT letter (sender, date, addressee, protocolnumb, subject, text, author,
heading)><!ELEMENT sender (#PCDATA)><!ELEMENT date (#PCDATA)><!ELEMENT addressee (#PCDATA)><!ELEMENT protocolnumb (#PCDATA)><!ELEMENT subject (#PCDATA)><!ELEMENT text (#PCDATA)><!ELEMENT author (#PCDATA)><!ELEMENT heading (#PCDATA)>
11
… to a more precise DTD
<! ELEMENT letter (sender, date, addressee, precedent?, protocolnumb, classif?, subject,
text, attachments?, author, heading)><!ELEMENT sender, date, addressee, protocolnumb, subject, text, author,
heading (#PCDATA)><!ELEMENT precedent (#PCDATA)><!ELEMENT classif (#PCDATA)><!ELEMENT attachments (#PCDATA)>
12
Let’s refine the markup ...<XML><letter><sender><body>University of Urbino</body>
<bureau>Faculty of Arts</bureau></sender>
<date><place>Rome,</place><time>1st August 2002</time></date><addressee>Dr. Giovanni Michetti</addressee>
<protocolnumb>Protocol n. 1234/AB</protocolnumb><subject>Subject: Teaching appointment</subject>
<text>We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree 80/88, the authorization by the administration you belong to</text>
<author><role>The Dean</role><name>Prof. Giorgio Cerboni Baiardi</name></author>
<heading>Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino
Tel: 0722.320125 Fax: 0722.322553 Email: [email protected]</heading></letter></XML>
13
... keeping on refining ...<XML><letter><sender><body>University of Urbino</body>
<bureau>Faculty of Arts</bureau></sender>
<date><place>Rome,</place><time>1st August 2002</time></date><addressee>Dr. Giovanni Michetti</addressee>
[Protocolnumb + Subject + Text]
<author><role>The Dean</role><name><title>Prof.</title><propername>Giorgio</propername><surname>Cerboni
Baiardi</surname></name></author>
<heading><bureau>Faculty of Arts</bureau><address>Piano S. Lucia 6 - 61029 Urbino</address>
<tel>Tel: 0722.320125</tel><fax>Fax: 0722.322553</fax><email>Email:
[email protected]</email></heading></letter></XML>
14
… and let’s refine the DTD<! ELEMENT letter
(sender, date, addressee, precedent?, protocolnumb, classifi?, subject, text,
attachment?, author, heading)>
<!ELEMENT sender (body, bureau)>
<!ELEMENT body (#PCDATA)>
<!ELEMENT bureau (#PCDATA)>
<!ELEMENT date (place, time)>
<!ELEMENT place (#PCDATA)>
<!ELEMENT time (#PCDATA)>
<!ELEMENT addressee (#PCDATA)>
<!ELEMENT precedent, protocolnumb, classif, subject, text, attachment (#PCDATA)>
<!ELEMENT author (role, name)>
<!ELEMENT role (#PCDATA)>
<!ELEMENT name (title, propername, surname)>
<!ELEMENT title, propername, surname (#PCDATA)>
<!ELEMENT heading (bureau,address, tel, fax, email)>
<!ELEMENT address, tel, fax, email (#PCDATA)>
15
The final DTD<! ELEMENT letter
(sender, date, addressee+, precedent?, protocolnumb, classifi?, subject, text,
attachment?, author, heading?)>
<!ELEMENT sender (body?, bureau)><!ELEMENT body (#PCDATA)><!ELEMENT bureau (#PCDATA)><!ELEMENT date (place, time)><!ELEMENT place (#PCDATA)><!ELEMENT time (#PCDATA)><!ELEMENT addressee (#PCDATA)><!ELEMENT precedent, protocolnumb, classif, subject, text, attachment (#PCDATA)>
<!ELEMENT author (role?, name)><!ELEMENT role (#PCDATA)>
<!ELEMENT name (title?, propername?, surname)><!ELEMENT title, propername, surname (#PCDATA)>
<!ELEMENT heading (bureau?, address?, tel?, fax?, email?)><!ELEMENT address, tel, fax, email (#PCDATA)>
16
XML declaration Every XML document should start with
an XML declaration, like<?XML version="1.0">
Such declaration must be right at the start of the document: there should be nothing before it (comments, instructions, white spaces, ...)
17
XML declaration
A parser uses the first 5 characters <?XML to understand which kind of character set the document uses
The version attribute must have value 1.0
18
XML declaration
It is possible to specify the language encoding using the optional encoding attribute.
Example:
<?XML version="1.0" encoding="ISO-8859-1"?>
19
Elements Elements are the most important
components of XML documents: they are the logical components through which you can identify the structure of documents. Example:
<author>Giovanni Michetti</author>delimiter
tag-namecontent
start-tagend-tag
element
20
Elements
Each start-tag must have a corresponding end-tag (starting with a forward slash)
Empty elements (like <img>, <br>, <hr> in HTML) are represented by a tag starting with a delimiter and ending with a forward slash before the closing bracket. Example: <image/>
21
Attributes Attributes are expressed as name-value
pairs associated with elements and appearing only in start-tags
Names are separated from related values by an equal sign (=). Values are wrapped in single or double quotes
Attributes must be associated to elements
No matter of the order of the attributes inside a start-tag
22
XML tree
An XML document is a kind of a hierarchical tree. It starts from a root (root or document element) and it develops from it into child elements, that can be sibling
23
XML tree
Each element has one and only one father (except from root)
Each element is completely wrapped inside another element
24
Entities Example:
<author>Giovanni Michetti</author>
The string Giovanni Michetti (the element content) is also called character data. Character data can appear anywhere inside elements, or as values of attributes
25
Entities There are special characters that are
not allowed in text blocks: what if we want to use the less than symbol < in a mathematical formula (a < b ) ?
Stratagem 1 Stratagem 2
26
Entities
1. CDATA sections: They start with the CDATA start marker
<!CDATA[
and end with the CDATA end marker
]]>
27
Entities
2. Entity references:Example:
< <
The parser recognizes the entity < and substitute it with the proper value <
28
Entities
A parser is a piece of software able to read and interpret an XML document. A parser read the XML document as plain text
Some parsers (validating parsers) are able to check the conformance of an XML document with a DTD
29
Entities Standard (i.e. predefined) entities:
< <> >& &' '" "
Any XML parser recognizes these entities and substitutes them with the proper values
30
Well-formed documents Any XML document must be well
formed: it has to comply with some constraints, some of which are:
Each start-tag has a corresponding end-tag Elements can’t overlap There must be one and only one root
element Attribute values must be quoted An element can’t contain different attributes
with the same name
31
Document Type Definition (DTD)
Once able to create a set of attributes and tags, we need to share it with other users in order to adopt the same syntax
We need a Document Type Definition (DTD)
32
Document Type Definition (DTD)
A DTD defines what markup can be used in a document that is supposed to conform to a specific structure, whose components are identified by tags
33
Document Type Definition (DTD)
For example, a DTD defines what elements a document can contain, their occurrences, their order, and so on
A DTD can set out which attributes an element can take and whether they must be valued. It is also possible to define a set of predefined values for the attributes, and so on
34
Internal and external DTD
A DTD can be an external file or it can be included as part of the XML document. If it is an external file, the XML document must contain an explicit reference inside the Document Type Declaration:
<!DOCTYPE MyXMLDocs SYSTEM “file.dtd”>
35
Internal and external DTD
A DTD can also be written inside the document type declaration. In this case we have an internal DTD, like:<!DOCTYPE MyXMLDoc [
<!ELEMENT MyXMLDoc (#PCDATA)>
]> In this case, all the constraints on the
structure of the document are provided as declarations inside the square brackets
36
Element declarations A DTD is a set of declarations, the most
important of which is the element declaration. Any DTD must have at least one element declaration (referred to the root element)
The syntax for a declaration is:
<!ELEMENT elementname (contentmodel)>
37
Element declarations Example:
<!ELEMENT anthology (poem+)>
<!ELEMENT poem (title?, (stanza+|line+) )
<!ELEMENT title (#PCDATA)>
<!ELEMENT stanza (verso+)>
<!ELEMENT line (#PCDATA)>
38
Cardinality suffixes Cardinality suffixes are symbols used to
specify how many times an element can occur at a certain point of the structure. Symbols used are:
? 0-1+ 1-n* 0-n
(none) 1
39
Connectors Connectors are symbols used to specify
order and relationships between components of a model
Symbols used are:
, (comma)
| (vertical line)
40
Attribute declarations An attribute declaration allows to define
attributes associated to a given element
The syntax for a declaration is:
<!ATTLIST element_name attribute_definition*>
where an attribute definition is like:
attribute_name attribute_type default_declaration
41
Valid documents Well-formed documents: XML
documents conforming to the rules laid down in the XML 1.0 specifications
Valid documents: well-formed documents conforming to the rules laid down in a DTD
42
Stylesheets
So far the structure. But how can we render documents in the proper way?
Stylesheets
43
Stylesheets Since content is separated from style, we do
need no more to re-write the whole document each time we want to change the layout: we simply need to change the “instructions” that modify rendering. In other words, we can modify representation without modifying content
XSL (eXtensible Stylesheet Language) is a style language based upon DSSL (Document Style Semantics and Specification Language)
44
So far the document …
… but a document is (generally) part of a file, which is in turn part of a series or a more complex archival collection
Archival bond
45
The object of analysis:from documents ...
46
… to files ...
47
.....
48
… to series
49
Archives:a complex system of relationships
File
Series
Archiv
e
Document
50
Preserving, of course; but what?
Preserving
Original data
Context allowing data to be interpreted
Hardware
??
?
51
Preserving context
Preserving the context
Need to manage a network of metadata
52
XML technologies XML Schema Document Object Model (DOM) Simple API for XML (SAX) XSLT/Xpath XML Query Xlink Xpointer Xbase Xform XML Fragment interchange Xinclude
53
XML features It’s a formal, non-proprietary standard
it is acceptable to a wide range of users It’s a meta-language
it allows to define DTDs and validate documents It allows to manage highly structured documents It’s human-readable and self-descriptive
good chances to last It uses Unicode text
no problems related to internationalization
54
XML features
It’s a family of technologies It’s modular It’s license-free and platform-independent It can be transported across Web using
existing transport protocol re-use of communication and
security structures already in place
55
XML features
It allows to easily manage metadata It provides very good mechanism for
representing the layout It’s easy, powerful, but not too expensive
56
XML double-edged features
1. It’s a meta-language: it allows to define DTDs danger of specialization (each user community with its own language)
Without a common language, XML is not so competitive with respect to other mechanism of data interchange
XSL does allow to translate between different encodings, but it could be quite complex
RosettaNet and OASIS: trying to adopt common languages
57
XML double-edged features
2. It’s self-descriptive: you can create documents without using a DTD ...
58
XML double-edged features
3. It supports sophisticated searching by means of the tags embedded in the text, but a bad markup (not complete or not correct) highly reduces search effectiveness
59
XML limitations
It’s a syntax: it contains no semantics you need to use other XML modules such as XML Schema and RDF
It’s based upon text: the size of the markup can be much larger than the data itself
60
Preservation
Some considerations ...