documents: form vs. content ?

61
1 Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002

Upload: reeves

Post on 24-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

Digital preservation. Principles and potential role of XML Giovanni Michetti Urbino, 9 th october 2002. Documents: form vs. content ?. Traditional environment:. Form. Content. Documents: form vs. content ?. Digital environment:. Content. Form. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Documents: form vs. content ?

1

Digital preservation.Principles and potential role of XML

Giovanni Michetti

Urbino, 9th october 2002

Page 2: Documents: form vs. content ?

2

Documents:form vs. content ?

Traditional environment:

Form

Content

Page 3: Documents: form vs. content ?

3

Documents:form vs. content ?

Digital environment:

Form

Content

Page 4: Documents: form vs. content ?

4

Documents:structure

Structure is unavoidably inside documents

Complexity grows structure grows Structure is (part of the) message

We deal with structure not in digital environment only

Page 5: Documents: form vs. content ?

5

Documents:structure and digital environment

Moving information onto new media

Need of functionalities to manage the explosive growth of information

Need to make structure explicit

Page 6: Documents: form vs. content ?

6

Markup

The proper description of an information resource requires: identifying its logical components making its structure explicit

Markup

Page 7: Documents: form vs. content ?

7

Markup

Markup:every means of making interpretation of a document explicit

Page 8: Documents: form vs. content ?

8

From a record ...University of Urbino

Faculty of Arts

Rome, 1st August 2002Dr. Giovanni Michetti

Protocol n. 1234/ABSubject: Teaching appointment

We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree n. 80/1998, the authorization by the administration you belong to.

The DeanProf. Giorgio Cerboni Baiardi

Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino

Tel: 0722.320125 Fax: 0722.322553 Email: [email protected]

Page 9: Documents: form vs. content ?

9

… to a marked record ...<XML><letter><sender>University of Urbino

Faculty of Arts </sender>

<date>Rome, 1st August 2002</date><addressee>Dr. Giovanni Michetti</addressee>

<protocolnumb>Protocol n. 1234/AB</protocolnumb><subject>Subject: Teaching appointment</subject>

<text>We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree 80/88, the authorization by the administration you belong to</text>

<author>The DeanProf. Giorgio Cerboni Baiardi</author>

<heading>Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino

Tel: 0722.320125 Fax: 0722.322553 Email: [email protected]</heading></letter></XML>

Page 10: Documents: form vs. content ?

10

… to a DTD ...

<! ELEMENT letter (sender, date, addressee, protocolnumb, subject, text, author,

heading)><!ELEMENT sender (#PCDATA)><!ELEMENT date (#PCDATA)><!ELEMENT addressee (#PCDATA)><!ELEMENT protocolnumb (#PCDATA)><!ELEMENT subject (#PCDATA)><!ELEMENT text (#PCDATA)><!ELEMENT author (#PCDATA)><!ELEMENT heading (#PCDATA)>

Page 11: Documents: form vs. content ?

11

… to a more precise DTD

<! ELEMENT letter (sender, date, addressee, precedent?, protocolnumb, classif?, subject,

text, attachments?, author, heading)><!ELEMENT sender, date, addressee, protocolnumb, subject, text, author,

heading (#PCDATA)><!ELEMENT precedent (#PCDATA)><!ELEMENT classif (#PCDATA)><!ELEMENT attachments (#PCDATA)>

Page 12: Documents: form vs. content ?

12

Let’s refine the markup ...<XML><letter><sender><body>University of Urbino</body>

<bureau>Faculty of Arts</bureau></sender>

<date><place>Rome,</place><time>1st August 2002</time></date><addressee>Dr. Giovanni Michetti</addressee>

<protocolnumb>Protocol n. 1234/AB</protocolnumb><subject>Subject: Teaching appointment</subject>

<text>We inform you that you have been offered the teaching of Analysis and treatment of digital records by the Faculty of Arts Council, during the meeting of 30th july 2002.We remind you that for the stipulation of the contract we need, according to the legislative decree 80/88, the authorization by the administration you belong to</text>

<author><role>The Dean</role><name>Prof. Giorgio Cerboni Baiardi</name></author>

<heading>Faculty of ArtsPiano S. Lucia 6 - 61029 Urbino

Tel: 0722.320125 Fax: 0722.322553 Email: [email protected]</heading></letter></XML>

Page 13: Documents: form vs. content ?

13

... keeping on refining ...<XML><letter><sender><body>University of Urbino</body>

<bureau>Faculty of Arts</bureau></sender>

<date><place>Rome,</place><time>1st August 2002</time></date><addressee>Dr. Giovanni Michetti</addressee>

[Protocolnumb + Subject + Text]

<author><role>The Dean</role><name><title>Prof.</title><propername>Giorgio</propername><surname>Cerboni

Baiardi</surname></name></author>

<heading><bureau>Faculty of Arts</bureau><address>Piano S. Lucia 6 - 61029 Urbino</address>

<tel>Tel: 0722.320125</tel><fax>Fax: 0722.322553</fax><email>Email:

[email protected]</email></heading></letter></XML>

Page 14: Documents: form vs. content ?

14

… and let’s refine the DTD<! ELEMENT letter

(sender, date, addressee, precedent?, protocolnumb, classifi?, subject, text,

attachment?, author, heading)>

<!ELEMENT sender (body, bureau)>

<!ELEMENT body (#PCDATA)>

<!ELEMENT bureau (#PCDATA)>

<!ELEMENT date (place, time)>

<!ELEMENT place (#PCDATA)>

<!ELEMENT time (#PCDATA)>

<!ELEMENT addressee (#PCDATA)>

<!ELEMENT precedent, protocolnumb, classif, subject, text, attachment (#PCDATA)>

<!ELEMENT author (role, name)>

<!ELEMENT role (#PCDATA)>

<!ELEMENT name (title, propername, surname)>

<!ELEMENT title, propername, surname (#PCDATA)>

<!ELEMENT heading (bureau,address, tel, fax, email)>

<!ELEMENT address, tel, fax, email (#PCDATA)>

Page 15: Documents: form vs. content ?

15

The final DTD<! ELEMENT letter

(sender, date, addressee+, precedent?, protocolnumb, classifi?, subject, text,

attachment?, author, heading?)>

<!ELEMENT sender (body?, bureau)><!ELEMENT body (#PCDATA)><!ELEMENT bureau (#PCDATA)><!ELEMENT date (place, time)><!ELEMENT place (#PCDATA)><!ELEMENT time (#PCDATA)><!ELEMENT addressee (#PCDATA)><!ELEMENT precedent, protocolnumb, classif, subject, text, attachment (#PCDATA)>

<!ELEMENT author (role?, name)><!ELEMENT role (#PCDATA)>

<!ELEMENT name (title?, propername?, surname)><!ELEMENT title, propername, surname (#PCDATA)>

<!ELEMENT heading (bureau?, address?, tel?, fax?, email?)><!ELEMENT address, tel, fax, email (#PCDATA)>

Page 16: Documents: form vs. content ?

16

XML declaration Every XML document should start with

an XML declaration, like<?XML version="1.0">

Such declaration must be right at the start of the document: there should be nothing before it (comments, instructions, white spaces, ...)

Page 17: Documents: form vs. content ?

17

XML declaration

A parser uses the first 5 characters <?XML to understand which kind of character set the document uses

The version attribute must have value 1.0

Page 18: Documents: form vs. content ?

18

XML declaration

It is possible to specify the language encoding using the optional encoding attribute.

Example:

<?XML version="1.0" encoding="ISO-8859-1"?>

Page 19: Documents: form vs. content ?

19

Elements Elements are the most important

components of XML documents: they are the logical components through which you can identify the structure of documents. Example:

<author>Giovanni Michetti</author>delimiter

tag-namecontent

start-tagend-tag

element

Page 20: Documents: form vs. content ?

20

Elements

Each start-tag must have a corresponding end-tag (starting with a forward slash)

Empty elements (like <img>, <br>, <hr> in HTML) are represented by a tag starting with a delimiter and ending with a forward slash before the closing bracket. Example: <image/>

Page 21: Documents: form vs. content ?

21

Attributes Attributes are expressed as name-value

pairs associated with elements and appearing only in start-tags

Names are separated from related values by an equal sign (=). Values are wrapped in single or double quotes

Attributes must be associated to elements

No matter of the order of the attributes inside a start-tag

Page 22: Documents: form vs. content ?

22

XML tree

An XML document is a kind of a hierarchical tree. It starts from a root (root or document element) and it develops from it into child elements, that can be sibling

Page 23: Documents: form vs. content ?

23

XML tree

Each element has one and only one father (except from root)

Each element is completely wrapped inside another element

Page 24: Documents: form vs. content ?

24

Entities Example:

<author>Giovanni Michetti</author>

The string Giovanni Michetti (the element content) is also called character data. Character data can appear anywhere inside elements, or as values of attributes

Page 25: Documents: form vs. content ?

25

Entities There are special characters that are

not allowed in text blocks: what if we want to use the less than symbol < in a mathematical formula (a < b ) ?

Stratagem 1 Stratagem 2

Page 26: Documents: form vs. content ?

26

Entities

1. CDATA sections: They start with the CDATA start marker

<!CDATA[

and end with the CDATA end marker

]]>

Page 27: Documents: form vs. content ?

27

Entities

2. Entity references:Example:

&lt; <

The parser recognizes the entity &lt; and substitute it with the proper value <

Page 28: Documents: form vs. content ?

28

Entities

A parser is a piece of software able to read and interpret an XML document. A parser read the XML document as plain text

Some parsers (validating parsers) are able to check the conformance of an XML document with a DTD

Page 29: Documents: form vs. content ?

29

Entities Standard (i.e. predefined) entities:

&lt; <&gt; >&amp; &&apos; '&quot; "

Any XML parser recognizes these entities and substitutes them with the proper values

Page 30: Documents: form vs. content ?

30

Well-formed documents Any XML document must be well

formed: it has to comply with some constraints, some of which are:

Each start-tag has a corresponding end-tag Elements can’t overlap There must be one and only one root

element Attribute values must be quoted An element can’t contain different attributes

with the same name

Page 31: Documents: form vs. content ?

31

Document Type Definition (DTD)

Once able to create a set of attributes and tags, we need to share it with other users in order to adopt the same syntax

We need a Document Type Definition (DTD)

Page 32: Documents: form vs. content ?

32

Document Type Definition (DTD)

A DTD defines what markup can be used in a document that is supposed to conform to a specific structure, whose components are identified by tags

Page 33: Documents: form vs. content ?

33

Document Type Definition (DTD)

For example, a DTD defines what elements a document can contain, their occurrences, their order, and so on

A DTD can set out which attributes an element can take and whether they must be valued. It is also possible to define a set of predefined values for the attributes, and so on

Page 34: Documents: form vs. content ?

34

Internal and external DTD

A DTD can be an external file or it can be included as part of the XML document. If it is an external file, the XML document must contain an explicit reference inside the Document Type Declaration:

<!DOCTYPE MyXMLDocs SYSTEM “file.dtd”>

Page 35: Documents: form vs. content ?

35

Internal and external DTD

A DTD can also be written inside the document type declaration. In this case we have an internal DTD, like:<!DOCTYPE MyXMLDoc [

<!ELEMENT MyXMLDoc (#PCDATA)>

]> In this case, all the constraints on the

structure of the document are provided as declarations inside the square brackets

Page 36: Documents: form vs. content ?

36

Element declarations A DTD is a set of declarations, the most

important of which is the element declaration. Any DTD must have at least one element declaration (referred to the root element)

The syntax for a declaration is:

<!ELEMENT elementname (contentmodel)>

Page 37: Documents: form vs. content ?

37

Element declarations Example:

<!ELEMENT anthology (poem+)>

<!ELEMENT poem (title?, (stanza+|line+) )

<!ELEMENT title (#PCDATA)>

<!ELEMENT stanza (verso+)>

<!ELEMENT line (#PCDATA)>

Page 38: Documents: form vs. content ?

38

Cardinality suffixes Cardinality suffixes are symbols used to

specify how many times an element can occur at a certain point of the structure. Symbols used are:

? 0-1+ 1-n* 0-n

(none) 1

Page 39: Documents: form vs. content ?

39

Connectors Connectors are symbols used to specify

order and relationships between components of a model

Symbols used are:

, (comma)

| (vertical line)

Page 40: Documents: form vs. content ?

40

Attribute declarations An attribute declaration allows to define

attributes associated to a given element

The syntax for a declaration is:

<!ATTLIST element_name attribute_definition*>

where an attribute definition is like:

attribute_name attribute_type default_declaration

Page 41: Documents: form vs. content ?

41

Valid documents Well-formed documents: XML

documents conforming to the rules laid down in the XML 1.0 specifications

Valid documents: well-formed documents conforming to the rules laid down in a DTD

Page 42: Documents: form vs. content ?

42

Stylesheets

So far the structure. But how can we render documents in the proper way?

Stylesheets

Page 43: Documents: form vs. content ?

43

Stylesheets Since content is separated from style, we do

need no more to re-write the whole document each time we want to change the layout: we simply need to change the “instructions” that modify rendering. In other words, we can modify representation without modifying content

XSL (eXtensible Stylesheet Language) is a style language based upon DSSL (Document Style Semantics and Specification Language)

Page 44: Documents: form vs. content ?

44

So far the document …

… but a document is (generally) part of a file, which is in turn part of a series or a more complex archival collection

Archival bond

Page 45: Documents: form vs. content ?

45

The object of analysis:from documents ...

Page 46: Documents: form vs. content ?

46

… to files ...

Page 47: Documents: form vs. content ?

47

.....

Page 48: Documents: form vs. content ?

48

… to series

Page 49: Documents: form vs. content ?

49

Archives:a complex system of relationships

File

Series

Archiv

e

Document

Page 50: Documents: form vs. content ?

50

Preserving, of course; but what?

Preserving

Original data

Context allowing data to be interpreted

Hardware

??

?

Page 51: Documents: form vs. content ?

51

Preserving context

Preserving the context

Need to manage a network of metadata

Page 52: Documents: form vs. content ?

52

XML technologies XML Schema Document Object Model (DOM) Simple API for XML (SAX) XSLT/Xpath XML Query Xlink Xpointer Xbase Xform XML Fragment interchange Xinclude

Page 53: Documents: form vs. content ?

53

XML features It’s a formal, non-proprietary standard

it is acceptable to a wide range of users It’s a meta-language

it allows to define DTDs and validate documents It allows to manage highly structured documents It’s human-readable and self-descriptive

good chances to last It uses Unicode text

no problems related to internationalization

Page 54: Documents: form vs. content ?

54

XML features

It’s a family of technologies It’s modular It’s license-free and platform-independent It can be transported across Web using

existing transport protocol re-use of communication and

security structures already in place

Page 55: Documents: form vs. content ?

55

XML features

It allows to easily manage metadata It provides very good mechanism for

representing the layout It’s easy, powerful, but not too expensive

Page 56: Documents: form vs. content ?

56

XML double-edged features

1. It’s a meta-language: it allows to define DTDs danger of specialization (each user community with its own language)

Without a common language, XML is not so competitive with respect to other mechanism of data interchange

XSL does allow to translate between different encodings, but it could be quite complex

RosettaNet and OASIS: trying to adopt common languages

Page 57: Documents: form vs. content ?

57

XML double-edged features

2. It’s self-descriptive: you can create documents without using a DTD ...

Page 58: Documents: form vs. content ?

58

XML double-edged features

3. It supports sophisticated searching by means of the tags embedded in the text, but a bad markup (not complete or not correct) highly reduces search effectiveness

Page 59: Documents: form vs. content ?

59

XML limitations

It’s a syntax: it contains no semantics you need to use other XML modules such as XML Schema and RDF

It’s based upon text: the size of the markup can be much larger than the data itself

Page 60: Documents: form vs. content ?

60

Preservation

Some considerations ...

Page 61: Documents: form vs. content ?

61

Thanks to all

Giovanni Michetti

[email protected]