2011.08.25 - slide 1is 257 – fall 2011 xml foundations: introduction ray r. larson university of...

30
IS 257 – Fall 2011 2011.08.25 - SLIDE 1 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML Foundations This lecture is largely based on an earlier lecture by Eric Wilde

Post on 21-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 1

XML Foundations:Introduction

Ray R. Larson

University of California, Berkeley

School of Information

IS 242: XML FoundationsThis lecture is largely based on an earlier lecture by Eric Wilde

Page 2: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 2

Abstract

• The Extensible Markup Language (XML) was introduced in 1998 to enable content providers to publish their content on the Web in an application-specific format. HTML was thought to lack sufficient semantics, since its only purpose was (and is) the preparation of content for Web-based publishing. XML was the first step towards machine-readable data formats for the Web, a trend that has been taken to higher levels with the introduction of the Semantic Web. XML appeared when the Web was in the steepest part of its success curve, and since then has taken over as the globally accepted format for the exchange of machine-readable structured data.

Page 3: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 3

XML Overview

• More and more value switches from goods to information

• Information sharing needs well-defined structures

• Business agility and flexibility are critical success factors

• Standardized formats prevent lock-in and incompatibilities

• XML is the most successful format for structured data

• XML technologies are widely used and universally available

• XML for B2B enables better workflow engineering

• XML for B2C is a good interface between B2B and Web interfaces

• XML is a mission-critical success factor for optimizing ROI and

minimizing interoperability risks in today's fast-moving globalized

fragmented business landscape …

Page 4: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 4

Plan for the Course

• XML Basics and how to apply them• Describing classes of XML documents• Combining different vocabularies of XML

documents• Selecting parts of an XML document• Transforming XML into something else (or XML

again)• A more complicated way to describe classes of

XML documents• Even more ways of describing classes of XML

documents• How does all of this relate to databases?• What to expect as future developments

Page 5: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 5

What will we be doing?

• Projects– Encoded Archival Description/Encoded Archival Context– iTunes XML as the common theme (linking with other data)– how to understand an XML document representing an iTunes

library– how to write a schema describing this document's structure– how to select parts of the library (tracks, playlists, artists, …)– how to transform libraries/playlists (into HTML, Atom, …)

• Tools– XML editor such as Altova XML Spy (XSLT and XQuery

included) or Oxygen– XSLT Processor such as Saxon– XQuery Processor such as Saxon

Page 6: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 6

Outline

• Varia

• What is XML?

• Why XML?

• Beyond XML

Page 7: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 7

About the course

• All subject to change…

• Web site at – http://courses.ischool.berkeley.edu/i242/f11/

• Office hours TuTh 2-3

• TA: Yiming Liu– Office, lab hours TBA

• Guest lecturers when away– Eric Wilde, Jeroen van Rotterdam (EMC)

Page 8: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 8

Outline

• Varia

• What is XML?– What is XML Good for?– What is XML NOT Good for?

• Why XML?

• Beyond XML

Page 9: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 9

XML Yin/Yang

• XML is …– … great for exchanging trees (if this is what you want to

do)– … platform-independent (even your mobile phone

processes XML)– … a foundation for other technologies (some of which

we will look at)

• XML is not …– … a programming language (ever programmed comma-

separated values?)– … capturing semantics (without higher-layer consensus,

XML is worthless)– … ensuring interoperability (we both use bits! we can

interoperate!)

Page 10: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 10

Outline

• Varia

• What is XML?– What is XML Good for?– What is XML NOT Good for?

• Why XML?

• Beyond XML

Page 11: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 11

Why use XML

• Because you want to share data– share it in a format which is widely used and easy to use– enable others to use it on various platforms with existing tools

• Because you want to share data cheaply– It is easier to use XML than to invent something new– it is even easier to use an existing XML schema than to invent a

new one

• Because you want to share data openly– if you invent new formats, people must process them– avoid applying the "security through obscurity" principle

inadvertently– application-specific processing should be deferred to higher

layers

Page 12: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 12

Is XML self-describing?

• XML is often said to be "self-describing"– many people think this is the same as "self-explanatory"– the catch is what exactly it is you refer to by "describing"

• Database data cannot live without a database– database data is simply content, the structure is provided by a

DBMS– XML documents have their structure encoded within them– compared to database data, XML in fact is "self-describing"

• What is the gap between "self-describing" and "self-explanatory"?– it is impossible to find out how the document could be modified– there are no semantics associated with neither structure nor

content– so "self-describing" means, you can guess a lot, but you maybe

wrong

Page 13: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 13

Outline

• Varia

• What is XML?– What is XML Good for?– What is XML NOT Good for?

• Why XML?

• Beyond XML

Page 14: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 14

XML is Character-based

• XML is not a binary format, it is based on Unicode– "binary structures" cannot (or rather should not) be

described using XML

• Multimedia formats often are binary– image formats such as GIF, JPEG, and PNG– audio formats such as MP3 and AAC– video formats such as MPEG4 and H.264

• But: multimedia also uses many XML formats– vector graphics formats such as Scalable Vector

Graphics (SVG)– Synchronized Multimedia Integration Language (SMIL)

for describing presentations

Page 15: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 15

XML is a syntax for trees

• Not all data is easily represented by trees– overlapping markup (multiple "views" of the same

content)– graph-like structures which are less constrained than

trees

• What is it that you have in your tree?– XML encodes a structure purely on the syntactic level– what the structures mean is in no way described by

XML– XML structures must be accompanied by semantic

descriptions

Page 16: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 16

XML Usage

• XML can be used in different ways– people should be able to use your XML directly using standard

tools– if they absolutely need a set of special tools, something is wrong

• XML is hip, so everybody wants to use it– many things have been created ad-hoc and without much

planning– if you start something which is XML-based, use XML responsibly– if you have to use some "bad XML", complain about it

• Finding the balance can be hard– XML is great for prototyping and experiments– once you decide to redesign your XML, it may be too late– XML documents may be short-lived, XML schemas are definitely

not

Page 17: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 17

Outline

• Varia

• What is XML?

• Why XML?– Pre-XML problems– XML on the Web– XML today

• Beyond XML

Page 18: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 18

Web Technology

• Early Web: URI+HTTP+HTML– URIs identify resources (in a human-readable way)– HTTP retrieves resources (using a simple protocol)– HTML is the resource format (using a simple data format)

• The early Web was a distributed hypermedia system– not designed by hypermedia researchers or companies– simple enough to be adopted very fast

• The Web today uses many different technologies– URI+HTTP+HTML for basic Web publishing– CSS & JavaScript (maybe even Ajax) for advanced publishing

• JavaScript & XML (a.k.a. Ajax)– scripts dynamically loading data from a server– machine-to-machine interaction: the server and the script

Page 19: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 19

From Humans to Machines

• The Web was designed for humans– HTML is a language for describing page layout and

links– machines were only used for implementing it

• Search engines were the first machine users on the Web– they made the Web's success possible– they demonstrated how hard it is to "understand"

HTML pages– search engines are still a very active field of research

• A bigger Web needs more automation

Page 20: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 20

Outline

• Varia

• What is XML?

• Why XML?– Pre-XML problems– XML on the Web– XML today

• Beyond XML

Page 21: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 21

SGML, HTML and XML

• Standard Generalized Markup Language (SGML)– a language for designing document types– a very complex standard with many expensive and non-

interoperable implementations

• Hypertext Markup Language (HTML)– implements a simple SGML document type– its syntax is SGML syntax, it is not defined by HTML itself– uses very few SGML features, dedicated processors are rather

easy to build

• Extensible Markup Language (XML)– a language for designing document types (i.e., classes of

documents)– a greatly simplified version of SGML, omitting many obscure

features– a specification with no optional parts!

Page 22: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 22

Outline

• Varia

• What is XML?

• Why XML?– Pre-XML problems– XML on the Web– XML today

• Beyond XML

Page 23: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 23

XML Documents on the Web

• XML's idea was that content should be published as XML– stylesheets could then be used to render human-readable views– machines could simply use the underlying XML

• There are (almost) no XML documents on the Web– stylesheet support depends on browsers (software has a long

life!)– many content providers do not want to publish machine-readable

data

• There are many XML documents behind HTML documents– content does not have to be made public in a machine-readable

way– browser-independent HTML can be produced from XML– XML technologies can be leveraged on the server-side

Page 24: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 24

XML Documents Elsewhere

• XML is not used as intended, but it is very successful– as a server-side foundation for Web publishing– as a B2B-focused format with no Web publishing in

mind

• XML has been successful because of different reasons– being there at the right time (Internet bubble)– politically correct (the W3C is OS-agnostic)– technically sound (simple and no optional parts)– human-readable based on a well-known syntax– great for rapid prototyping and experiments

Page 25: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 25

Outline

• Varia

• What is XML?

• Why XML?– Pre-XML problems– XML on the Web– XML today

• Beyond XML

Page 26: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 26

Used Everywhere

• Very small: Messages from sensors– e.g., building automation or car electronics– mostly implemented in hardware or firmware

• Very large: Genome sequences– encoding the results of genome analyses– yields very large XML documents (several gigabytes)

• Very different processing requirements– very fast processing (time critical applications)– memory-conserving processing (very large

documents)– incremental processing (streaming)– random access (only small parts required)

Page 27: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 27

This course and XML

• "XML is ASCII for the 21st century"– information professionals should know and use XML– you will see it in many projects– you will hopefully use it in many projects– you will be able to build and test prototypes very

rapidly

• What do you need for using XML?– XML and some kind of schema language– XSLT for processing it– Xquery and XML Databases for search and access

Page 28: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 28

Outline

• Varia

• What is XML?

• Why XML?

• Beyond XML

Page 29: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 29

Sharing Concepts

• XML is a syntax for trees– trees are just structured data– for doing something useful, you must understand the

trees• Schema-based sharing of concepts is possible

– HTML works great because everybody is using it– Anything beyond HTML's capabilities needs a new

schema• General sharing of concepts is hard

– the AI community tried for decades and failed– micro-formats are a more humble approach to

"reusable shared concepts"– agreement in communities gets exponentially harder

with their size

Page 30: 2011.08.25 - SLIDE 1IS 257 – Fall 2011 XML Foundations: Introduction Ray R. Larson University of California, Berkeley School of Information IS 242: XML

IS 257 – Fall 2011 2011.08.25 - SLIDE 30

The Semantic Web

• Technologies for describing concepts– the foundation of successful interaction is mutual understanding– describe your XML using Semantic Web technologies

• XML core technologies do not convey any meaning– XML is a language for exchanging trees– XML schema languages describe what trees may be exchanged– XML schema languages are for markup design

• Semantic Web technologies have received a lot of attention– and a lot of research funding (latest rebranding: Linked Data)– success for the most general approaches is questionable at best– debatable success of AI's overall promises ("thinking machines")– modest approaches are more promising and likely to succeed