xml and databases - computer science and engineeringcs4317/10s1/lectures/01_handout.pdf · xml and...

18
XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW - Semester 1, 2010 XML and Databases Sebastian Maneth NICTA and UNSW (today: Kim Nguyen) Lecture 1 Introduction to XML CSE@UNSW - Semester 1, 2010 3 Similar to HTML (Berners-Lee, CERN W3C) use your own tags. Amount/popularity of XML data is growing steadily (faster than computing power) XML 4 Similar to HTML (Berners-Lee, CERN W3C) use your own tags. Amount/popularity of XML data is growing steadily (faster than computing power) HTML pages are *tiny* (couple of Kbytes) XML documents can be huge (GBytes) XML Databases 5 XML and Databases This course will introduce you to the world of XML and to the challenges of dealing with XML in a RDMS. Some of these challenges are Existing (DB) technology cannot be applied to XML data. 6 XML and Databases This course will introduce you to the world of XML and to the challenges of dealing with XML in a RDMS. Some of these challenges are Existing (DB) technology cannot be applied to XML data. can handle huge amounts of data stored in relations storage management index structures join/sort algorithms

Upload: others

Post on 25-Jan-2020

75 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

XM

L a

nd D

atabase

s

Sebast

ian M

aneth

NIC

TA

and U

NS

W

Le

ctu

re 1

Intr

od

uctio

n to

XM

L

CS

E@

UN

SW

-

Se

me

ste

r 1

, 2

01

0

XM

L a

nd D

atabase

s

Sebast

ian M

aneth

NIC

TA

and U

NS

W

(today:

Kim

Nguye

n)

Le

ctu

re 1

Intr

od

uctio

n to

XM

L

CS

E@

UN

SW

-

Se

me

ste

r 1

, 2

01

0

3

àS

imila

r to

HT

ML

(B

ern

ers

-Le

e, C

ER

N à

W3

C)

use

yo

ur

ow

n ta

gs.

àA

mo

un

t/p

op

ula

rity

of

XM

Ld

ata

is g

row

ing

ste

ad

ily(f

ast

er

tha

n c

om

pu

ting

po

we

r)

XM

L

4

àS

imila

r to

HT

ML

(B

ern

ers

-Le

e, C

ER

N à

W3

C)

use

yo

ur

ow

n ta

gs.

àA

mo

un

t/p

op

ula

rity

of

XM

Ld

ata

is g

row

ing

ste

ad

ily(f

ast

er

tha

n c

om

pu

ting

po

we

r)

HT

ML

pa

ge

s a

re

*tin

y* (

cou

ple

of K

byt

es)

XM

Ld

ocu

me

nts

ca

n b

e

hu

ge

(GB

yte

s)

XM

L

Data

base

s

5

XM

L a

nd

Da

tab

ase

s

Th

is c

ou

rse

will

àin

tro

du

ce y

ou

to

the

w

orl

d o

f X

ML

an

d to

th

e c

ha

llen

ge

so

f d

ea

ling

with

XM

L in

a R

DM

S.

So

me

of

the

se c

ha

llen

ge

s a

re

Exi

stin

g (

DB

) te

chn

olo

gy

can

no

tb

e a

pp

lied

to X

ML

da

ta.

6

XM

L a

nd

Da

tab

ase

s

Th

is c

ou

rse

will

àin

tro

du

ce y

ou

to

the

w

orl

d o

f X

ML

an

d to

th

e c

ha

llen

ge

so

f d

ea

ling

with

XM

L in

a R

DM

S.

So

me

of

the

se c

ha

llen

ge

s a

re

Exi

stin

g (

DB

) te

chn

olo

gy

can

no

tb

e a

pp

lied

to X

ML

da

ta.

can

ha

nd

le h

ug

e a

mo

un

ts o

f d

ata

sto

red

in r

ela

tio

ns

àst

ora

ge

ma

na

ge

me

nt

àin

de

x st

ruct

ure

join

/so

rt a

lgo

rith

ms

à…

Page 2: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

7

XM

L a

nd

Da

tab

ase

s

Th

is c

ou

rse

will

àin

tro

du

ce y

ou

to

the

w

orl

d o

f X

ML

an

d to

th

e c

ha

llen

ge

so

f d

ea

ling

with

XM

L in

a R

DM

S.

So

me

of

the

se c

ha

llen

ge

s a

re

Exi

stin

g (

DB

) te

chn

olo

gy

can

no

tb

e a

pp

lied

to X

ML

da

ta.

can

ha

nd

le h

ug

e a

mo

un

ts o

f d

ata

sto

red

in r

ela

tio

ns

àst

ora

ge

ma

na

ge

me

nt

àin

de

x st

ruct

ure

join

/so

rt a

lgo

rith

ms

à…

XM

L d

ata

.

tre

est

ruct

ure

d

8

XM

L a

nd

Da

tab

ase

s

Th

is c

ou

rse

will

àin

tro

du

ce y

ou

to

the

w

orl

d o

f X

ML

an

d to

th

e c

ha

llen

ge

so

f d

ea

ling

with

XM

L in

a R

DM

S.

So

me

of

the

se c

ha

llen

ge

s a

re

Exi

stin

g (

DB

) te

chn

olo

gy

can

no

tb

e a

pp

lied

to X

ML

da

ta.

can

ha

nd

le h

ug

e a

mo

un

ts o

f d

ata

sto

red

in r

ela

tio

ns

àst

ora

ge

ma

na

ge

me

nt

àin

de

x st

ruct

ure

join

/so

rt a

lgo

rith

ms

à…

tre

est

ruct

ure

d

Ho

w to

sto

rea

tr

ee

, in

a D

B t

ab

le/r

ela

tion

??

1

23

4

56

no

de

fch

ild n

sib

1

2

-

2

-

33

-4

4

5

-

5

-

66

--

XM

L d

ata

.

1

23

4

56

no

de

fch

ild n

sib

1

2

-

2

-

33

-4

4

5

-

5

-

66

--

tre

eta

ble

“sh

red

din

g”

9

XM

L a

nd

Da

tab

ase

s

Th

is c

ou

rse

will

àin

tro

du

ce y

ou

to

the

w

orl

d o

f X

ML

an

d to

th

e c

ha

llen

ge

so

f d

ea

ling

with

XM

L in

a R

DM

S.

So

me

of

the

se c

ha

llen

ge

s a

re

Exi

stin

g (

DB

) te

chn

olo

gy

can

no

tb

e a

pp

lied

to X

ML

da

ta.

àh

ow

do

we

sto

re t

ree

s?

àca

n w

e b

en

efit

fro

m i

nd

ex

str

uc

ture

s?

àh

ow

to

imp

lem

en

t tre

e n

av

iga

tio

n?

Ad

ditio

na

l ch

alle

ng

es p

ose

d b

y W

3C

’s X

Qu

ery

pro

po

sa

l.

àa

no

tion

of

ord

er

àa

co

mp

lex

ty

pe

/ s

ch

em

a s

ys

tem

àp

oss

ibili

ty to

co

nst

ruct

ne

w tr

ee

no

de

s o

n t

he

fly

.

tre

est

ruct

ure

d

XM

L d

ata

.

10

XM

L a

nd

Da

tab

ase

s

Th

is c

ou

rse

will

àin

tro

du

ce y

ou

to

the

w

orl

d o

f X

ML

an

d to

th

e c

ha

llen

ge

so

f d

ea

ling

with

XM

L in

a R

DM

S.

So

me

of

the

se c

ha

llen

ge

s a

re

Exi

stin

g (

DB

) te

chn

olo

gy

can

no

tb

e a

pp

lied

to X

ML

da

ta.

àh

ow

do

we

sto

re t

ree

s?

àca

n w

e b

en

efit

fro

m i

nd

ex

str

uc

ture

s?

àh

ow

to

imp

lem

en

t tre

e n

av

iga

tio

n?

Ad

ditio

na

l ch

alle

ng

es p

ose

d b

y W

3C

’s X

Qu

ery

pro

po

sa

l.

àa

no

tion

of

ord

er

àa

co

mp

lex

ty

pe

/ s

ch

em

a s

ys

tem

àp

oss

ibili

ty to

co

nst

ruct

ne

w tr

ee

no

de

s o

n t

he

fly

.

tre

est

ruct

ure

d

XM

L d

ata

.

XM

L =

T

hre

at

to D

ata

ba

se

s…

?!

11

XM

L a

nd

Da

tab

ase

s

Th

is c

ou

rse

will

àin

tro

du

ce y

ou

to

the

w

orl

d o

f X

ML

an

d to

th

e c

ha

llen

ge

so

f d

ea

ling

with

XM

L in

a R

DM

S.

So

me

of

the

se c

ha

llen

ge

s a

re

Exi

stin

g (

DB

) te

chn

olo

gy

can

no

tb

e a

pp

lied

to X

ML

da

ta.

àh

ow

do

we

sto

re t

ree

s?

àca

n w

e b

en

efit

fro

m i

nd

ex

str

uc

ture

s?

àh

ow

to

imp

lem

en

t tre

e n

av

iga

tio

n?

Ad

ditio

na

l ch

alle

ng

es p

ose

d b

y W

3C

’s X

Qu

ery

pro

po

sa

l.

àa

no

tion

of

ord

er

àa

co

mp

lex

ty

pe

/ s

ch

em

a s

ys

tem

àp

oss

ibili

ty to

co

nst

ruct

ne

w tr

ee

no

de

s o

n t

he

fly

.

tre

est

ruct

ure

d

XM

L d

ata

.

XM

L =

T

hre

at

to D

ata

ba

se

s!!

12

XM

L a

nd

Da

tab

ase

s

Th

is c

ou

rse

will

àin

tro

du

ce y

ou

to

the

w

orl

d o

f X

ML

an

d to

th

e c

ha

llen

ge

so

f d

ea

ling

with

XM

L in

a R

DM

S.

Yo

u w

ill le

arn

ab

ou

t

àT

ree

str

uct

ure

d d

ata

(XM

L)

àX

ML

pa

rse

rs &

effi

cie

nt m

em

ory

re

pre

sen

tatio

n

àQ

ue

ry la

ng

ua

ge

sfo

r X

ML (

XP

ath

, X

Qu

ery

, X

SLT

…)

àE

ffici

en

t e

valu

atio

n u

sin

g fi

nite

-sta

te a

uto

ma

ta

àM

ap

pin

g X

ML

to d

ata

ba

ses

àA

dva

nce

d to

pic

s (q

ue

ry o

ptim

iza

tion

s,

acc

ess

co

ntr

ol,

up

da

te la

ng

ua

ge

s…

)

Page 3: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

13

Th

is c

ou

rse

will

àin

tro

du

ce y

ou

to

the

w

orl

d o

f X

ML

an

d to

th

e c

ha

llen

ge

so

f d

ea

ling

with

XM

L in

a R

DM

S.

Yo

u w

ill le

arn

ab

ou

t

àT

ree

str

uct

ure

d d

ata

(XM

L)

àX

ML

pa

rse

rs &

effi

cie

nt m

em

ory

re

pre

sen

tatio

n

àQ

ue

ry L

an

gu

ag

es

for

XM

L (

XP

ath

, X

Qu

ery

, X

SLT

…)

àE

ffici

en

t e

valu

atio

n u

sin

g fi

nite

-sta

te a

uto

ma

ta

àM

ap

pin

g X

ML

to d

ata

ba

ses

àA

dva

nce

d T

op

ics

(qu

ery

op

timiz

atio

ns,

a

cce

ss c

on

tro

l,u

pd

ate

la

ng

ua

ge

s…

)

Yo

u w

ill N

OT

lea

rn a

bo

ut

àH

ack

ing

CG

I scr

ipts

àH

TM

Java

Scr

ipt

… Yo

u N

EE

D t

o p

rog

ram

in

àJa

va o

C+

+

XM

L a

nd

Da

tab

ase

s

14

Ab

ou

t X

ML

àX

ML is t

he

Wo

rld

Wid

e W

eb

Co

nso

rtiu

m’s

(W

3C

, http://www.w3.org/

) E

xte

nsi

ble

Ma

rku

p L

an

gu

ag

e

àW

e h

op

e to

co

nvi

nce

yo

u th

at

XM

L is

no

t ye

t a

no

the

r h

ype

d T

LA

,b

ut

is u

sefu

l te

chn

olo

gy.

àY

ou

will

be

com

e b

est

frie

nd

s w

ith o

ne

of t

he

mo

st

imp

ort

an

t d

ata

str

uctu

res in

Co

mp

utin

g S

cie

nce

, th

e

tre

e.

X

ML

is a

ll a

bo

ut

tre

e-s

ha

pe

d d

ata

.

àY

ou

will

lea

rn to

ap

ply

a n

um

be

r o

f clo

sely

re

late

d

XM

L s

tan

da

rds

:

> R

ep

rese

ntin

g d

ata

:

XM

Lits

elf,

DT

D, X

ML

Sc

he

ma

, X

ML

dia

lect

s>

In

terf

ace

s to

co

nn

ect

PL

s to

XM

L:

D

OM

, S

AX

> L

an

gu

ag

es

to q

ue

ry/tra

nsf

orm

XM

L:

X

Pa

th,

XQ

ue

ry, X

SLT

.

15

Ab

ou

t X

ML

We

will

talk

ab

ou

t a

lgo

rith

ms a

nd

pro

gra

mm

ing

te

ch

niq

ue

sto

effi

cie

ntly

ma

nip

ula

te X

ML

da

ta:

àR

eg

ula

r e

xp

res

sio

ns

can

be

use

d to

va

lid

ate

XM

L d

ata

àF

init

e s

tate

au

tom

ata

lie a

t th

e h

ea

rt o

f hig

hly

effi

cie

nt

XP

ath

im

ple

me

nta

tio

ns

àTre

e t

rav

ers

als

ma

y b

e u

sed

to p

rep

roce

ss X

ML

tre

es

in o

rde

r to

sup

po

rt X

Pa

th e

va

lua

tio

n,

to s

tore

XM

L t

ree

sin

da

tab

ase

s, e

tc.

In th

e e

nd

, yo

u s

ho

uld

be

ab

le to

dig

est

th

e th

ick

pile

of

rela

ted

W3

CX

__

_ s

tan

da

rds.

(lik

e, X

Qu

ery

, X

Po

inte

r, X

Lin

k, X

HT

ML

, X

Incl

ud

e, X

ML B

ase

, X

ML S

ch

em

a,

16

Co

urs

e O

rga

niz

atio

n

Le

ctu

re

Tu

esd

ay,

15

:00

–1

8:0

0C

he

mic

alS

c M

17

(e

x A

pp

lied

Sc)

(K

-F1

0-M

17

)

Le

ctu

rer

Se

ba

stia

n M

an

eth

Co

nsu

lt

F

rid

ay,

11

:00

-12

:00

(E

50

8,

L5

) A

ll e

ma

il to

cs

43

17

@cs

e.u

nsw

.ed

u.a

u

Tu

tori

al

Tu

esd

ay,

12

:00

-14

:00

@ Q

ua

dra

ng

le G

04

0 (

K-E

15

-G0

40

) -

-b

efo

rele

ctu

reW

ed

ne

sda

y, 1

2:0

0-1

4:0

0 @

Qu

ad

ran

gle

G0

22

(K

-E1

5-G

02

2)

We

dn

esd

ay,

14

:00

-16

:00

@ Q

ua

dra

ng

le G

02

2 (

K-E

15

-G0

22

) T

hu

rsd

ay,

14

:00

-16

:00

@ Q

ua

dra

ng

le G

04

0 (

K-E

15

-G0

40

) T

hu

rsd

ay,

16

:00

-18

:00

@ Q

ua

dra

ng

le G

02

2 (

K-E

15

-G0

22

)

Tu

tors

?,

Kim

Ng

uye

n

All

em

ail

to

cs4

31

7@

cse

.un

sw.e

du

.au

17

Co

urs

e O

rga

niz

atio

n

Le

ctu

re

Tu

esd

ay,

15

:00

–1

8:0

0C

he

mic

alS

c M

17

(e

x A

pp

lied

Sc)

(K

-F1

0-M

17

)

Le

ctu

rer

Se

ba

stia

n M

an

eth

Co

nsu

lt

F

rid

ay,

11

:00

-12

:00

(E

50

8,

L5

) A

ll e

ma

il to

cs

43

17

@cs

e.u

nsw

.ed

u.a

u

Bo

ok

No

ne

!

Co

urs

e s

lide

s o

f M

arc

Sch

oll,

Un

i Ko

nst

an

zh

ttp

://w

ww

.inf.

un

i-ko

nst

an

z.d

e/d

bis

/te

ach

ing

/ws0

50

6/d

ata

ba

se-x

ml/X

ML

DB

.pd

f

Th

eo

ry /

PL

orie

nte

d, b

oo

k d

raft

:h

ttp

://a

rbre

.is.s

.u-t

oky

o.a

c.jp

/~h

ah

oso

ya/x

mlb

oo

k/

Su

gg

este

d r

ea

din

g m

ate

ria

l:

18

Co

urs

e O

rga

niz

atio

n

Le

ctu

re

Tu

esd

ay,

15

:00

–1

8:0

0C

he

mic

alS

c M

17

(e

x A

pp

lied

Sc)

(K

-F1

0-M

17

)

Le

ctu

rer

Se

ba

stia

n M

an

eth

Co

nsu

lt

F

rid

ay,

11

:00

-12

:00

(E

50

8,

L5

) A

ll e

ma

il to

cs

43

17

@cs

e.u

nsw

.ed

u.a

u

Pro

gra

mm

ing

As

sig

nm

en

ts

5 a

ssig

nm

en

ts, d

ue

eve

ry o

the

r M

on

da

y. (

1st

is d

ue

1

5th

M

arc

h2

nd

is d

ue

29

thM

arc

h…

)

Pe

r a

ssig

nm

en

t:

10

po

ints

(to

tal:

60

po

ints

) (+

2 b

on

us

po

ints

)

Fin

al

Ex

am

:

Fin

al e

xam

:

40

po

ints

(m

ust

ge

t 1

6/4

0to

pa

ss,

tha

t is

40

%)

Page 4: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

19

Ou

tlin

e -

Ass

ign

me

nts

Yo

u c

an

fre

ely

ch

oo

se to

pro

gra

m y

ou

r a

ssig

nm

en

ts in

àC

/ C

++

, o

Java

Ho

we

ver,

yo

ur

cod

e

mu

st

co

mp

ile

wit

h

gc

c/

g+

+, ja

va

c,

as

inst

alle

d o

n C

SE

lin

ux

syst

em

s!

Su

bm

it co

de

(u

sin

g give

) b

y M

on

da

y 2

3:5

9 (

eve

ry o

the

r w

ee

k)

Ass

ign

me

nt 4

(ha

rde

r) g

ets

fou

r w

ee

ks /

cou

nts

do

ub

le (

20

Po

ints

)d

ue

da

te 1

7th

Ma

y

20

Ou

tlin

e -

Ass

ign

me

nts

1.

Re

ad

XM

L,

usi

ng

DO

M p

ars

er.

Cre

ate

do

cum

en

t sta

tistic

s

2.

SA

X P

ars

e in

to m

em

ory

str

uct

ure

: Tre

e v

s D

AG

3.

Ma

p X

ML

into

RD

BM

S

4.

XP

ath

eva

lua

tion

ove

r m

ain

me

mo

ry s

tru

ctu

res

(+

str

ea

min

g su

pp

ort

)

5.

XP

ath

into

SQ

L T

ran

sla

tion

13

da

ys

2 w

ee

ks

2 w

ee

ks(+

1 w

ee

k b

rea

k)4

we

eks

2 w

ee

ks

21

Ou

tlin

e -

Le

ctu

res

1.

Intr

od

uct

ion

to X

ML

, E

nco

din

gs,

Pa

rse

rs2

.M

em

ory

Re

pre

sen

tatio

ns

for

XM

L: S

pa

ce v

sA

cce

ss S

pe

ed

3.

RD

BM

S R

ep

rese

nta

tion

of

XM

L4

.D

TD

s, S

che

ma

s, R

eg

ula

r E

xpre

ssio

ns,

Am

big

uity

5.

No

de

Se

lect

ing

Qu

erie

s: X

Pa

th6

.E

ffici

en

t X

Pa

th E

valu

atio

n7

.X

Pa

th P

rop

ert

ies:

ba

ckw

ard

axe

s, c

on

tain

me

nt t

est

8.

Str

ea

min

g E

valu

atio

n: h

ow

mu

ch m

em

ory

do

yo

u n

ee

d?

9.

XP

ath

Eva

lua

tion

usi

ng

RD

BM

S1

0.

Pro

pe

rtie

s o

f XP

ath

11

.X

SLT

12

.X

Qu

ery

13

.W

rap

up

, Exa

m P

rep

ara

tion

, Op

en

qu

est

ion

s, e

tc

22

Le

ctu

re 1

XM

L I

ntr

od

uct

ion

Le

ctu

re 1

XM

L I

ntr

od

uct

ion

23

Ou

tlin

e

1.

Th

ree

m

oti

va

tio

ns

fo

r X

ML

(1)

re

ligio

us

(2)

pra

ctic

al

(3)

th

eo

retic

al /

ma

the

ma

tica

l

2.

We

ll-f

orm

ed

XM

L

3.

Ch

ara

cte

r E

nc

od

ing

s

4.

Pa

rse

rs f

or

XM

L

àp

ars

ing

into

DO

M

(Do

cum

en

t Ob

ject

Mo

de

l)

24

XM

L I

ntr

od

uct

ion

Relig

iou

sm

otiv

atio

n f

or

XM

L:

to h

ave

one

langua

ge

to s

peak

about

data

.

Page 5: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

25

26

àX

ML

is a

Data

Exch

an

ge F

orm

at

1974

SG

ML

(C

harles

Gold

farb

at I

BM

Rese

arc

h)

1989

HT

ML

(T

im B

ern

ers

-Lee a

t CE

RN

/Geneva

)

1994

Bern

ers

-Lee fo

unds

Web C

onso

rtiu

m (

W3C

)

1996

XM

L(W

3C

dra

ft, v

1.0

in 1

998)

XM

L M

otiv

atio

n

(h

isto

rica

l)

27

àX

ML

is a

Data

Exch

an

ge F

orm

at

1974

SG

ML

(C

harles

Gold

farb

at I

BM

Rese

arc

h)

1989

HT

ML

(T

im B

ern

ers

-Lee a

t CE

RN

/Geneva

)

1994

Bern

ers

-Lee fo

unds

Web C

onso

rtiu

m (

W3C

)

1996

XM

L(W

3C

dra

ft, v

1.0

in 1

998)

XM

L M

otiv

atio

n

(h

isto

rica

l)

http://www.w3.org/TR/REC-xml/

28

(2)

Pra

ctic

al

XM

L=

da

ta

Text

file

Tom Henzinger

EPFL

[email protected]

… Helmut Seidl

TU Munich

[email protected]

29

Tom Henzinger

EPFL

[email protected]

… Helmut Seidl

TU Munich

[email protected]

Text

file

<Related>

<colleague>

<name>Tom Henzinger</name>

<affil>EPFL</affil>

<email>

[email protected]

</email>

</colleague>

… <friend>

<name>Helmut Seidl</name>

<affil>TU Munich</affil>

<email>[email protected]

</email>

</friend>

</Related>

XM

L d

ocu

men

t

“mark

it

up!”

(2)

Pra

ctic

al

XM

L=

da

ta

+

str

uc

ture

30

Tom Henzinger

EPFL

[email protected]

… Helmut Seidl

TU Munich

[email protected]

Text

file

<Related>

<colleague>

<name>Tom Henzinger</name>

<affil>EPFL</affil>

<email>

[email protected]

</email></colleague>

</colleague>

… <friend>

<name>Helmut Seidl</name>

<affil>TU Munich</affil>

<email>[email protected]

</email>

</friend>

</Related>

XM

L d

ocu

men

t

“mark

it

up!”

(2)

Pra

ctic

al

XM

L=

da

ta

+

str

uc

ture

Is th

is a

go

od

“te

mp

late

”??

Wh

at

ab

ou

t la

st/

firs

t n

am

e?

Se

ve

ral a

ffil’

s / e

ma

il’s…

?

Page 6: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

31

XM

L D

ocu

me

nts

àO

rdin

ary

text

file

s (

UT

F-8

, U

TF

-16

, U

CS

-4 …

)

àO

rig

ina

tes

fro

m t

ype

sett

ing

/Do

cPro

cess

ing

co

mm

un

ity

àId

ea

of

lab

ele

d b

racke

ts (

“ma

rk u

p”)

fo

r str

uctu

re is n

ot

ne

w!

(alre

ad

y u

se

d b

y C

ho

msky in

th

e 1

96

0’s

)

àB

rack

ets

de

scrib

e a

tre

e s

tru

ctu

re

àA

llow

s a

pp

lica

tion

s fr

om

diff

ere

nt v

en

do

rs to

exc

ha

ng

e d

ata

!

ès

tan

da

rdiz

ed

, e

xtr

em

ely

wid

ely

ac

ce

pte

d!

32

XM

L D

ocu

me

nts

àO

rdin

ary

text

file

s (

UT

F-8

, U

TF

-16

, U

CS

-4 …

)

àO

rig

ina

tes

fro

m t

ype

sett

ing

/Do

cPro

cess

ing

co

mm

un

ity

àId

ea

of

lab

ele

d b

racke

ts (

“ma

rk u

p”)

fo

r str

uctu

re is n

ot

ne

w!

(alre

ad

y u

se

d b

y C

ho

msky in

th

e 1

96

0’s

)

àB

rack

ets

de

scrib

e a

tre

e s

tru

ctu

re

àA

llow

s a

pp

lica

tion

s fr

om

diff

ere

nt v

en

do

rs to

exc

ha

ng

e d

ata

!

ès

tan

da

rdiz

ed

, e

xtr

em

ely

wid

ely

ac

ce

pte

d!

So

cia

l Im

plic

atio

ns!

All

scie

nce

s (b

iolo

gy,

ge

og

rap

hy,

me

teo

rolo

gy,

astr

olo

gy…

)

ha

ve

ow

n X

ML “

dia

lects

” to

sto

re t

he

ird

ata

op

tima

lly

33

XM

L D

ocu

me

nts

àO

rdin

ary

text

file

s (

UT

F-8

, U

TF

-16

, U

CS

-4 …

)

àO

rig

ina

tes

fro

m t

ype

sett

ing

/Do

cPro

cess

ing

co

mm

un

ity

àId

ea

of

lab

ele

d b

racke

ts (

“ma

rk u

p”)

fo

r str

uctu

re is n

ot

ne

w!

(alre

ad

y u

se

d b

y C

ho

msky in

th

e 1

96

0’s

)

àB

rack

ets

de

scrib

e a

tre

e s

tru

ctu

re

àA

llow

s a

pp

lica

tion

s fr

om

diff

ere

nt v

en

do

rs to

exc

ha

ng

e d

ata

!

ès

tan

da

rdiz

ed

, e

xtr

em

ely

wid

ely

ac

ce

pte

d!

Pro

ble

mh

igh

ly v

erb

ose

, lo

ts o

f re

pe

titiv

e m

ark

up

, la

rge

file

s

34

XM

L D

ocu

me

nts

àO

rdin

ary

text

file

s (

UT

F-8

, U

TF

-16

, U

CS

-4 …

)

àO

rig

ina

tes

fro

m t

ype

sett

ing

/Do

cPro

cess

ing

co

mm

un

ity

àId

ea

of

lab

ele

d b

racke

ts (

“ma

rk u

p”)

fo

r str

uctu

re is n

ot

ne

w!

(alre

ad

y u

se

d b

y C

ho

msky in

th

e 1

96

0’s

)

àB

rack

ets

de

scrib

e a

tre

e s

tru

ctu

re

àA

llow

s a

pp

lica

tion

s fr

om

diff

ere

nt v

en

do

rs to

exc

ha

ng

e d

ata

!

ès

tan

da

rdiz

ed

, e

xtr

em

ely

wid

ely

ac

ce

pte

d!

Co

ntr

a..

hig

hly

ve

rbo

se, l

ots

of

rep

etit

ive

ma

rku

p,

larg

e fi

les

Pro

..w

e h

ave

a s

tan

da

rd!

A S

tan

da

rd!

A S

TAN

DA

RD

! à

JY

ou

ne

ve

r n

ee

d to

write

a p

ars

er

ag

ain

! U

se

XM

L!

J

35

XM

L D

ocu

me

nts

… in

ste

ad

of

writin

g a

pa

rse

r, y

ou

sim

ply

fix

yo

ur

ow

n “

XM

L d

iale

ct”

,

by d

escri

bin

g a

ll “a

dm

issib

le te

mp

late

s”

(+

ma

yb

e e

ve

n th

e s

pe

cific

da

ta ty

pe

s th

at m

ay

ap

pe

ar

insi

de

).

Yo

u d

o t

his

, u

sin

g a

n X

ML T

yp

e d

efin

itio

n la

ng

ua

ge

such

a

s D

TD

o

r R

ela

x N

G (

Oa

sis)

.

Of c

ou

rse

, su

ch ty

pe

de

finiti

on

lan

gu

ag

es

are

SIM

PL

E,

be

cau

se y

ou

wa

nt t

he

pa

rse

rs to

be

effi

cie

nt!

Th

ey

are

sim

ilar

to E

BN

F. à

con

text

-fre

e g

ram

ma

r w

ith

re

g.

exp

r’s in

the

rig

ht-

ha

nd

sid

es.

J

36

XM

L D

ocu

me

nts

Ele

me

nt n

am

es

an

d th

eir c

on

ten

t

Exa

mp

le D

TD

(Do

cum

en

t Typ

e D

esc

rip

tion

)

Related à

(colleague | friend | family)*

colleague à

(name,affil*,email*)

friend à

(name,affil*,email*)

family à

(name,affil*,email*)

name à

(#PCDATA)

Page 7: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

37

XM

L D

ocu

me

nts

Re

late

d

frie

nd

c

olle

ag

ue

fa

mily

na

me

affi

l e

ma

iln

am

e e

ma

il e

ma

il

Vic

tor

..

Ele

me

nt n

am

es

an

d th

eir c

on

ten

t

Exa

mp

le D

TD

(D

ocu

me

nt T

ype

De

scri

ptio

n)

Related à

(colleague | friend | family)*

colleague à

(name,affil*,email*)

friend à

(name,affil*,email*)

family à

(name,affil*,email*)

name à

(#PCDATA)

38

XM

L D

ocu

me

nts

Exa

mp

le D

TD

Related à

(colleague | friend | family)*

colleague à

(name,affil*,email*)

friend à

(name,affil*,email*)

family à

(name,affil*,email*)

name à

(#PCDATA)

Re

late

d

frie

nd

c

olle

ag

ue

fa

mily

na

me

affi

l e

ma

iln

am

e e

ma

il e

ma

il

Vic

tor

..

Ele

me

nt n

am

es

an

d th

eir c

on

ten

t

frie

nd

c

olle

ag

ue

fa

mily

“Ele

me

nt n

od

e”

39

XM

L D

ocu

me

nts

Re

late

d

frie

nd

c

olle

ag

ue

fa

mily

na

me

affi

l e

ma

il

Vic

tor

..

Ele

me

nt n

am

es

an

d th

eir c

on

ten

t

frie

nd

c

olle

ag

ue

fa

mily

“Ele

me

nt n

od

e”

Vic

tor

..

“Te

xt n

od

e”

Exa

mp

le D

TD

Related à

(colleague | friend | family)*

colleague à

(name,affil*,email*)

friend à

(name,affil*,email*)

family à

(name,affil*,email*)

name à

(#PCDATA)

na

me

em

ail

em

ail

40

XM

L D

ocu

me

nts

Re

late

d

frie

nd

c

olle

ag

ue

fa

mily

na

me

affi

l e

ma

il

Vic

tor

..

Ele

me

nt n

am

es

an

d th

eir c

on

ten

t

frie

nd

c

olle

ag

ue

fa

mily

“Ele

me

nt n

od

e”

Vic

tor

..

“Te

xt n

od

e”

Te

rmin

olo

gy

do

cum

en

t is

va

lid

wrt

th

e D

TD

“It v

ali

da

tes”

Te

rmin

olo

gy

do

cum

en

t is

va

lid

wrt

th

e D

TD

“It v

ali

da

tes”

Te

rmin

olo

gy

do

cum

en

t is

Exa

mp

le D

TD

Related à

(colleague | friend | family)*

colleague à

(name,affil*,email*)

friend à

(name,affil*,email*)

family à

(name,affil*,email*)

name à

(#PCDATA)

na

me

em

ail

em

ail

41

XM

L D

ocu

me

nts

Wh

at e

lse

: (

be

sid

es

ele

me

nt

an

d t

ext

no

de

s)

àa

ttrib

ute

pro

cess

ing

inst

ruct

ion

com

me

nts

àn

am

esp

ace

en

tity

refe

ren

ces

(tw

o k

ind

s)

42

XM

L D

ocu

me

nts

Wh

at e

lse

: (

be

sid

es

ele

me

nt

an

d t

ext

no

de

s)

àa

ttrib

ute

pro

cess

ing

inst

ruct

ion

s

àco

mm

en

tsà

na

me

spa

ces

àe

ntit

y re

fere

nce

s (t

wo

kin

ds)

<fa

mily

re

l=“b

roth

er”

,ag

e=

“25

”><

na

me

>… <

/fam

ily>

Page 8: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

43

XM

L D

ocu

me

nts

Wh

at e

lse

:

àa

ttrib

ute

pro

cess

ing

inst

ruct

ion

s

àco

mm

en

tsà

na

me

spa

ces

àe

ntit

y re

fere

nce

s (t

wo

kin

ds)

<fa

mily

re

l=“b

roth

er”

,ag

e=

“25

”><

na

me

>… <

/fam

ily>

<?

ph

p s

ql (“

SE

LE

CT

* F

RO

M …

”) …

?>

Se

e2

.6 P

roce

ssin

g In

stru

ctio

ns

44

XM

L D

ocu

me

nts

Wh

at e

lse

:

àa

ttrib

ute

pro

cess

ing

inst

ruct

ion

s

àco

mm

en

ts

<!--

some comment -->

àn

am

esp

ace

en

tity

refe

ren

ces

(tw

o k

ind

s)

<fa

mily

re

l=“b

roth

er”

,ag

e=

“25

”><

na

me

>… <

/fam

ily>

<?

ph

p s

ql (“

SE

LE

CT

* F

RO

M …

”) …

?>

Se

e2

.6 P

roce

ssin

g In

stru

ctio

ns

45

XM

L D

ocu

me

nts

Wh

at e

lse

:

àa

ttrib

ute

pro

cess

ing

inst

ruct

ion

s

àco

mm

en

ts

<!--

some comment -->

àn

am

esp

ace

en

tity

refe

ren

ces

(tw

o k

ind

s)

<fa

mily

re

l=“b

roth

er”

,ag

e=

“25

”><

na

me

>… <

/fam

ily>

<?

ph

p s

ql (“

SE

LE

CT

* F

RO

M …

”) …

?>

Se

e2

.6 P

roce

ssin

g In

stru

ctio

ns

<!-

-th

e 'p

rice

' ele

me

nt's

na

me

spa

ce is

htt

p://

eco

mm

erc

e.o

rg/s

che

ma

-->

<

ed

i:price

xm

lns:

ed

i='h

ttp:/

/eco

mm

erc

e.o

rg/s

che

ma

' un

its=

'Eu

ro'>

32

.18

</e

di:p

rice

> 46

XM

L D

ocu

me

nts

Wh

at e

lse

:

àa

ttrib

ute

pro

cess

ing

inst

ruct

ion

s

àco

mm

en

ts

<!--

some comment -->

àn

am

esp

ace

en

tity

refe

ren

ces

(tw

o k

ind

s)

<fa

mily

re

l=“b

roth

er”

,ag

e=

“25

”><

na

me

>… <

/fam

ily>

<?

ph

p s

ql (“

SE

LE

CT

* F

RO

M …

”) …

?>

Se

e2

.6 P

roce

ssin

g In

stru

ctio

ns

<!-

-th

e 'p

rice

' ele

me

nt's

na

me

spa

ce is

htt

p://

eco

mm

erc

e.o

rg/s

che

ma

-->

<

ed

i:price

xm

lns:

ed

i='h

ttp:/

/eco

mm

erc

e.o

rg/s

che

ma

' un

its=

'Eu

ro'>

32

.18

</e

di:p

rice

>

47

XM

L D

ocu

me

nts

Wh

at e

lse

:

àa

ttrib

ute

pro

cess

ing

inst

ruct

ion

s

àco

mm

en

ts

<!--

some comment -->

àn

am

esp

ace

en

tity

refe

ren

ce

s(t

wo

kin

ds)

<fa

mily

re

l=“b

roth

er”

,ag

e=

“25

”><

na

me

>… <

/fam

ily>

<?

ph

p s

ql (“

SE

LE

CT

* F

RO

M …

”) …

?>

Se

e2

.6 P

roce

ssin

g In

stru

ctio

ns

<!-

-th

e 'p

rice

' ele

me

nt's

na

me

spa

ce is

htt

p://

eco

mm

erc

e.o

rg/s

che

ma

-->

<

ed

i:price

xm

lns:

ed

i='h

ttp:/

/eco

mm

erc

e.o

rg/s

che

ma

' un

its=

'Eu

ro'>

32

.18

</e

di:p

rice

>

ch

ara

cte

r re

fere

nce

Typ

e <

key>

less

-th

an

</k

ey>

(&

#x3

C;)

to

sa

ve o

ptio

ns.

48

XM

L D

ocu

me

nts

Wh

at e

lse

:

àa

ttrib

ute

pro

cess

ing

inst

ruct

ion

s

àco

mm

en

ts

<!--

some comment -->

àn

am

esp

ace

en

tity

refe

ren

ce

s(t

wo

kin

ds)

<fa

mily

re

l=“b

roth

er”

,ag

e=

“25

”><

na

me

>… <

/fam

ily>

<?

ph

p s

ql (“

SE

LE

CT

* F

RO

M …

”) …

?>

Se

e2

.6 P

roce

ssin

g In

stru

ctio

ns

<!-

-th

e 'p

rice

' ele

me

nt's

na

me

spa

ce is

htt

p://

eco

mm

erc

e.o

rg/s

che

ma

-->

<

ed

i:price

xm

lns:

ed

i='h

ttp:/

/eco

mm

erc

e.o

rg/s

che

ma

' un

its=

'Eu

ro'>

32

.18

</e

di:p

rice

>

ch

ara

cte

r re

fere

nce

Typ

e <

key>

less

-th

an

</k

ey>

(&

#x3

C;)

to

sa

ve o

ptio

ns.

Th

is d

ocu

me

nt w

as

pre

pa

red

on

&d

ocd

ate

;an

d

Page 9: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

49

Ea

rly

Ma

rku

p

Th

e te

rm m

ark

up

ha

s b

ee

n c

oin

ed

by

the

ty

pe

se

ttin

gco

mm

un

ity,

no

tb

y co

mp

ute

r sc

ien

tist.

With

the

ad

ven

t o

f prin

ting

pre

ss,

write

rs a

nd

ed

itors

use

d

(oft

en

ma

rgin

al)

no

tes

to in

stru

ct p

rin

ters

toà

Se

lect

ce

rta

in fo

nts

àL

et

pa

ssa

ge

s o

f te

xt s

tan

d o

ut

àIn

de

nt

a li

ne

of t

ext

, e

tc

Pro

ofr

ea

de

rs u

se a

sp

eci

al s

et

of

sym

bo

ls, t

he

ir s

pe

cia

l ma

rku

p

lan

gu

ag

e,

to id

en

tify

typ

os,

form

att

ing

glit

che

s, a

nd

sim

ilar

err

on

eo

us

fra

gm

en

ts o

f te

xt.

Th

e m

ark

up

lan

gu

ag

e is

de

sig

ne

d to

be

ea

sily

re

cog

niz

ab

le in

the

act

ua

l flo

w o

f te

xt.

50

Ea

rly

Ma

rku

p

Co

mp

ute

r sc

ien

tists

ad

op

ted

the

ma

rku

p id

ea

–o

rig

ina

lly to

a

nn

ota

te p

rog

ram

so

urc

e c

od

e:

àD

esi

gn

the

ma

rku

p la

ng

ua

ge

su

ch t

ha

t its

co

nst

ruct

s a

re

ea

sil

y r

ec

og

niz

ab

leb

y a

ma

chin

e.

àA

pp

roa

ch

es

(1)

Ma

rku

p is

wri

tte

n u

sin

g a

sp

ec

ial

se

t o

f c

ha

rac

ters

, dis

join

t fro

mth

e s

et o

f ch

ara

cte

rs t

ha

t fo

rm t

he

toke

ns

of

the

pro

gra

m(2

) M

ark

up

occ

urs

in p

lace

s in

th

e s

ou

rce

file

wh

ere

pro

gra

m c

od

e

ma

y n

ot

ap

pe

ar

(pro

gra

m l

ay

ou

t).

Exa

mp

le:

Fo

rtra

n 7

7 fi

xed

fo

rm s

ou

rce

:

àF

ort

ran

sta

tem

en

ts s

tart

in c

olu

mn

7 a

nd

do

no

t e

xce

ed

co

lum

n 7

2,

àa

Fo

rtra

n s

tate

me

nt

lon

ge

r th

an

66

ch

ars

ma

y b

e c

on

tinu

ed

on

th

e n

ext

lin

eIf

a c

ha

ract

er

no

t in

{ 0

,!,_

} is

pla

ce in

co

lum

n 6

of

the

co

ntin

uin

g li

ne

àco

mm

en

t lin

es s

tart

with

a “

C”

or

“*”

in c

olu

mn

1,

àN

um

eric

lab

els

(D

O,

FO

RM

AT

sta

tem

en

ts)

ha

ve to

be

pla

ced

in c

olu

mn

s 1

-5.

51

52

Sa

mp

le M

ark

up

Ap

plic

atio

n

A C

om

ic S

trip

Fin

de

r

àN

ext

8 s

lide

s fr

om

Ma

rc S

ch

oll’

s 2

00

5 le

ctu

re.

53

54

Page 10: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

55

56

57

58

59

60

Page 11: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

61

To

da

y,

XM

L h

as

ma

ny

frie

nd

s:

Qu

ery

Lan

gu

ag

es

XP

ath

, X

SLT,

XQ

uery

, …

(most

ly b

y W

3C

)

Imp

lem

en

tati

on

s(P

ars

ers

, Valid

ato

rs, T

ransl

ato

rs)

SA

X,

Xala

n, G

ala

x, X

erx

es, …

(by IB

M/A

pache, M

icro

soft

, O

racle

, S

un…

)

Cu

rren

t Is

su

es

-D

B/P

L s

upport

(

“data

bin

din

g”,

JB

ind,

Casto

r, Z

eus…

)

-st

ora

ge s

upport

(c

om

pre

ssio

n,

data

optim

izatio

n)

62

XM

L: t

ypic

al u

sag

e s

cen

ario

<Pro

duct>

<pro

duct_

id>

m1

01

</pro

duct_

id>

<nam

e>

Sony w

alk

man <

/nam

e>

<curr

ency>

AU

D <

/curr

ency>

<pri

ce>

20

0.0

0 <

/pri

ce>

<gst>

10

% <

/gst>

</Pro

duct>

X M L

XM

L

Sty

les

he

et

XM

L

Sty

les

he

et

Pre

sen

tatio

nF

orm

at i

nfo

XM

L S

tyle

sh

ee

t

Pre

sen

tatio

nF

orm

at i

nfo

XM

L S

tyle

sh

ee

t

On

e d

ata

so

urc

seve

ral

(dyn

am

ic g

en

.) v

iew

s

Do

cum

en

t str

uct

ure

De

f. o

f p

rice

, g

st,

DT

D,

XM

L S

che

ma

63

XM

L: w

he

re is

it u

sed

?

-D

ocu

me

nt

form

ats

:S

VG

(ve

cto

r im

ag

es)

, Op

en

Do

cum

en

tFo

rma

t (O

pe

nO

ffice

, Go

og

leD

ocs

),D

ocB

oo

k(T

ext

fo

rma

ttin

g),

EP

UB

(e

lect

ron

ic b

oo

ks),

XH

TM

L (

we

b),

-P

roto

col f

orm

ats

:S

OA

P (

We

bS

erv

ice

s), X

MP

P (

Jab

be

r, G

oo

gle

Ta

lk),

AJA

X (

cust

om

XM

L

me

ssa

ge

s u

sed

fo

r in

tera

ctiv

e w

eb

site

s: F

ace

bo

ok,

Gm

ail,

Tw

ee

ter,

…),

RS

S f

ee

ds (

use

d fo

r b

log

s/n

ew

s s

ite

s),

-C

ust

om

da

ta f

orm

at:

Bio

-in

form

atic

s, L

ing

uis

tics,

Co

nfig

ura

tion

file

s, G

eo

gra

ph

ic d

ata

(O

pe

nS

tre

etM

ap

, Go

og

le m

ap

s),

Bib

liog

rap

hic

Re

sou

rce

s (M

ed

Lin

e, A

DC

)

64

Reg

ula

r T

ree l

an

gu

ag

es

(R

EG

T).

Many

chara

cteriza

tions:

R

eg. Tre

e G

ram

mars

Tre

e A

uto

mata

MS

O L

ogic

Nic

e p

ropert

ies:

Clo

sed u

nder

inte

rsection (u

nio

n, com

ple

ment)

Decid

able

equiv

ale

nce

CF

RE

GT

(as

bra

cke

ted

str

ing

s)

anb

nc* a

* bncn

(3)

Th

eo

retic

al /

Ma

the

ma

tica

l Co

nte

xt-F

ree

:

E ::=E ‘+’ E

E ::= E ‘x’ E

E ::= [1-9][0-9]*

65

2.

We

ll-F

orm

ed

XM

L

http://www.w3.org/TR/REC-xml/

Fro

m th

e W

3C

XM

L re

com

me

nd

atio

n

“A t

extu

al o

bje

ct

is w

ell-f

orm

ed

XM

Lif,

(1)

take

n a

s a

wh

ole

, it m

atc

he

s t

he

pro

du

cti

on

la

be

led

do

cu

me

nt

(2)

it m

ee

ts a

ll th

e w

ell-f

orm

ed

ne

ss

co

ns

tra

ints

giv

en

in

th

is s

pe

cific

atio

n ..”

http://www.w3.org/TR/REC-xml/

do

cu

me

nt=

sta

rt s

ymb

ol o

f a

co

nte

xt-f

ree

gra

mm

ar

(“

XM

L g

ram

ma

r”)

à(1

) co

nta

ins

the

co

nte

x-f

ree

pro

pe

rtie

so

f w

ell-

form

ed

XM

(2)

con

tain

s th

e

co

nte

xt-

de

pe

nd

en

t p

rop

ert

ies

of

we

ll-fo

rme

d X

ML

Th

ere

are

10

WF

Cs

(we

ll-fo

rme

dn

ess

co

nst

rain

ts).

E.g

.: E

lem

en

t Ty

pe

Ma

tch

“Th

e N

am

e in

an

ele

me

nt’s e

nd

ta

g m

ust

ma

tch

the

ele

me

nt n

am

e in

th

e s

tart

ta

g.”

66

2.

We

ll-F

orm

ed

XM

L

http://www.w3.org/TR/REC-xml/

Fro

m th

e W

3C

XM

L re

com

me

nd

atio

n

“A t

extu

al o

bje

ct

is w

ell-f

orm

ed

XM

Lif,

(1)

take

n a

s a

wh

ole

, it m

atc

he

s t

he

pro

du

cti

on

la

be

led

do

cu

me

nt

(2)

it m

ee

ts a

ll th

e w

ell-f

orm

ed

ne

ss

co

ns

tra

ints

giv

en

in

th

is s

pe

cific

atio

n ..”

http://www.w3.org/TR/REC-xml/

do

cu

me

nt=

sta

rt s

ymb

ol o

f a

co

nte

xt-f

ree

gra

mm

ar

(“

XM

L g

ram

ma

r”)

à(1

) co

nta

ins

the

co

nte

x-f

ree

pro

pe

rtie

so

f w

ell-

form

ed

XM

(2)

con

tain

s th

e

co

nte

xt-

de

pe

nd

en

t p

rop

ert

ies

of

we

ll-fo

rme

d X

ML

Th

ere

are

10

WF

Cs

(we

ll-fo

rme

dn

ess

co

nst

rain

ts).

E.g

.: E

lem

en

t Ty

pe

Ma

tch

“Th

e N

am

e in

an

ele

me

nt’s e

nd

ta

g m

ust

ma

tch

the

ele

me

nt n

am

e in

th

e s

tart

ta

g.”

àW

hy

is t

his

no

tco

nte

xt-f

ree

?

Page 12: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

67

2.

We

ll-F

orm

ed

XM

LC

on

text

-fre

e g

ram

ma

r in

EB

NF

= S

yste

m o

f p

rod

uct

ion

ru

les

o

f th

e fo

rm

lhs ::= rhs

lhs

a n

on

term

ina

l sym

bo

l (e

.g,

do

cu

me

nt)

rhs

a s

trin

g o

ver

no

nte

rmin

al a

nd

te

rmin

al s

ymb

ols

.

Ad

diti

on

ally

(E

BN

F),

we

ma

y u

se r

eg

ula

r e

xpre

ssio

ns

in rhs

. S

uch

as:

r* denoting , r, rr, rrr, …

zero

or

mo

re r

ep

ititio

ns

r+ denoting rr*

on

e o

r m

ore

re

piti

tion

sr? denoting r |

op

tion

al r

[abc] denoting a | b | c

cha

ract

er

cla

ss[a-z] denoting a | b | … | z

cha

ract

er

cla

ss

lhs ::= rhs

68

XM

L G

ram

ma

r -

EB

NF

-sty

le[1]

document

::=

prologelementMisc*

[2]

Char

::=

a Unicode character

[3]

S::=

(‘ ’ | ‘\t’ | ‘\n’ | ‘\r’)+

[4] NameChar ::= (Letter | Digit | ‘.’ | ‘-’ | ‘:’

[5]

Name

::=

(Letter| '_' | ':') (NameChar)*

[22]

prolog

::=

XMLDecl? Misc* (doctypedeclMisc*)?

[23]

XMLDecl

::=

'<?xml' VersionInfoEncodingDecl? SDDecl? S? '?>‘

[24]VersionInfo

::=

S'version'Eq("'"VersionNum"'"|'"'VersionNum'"')

[25]

Eq

::=

S? '=' S?

[26]VersionNum

::=

'1.0‘

[39]

element

::=

EmptyElemTag

| STagcontentEtag

[40]

STag

::=

'<' Name(SAttribute)* S? '>'

[41]Attribute

::=

NameEqAttValue

[42]

ETag

::=

'</' NameS? '>‘

[43]

content

::=

(element| Reference| CharData?)*

[44]EmptyElemTag::=

'<' Name(SAttribute)* S? '/>‘

[67]Reference

::=

EntityRef| CharRef

[68]EntityRef

::=

'&' Name';‘

[84] Letter ::= [a-zA-Z]

[88] Digit ::= [0-9]

69

XM

L G

ram

ma

r -

EB

NF

-sty

le

As

usu

al,

the

XM

L g

ram

ma

r ca

n b

e s

yste

ma

tica

lly tr

an

sfo

rme

d in

to

a p

rog

ram

, an

XM

L p

ars

er,

to b

e u

sed

to c

he

ck th

e s

ynta

x o

f XM

L in

pu

t

Pa

rsin

g X

ML

1.

Sta

rtin

g w

ith t

he

sym

bo

l document

, th

e p

ars

er

use

s th

e lhs::=rhs

rule

s to

exp

an

d s

ymb

ols

, co

nst

ruct

ing

a p

ars

e t

ree

.

2.

Le

ave

s o

f th

e p

ars

e tr

ee

are

ch

ara

cte

rs w

hic

h h

ave

no

fu

rth

er

exp

an

sio

n

3.

Th

e X

ML

inp

ut i

s p

ars

ed

su

cce

ssfu

lly if

it p

erf

ect

ly m

atc

he

s th

e

pa

rse

tre

e’s

fro

nt

(co

nca

ten

ate

th

e p

ars

e tre

e’s

le

ave

s fro

m le

ft-t

o-r

igh

t,w

hile

re

mo

vin

g

sym

bo

ls).

70

Ex

am

ple

1

Pa

rse

tre

e f

or

XM

L in

pu

t

<bubble speaker=“phb”>Um... No.</bubble>

71

Ex

am

ple

2

Pa

rse

tre

e f

or

XM

L in

pu

t

<?xml version=“1.0”?><foo/>

72

2.

We

ll-F

orm

ed

XM

L

Te

rmin

olo

gy

•ta

g n

am

es

name,email,author

, …

•s

tart

ta

g<name>

,

en

d t

ag</name>

•e

lem

en

ts<name> … </name>

, <author age=“99”> … </author>

•e

lem

en

ts m

ay

be

ne

ste

d

•e

mp

ty e

lem

en

t<red></red>

is a

bb

revi

ate

d a

s <red/>

•a

n X

ML

do

cum

en

t: s

ing

le r

oo

t e

lem

en

t

<someTag> …. </someTag>

•w

ell-f

orm

ed

co

ns

tra

ints

àb

eg

in/e

nd

tag

s m

atc

no

att

rib

ute

na

me

ma

y a

pp

ea

r m

ore

th

an

on

cein

a s

tart

ta

g o

r e

mp

ty e

lem

en

t ta

a p

ars

ed

en

tity

mu

st n

ot

con

tain

a r

ecu

rsiv

ere

fere

nce

to it

self,

eith

er

dire

ctly

or

ind

ire

ctly

co

nte

xt-

de

pe

nd

en

t

pro

pe

rtie

s

Page 13: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

73

We

ll-fo

rme

dX

ML

(fra

gm

en

ts)

<Staff>

<Name>

<FirstName> Sebastian </FirstName>

<LastName> Maneth </LastName>

</Name>

<Login> smaneth </Login>

<Ext> 2481 </Ext>

</Staff>

a Staff

ele

me

nt

<Name>

<FirstName> Sebastian </FirstName>

<LastName> Maneth </LastName>

</Name>

a Name

ele

me

nt

No

n-w

ell-

form

ed

XM

L

<foo> oops </bar>

<foo> oops </Foo>

<foo> oops .. <EOT>

<a><b><c></a></b></c>

74

We

ll-fo

rme

dX

ML

(fra

gm

en

ts)

<Staff>

<Name>

<FirstName> Sebastian </FirstName>

<LastName> Maneth </LastName>

</Name>

<Login> smaneth </Login>

<Ext> 2481 </Ext>

</Staff>

a Staff

ele

me

nt

<Name>

<FirstName> Sebastian </FirstName>

<LastName> Maneth </LastName>

</Name>

a Name

ele

me

nt

No

n-w

ell-

form

ed

XM

L

<foo> oops </bar>

<foo> oops </Foo>

<foo> oops .. <EOT>

<a><b><c></a></b></c>

Qu

es

tio

ns

Ho

wca

n y

ou

imp

lem

en

tth

e th

ree

we

ll-fo

rme

d c

on

stra

ints

?

Wh

en

, d

urin

g p

ars

ing

, d

o y

ou

ap

ply

the

ch

eck

s?

75

Ch

ara

cte

r E

nco

din

g

•F

or

a c

om

pu

ter, a

ch

ara

cte

r lik

e X

is n

oth

ing

bu

t a

n 8

(1

6/3

2)

bit

nu

mb

er

wh

ose

va

lue

is

inte

rpre

ted

as

the

ch

ara

cte

r X

, w

he

n n

ee

de

d.

•P

rob

lem

: m

an

y su

ch

nu

mb

er à

cha

ract

er

ma

pp

ing

s, th

e s

o c

alle

de

nc

od

ing

sa

re in

use

tod

ay.

•D

ue

to

the

hu

ge

am

ou

nt o

f ch

ara

cte

rs n

ee

de

d b

y th

e g

lob

al c

om

pu

ting

co

mm

un

ity to

da

y (

La

tin

, H

eb

rew

, A

rab

ic, G

ree

k,

Ja

pa

ne

se

, C

hin

ese

…),

co

nfl

icti

ng

in

ters

ec

tio

ns

be

twe

en

en

cod

ing

s a

re c

om

mo

n.

Ex

am

ple

0xcb 0xe4 0xd3

iso

-88

59

-7Æ

0xcb 0xe4 0xd3

iso

-88

59

-15

Ë

ä Ó

76

Un

ico

de

•T

he

Un

ico

de

http://www.unicode.org

Initi

ativ

e a

ims

to d

efin

e a

ne

w

en

cod

ing

tha

t tr

ies

to e

mb

race

all

cha

ract

er

ne

ed

s.

•T

he

Un

ico

de

en

co

din

g c

on

tain

s c

ha

racte

rs o

f “a

ll” la

ng

ua

ge

s o

f th

e w

orl

d

plu

s s

cie

ntific,

ma

the

ma

tica

l, te

ch

nic

al, b

ox d

raw

ing

, …

sym

bo

ls

•R

an

ge

of

the

Un

ico

de

en

cod

ing

: 0x0000-0x10FFFF (=16*65536)

àC

od

es

tha

t fit

into

th

e fi

rst

16

bits

(d

en

ote

d U+0000-U+FFFF

)e

nco

de

the

mo

st w

ide

ly u

sed

lan

gu

ag

es

an

d t

he

ir c

ha

ract

ers

(Ba

sic

Mu

ltilin

gu

al

Pla

ne

, B

MP

)

àC

od

es

U+

00

00

-U+

00

7F

ha

ve b

ee

n a

ssig

ne

d to

ma

tch

the

7-b

it A

SC

IIe

nco

din

g w

hic

h is

pe

rva

sive

tod

ay.

77

Un

ico

de

Tra

nsf

orm

atio

n F

orm

ats

Cu

rre

nt C

PU

s o

pe

rate

mo

st e

ffici

en

tly o

n 3

2-b

it w

ord

s(1

6-b

it w

ord

s, b

yte

s)

Un

ico

de

thu

s d

eve

lop

ed

Un

ico

de

Tra

nsf

orm

atio

n F

orm

ats

(U

TF

s) w

hic

h d

efin

eh

ow

a U

nic

od

e c

ha

ract

er

cod

e b

etw

ee

n U+0000

an

d U+10FFFF

is t

o b

em

ap

pe

d in

to a

32

-bit

wo

rd (

16-b

it w

ord

, b

yte

).

UT

F-3

2

àS

imp

ly m

ap

exa

ctly

to

the

co

rre

spo

nd

ing

32

-bit

valu

Fo

r e

ach

Un

ico

de

ch

ara

cte

r in

UT

F-3

2:

w

ast

e o

f a

t le

ast

11

bits

!

78

Un

ico

de

Tra

nsf

orm

atio

n F

orm

ats

Cu

rre

nt C

PU

s o

pe

rate

mo

st e

ffici

en

tly o

n 3

2-b

it w

ord

s(1

6-b

it w

ord

s, b

yte

s)

Un

ico

de

thu

s d

eve

lop

ed

Un

ico

de

Tra

nsf

orm

atio

n F

orm

ats

(U

TF

s) w

hic

h d

efin

eh

ow

a U

nic

od

e c

ha

ract

er

cod

e b

etw

ee

n U+0000

an

d U+10FFFF

is t

o b

em

ap

pe

d in

to a

32

-bit

wo

rd (

16-b

it w

ord

, b

yte

).

UT

F-3

2

àS

imp

ly m

ap

exa

ctly

to

the

co

rre

spo

nd

ing

32

-bit

valu

Fo

r e

ach

Un

ico

de

ch

ara

cte

r in

UT

F-3

2:

w

ast

e o

f a

t le

ast

11

bits

!

UT

F-1

6

Ma

p a

Un

ico

de

ch

ara

cte

r in

to o

ne

or

two

16

-bit

wo

rds

àU

+0

00

0 t

o

U+

FF

FF

m

ap

exa

ctly

to t

he

co

rre

spo

nd

ing

16

-bit

valu

ab

ove

U+

FF

FF

: su

btr

act

0x0

10

00

0 a

nd

the

n f

ill th

e

□‘s

in

1101 10□□

□□□□□□□□

1101 11□□

□□□□□□□□

E.g

. U

nic

od

e c

ha

ract

er U+012345

(0x012345 -

0x010000 = 0x02345

)

UT

F-1

6: 1101 1000 0000 1000

1101 1111 0100 0101

Page 14: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

79

Un

ico

de

Tra

nsf

orm

atio

n F

orm

ats

•UT

F-1

6 w

ork

s co

rre

ctly

, b

eca

use

the

ch

ara

cte

r co

de

s b

etw

ee

n

1101 10□□□□□□□□□□and

1101 11□□□□□□□□□□

(with

ea

ch □

rep

lace

d b

y a

0)

are

left

un

ass

ign

ed

in U

nic

od

e!!!

(ra

ng

e 0

xD8

00

–0

xDF

FF

is r

ese

rve

d)

80

Un

ico

de

Tra

nsf

orm

atio

n F

orm

ats

•UT

F-1

6 w

ork

s co

rre

ctly

, b

eca

use

the

ch

ara

cte

r co

de

s b

etw

ee

n

1101 10□□□□□□□□□□and

1101 11□□□□□□□□□□

(with

ea

ch □

rep

lace

d b

y a

0)

are

left

un

ass

ign

ed

in U

nic

od

e!!!

(ra

ng

e 0

xD8

00

–0

xDF

FF

is r

ese

rve

d)

•UT

F-1

6is

th

e o

nly

en

cod

ing

in J

ava

: a

char

is 1

6 b

its

larg

e, n

ot

8 a

s in

C/C

++

WA

RN

ING

: a

“sin

gle

” ch

ara

cte

r w

hic

h u

se

s tw

o 1

6 b

its f

ield

s is m

ad

e u

p o

f 2

ch

ars

!

Exa

mp

le, th

e s

trin

g “

ta

ke

s t

wo

16

bit f

ield

s:

String s = “\uD834\uDD1E”;

s.length(); // returns 2

s.charAt(1); //returns \uDD1E, it’s an INVALID character.

s.codePointAt(1); //returns the integer 0xD834DD1E

g “

81

UT

F-8

Ma

ps

a u

nic

od

e c

ha

ract

er

into

1

, 2

, 3

, o

r 4

b

yte

s.

Un

ico

de

ra

ng

e

B

yte

se

qu

en

ce

U+000000 à

U+00007F 0□□□□□□□

U+000080 à

U+0007FF 110□□□□□10□□□□□□

U+000800 à

U+00FFFF 1110□□□□10□□□□□□10□□□□□□

U+010000 à

U+10FFFF 11110□□□10□□□□□□10□□□□□□10□□□□□□

Sp

are

bits

(□

) a

re f

ille

d fr

om

rig

ht

to le

ft. P

ad

to t

he

left

with

0-b

its.

E.g

. U+00A9

in U

TF

-8is

11000010 10101001

U+2260

in U

TF

-8is

11100010 10001001 10100000

82

UT

F-8

àF

or

a U

TF

-8 m

ulti

-byt

e s

eq

ue

nce

, th

e le

ng

th o

f th

e s

eq

ue

nce

is

eq

ua

l to

the

nu

mb

er

of

lea

din

g 1

-bits

(in

th

e f

irst

byt

e)

àC

ha

ract

er

bo

un

da

rie

s a

re s

imp

le to

de

tect

àU

TF

-8 e

nco

din

g d

oe

s n

ot a

ffect

(b

ina

ry)

sort

ord

er

àTe

xt p

roce

ssin

g s

oft

wa

re d

esi

gn

ed

to d

ea

l with

7-b

it A

SC

IIre

ma

ins

fun

ctio

na

l.

(esp

eci

ally

tru

e f

or

the

C p

rog

ram

min

g la

ng

ua

ge

an

d it

s st

rin

g (char[]

)re

pre

sen

tatio

n)

83

XM

L a

nd

Un

ico

de

àA

co

nfo

rmin

g X

ML

pa

rse

r is

re

qu

ire

dto

co

rre

ctly

pro

cess

UT

F-8

an

dU

TF

-16

en

cod

ed

do

cum

en

ts. (

Th

e W

3C

XM

L R

eco

mm

en

da

tion

p

red

ate

s th

e U

TF

-32

de

finiti

on

)

àD

ocu

me

nts

tha

t u

se a

diff

ere

nt e

nco

din

g m

ust

an

no

un

ce s

o u

sin

g th

eX

ML

text

de

cla

ratio

n, e

.g.,

<?xml encoding=“iso-8859-15”?>

or

<?xml encoding=“utf-32”?>

àO

the

rwis

e, a

n X

ML

pa

rse

r is

en

cou

rag

ed

to

gu

es

s th

e e

nco

din

g w

hile

rea

din

g th

e v

ery

firs

t b

yte

s o

f th

e in

pu

t X

ML

do

cum

en

t:

He

ad

of

do

c (

by

tes

)

E

nc

od

ing

gu

es

s

0x00 0x3C 0x00 0x3F

UT

F-1

6

(little

En

dia

n)

0x3C 0x00 0x3F 0x00

UT

F-1

6

(b

ig E

nd

ian

)0x3c 0x3F 0x78 0x6D

UT

F-8

(o

r A

SC

II)

No

tice

: <

= U

+0

03

C,

? =

U+

00

3F,

x =

U+

00

78

, m =

U+

00

6D

84

XM

L a

nd

Un

ico

de

àW

ha

t d

oe

s “

gu

ess t

he

en

co

din

g”

me

an

? U

nd

er

wh

ich

circu

msta

nce

s

do

es

the

pa

rse

r k

no

wit

ha

s d

ete

rmin

ed

the

co

rre

ct e

nco

din

g?

Are

the

re c

ase

s w

he

n it

ca

nN

OT

de

term

ine

the

co

rre

ct e

nco

din

g?

àW

ha

t a

bo

ut e

ffici

en

cy o

f th

e U

TF

s? F

or

diff

ere

nt

text

s, c

om

pa

re th

e

spa

ce r

eq

uir

em

en

t in

UT

F-8

/16

an

d U

TF

-32

ag

ain

st e

ach

oth

er.

Wh

ich

ch

ara

cte

rs d

o y

ou

fin

d a

bo

ve 0

xFF

FF

in U

nic

od

e?

Ca

n y

ou

ima

gin

e a

sce

na

rio

wh

ere

UT

F-3

2 is

fa

ste

rth

an

UT

F-8

/16

?

Qu

es

tio

ns

Page 15: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

85

Th

e X

ML

Pro

cess

ing

Mo

de

l

àO

n th

e p

hy

sic

alsi

de

, X

ML

de

fine

s n

oth

ing

bu

t a

fla

t te

xt

form

at,

i.e.,

it

de

fine

s a

se

t o

f (e

.g. U

TF

-8/1

6)

cha

ract

er

seq

ue

nce

s b

ein

g

we

ll-fo

rme

d X

ML

.

àA

pp

lica

tion

s th

at

wa

nt

to a

na

lyze

an

d tr

an

sfo

rm X

ML

da

ta in

an

ym

ea

nin

gfu

l wa

y w

ill fi

nd

pro

cess

ing

fla

t ch

ara

cte

r se

qu

en

ces

ha

rda

nd

ine

ffici

en

t!

àT

he

ne

stin

g o

f XM

L e

lem

en

ts a

nd

att

rib

ute

s, h

ow

eve

r, d

efin

es

a lo

gic

al

tre

e-l

ike

str

uc

ture

.

Sta

ff

Nam

e

First

Nam

eLas

tNam

e

Log

inE

xtensi

on

86

Th

e X

ML

Pro

cess

ing

Mo

de

l

àV

irtu

ally

all

XM

L a

pp

lic

ati

on

s o

pe

rate

on

th

e lo

gic

al

tre

e v

iew

wh

ich

is p

rovi

de

d to

th

em

by

an

XM

L p

roce

sso

r (i.e

., “

pa

rse

& s

tore

”).

àX

ML p

roce

sso

rs a

re w

ide

ly a

va

ilab

le (

e.g

., A

pa

ch

e’s

Xe

rce

s).

Ho

w is

th

e X

ML

pro

cess

or

sup

po

sed

to co

mm

un

ica

te th

e

XM

L tr

ee

str

uct

ure

to

th

e a

pp

lica

tion

?

87

Th

e X

ML

Pro

cess

ing

Mo

de

l

àV

irtu

ally

all

XM

L a

pp

lic

ati

on

s o

pe

rate

on

th

e lo

gic

al

tre

e v

iew

wh

ich

is p

rovi

de

d to

th

em

by

an

XM

L p

roce

sso

r (i.e

., “

pa

rse

& s

tore

”).

àX

ML p

roce

sso

rs a

re w

ide

ly a

va

ilab

le (

e.g

., A

pa

ch

e’s

Xe

rce

s).

Ho

w is

th

e X

ML

pro

cess

or

sup

po

sed

to co

mm

un

ica

te th

e

XM

L tr

ee

str

uct

ure

to

th

e a

pp

lica

tion

?

àF

or

ma

ny P

L’s

th

ere

are

“d

ata

bin

din

g”

too

ls.

Giv

es

very

fle

xib

le w

ay

to g

et

PL

vie

w o

f th

e X

ML

tre

e s

tru

ctu

re.

88

Th

e X

ML

Pro

cess

ing

Mo

de

l

àV

irtu

ally

all

XM

L a

pp

lic

ati

on

s o

pe

rate

on

th

e lo

gic

al

tre

e v

iew

wh

ich

is p

rovi

de

d to

th

em

by

an

XM

L p

roce

sso

r (i.e

., “

pa

rse

& s

tore

”).

àX

ML p

roce

sso

rs a

re w

ide

ly a

va

ilab

le (

e.g

., A

pa

ch

e’s

Xe

rce

s).

Ho

w is

th

e X

ML

pro

cess

or

sup

po

sed

to co

mm

un

ica

te th

e

XM

L tr

ee

str

uct

ure

to

th

e a

pp

lica

tion

?

àF

or

ma

ny P

L’s

th

ere

are

“d

ata

bin

din

g”

too

ls.

Giv

es

very

fle

xib

le w

ay

to g

et

PL

vie

w o

f th

e X

ML

tre

e s

tru

ctu

re.

Bu

t firs

t, le

t’s s

ee

wh

at

the

sta

nd

ard

sa

ys…

89

Th

e X

ML

Pro

cess

ing

Mo

de

l

àV

irtu

ally

all

XM

L a

pp

lic

ati

on

s o

pe

rate

on

th

e lo

gic

al

tre

e v

iew

wh

ich

is p

rovi

de

d to

th

em

by

an

XM

L p

roce

sso

r(i.e

., “

pa

rse

& s

tore

”).

àX

ML p

roce

sso

rs a

re w

ide

ly a

va

ilab

le (

e.g

., A

pa

ch

e’s

Xe

rce

s).

Ho

w is

th

e X

ML

pro

cess

or

sup

po

sed

to co

mm

un

ica

te th

e

XM

L tr

ee

str

uct

ure

to

th

e a

pp

lica

tion

?

àth

rou

gh

a fi

xed

inte

rfa

ce o

f acc

ess

or

fun

ctio

ns:

Th

e X

ML

Info

rma

tion

Se

t

Th

e a

cce

sso

r fu

nct

ion

s o

pe

rate

on

diff

ere

nt

typ

es

of n

od

e o

bje

cts:

NO

DE

DO

CE

LE

MA

TT

RC

HA

R

90

Acce

sso

r F

un

ctio

ns (“

no

de

pro

pe

rtie

s”)

Node type Property Comment

DOC childern: DOCàELEM root element

base-uri: DOCàSTRING

version : DOCàSTRING

ELEM localname : ELEMàSTRING

children : ELEMà[NODE] [..]=sequence

attributes: ELEMà[ATTR] type

parent : ELEMàNODE

ATTR localname : ATTRàSTRING

value : ATTRàSTRING

owner : ATTRàELEM

CHAR code : CHARàUNICODE a single character

parent : CHARàELEM

XM

L In

form

atio

n S

et

-http://www.w3.org/TR/xml-infoset

Page 16: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

91

Info

rma

tion

se

t o

f a

sa

mp

le d

ocu

me

nt

<?xml version=“1.0”?>

<forecast date=“Thu, May 16”>

<condition>sunny</condition>

<temperature unit=“Celsius”>23</temperature>

</forecast>

children(d) = e1 code(c1) = U+0073 (=‘s’)

base-uri(d) = “file:/…” parent(c1)= e2

version(d) = “1.0” . . .

code(c5) = U+0079 (=‘y’)

localname(e1) = “forecast” parent(c5)= e2

children(e1) = [e2,e3]

attributes(e1)= [a1] localname(e3) =“temperature”

parent(e1) = d children(e3) =[c6,c7]

attributes(e3)=[a2]

localname(a1) = “date” parent(e3) =a1

value(a1) = “Thu, May 16” . . .

owner(a1) = e1

localname(e2) = “condition”

children(e2) = [c1,c2,c3,c4,c5]

attributes(e2)= []

parent(e2) = e1

92

Qu

es

tio

ns

(1)

A N

OD

E ty

pe

ca

n b

e o

ne

of

DO

C,

EL

EM

, AT

TR

, o

r C

HA

R.

In th

e t

wo

pla

ces

of

the

pro

pe

rty

fun

ctio

ns

wh

ere

NO

DE

ap

pe

ars

, w

hic

h o

f th

e f

ou

r ty

pe

s m

ay

actu

ally

ap

pe

ar

the

re?

Fo

r in

sta

nce

, is

this

allo

we

d?

localname(e1) = “condition”

children(e1) = [c1,e2,c2]

(2)

Wh

at

ab

ou

t WH

ITE

SP

AC

E?

Wh

ere

in a

n X

ML

do

cum

en

t do

es

it m

att

er,

an

d w

he

re n

ot?

Wh

ere

in th

e I

nfo

set

Exa

mp

le (

pre

vio

us

slid

e)

are

the

re

turn

s a

nd

ind

en

tatio

ns

of

the

do

cum

en

t?(d

id w

e d

o a

mis

take

? If

so

, w

ha

t is

the

co

rre

ct In

fose

t?)

93

Qu

ery

ing

th

e In

fos

et

Usi

ng

the

In

fose

t, w

e c

an

an

aly

ze a

giv

en

XM

L d

ocu

me

nt i

n m

an

y w

ays

.F

or

inst

an

ce:

àF

ind

all

EL

EM

no

de

s w

ith lo

caln

am

e=bubble

, o

wn

ing

an

AT

TR

no

de

with

lo

caln

am

e=speaker

an

d v

alu

e=Dilbert

.

àL

ist

all panel

EL

EM

no

de

s co

nta

inin

g a

bu

bb

le s

po

ken

by “Dogbert”

àS

tart

ing

in p

an

el 2

(A

TT

R no

), f

ind

all

bu

bb

les

follo

win

g th

ose

sp

oke

n b

y “

Alic

e”

Su

ch q

ue

rie

s a

pp

ea

r ve

ry o

fte

n a

nd

ca

n c

on

ven

ien

tly b

e d

esc

rib

ed

usi

ng

XP

ath

qu

eri

es:

à//bubble/[@speaker=“Dilbert”]

à//panel[.//bubble[@speaker=“Dogbert”]]

à//panel[@no=“2”]//bubble[@speaker=“Alice”]/following::bubble

94

3.

Pa

rse

rs f

or

XM

L

Two

diff

ere

nt

ap

pro

ach

es:

(1)

Pa

rse

r st

ore

s d

ocu

me

nt i

nto

a f

ixe

d (

sta

nd

ard

) d

ata

str

uct

ure

(e

.g.,

an

In

fose

t co

mp

lian

t da

ta s

tru

ctu

re,

such

as

DO

M)

parser.parse(“foo.xml”);

doc = parser.getDocument ();

(2)

Pa

rse

r tr

igg

ers

“e

ve

nts

”. D

oe

s n

ot sto

re!

Use

r h

as

to w

rite

ow

n c

od

e o

n h

ow

to s

tore

/ p

roce

ss t

he

eve

nts

trig

ge

red

by

the

pa

rse

r.

DO

M =

Do

cum

en

t Ob

ject

Mo

de

l

àW

3C

sta

nd

ard

, se

e http://www.w3.org/TR/REC-DOM-Level-1/

parser.parse(“foo.xml”);

doc = parser.getDocument ();

95

DO

M –

Do

cum

en

t O

bje

ct M

od

el

àL

an

gu

ag

e a

nd

pla

tfo

rm-in

de

pe

nd

en

t vie

w o

f X

ML

àD

OM

AP

Is e

xist

fo

r m

an

y P

Ls

(Ja

va,

C+

+, C

, P

erl,

Pyth

on

, …

)

DO

M r

elie

s o

n t

wo

ma

in c

on

cep

ts

(1)

Th

e X

ML

pro

cess

or

con

stru

cts

the

c

om

ple

te X

ML

do

cu

me

nt

tre

e(in

-me

mo

ry)

(2)

Th

e X

ML

ap

plic

atio

n is

sue

s D

OM

lib

rary

ca

lls to

ex

plo

rea

nd

m

an

ipu

late

the

XM

L tr

ee

, or

to g

en

era

ten

ew

XM

L tr

ee

s.

Ad

van

tag

es

ea

sy t

o u

se•

on

ce in

me

mo

ry,

no

tric

ky is

sue

s w

ith X

ML

syn

tax

an

ymo

re•

all

DO

M tr

ee

s se

ria

lize

to w

ell-

form

ed

XM

L (e

ven

aft

er

arb

itra

ry u

pd

ate

s)!

96

DO

M –

Do

cum

en

t O

bje

ct M

od

el

àL

an

gu

ag

e a

nd

pla

tfo

rm-in

de

pe

nd

en

t vie

w o

f X

ML

àD

OM

AP

Is e

xist

fo

r m

an

y P

Ls

(Ja

va,

C+

+, C

, P

erl,

Pyth

on

, …

)

DO

M r

elie

s o

n t

wo

ma

in c

on

cep

ts

(1)

Th

e X

ML

pro

cess

or

con

stru

cts

the

c

om

ple

te X

ML

do

cu

me

nt

tre

e(in

-me

mo

ry)

(2)

Th

e X

ML

ap

plic

atio

n is

sue

s D

OM

lib

rary

ca

lls to

ex

plo

rea

nd

m

an

ipu

late

the

XM

L tr

ee

, or

to g

en

era

ten

ew

XM

L tr

ee

s.

Dis

ad

van

tag

e

Use

s L

OT

S o

f m

em

ory

!!

Ad

van

tag

es

ea

sy t

o u

se•

on

ce in

me

mo

ry,

no

tric

ky is

sue

s w

ith X

ML

syn

tax

arise

an

ymo

re•

all

DO

M tr

ee

s se

ria

lize

to w

ell-

fro

me

d X

ML

(eve

n a

fte

r a

rbitr

ary

up

da

tes)

!

Page 17: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

97

DO

M L

eve

l 1 (

Co

re)

No

de

Pro

ce

ssin

gIn

str

uctio

nC

ha

racte

rDa

taA

ttr

Ele

me

nt

Do

cu

me

nt

Text

Co

mm

en

t

CD

ATA

sect

ion

Na

me

No

de

Ma

pN

od

eL

ist

Ch

ara

cte

r st

rin

gs

(DO

M t

ype

DO

MS

trin

g)

are

de

fine

d to

be

en

cod

ed

usi

ng

UT

F-1

6 (

e.g

., J

ava

DO

M r

ere

sen

ts ty

pe

DO

MS

trin

gu

sin

gits

String

typ

e).

98

DO

M L

eve

l 1 (

Co

re)

So

me

me

tho

ds

DOM type Method Comment

Node nodeName

nodeValue

parentNode : Node

firstChild : Node leftmost child

nextSibling : Node returns NULL for root elem

or last child or attributes

childNodes : NodeList

attributes : NamedNodeMap

ownerDocument: Document

replaceChild : Node

Document createElement : Element creates element with

given tag name

createComment : Comment

getElementsByTagName: NodeList list of all Elem nodes

in document order

: DOMString redefined in subclasses

99

DO

M L

eve

l 1 (

Co

re)

Na

me

,Va

lue

, an

d a

ttrib

ute

sd

ep

en

d o

n t

he

ty

pe

o

f th

e c

urr

en

t no

de

.

10

0

DO

M L

eve

l 1 (

Co

re)

So

me

de

tails

Cre

atin

g a

n element/attribute u

sin

g createElement/createAttribute

do

es

no

tw

ire

the

ne

w n

od

e w

ith th

e X

ML

tre

e s

tru

ctu

re y

et.

àC

all

insertBefore, replaceChild

, …

,

to w

ire

a n

od

e a

t a

n e

xp

licity p

ositio

n

DO

M ty

pe

NodeList

ma

kes

up

fo

the

lack

of

colle

ctio

n d

ata

typ

es

in m

ost

pro

gra

mm

ing

lan

gu

ag

es

DO

M ty

pe

NamedNodeMap

rep

rese

nts

an

association table

(no

de

s m

ay

be

a

cce

sse

d b

y n

am

e)

Exa

mp

le:

v0

a1

a2

bubble

speaker

togetAttributes

“speaker” à

a1

“to” à

a2

A N

am

ed

No

de

Ma

p

Me

tho

ds:

getNamedItem

, setNamedItem

,…

bubble.

getAttributes().

getNamedItem(“ … “)

10

1

E.g

. F

ind

all

occ

urr

en

ces

of Dogbert

spe

aki

ng

(a

ttrib

ute

speaker

of e

lem

en

t bubble

)

10

2

Qu

es

tio

ns

Giv

en

an

XM

L fil

e o

f, s

ay,

50

K, h

ow

larg

e is

its D

OM

re

pre

sen

tatio

n in

ma

in m

em

ory

?

Ho

w m

uch

larg

er, in

the

worst case

, is

a D

OM

re

pre

sen

tatio

n

with

re

spe

ct to

th

e s

ize

of t

he

XM

L d

ocu

me

nt?

(d

iffic

ult!

)

Ho

w c

ou

ld w

e d

ecr

ea

se th

e m

em

ory

ne

ed

of

DO

M,

wh

ilep

rese

rvin

g it

s fu

nct

ion

alit

y?

Page 18: XML and Databases - Computer Science and Engineeringcs4317/10s1/lectures/01_handout.pdf · XML and Databases Sebastian Maneth NICTA and UNSW Lecture 1 Introduction to XML CSE@UNSW

10

3

Ne

xt

“Eve

n trig

ge

r” P

ars

ers

fo

r X

ML

:

àB

uild

yo

ur

ow

n X

ML

da

ta s

tru

ctu

rea

nd

fill

it

up

as t

he

pa

rse

r tr

igg

ers

in

pu

t “e

ve

nts

”.

EN

DL

ect

ure

1