introduction to xml data management issues. types of data structured structured semi-structured...

46
Introduction to Introduction to XML XML Data Management Issues Data Management Issues

Upload: ellen-wood

Post on 25-Dec-2015

260 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Introduction to Introduction to XMLXML

Data Management IssuesData Management Issues

Page 2: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Types of dataTypes of data

StructuredStructured Semi-structuredSemi-structured

Page 3: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Structured DataStructured Data

data is organized in data is organized in entities ( entities ( tablestables))

entities have entities have attributesattributes

Page 4: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Current Database Current Database WorldWorld

– StructureStructure Relational Database Management System Relational Database Management System

(DBMS):(DBMS): everything is a tableeverything is a table

– Software: MS Access, Oracle….Software: MS Access, Oracle….

Page 5: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Example of a table (patients)Example of a table (patients)

Page 6: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Example ofExample ofa group of a group of tablestables

Page 7: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

MS Access Table LinksMS Access Table Links

Page 8: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

World of Web DataWorld of Web Data

– Easy document exchangeEasy document exchange

– Unstructured (or poorly structured) Unstructured (or poorly structured) datadata Everything is a documentEverything is a document

– No standard for query languagesNo standard for query languages

Page 9: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

World of Web DataWorld of Web Data

ExampleExample– An organization An organization AA publishes financial publishes financial

data on its web pages (HTML), data on its web pages (HTML), generated from DBMS.generated from DBMS.

– A second organization A second organization BB wants some wants some financial analyses; can access only financial analyses; can access only web data.web data.

RDBMS

A BHTML

Page 10: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Semi-structured DataSemi-structured Data

data can be of any type data can be of any type not necessarily following any format not necessarily following any format does not follow any rules does not follow any rules examples include:examples include:

– text text – video video – sound sound – images images

Page 11: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Characteristics of Semi-Characteristics of Semi-Structured DataStructured Data

structure is structure is irregularirregular: missing or : missing or additional attributes additional attributes

parts of data parts of data lacklack structure, e.g., structure, e.g., images images

some may yield some may yield littlelittle structure, structure, e.g., plain text e.g., plain text

Page 12: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Semi-structured Data Semi-structured Data DefinitionDefinition

Data that is inherently Data that is inherently self-self-describingdescribing and does not conform to and does not conform to an explicit and fixed rule is known as an explicit and fixed rule is known as Semistructured DataSemistructured Data

Data Structure is contained within Data Structure is contained within data itselfdata itself

Page 13: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Example of Semi-Structured Example of Semi-Structured DataData

name: name: Peter WoodPeter Wood email: email: [email protected], [email protected],

[email protected]@bbk.ac.uk ------------------------------------------------------------------------------------------------------------------------------------ name:name:

• first name: first name: MarkMark • last name: last name: LeveneLevene

email: email: [email protected]@dcs.bbk.ac.uk ------------------------------------------------------------------------------------------------------------------------------------ name: name: Alex SmithAlex Smith affiliation: affiliation: StFXStFX

Page 14: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

IMDB – A Motivating IMDB – A Motivating ExampleExample

The The Internet Movie DatabaseInternet Movie Database is a is a classical example of a collection classical example of a collection of semi-structured dataof semi-structured data

Although the information Although the information pertaining to different movies pertaining to different movies may be essentially similar, their may be essentially similar, their structure may be different!structure may be different!

Let us consider an example movie Let us consider an example movie databasedatabase

Page 15: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

An Example Movie An Example Movie DatabaseDatabase

Page 16: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

IMDB-Irregularity In IMDB-Irregularity In StructureStructure

• Different layout for movies and TV seriesDifferent layout for movies and TV series• Movie entries show Movie entries show Director, Writers Director, Writers andand

StarsStars• TV entries show just TV entries show just Creators Creators & & StarsStars

Captain Phillips (Movie)Captain Phillips (Movie)

Lost (TV Series)Lost (TV Series)

Page 17: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

XML – An Embodiment XML – An Embodiment of Semi-structured of Semi-structured DataData XML can be used to represent XML can be used to represent

semi-structured data.semi-structured data.

Page 18: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

What is XML? What is XML?

XML stands for EXML stands for EXXtensible tensible MMarkup arkup LLanguage anguage

XML is a XML is a markup languagemarkup language much much like HTML (tags)like HTML (tags)

XML was designed to XML was designed to describe describe datadata

XML tags are XML tags are not predefinednot predefined. . You must You must define your own tagsdefine your own tags

Page 19: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

The main difference The main difference between XML and HTML between XML and HTML

XML and HTML were designed with XML and HTML were designed with different goalsdifferent goals::

XMLXML was designed to was designed to describe datadescribe data and and to focus on what data is.to focus on what data is.

HTMLHTML was designed to was designed to display datadisplay data and and to focus on how data looks.to focus on how data looks.

It is important to understand that It is important to understand that XML is XML is not a replacement for HTMLnot a replacement for HTML..

Page 20: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

XML does not DO XML does not DO anythinganything Maybe it is a little hard to understand, but XML DOES NOT DO Maybe it is a little hard to understand, but XML DOES NOT DO

ANYTHING. XML is created to structure, store and to send ANYTHING. XML is created to structure, store and to send information.information.

The note has a header and a message body. It also has sender and The note has a header and a message body. It also has sender and receiver information. But still, this XML document does not DO receiver information. But still, this XML document does not DO anything. It is just pure information wrapped in XML tags. Someone anything. It is just pure information wrapped in XML tags. Someone must write a piece of software to send, receive or display it.must write a piece of software to send, receive or display it.

<note>

<to>John</to>

<from>Mary</from>

<heading>Reminder</heading>

<body>Don't forget me this weekend!</body>

</note>

Page 21: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

XML is free and XML is free and extensibleextensible XML tags are not predefined. You must XML tags are not predefined. You must

""inventinvent" your own tags." your own tags. The tags used to mark up The tags used to mark up HTMLHTML documents documents

and the structure of HTML documents are and the structure of HTML documents are predefinedpredefined. (like <b>, <i>, <h1>, etc.).. (like <b>, <i>, <h1>, etc.).

XML allows authors to define their own tags XML allows authors to define their own tags and their own document structure.and their own document structure.

The tags in the example above (like The tags in the example above (like <to><to> and and <from>)<from>) are not defined in any XML are not defined in any XML standard. These tags are "invented" by the standard. These tags are "invented" by the author of the XML document.author of the XML document.

Page 22: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

XML is used to Exchange XML is used to Exchange DataData

With XML, data can be exchanged between With XML, data can be exchanged between incompatible systems.incompatible systems.

In the real world, computer systems and In the real world, computer systems and databases contain data in databases contain data in incompatible incompatible formatsformats. One of the most time-consuming . One of the most time-consuming challenges for developers has been to challenges for developers has been to exchange data between such systems over exchange data between such systems over the Internet.the Internet.

Since XML data is stored in Since XML data is stored in plain text formatplain text format, , XML provides a XML provides a software- and hardware-software- and hardware-independent independent way of sharing data.way of sharing data.

Page 23: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

XML can be used to Create XML can be used to Create new Languagesnew Languages XML is the mother of XML is the mother of WAPWAP( ( Wireless Wireless

Application ProtocolApplication Protocol)) and and WMLWML ( (The The Wireless Markup Language)Wireless Markup Language)..

WML used to markup Internet applications WML used to markup Internet applications for for handheld deviceshandheld devices like like mobile phonesmobile phones..

Page 24: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

XML and Microsoft XML and Microsoft OfficeOffice

Starting with Office 2007, Microsoft changed Starting with Office 2007, Microsoft changed the format of all Office documents.the format of all Office documents.

They are all saved in XML format.They are all saved in XML format. So a Word file is a ZIP folder holding a So a Word file is a ZIP folder holding a

number of files including the text in XML number of files including the text in XML format.format.

Advantages:Advantages:– Small file sizeSmall file size– Compatibility with other softwareCompatibility with other software– Older Word files have the extension Older Word files have the extension DOCDOC, ,

new ones use new ones use DOCXDOCX

Page 25: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

XML Syntax XML Syntax

The syntax rules of XML The syntax rules of XML are very are very simplesimple and and very strictvery strict. The rules . The rules are very easy to learn, and very are very easy to learn, and very easy to use.easy to use.

Because of this, creating software Because of this, creating software that can read and manipulate that can read and manipulate XML is very easy to do.XML is very easy to do.

Page 26: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

All XML elements must have All XML elements must have a closing taga closing tag

Elements or tags Elements or tags are basic blocks of any are basic blocks of any XML documentXML document

With XML, it is illegal to omit the closing tag.With XML, it is illegal to omit the closing tag.

In HTML some elements do not have to have In HTML some elements do not have to have a closing tag. The following code is legal in a closing tag. The following code is legal in HTMLHTML::

<p>This is a paragraph<p>This is a paragraph In In XMLXML all elements all elements mustmust have a closing have a closing

tag, like this:tag, like this:

<par>This is a paragraph</par><par>This is a paragraph</par>

Page 27: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

XML tags are case XML tags are case sensitivesensitive Unlike HTML, XML tags are Unlike HTML, XML tags are

case sensitive.case sensitive. With XML, the tag With XML, the tag <Letter> <Letter> is is

different from the tag different from the tag <letter><letter>.. Opening and closing tags must Opening and closing tags must

therefore be written with the therefore be written with the same case:same case:<Message>This is incorrect</message> <Message>This is incorrect</message> <message>This is correct</message><message>This is correct</message>

Page 28: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

All XML elements must be All XML elements must be properly nestedproperly nested

Improper nesting of tags makes no sense to Improper nesting of tags makes no sense to XML.XML.

In HTML some elements can be improperly nested In HTML some elements can be improperly nested within each other like this:within each other like this:

<b><i>This text is bold and italic</b></i><b><i>This text is bold and italic</b></i> In XML all elements must be properly nested within In XML all elements must be properly nested within

each other like this:each other like this:<bold><italic><bold><italic>

This text is bold and italicThis text is bold and italic

</italic></bold></italic></bold>

Page 29: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

All XML documents must All XML documents must have a root element (tag)have a root element (tag)

All XML documents must contain a single All XML documents must contain a single tag pair to define a root element.tag pair to define a root element.

All other elements must be within this root All other elements must be within this root element.element.

All elements can have sub elements (child All elements can have sub elements (child elements). Sub elements must be correctly elements). Sub elements must be correctly nested within their parent element:nested within their parent element:<root><root>

<child><child> <subchild>.....</subchild><subchild>.....</subchild>

</child> </child> </root> </root>

Page 30: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

With XML, white space is With XML, white space is

preservedpreserved With XML, white space is preservedWith XML, white space is preserved With XML, the white space in your With XML, the white space in your

document is not truncateddocument is not truncated.. This is unlike HTML. With HTML, a This is unlike HTML. With HTML, a

sentence like this:sentence like this:

Hello              my name is JohnHello              my name is John,,

will be displayed like this:will be displayed like this:

Hello my name is JohnHello my name is John,,

because HTML strips off the white space.because HTML strips off the white space.

Page 31: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Element NamingElement Naming

XML elements must follow these naming XML elements must follow these naming rules:rules:

Names can contain Names can contain letters, numbers, and letters, numbers, and other characters other characters

Names must Names must not start with a number or not start with a number or punctuation character punctuation character

Names must Names must not start with the letters xml not start with the letters xml (or (or XML or Xml ..) XML or Xml ..)

Names cannot contain spaces Names cannot contain spaces

Page 32: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Element NamingElement Naming

Any name can be used, no words are Any name can be used, no words are reserved, but the idea is to make reserved, but the idea is to make names descriptivenames descriptive

XML documents often have a XML documents often have a

corresponding databasecorresponding database, in which fields , in which fields exist corresponding to elements in the exist corresponding to elements in the XML document. A good practice is to XML document. A good practice is to use the naming rules of your database use the naming rules of your database for the elements in the XML documents.for the elements in the XML documents.

Page 33: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Comments in XMLComments in XML

The syntax for writing comments The syntax for writing comments in XML is similar to that of HTML.in XML is similar to that of HTML.

<!-- This is a comment --<!-- This is a comment -->>

Page 34: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Errors in XML will stop the XML Errors in XML will stop the XML programprogram

The World Wide Web Consortium (W3C) XML The World Wide Web Consortium (W3C) XML specification states that a program should not specification states that a program should not continue to process an XML document if it finds a continue to process an XML document if it finds a validation error. The reason is that XML software validation error. The reason is that XML software should be easy to write, and that all XML documents should be easy to write, and that all XML documents should be compatible.should be compatible.

With HTML it was possible to create documents with With HTML it was possible to create documents with

lots of errors (like when you forget an end tag). One of lots of errors (like when you forget an end tag). One of the main reasons that HTML browsers are so big and the main reasons that HTML browsers are so big and incompatible, is that they have their own ways to incompatible, is that they have their own ways to figure out what a document should look like when figure out what a document should look like when they encounter an HTML error.they encounter an HTML error.

With XML this should not be possible.With XML this should not be possible.

Page 35: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

XML and Web XML and Web BrowsersBrowsers

Internet Explorer Internet Explorer 5.0+, 5.0+, Google Google Chrome Chrome & & FirefoxFirefox support XMLsupport XML

Page 36: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Viewing XML Files Viewing XML Files

If you open an XML document in IE ( or If you open an XML document in IE ( or other browsers), it will display the other browsers), it will display the document with document with color color codedcoded root and root and child elements. A plus (child elements. A plus (++) or minus sign ) or minus sign ((--) to the left of the elements can be ) to the left of the elements can be clicked to expand or collapse the clicked to expand or collapse the element structure.element structure.

   If you want to view the raw XML source, If you want to view the raw XML source,

you must select "View Source" from the you must select "View Source" from the browser menu. browser menu.

If an erroneous XML file is opened, the If an erroneous XML file is opened, the browser will report the error.browser will report the error.

Page 37: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Other Examples Other Examples

Viewing some XML documents will Viewing some XML documents will help you get the XML feeling.help you get the XML feeling.

An XML CD catalogAn XML CD catalogThis is some CD collection, stored as XML dataThis is some CD collection, stored as XML data

An XML plant catalogAn XML plant catalogThis is a plant catalog from a plant shop, This is a plant catalog from a plant shop, stored as XML data.stored as XML data.

A Simple Food MenuA Simple Food MenuThis is a breakfast food menu from a This is a breakfast food menu from a restaurant, stored as XML data.restaurant, stored as XML data.

Page 38: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Why does XML display like Why does XML display like this?this?

XML documents do not carry XML documents do not carry information about how to display the information about how to display the data.data.

Since XML tags are "invented" by the author Since XML tags are "invented" by the author of the XML document, browsers do not know of the XML document, browsers do not know if a tag like <table> describes an HTML if a tag like <table> describes an HTML tabletable or a or a dining tabledining table..

Without any information about how to Without any information about how to display the data, most browsers will just display the data, most browsers will just display the XML document as it is.display the XML document as it is.

Page 39: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

The XML Rules The XML Rules (Summary)(Summary)

1.1. Single, unique root Single, unique root elementelement

2.2. Matching open/close Matching open/close tagstags

3.3. Consistent Consistent capitalisationcapitalisation

4.4. Correctly nested Correctly nested elementselements

5.5. Tags naming Tags naming

<?xml version=“1.0”?>

<company id=“4859”>

<name>3Months.com</name>

<type>Web Development</type>

<address>

<street>Wakefield st</street>

<city>Wellington</city>

<country>New Zealand</country>

</address>

</company>

Page 40: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Authoring XML Authoring XML DocumentsDocuments

A basic XML document is an XML element A basic XML document is an XML element that can, but might not, include nested that can, but might not, include nested XML elements.XML elements.

Example:Example: <<booksbooks>> <<bookbook>> <<titletitle> Second Chance <> Second Chance </title/title>> <<authorauthor> Matthew Dunn <> Matthew Dunn </author/author>>

<<ISBNISBN> 123456789 > 123456789 </ISBN></ISBN> <</book/book>> <</books/books>>

Page 41: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Use of XML and HTML Use of XML and HTML togethertogether

This is pure data in XML fileThis is pure data in XML file This is a pure Format file to display the This is a pure Format file to display the

same datasame data

View the result with Google Chrome or IE View the result with Google Chrome or IE 6+ 6+

Page 42: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Converting Relational Database to Converting Relational Database to XMLXML

ExampleExample:: Exporting the following data into XML Exporting the following data into XML

Relational DatabaseRelational Database::

Store (Store (sidsid, name, phone), name, phone)

Book (Book (bidbid, title, authors), title, authors)

BookStore (BookStore (sid sid , , bidbid, price, stock), price, stock)

Store BookBookStore

phone

authors

bidtitlesid

name

price stock

Page 43: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

Converting Relational Converting Relational Database to XML (Cont’d)Database to XML (Cont’d)

XML:XML:<<storestore> >

<<sidsid> 123 </> 123 </sidsid>><<namename> Chapter <> Chapter </name/name>><<phonephone> 429-8976<> 429-8976</phone/phone>><<bookbook> >

<<titletitle> The Da Vinci Code<> The Da Vinci Code</title/title> > <<authorsauthors> Dan Brown<> Dan Brown</authors/authors>><<bidbid> 987<> 987</bid/bid>>

<</book/book>><<bookbook>…<>…</book/book> > … …

<</store/store>>

Page 44: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

ExamplesExamples

example of databaseexample of database

Example of database converted Example of database converted to XMLto XML

Page 45: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

XML representation of a XML representation of a sample Movie Databasesample Movie Database

<?xml version="1.0" encoding="ISO-8859-1“ standalone=“yes”?><?xml version="1.0" encoding="ISO-8859-1“ standalone=“yes”?> <IMDb><IMDb>

<Movies> <Movies> <Movie> <Movie>

<Title> The Notebook</Title><Title> The Notebook</Title><Actor> Ryan Gosling</Actor><Actor> Ryan Gosling</Actor><Actor> Rachel McAdams</Actor><Actor> Rachel McAdams</Actor><Director> Nick Cassavetes</Director><Director> Nick Cassavetes</Director>

</Movie></Movie><Movie> <Movie>

<Title> 300 </Title><Title> 300 </Title><Actor> Gerard Butler</Actor><Actor> Gerard Butler</Actor><Actor> Lena Headey </Actor><Actor> Lena Headey </Actor><Director> Zack Snyder</Director><Director> Zack Snyder</Director>

</Movie></Movie>

</Movies></Movies></IMDb></IMDb>

Page 46: Introduction to XML Data Management Issues. Types of data Structured Structured Semi-structured Semi-structured

XML JokeXML Joke

Question: When should I use Question: When should I use XML?XML?

Answer: When you need a Answer: When you need a buzzword in your resume. buzzword in your resume.