the joy of sax (and dom, and jdom…) bill maccartney 11 october 2004

28
The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

Upload: jeffry-woods

Post on 26-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

The Joy of SAX(and DOM, and JDOM…)

Bill MacCartney11 October 2004

Page 2: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

Roadmap

What are XML APIs good for? Overview of JAXP

JAXP XML Parsers SAX vs. DOM

SAX SAX Architecture Using SAX SAXExample.java

DOM DOM Architecture Using DOM DOMExample.java

JDOM

Page 3: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

What are XML APIs for?

You want to read/write data from/to XML files, and you don't want to write an XML parser.

Applications: processing an XML-tagged corpus

saving configs, prefs, parameters, etc. as XML files

sharing results with outside users in portable format example: typed dependency relations

alternative to serialization for persistent stores doesn't break with changes to class definition

human-readable

Page 4: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

Overview of JAXP

javax.xml.parsers The main JAXP APIs, which provide a common interface for various SAX and DOM parsers.

org.w3c.dom Defines the Document class (a DOM), as well as classes for all of the components of a DOM.

org.xml.sax Defines the basic SAX APIs.

javax.xml.transform Defines the XSLT APIs that let you transform XML into other forms. (Not covered today.)

JAXP = Java API for XML Processing Provides a common interface for creating and using the

standard SAX, DOM, and XSLT APIs in Java.

All JAXP packages are included standard in JDK 1.4+.The key packages are:

Page 5: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

JAXP XML Parsers

javax.xml.parsers defines abstract classes DocumentBuilder (for DOM) and SAXParser (for SAX). It also defines factory classes DocumentBuilderFactory and

SAXParserFactory. By default, these give you the “reference implementation” of DocumentBuilder and SAXParser, but they are intended to be vendor-neutral factory classes, so that you could swap in a different implementation if you preferred.

The JDK includes three XML parser implementations from Apache: Crimson: The original. Small and fast. Based on code donated to

Apache by Sun. Standard implementation for J2SE 1.4.

Xerces: More features. Supports XML Schema. Based on code donated to Apache by IBM.

Xerces 2: The future. Standard implementation for J2SE 5.0.

Page 6: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

SAX vs. DOM

Java-specific

interprets XML as a stream of events

you supply event-handling callbacks

SAX parser invokes your event-handlers as it parses

doesn't build data model in memory

serial access

very fast, lightweight

good choice when

a) no data model is needed, or

b) natural structure for data model is list, matrix, etc.

W3C standard for representing structured documents

platform and language neutral(not Java-specific!)

interprets XML as a tree of nodes

builds data model in memory

enables random access to data

therefore good for interactive apps

more CPU- and memory-intensive

good choice when data model has natural tree structure

SAX = Simple API for XML DOM = Document Object Model

There is also JDOM … more later

Page 7: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

SAX Architecture

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 8: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

Using SAX

import javax.xml.parsers.*;import org.xml.sax.*;import org.xml.sax.helpers.*;

// get a SAXParser objectSAXParserFactory factory = SAXParserFactory.newInstance();SAXParser saxParser = factory.newSAXParser();

// invoke parser using your custom content handlersaxParser.parse(inputStream, myContentHandler);saxParser.parse(file, myContentHandler);saxParser.parse(url, myContentHandler);

(This reflects SAX 1, which you can still use, but SAX 2 prefers a new incantation…)

Here’s the standard recipe for starting with SAX:

Page 9: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

Using SAX 2

// tell SAX which XML parser you want (here, it’s Crimson)System.setProperty("org.xml.sax.driver",

"org.apache.crimson.parser.XMLReaderImpl");

// get an XMLReader objectXMLReader reader = XMLReaderFactory.createXMLReader();

// tell the XMLReader to use your custom content handlerreader.setContentHandler(myContentHandler);

// Have the XMLReader parse input from Reader myReader:reader.parse(new InputSource(myReader));

But where does myContentHandler come from?

In SAX 2, the following usage is preferred:

Page 10: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

Defining a ContentHandler

Easiest route: define a new class which extends org.xml.sax.helpers.DefaultHandler.

Override event-handling methods from DefaultHandler:

(All are no-ops in DefaultHandler.)

startDocument() // receive notice of start of documentendDocument() // receive notice of end of documentstartElement() // receive notice of start of each elementendElement() // receive notice of end of each elementcharacters() // receive a chunk of character dataerror() // receive notice of recoverable parser error

// ...plus more...

Page 11: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

startElement()and endElement()

For simple usage, ignore namespaceURI and localName, and just use qName (the “qualified” name).

XML namespaces are described in an appendix, below.

startElement() and endElement() events always come in pairs:

“<foo/>” will generate calls:

startElement("", "", "foo", null)endElement("", "", "foo")

startElement(String namespaceURI, // for use w/ namespaces String localName, // for use w/ namespaces String qName, // "qualified" name -- use this one! Attributes atts)

endElement(String namespaceURI, String localName, String qName)

The SAXParser invokes your callbacks to notify you of events:

Page 12: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

SAX Attributes

Every call to startElement() includes an Attributes object which represents all the XML attributes for that element.

Methods in the Attributes interface:

getLength() // return number of attributesgetIndex(String qName) // look up attribute's index by qNamegetValue(String qName) // look up attribute's value by qNamegetValue(int index) // look up attribute's value by index// ... and others ...

Page 13: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

SAX characters()

May be called multiple times within each block of character data—for example, once per line.

So, you may want to use calls to characters() to accumulate characters in a StringBuffer, and stop accumulating at the next call to startElement().

public void characters(char[] ch, // buffer containing chars int start, // start position in buffer int length) // num of chars to read

The characters() event handler receives notification of character data (i.e. content that is not part of an XML element):

Page 14: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

SAXExample: Input XML

<?xml version="1.0" encoding="UTF-8"?>

<dots> this is before the first dot and it continues on multiple lines <dot x="9" y="81" /> <dot x="11" y="121" /> <flip> flip is on <dot x="196" y="14" /> <dot x="169" y="13" /> </flip> flip is off <dot x="12" y="144" /> <extra>stuff</extra> <!-- a final comment --></dots>

Page 15: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

SAXExample: Code

Please see SAXExample.java

Page 16: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

SAXExample: Input Output

startDocumentstartElement: dots (0 attributes)characters: this is before the first dot and it continues on multiple linesstartElement: dot (2 attributes)endElement: dotstartElement: dot (2 attributes)endElement: dotstartElement: flip (0 attributes)characters: flip is onstartElement: dot (2 attributes)endElement: dotstartElement: dot (2 attributes)endElement: dotendElement: flipcharacters: flip is offstartElement: dot (2 attributes)endElement: dotstartElement: extra (0 attributes)characters: stuffendElement: extraendElement: dotsendDocument

Finished parsing input. Got the following dots:[(9, 81), (11, 121), (14, 196), (13, 169), (12, 144)]

<?xml version="1.0" encoding="UTF-8"?>

<dots> this is before the first dot and it continues on multiple lines <dot x="9" y="81" /> <dot x="11" y="121" /> <flip> flip is on <dot x="196" y="14" /> <dot x="169" y="13" /> </flip> flip is off <dot x="12" y="144" /> <extra>stuff</extra> <!-- a final comment --></dots>

Page 17: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

DOM Architecture

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 18: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

DOM Document Structure

Document+---Element <dots> +---Text "this is before the first dot | and it continues on multiple lines" +---Element <dot> +---Text "" +---Element <dot> +---Text "" +---Element <flip> | +---Text "flip is on" | +---Element <dot> | +---Text "" | +---Element <dot> | +---Text "" +---Text "flip is off" +---Element <dot> +---Text "" +---Element <extra> | +---Text "stuff" +---Text "" +---Comment "a final comment" +---Text ""

XML Input:Document structure:

There’s a text node between every pair of element nodes, even if the text is empty.

XML comments appear in special comment nodes.

Element attributes do not appear in tree—available through Element object.

<?xml version="1.0" encoding="UTF-8"?>

<dots> this is before the first dot and it continues on multiple lines <dot x="9" y="81" /> <dot x="11" y="121" /> <flip> flip is on <dot x="196" y="14" /> <dot x="169" y="13" /> </flip> flip is off <dot x="12" y="144" /> <extra>stuff</extra> <!-- a final comment --></dots>

Page 19: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

Using DOM

import javax.xml.parsers.*;import org.w3c.dom.*;

// get a DocumentBuilder objectDocumentBuilderFactory dbf =

DocumentBuilderFactory.newInstance();DocumentBuilder db = null;try { db = dbf.newDocumentBuilder();} catch (ParserConfigurationException e) { e.printStackTrace();}

// invoke parser to get a DocumentDocument doc = db.parse(inputStream);Document doc = db.parse(file);Document doc = db.parse(url);

Here’s the basic recipe for getting started with DOM:

Page 20: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

DOM Document access idioms

// get the root of the Document treeElement root = doc.getDocumentElement();

// get nodes in subtree by tag nameNodeList dots = root.getElementsByTagName("dot");

// get first dot elementElement firstDot = (Element) dots.item(0);

// get x attribute of first dotString x = firstDot.getAttribute("x");

OK, say we have a Document. How do we get at the pieces of it?

Here are some common idioms:

Page 21: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

More Document accessors

Node access methods:

String getNodeName()short getNodeType()Document getOwnerDocument()boolean hasChildNodes()NodeList getChildNodes()Node getFirstChild()Node getLastChild()Node getParentNode()Node getNextSibling()Node getPreviousSibling()boolean hasAttributes()... and more ...

Element extends Node and adds these access methods:

String getTagName()boolean hasAttribute(String name)String getAttribute(String name)NodeList getElementsByTagName(String name)… and more …

Document extends Node and adds these access methods:

Element getDocumentElement()DocumentType getDoctype()... plus the Element methods just mentioned ...... and more ...

e.g. DOCUMENT_NODE, ELEMENT_NODE, TEXT_NODE, COMMENT_NODE, etc.

Page 22: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

Creating & manipulating Documents

// get new empty Document from DocumentBuilderDocument doc = db.newDocument();

// create a new <dots> Element and add to Document as rootElement root = doc.createElement("dots");doc.appendChild(root);

// create a new <dot> Element and add as child of rootElement dot = doc.createElement("dot");dot.setAttribute("x", "9");dot.setAttribute("y", "81");root.appendChild(dot);

The DOM API also includes lots of methods for creating and manipulating Document objects:

Page 23: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

More Document manipulators

Node manipulation methods:

void setNodeValue(String nodeValue)Node appendChild(Node newChild)Node insertBefore(Node newChild, Node refChild)Node removeChild(Node oldChild)... and more ...

Element manipulation methods:

void setAttribute(String name, String value)void removeAttribute(String name)… and more …

Document manipulation methods:

Text createTextNode(String data)Comment createCommentNode(String data)... and more ...

Page 24: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

Writing a Document as XML

Strangely, since JAXP 1.1, there is no simple, documented way to write out a Document object as XML.

Instead, you can exploit an undocumented trick: cast the Document to a Crimson XmlDocument, which knows how to write itself out:

There is a supported way to write Documents as XML via the XSLT library, but it is far more clumsy than this two-line trick.

Of course, one could just walk the Document tree and write XML using printlns.

JDOM remedies this with easy XML output!

import org.apache.crimson.tree.XmlDocument;XmlDocument x = (XmlDocument) doc;x.write(out, "UTF-8");

Page 25: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

DOMExample: Code

Please see DOMExample.java

Page 26: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

JDOM Overview

DOM can be awkward for Java programmers Language-neutral does not use Java features

Example: getChildNodes() returns a NodeList, which is not a List. (NodeList.iterator() is not defined.)

JDOM looks like a good alternative: open source project, Apache license, late beta builds on top of JAXP, integrates with SAX and DOM similar to DOM model, but no shared code API designed to be easy & obvious for Java programmers exploits power of Java language: collections, method overloading rumored to become integrated in future JDKs XML output is easy! Key packages: org.jdom, org.jdom.transform,

org.jdom.input, org.jdom.output.

Page 27: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

DOM vs. JDOM

The DOM way:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();DocumentBuilder builder = factory.newDocumentBuilder();Document doc = builder.newDocument();Element root = doc.createElement("root");Text text = doc.createText("This is the root");root.appendChild(text);doc.appendChild(root);

The JDOM way:

Document doc = new Document();Element e = new Element("root");e.setText("This is the root");doc.addContent(e);

schweet!

Page 28: The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004

Pointers

Everything in this tutorial (slides, example code, example data) will be archived at http://nlp.stanford.edu/local/for your future reference.

There’s a good JAXP/SAX/DOM tutorial at:http://java.sun.com/xml/jaxp/dist/1.1/docs/tutorial/

You can learn more about JDOM athttp://www.jdom.org/docs/faq.html