parsing an xml document

Parsing an XML Document

To process an XML document, you need to parse it. A parser is a program that reads a file, confirms that the file has the correct format, breaks it up into the constituent elements, and lets a programmer access those elements. The Java library supplies two kinds of XML parsers:

Tree parsers such as the Document Object Model (DOM) parser that read an XML document into a tree structure.

Streaming parsers such as the Simple API for XML (SAX) parser that generate events as they read an XML document.

The DOM parser is easy to use for most purposes, and we explain it first. You would consider a streaming parser if you process very long documents whose tree structures would use up a lot of memory, or if you are just interested in a few elements and you don't care about their context. For more information, see the section "Streaming Parsers" on page 138.

The DOM parser interface is standardized by the World Wide Web Consortium (W3C). The org.w3c.dom package contains the definitions of interface types such as Document and Element. Different suppliers, such as the Apache Organization and IBM, have written DOM parsers whose classes implement these interfaces. The Sun Java API for XML Processing (JAXP) library actually makes it possible to plug in any of these parsers. But Sun also includes its own DOM parser in the Java SDK. We use the Sun parser in this chapter.

To read an XML document, you need a DocumentBuilder object, which you get from a DocumentBuilderFactory, like this:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();DocumentBuilder builder = factory.newDocumentBuilder();

You can now read a document from a file:

File f = . . .Document doc = builder.parse(f);

Alternatively, you can use a URL:

URL u = . . .Document doc = builder.parse(u);

You can even specify an arbitrary input stream:

InputStream in = . . .Document doc = builder.parse(in);

Note

If you use an input stream as an input source, then the parser will not be able to locate other files that are referenced relative to the location of the document, such as a DTD in the same directory. You can install an "entity resolver" to overcome that problem.

The Document object is an in-memory representation of the tree structure of the XML document. It is composed of objects whose classes implement the Node interface and its various subinterfaces. Figure 2-1 shows the inheritance hierarchy of the subinterfaces.

Figure 2-1. The Node interface and its subinterfaces

[View full size image]

You start analyzing the contents of a document by calling the getDocumentElement method. It returns the root element.

Element root = doc.getDocumentElement();

For example, if you are processing a document

<?xml version="1.0"?> . . .

then calling getDocumentElement returns the font element.

The getTagName method returns the tag name of an element. In the preceding example, root.getTagName() returns the string "font".

To get the element's children (which may be subelements, text, comments, or other nodes), use the getChildNodes method. That method returns a collection of type NodeList. That type was invented before the standard Java collections, and it has a different access protocol. The item method gets the item with a given index, and the getLength method gives the total count of the items. Therefore, you can enumerate all children like this:

NodeList children = root.getChildNodes();for (int i = 0; i < children.getLength(); i++){ Node child = children.item(i); . . .}

Be careful when analyzing the children. Suppose, for example, that you are processing the document

 <name>Helvetica</name> <size>36</size>

You would expect the font element to have two children, but the parser reports five:

The whitespace between and <name> The name element The whitespace between </name> and <size> The size element The whitespace between </size> and 

Figure 2-2 shows the DOM tree.

Figure 2-2. A simple DOM tree

[View full size image]

If you expect only subelements, then you can ignore the whitespace:

for (int i = 0; i < children.getLength(); i++){ Node child = children.item(i); if (child instanceof Element) { Element childElement = (Element) child; . . . }}

Now you look at only two elements, with tag names name and size.

As you see in the next section, you can do even better if your document has a DTD. Then the parser knows which elements don't have text nodes as children, and it can suppress the whitespace for you.

When analyzing the name and size elements, you want to retrieve the text strings that they contain. Those text strings are themselves contained in child nodes of type Text. Because you know that these Text nodes are the only children, you can use the getFirstChild method without having to traverse another NodeList. Then use the getData method to retrieve the string stored in the Text node.

for (int i = 0; i < children.getLength(); i++){ Node child = children.item(i); if (child instanceof Element) {

Element childElement = (Element) child; Text textNode = (Text) childElement.getFirstChild(); String text = textNode.getData().trim(); if (childElement.getTagName().equals("name")) name = text; else if (childElement.getTagName().equals("size")) size = Integer.parseInt(text); }}

Tip

It is a good idea to call trim on the return value of the getData method. If the author of an XML file puts the beginning and the ending tag on separate lines, such as

<size> 36</size>

then the parser includes all line breaks and spaces in the text node data. Calling the trim method removes the whitespace surrounding the actual data.

You can also get the last child with the getLastChild method, and the next sibling of a node with getNextSibling. Therefore, another way of traversing a set of child nodes is

for (Node childNode = element.getFirstChild(); childNode != null; childNode = childNode.getNextSibling()){ . . .}

To enumerate the attributes of a node, call the getAttributes method. It returns a NamedNodeMap object that contains Node objects describing the attributes. You can traverse the nodes in a NamedNodeMap in the same way as a NodeList. Then call the getNodeName and getNodeValue methods to get the attribute names and values.

NamedNodeMap attributes = element.getAttributes();for (int i = 0; i < attributes.getLength(); i++){ Node attribute = attributes.item(i); String name = attribute.getNodeName(); String value = attribute.getNodeValue(); . . .}

Alternatively, if you know the name of an attribute, you can retrieve the corresponding value directly:

String unit = element.getAttribute("unit");

You have now seen how to analyze a DOM tree. The program in Listing 2-1 puts these techniques to work. You can use the File -> Open menu option to read in an XML file. A DocumentBuilder object parses the XML file and produces a Document object. The program displays the Document object as a tree (see Figure 2-3).

Figure 2-3. A parse tree of an XML document

The tree display shows clearly how child elements are surrounded by text containing whitespace and comments. For greater clarity, the program displays newline and return characters as \n and \r. (Otherwise, they would show up as hollow boxes, the default symbol for a character that Swing cannot draw in a string.)

In Chapter 6, you will learn the techniques that this program uses to display the tree and the attribute tables. The DOMTreeModel class implements the TreeModel interface. The getRoot method returns the root element of the document. The getChild method gets the node list of children and returns the item with the requested index. The tree cell renderer displays the following:

For elements, the element tag name and a table of all attributes. For character data, the interface (Text, Comment, or CDATASection), followed

by the data, with newline and return characters replaced by \n and \r. For all other node types, the class name followed by the result of toString.

Listing 2-1. DOMTreeTest.java

Code View: 1. import java.awt.*; 2. import java.awt.event.*; 3. import java.io.*; 4. import javax.swing.*; 5. import javax.swing.event.*; 6. import javax.swing.table.*; 7. import javax.swing.tree.*; 8. import javax.xml.parsers.*; 9. import org.w3c.dom.*; 10. 11. /** 12. * This program displays an XML document as a tree. 13. * @version 1.11 2007-06-24 14. * @author Cay Horstmann 15. */ 16. public class DOMTreeTest 17. { 18. public static void main(String[] args) 19. { 20. EventQueue.invokeLater(new Runnable() 21. { 22. public void run() 23. { 24. JFrame frame = new DOMTreeFrame(); 25. frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); 26. frame.setVisible(true); 27. } 28. }); 29. } 30. } 31. 32. /** 33. * This frame contains a tree that displays the contents of an XML document. 34. */ 35. class DOMTreeFrame extends JFrame 36. { 37. public DOMTreeFrame() 38. { 39. setTitle("DOMTreeTest"); 40. setSize(DEFAULT_WIDTH, DEFAULT_HEIGHT); 41. 42. JMenu fileMenu = new JMenu("File"); 43. JMenuItem openItem = new JMenuItem("Open"); 44. openItem.addActionListener(new ActionListener() 45. { 46. public void actionPerformed(ActionEvent event)

47. { 48. openFile(); 49. } 50. }); 51. fileMenu.add(openItem); 52. 53. JMenuItem exitItem = new JMenuItem("Exit"); 54. exitItem.addActionListener(new ActionListener() 55. { 56. public void actionPerformed(ActionEvent event) 57. { 58. System.exit(0); 59. } 60. }); 61. fileMenu.add(exitItem); 62. 63. JMenuBar menuBar = new JMenuBar(); 64. menuBar.add(fileMenu); 65. setJMenuBar(menuBar); 66. } 67. 68. /** 69. * Open a file and load the document. 70. */ 71. public void openFile() 72. { 73. JFileChooser chooser = new JFileChooser(); 74. chooser.setCurrentDirectory(new File(".")); 75. 76. chooser.setFileFilter(new javax.swing.filechooser.FileFilter() 77. { 78. public boolean accept(File f) 79. { 80. return f.isDirectory() || f.getName().toLowerCase().endsWith(".xml"); 81. } 82. 83. public String getDescription() 84. { 85. return "XML files"; 86. } 87. }); 88. int r = chooser.showOpenDialog(this); 89. if (r != JFileChooser.APPROVE_OPTION) return; 90. final File file = chooser.getSelectedFile(); 91. 92. new SwingWorker<Document, Void>() 93. { 94. protected Document doInBackground() throws Exception 95. { 96. if (builder == null) 97. { 98. DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); 99. builder = factory.newDocumentBuilder();100. }101. return builder.parse(file);

102. }103.104. protected void done()105. {106. try107. {108. Document doc = get();109. JTree tree = new JTree(new DOMTreeModel(doc));110. tree.setCellRenderer(new DOMTreeCellRenderer());111.112. setContentPane(new JScrollPane(tree));113. validate();114. }115. catch (Exception e)116. {117. JOptionPane.showMessageDialog(DOMTreeFrame.this, e);118. }119. }120. }.execute();121. }122.123. private DocumentBuilder builder;124. private static final int DEFAULT_WIDTH = 400;125. private static final int DEFAULT_HEIGHT = 400;126. }127.128. /**129. * This tree model describes the tree structure of an XML document.130. */131. class DOMTreeModel implements TreeModel132. {133. /**134. * Constructs a document tree model.135. * @param doc the document136. */137. public DOMTreeModel(Document doc)138. {139. this.doc = doc;140. }141.142. public Object getRoot()143. {144. return doc.getDocumentElement();145. }146.147. public int getChildCount(Object parent)148. {149. Node node = (Node) parent;150. NodeList list = node.getChildNodes();151. return list.getLength();152. }153.154. public Object getChild(Object parent, int index)155. {156. Node node = (Node) parent;

157. NodeList list = node.getChildNodes();158. return list.item(index);159. }160.161. public int getIndexOfChild(Object parent, Object child)162. {163. Node node = (Node) parent;164. NodeList list = node.getChildNodes();165. for (int i = 0; i < list.getLength(); i++)166. if (getChild(node, i) == child) return i;167. return -1;168. }169.170. public boolean isLeaf(Object node)171. {172. return getChildCount(node) == 0;173. }174.175. public void valueForPathChanged(TreePath path, Object newValue)176. {177. }178.179. public void addTreeModelListener(TreeModelListener l)180. {181. }182.183. public void removeTreeModelListener(TreeModelListener l)184. {185. }186.187. private Document doc;188. }189.190. /**191. * This class renders an XML node.192. */193. class DOMTreeCellRenderer extends DefaultTreeCellRenderer194. {195. public Component getTreeCellRendererComponent(JTree tree, Object value, boolean selected,196. boolean expanded, boolean leaf, int row, boolean hasFocus)197. {198. Node node = (Node) value;199. if (node instanceof Element) return elementPanel((Element) node);200.201. super.getTreeCellRendererComponent(tree, value, selected, expanded, leaf, row, hasFocus);202. if (node instanceof CharacterData) setText(characterString((CharacterData) node));203. else setText(node.getClass() + ": " + node.toString());204. return this;205. }206.207. public static JPanel elementPanel(Element e)208. {

209. JPanel panel = new JPanel();210. panel.add(new JLabel("Element: " + e.getTagName()));211. final NamedNodeMap map = e.getAttributes();212. panel.add(new JTable(new AbstractTableModel()213. {214. public int getRowCount()215. {216. return map.getLength();217. }218.219. public int getColumnCount()220. {221. return 2;222. }223.224. public Object getValueAt(int r, int c)225. {226. return c == 0 ? map.item(r).getNodeName() : map.item(r).getNodeValue();227. }228. }));229. return panel;230. }231.232. public static String characterString(CharacterData node)233. {234. StringBuilder builder = new StringBuilder(node.getData());235. for (int i = 0; i < builder.length(); i++)236. {237. if (builder.charAt(i) == '\r')238. {239. builder.replace(i, i + 1, "\\r");240. i++;241. }242. else if (builder.charAt(i) == '\n')243. {244. builder.replace(i, i + 1, "\\n");245. i++;246. }247. else if (builder.charAt(i) == '\t')248. {249. builder.replace(i, i + 1, "\\t");250. i++;251. }252. }253. if (node instanceof CDATASection) builder.insert(0, "CDATASection: ");254. else if (node instanceof Text) builder.insert(0, "Text: ");255. else if (node instanceof Comment) builder.insert(0, "Comment: ");256.257. return builder.toString();258. }259. }

javax.xml.parsers.DocumentBuilderFactory 1.4

static DocumentBuilderFactory newInstance()

returns an instance of the DocumentBuilderFactory class.

DocumentBuilder newDocumentBuilder()

returns an instance of the DocumentBuilder class.

javax.xml.parsers.DocumentBuilder 1.4

Document parse(File f) Document parse(String url) Document parse(InputStream in)

parses an XML document from the given file, URL, or input stream and returns the parsed document.

org.w3c.dom.Document 1.4

Element getDocumentElement()

returns the root element of the document.

org.w3c.dom.Element 1.4

String getTagName()

returns the name of the element.

String getAttribute(String name)

returns the value of the attribute with the given name, or the empty string if there is no such attribute.

org.w3c.dom.Node 1.4

NodeList getChildNodes()

returns a node list that contains all children of this node.

Node getFirstChild() Node getLastChild()

gets the first or last child node of this node, or null if this node has no children.

Node getNextSibling() Node getPreviousSibling()

gets the next or previous sibling of this node, or null if this node has no siblings.

Node getParentNode()

gets the parent of this node, or null if this node is the document node.

NamedNodeMap getAttributes()

returns a node map that contains Attr nodes that describe all attributes of this node.

String getNodeName()

returns the name of this node. If the node is an Attr node, then the name is the attribute name.

String getNodeValue()

returns the value of this node. If the node is an Attr node, then the value is the attribute value.

org.w3c.dom.CharacterData 1.4

String getData()

returns the text stored in this node.

org.w3c.dom.NodeList 1.4

int getLength()

returns the number of nodes in this list.

Node item(int index)

returns the node with the given index. The index is between 0 and getLength() - 1.

org.w3c.dom.NamedNodeMap 1.4

int getLength()

returns the number of nodes in this map.

Node item(int index)

returns the node with the given index. The index is between 0 and getLength() - 1.

parsing an xml document

Documents