Working with XML and Java
In this article, I will demonstrate some basic techniques for reading XML file data using straight Java code and no third-party libraries. Although I am in fact a fan of third-party libraries (examples include JDOM and DOM4J, among others), sometimes they are a tool too powerful for a simple job. You simply may want to minimize the overall footprint of the application you deliver. Your own licensing might get a little sensitive about including such libraries. You just might want to further your own knowledge and develop a closer understanding of what these tools are helping you to do in the first place. Regardless, the standard Java API provides plenty of tools needed to get the work done yourself.
I will present a rather introductory article; however, the reader is still assumed to know how to write and run Java programs (using whichever editor or IDE suits their fancy) and be able to reference the standard API Javadocs for further information. (If you do not yet have a Java API bookmark, use this one to get the online "JavaSE 6" docs, although you can also download the API docs for a faster local version.) It also will be helpful (but not required) for the reader to have a basic understanding of XML files and XPath expressions. I will not go into any great detail on these topics, so let your favorite search engine be your guide to more information.
Basics of XML with Java
Classes, Interfaces, and the W3C
One thing worth noting is that the World Wide Web Consortium (W3C) established a general pattern and naming convention for working with XML data structures, and many programming languages, Java included, accomodate these patterns. This goes a long way to making the development conventions portable across languages, although it can be argued that these conventions can feel a bit out of place in a particular language. (The extra bit of convenience and syntactic sugar that many third-party XML libraries provides over these standard conventions is perhaps one reason they flourish.) Thus it is that the interface types in the standard Java API are present in the org.w3c.* package space.
The main, top-level interface in the org.w3c.dom package is the Node interface. Speaking rather loosely, anything in an XML document is a node of some kind: a comment, an element, an attribute in that element, or even the raw text sitting between an elements start and end tags. All of these different kinds of information have their own interface types in org.w3c.dom, but they all share Node as their parent interface. Given this is the case, it is worth taking a quick glance at the methods declared here. Some of the notable ones include:
|Methods of org.w3c.dom.Node||Description/Purpose|
|getChildNodes()||Returns a NodeList of all Nodes.|
Be careful! Returns all types of nodes, not just "Element" nodes.
|Returns first/last Node of the available Nodes|
|getNodeType()||Returns which variety of Node (Comment, Element, Text, and so forth) the node is. The exact "types" of nodes are available as constants in the Node class itself, such as Node.COMMENT_NODE.|
|getNodeName()||Returns, as a String, what you might think of as the name or specifier of a Node. This is most usually useful for Element and Attr (that is, attribute) Node types.|
|getNodeValue()||Returns, as a String, what you might think of as the value for a Node. This really only applies to certain Node types. This is most usually useful for the Attr (that is, attribute) Node type.|
|getAttributes()||Returns a NamedNodeMap for the attributes in the start tag of the Node (which really should be an Element to be meaningful).|
insertBefore(Node newChild, Node refChild) removeChild(Node oldChild)
|Adds, inserts, or removes child Nodes to/from a Node.|
Of course, the methods not mentioned also have their utility. It is also appropriate to look at the NodeList and NamedNodeMap interfaces as well. The NodeList is somewhat analogous to a standard List-based collection (of the java.util.* variety) of Nodes; remember the naming conventions and methods have been more or less standardized by the W3C. The notable methods for both of these classes are:
|Methods of org.w3c.dom.NodeList||Description/Purpose|
|getLength()||Returns number of Nodes in the collection.|
(Similar to size() in a standard Collection type.)
|item(int index)||Returns the Node at the specified index in the collection. The index is counted from zero.|
(Similar to get(int index) in a standard Collection type.)
Likewise, the NamedNodeMap is somewhat akin to a standard Map-based collection. It has the same pair of methods as in the NodeList, and includes a few others that are more appropriate for working with a map.
Also note there is an interface type called Document. This is the top-level container object of all the different nodes that are represented in an XML file. When you load such a file (and I will get to that shortly), you will start your work with a Document type, and you then drill down through the various accessor methods highlighted above (getFirstChild(), for example) to get to the information you need.
At this point, some of you may be asking: "So where are the concrete objects that flesh out these interface types?" Well, the W3C just designed how the objects need to look and work. Implementation details are left to language and tool makers. In straight Java, such implementation details are present in the javax.xml.* package structure. With that being said, what might seem odd at first is that you never, on your own, instantiate a concrete Element object. That is, there is nothing that provides Element e = new Element();. Instead, you use static factory method to, for example, create a new Document, and then use method calls on the resulting object to create additional nodes of various types. (And then note you need to call appendChild or insertBefore to tie a child node to its parent node.)