Introduction
In this article, I will demonstrate some basic techniques for reading XML file data using straight Java code and no third-party libraries. Although I am in fact a fan of third-party libraries (examples include JDOM and DOM4J, among others), sometimes they are a tool too powerful for a simple job. You simply may want to minimize the overall footprint of the application you deliver. Your own licensing might get a little sensitive about including such libraries. You just might want to further your own knowledge and develop a closer understanding of what these tools are helping you to do in the first place. Regardless, the standard Java API provides plenty of tools needed to get the work done yourself.
I will present a rather introductory article; however, the reader is still assumed to know how to write and run Java programs (using whichever editor or IDE suits their fancy) and be able to reference the standard API Javadocs for further information. (If you do not yet have a Java API bookmark, use this one to get the online “JavaSE 6” docs, although you can also download the API docs for a faster local version.) It also will be helpful (but not required) for the reader to have a basic understanding of XML files and XPath expressions. I will not go into any great detail on these topics, so let your favorite search engine be your guide to more information.
Basics of XML with Java
Classes, Interfaces, and the W3C
One thing worth noting is that the World Wide Web Consortium (W3C) established a general pattern and naming convention for working with XML data structures, and many programming languages, Java included, accomodate these patterns. This goes a long way to making the development conventions portable across languages, although it can be argued that these conventions can feel a bit out of place in a particular language. (The extra bit of convenience and syntactic sugar that many third-party XML libraries provides over these standard conventions is perhaps one reason they flourish.) Thus it is that the interface types in the standard Java API are present in the org.w3c.* package space.
The main, top-level interface in the org.w3c.dom package is the Node interface. Speaking rather loosely, anything in an XML document is a node of some kind: a comment, an element, an attribute in that element, or even the raw text sitting between an elements start and end tags. All of these different kinds of information have their own interface types in org.w3c.dom, but they all share Node as their parent interface. Given this is the case, it is worth taking a quick glance at the methods declared here. Some of the notable ones include:
Methods of org.w3c.dom.Node | Description/Purpose |
---|---|
getChildNodes() | Returns a NodeList of all Nodes. Be careful! Returns all types of nodes, not just “Element” nodes. |
getFirstChild() getLastChild() |
Returns first/last Node of the available Nodes |
getNodeType() | Returns which variety of Node (Comment, Element, Text, and so forth) the node is. The exact “types” of nodes are available as constants in the Node class itself, such as Node.COMMENT_NODE. |
getNodeName() | Returns, as a String, what you might think of as the name or specifier of a Node. This is most usually useful for Element and Attr (that is, attribute) Node types. |
getNodeValue() | Returns, as a String, what you might think of as the value for a Node. This really only applies to certain Node types. This is most usually useful for the Attr (that is, attribute) Node type. |
getAttributes() | Returns a NamedNodeMap for the attributes in the start tag of the Node (which really should be an Element to be meaningful). |
appendChild(Node newChild) insertBefore(Node newChild, Node refChild) removeChild(Node oldChild) |
Adds, inserts, or removes child Nodes to/from a Node. |
Of course, the methods not mentioned also have their utility. It is also appropriate to look at the NodeList and NamedNodeMap interfaces as well. The NodeList is somewhat analogous to a standard List-based collection (of the java.util.* variety) of Nodes; remember the naming conventions and methods have been more or less standardized by the W3C. The notable methods for both of these classes are:
Methods of org.w3c.dom.NodeList | Description/Purpose |
---|---|
getLength() | Returns number of Nodes in the collection. (Similar to size() in a standard Collection type.) |
item(int index) | Returns the Node at the specified index in the collection. The index is counted from zero. (Similar to get(int index) in a standard Collection type.) |
Likewise, the NamedNodeMap is somewhat akin to a standard Map-based collection. It has the same pair of methods as in the NodeList, and includes a few others that are more appropriate for working with a map.
Also note there is an interface type called Document. This is the top-level container object of all the different nodes that are represented in an XML file. When you load such a file (and I will get to that shortly), you will start your work with a Document type, and you then drill down through the various accessor methods highlighted above (getFirstChild(), for example) to get to the information you need.
At this point, some of you may be asking: “So where are the concrete objects that flesh out these interface types?” Well, the W3C just designed how the objects need to look and work. Implementation details are left to language and tool makers. In straight Java, such implementation details are present in the javax.xml.* package structure. With that being said, what might seem odd at first is that you never, on your own, instantiate a concrete Element object. That is, there is nothing that provides Element e = new Element();. Instead, you use static factory method to, for example, create a new Document, and then use method calls on the resulting object to create additional nodes of various types. (And then note you need to call appendChild or insertBefore to tie a child node to its parent node.)
DOM versus SAX
The preceding section talked about the classes and interfaces that contain the data that represents an XML document once it has been read from a file. However, the very process of reading in a file drags in a whole other discussion point: DOM versus SAX. In the Document Object Model (DOM) mode, the entire file is loaded and stored completely in memory all at once: The root element node, all descendants, text nodes, attributes, comments, processing instructions are all present in one big tree underneath the top-level Document object. You are given the entire tree when the parsing is finished, assuming there were no exceptions thrown while parsing the file; and you can inspect this big tree of nodes in any way you want at any time you want, adding or removing nodes, changing values, or whatever suits your fancy. This approach is the easiest one for you to deal with and is what is used for the examples in this article.
The alternative, the Simple API for XML (SAX), works quite a bit differently. Instead of loading up a complete node tree and giving you the result when finished, the SAX model instead functions as an event-callback model—the SAX parser reads just enough of the file at a time to know whether it has just read the start tag of an element, read characters in a text node, read a comment node, and so forth. For each such event, it calls a method to relay the information as it happens. It may happen that you have registered no callback method for a particular SAX event (more specifically, you have not overridden a “do nothing” callback method in some convenience class) and in this case, you have no knowledge that some particular node was encountered by the parser. The utility for using a SAX parser is explained more completely in other sources. However, one of the big reasons is that some XML documents are large. Really, really large. Larger than you want to commit memory for to have the entire structure loaded in memory. In cases such as this, a SAX-based approach lets you listen in on the parser and only respond to the little pieces of the file you are interested in. (In actual practice, writing code for a SAX parser can turn pretty ugly, because now you are responsible for keeping track of the state of such things as how many levels beneath a node you are currently at, so you know when to change your own processing rules.) Due to the number of details involved, I will not discuss a SAX-based approach further.
By Example
The XML Data File
First, create a rather arbitrary XML document. I will be working with a file named “demo.xml” that looks like this:
<?xml version="1.0"?> <demo> <child name="child_one">Hello</child> <child name="child_two">World</child> </demo>
Create a similar five-line file in a text editor and save it. You will next start working on code to load the file into a Document and then explore what you do to navigate around and read information from it.
The Java Code
To keep things simple at this point, you can start by importing these packages:
java.io.*; javax.xml.parsers.*; org.w3c.dom.*;
Note: Full-package imports is a bad, lazy habit. If your editor or IDE will insert import statements for each class you reference automatically, do it!
Next, instantiate a standard File object for your XML file. Something like this:
File file = new File("demo.xml");
Although there are variations in the factory methods coming up next that can try to work directly with a filename (that is, a String object), thereby skipping the need for instantiating a File object, doing things this way lets you make a quick sanity check via file.exists() and possibly file.canRead(). This lets you print a polite warning message and exit before getting further weighed down in the code. (In such cases of errors, do be so polite as to exit with a status other than zero, because another program can determine your program had a problem if the exit code is non-zero.) Consider:
String FILE_NAME = "demo.xml"; File file = new File(FILE_NAME); if ( ! (file.exists() && file.canRead()) { System.err.println("Error: cannot read "+FILE_NAME+". Exiting now."); System.exit(1); }
As to actually get the XML file loaded into a Document object, follow this recipe:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = dbFactory.newDocumentBuilder(); Document doc = builder.parse(file);
Important: These three lines may throw a few different types of exceptions while doing their work. The code for this demo will be given below and will include a fairly lazy try/catch around everything. Again, this is a bad habit, but this keeps the article/example simple.
With a fully-loaded Document object at your disposal, you now are ready to start using some of the accessor methods, mentioned earlier, to walk through the XML file structure and look at information. The first thing you probably want is to call getFirstChild() on the Document to get the root node. In your case, this is the “demo” element. To illustrate:
Node rootNode = doc.getFirstChild(); System.out.println("Root node name is "+rootNode.getNodeName());
Now, try working with a NodeList for all Nodes under your rootNode element:
NodeList nodes = rootNode.getChildNodes(); System.out.println("There are "+nodes.getLength()+" total nodes.");
If you try that with the demo.xml file given above, you should get a message saying there are five nodes. Would you expect that? (Why not two nodes?) Be careful to note that getChildNodes() returns all nodes of all types—this includes the empty text present in your formatting such as the newline and tab characters. (If you carefully remove all the formatting, you will see the NodeList only has two nodes.) This means that, if you want to loop through all the children of a given node, but only work with actual Element nodes, you need to check the node type and skip anything not of interest. To count only Element-type Nodes in a given NodeList, for example, do the following:
int count=0; for (int i=0 ; i<nodes.getLength() ; i++) { Node node = (Node) nodes.item(i); if (node.getNodeType() == Node.ELEMENT_NODE) { count++; } } System.out.println("There are "+count+" Element nodes.");
Now, you should have the number you might have naïvely expected earlier: two. As an alternative to checking getNodeType() and comparing it to the named constant in the Node interface, you also could check if (node instanceof Element). You will get the same effect in this case.
The last item that bears discussion is working with attributes of an element. Consider:
Element child1 = (Element) nodes.item(1); NamedNodeMap nodeMap = child1.getAttributes(); for (int i=0 ; i<nodeMap.getLength() ; i++) { Node node = nodeMap.item(i); String name = node.getNodeName(); String value = node.getNodeValue(); System.out.println("Attribute ("+name+") has value ("+value+")"); }
In this particular case, because I “know” that nodes.item(1) is an Element, I can cast it to an Element type. (This is not a requirement for the code to function, and when you don’t “know” ahead of time, consider checking with getNodeType() or using the instanceof operator.) The API docs state that calling getAttributes returns null for anything other than when called on an Element, and that it [only] returns Attribute nodes. This being the case, you can skip on some node type checking in this area and get straight to the business of asking each attribute Node for its name and its value.
When your assumptions about node types go wrong—as might be the case for adding an extra blank line in just the right (wrong?) spot of the file above—you’ll get ClassCastExceptions (or something similar) thrown at you. Code other than what is present in a demonstrational article should take all necessary precautions to insulate itself from the developer’s assumptions and the user’s actions. (The time you think you’ll save now will be paid back, with interest, in debugging later.)
Here’s the completed demo code, including the various lazy “don’t do this” shortcuts I’ve already warned you about:
import java.io.*; import javax.xml.parsers.*; import org.w3c.dom.*; public class Demo { public static void main(String[] args) { String FILE_NAME = "demo.xml"; File file = new File(FILE_NAME); if ( ! (file.exists() && file.canRead())) { System.err.println("Error: cannot read "+FILE_NAME+". Exiting now."); System.exit(1); } try { DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = dbFactory.newDocumentBuilder(); Document doc = builder.parse(file); Node rootNode = doc.getFirstChild(); System.out.println("Root node name: " + rootNode.getNodeName()); NodeList nodes = rootNode.getChildNodes(); System.out.println("There are "+ nodes.getLength() + " total nodes."); int count=0; for (int i=0 ; i<nodes.getLength() ; i++) { Node node = nodes.item(i); if (node instanceof Element) { count++; } System.out.println("There are "+count+" Element nodes."); Element child1 = (Element) nodes.item(1); NamedNodeMap nodeMap = child1.getAttributes(); for (int i=0 ; i<nodeMap.getLength() ; i++) { Node node = nodeMap.item(i); String name = node.getNodeName(); String value = node.getNodeValue(); System.out.println("Attribute ("+name+") has value ("+value+")"); } } catch (Exception e) { e.printStackTrace(); System.exit(1); } } }
Enabling XML parser validation
By default, the parser (that is, the DocumentBuilder instance) returned by the factory will be non-validating. (Well-formedness errors will show up as a parse exception either way.) This means any careless (or malicious) user could insert any random and unexpected elements into the XML document and cause you all manner of grief. Of course, this is what DTDs and Schemas are for: They define the legal structure of a given XML document. Turning on validation mode is simply a matter of calling setValidating(true) on the DocumentBuilderFactory instance before having it return you a new DocumentBuilder. (And obviously, the input XML document must declare a DTD or schema somewhere to validate against.) Now, validation errors will generate additional exceptions when the DocumentBuilder attempts to parse the file into a Document.
Note: You can refine the exact behavior of these exceptions by creating a custom org.xml.sax.ErrorHandler implementation and registering it with your DocumentBuilder instance prior to having it parse your file—you can intercept the validation exceptions and do whatever you want, although it is likely you will wrap them in a custom exception and just re-throw them (to process them elsewhere in your parsing logic). You will be given a default ErrorHandler implementation if you omit this step, but it adds an extra line of output warning you of this point.
Drill-Down Overload and XPath
The nature of the standard W3C DOM accessor methods is that they tend to induce a lot of method calls to drill down through your XML DOM tree to get to the piece of data you need. This can be all the more compounded by possible checks to be sure the Node you are working on really is a meaningful Element, and not just a spurious TextNode present only due to pretty-print formatting. The solution to this is to use XPath expressions to do the work for you.
Quick Overview of XPath Syntax
Just as a regular expression (regex) engine has all the logic necessary to greatly simplify the job of parsing an arbitrary piece of text against a tricky formatting pattern, an XPath expression engine likewise contains the code needed to walk through a DOM tree given a text-based set of instructions. And just like regular expressions, there are a lot of little details to get straight before you can say you really understand things.
Fortunately, you can do some fairly common things knowing only a couple details. Thus, with a heavy amount of simplification, here is an overly brief survivor’s guide to a few parts of the XPath syntax:
- The element(s) you want to navigate through are denoted using their node name. Parent elements are separated from child elements by a forward slash. Use a leading forward slash to denote the top of the document. (In this respect, a series of elements looks like a directory path on a UNIX-style operating system.) Example: /root/child/subchild
- An attribute is denoted by its name prefixed by an “@” (at) symbol. Furthermore, a given attribute of an element is written as though it were a child: a forward slash separates the two. Example: /root/child/@attr
- Because XPath expressions can match more than one node, you can qualify just which node you want by putting an index number in square brackets at the end of the element name. (Warning: Such indexes starts counting at 1, not 0.) Example: /root/child[1]
- You also can specify another XPath expression in the square brackets as a way of saying “the element(s) in the node list that match this additional constraint.” This tends to just be a simple attribute comparison. Example: /root/child[@attr=’some_value’]
XPath is probably more easily understood (at least at first) by example. Using the short XML document you experimented with earlier, the expression “/demo/child” returns two Element nodes, one for each “child” underneath the one and only “demo” root element. The expression “/demo/child[1]” would just return the first one. The expression “/demo/child[1]/@name” would return the value of the “name” attribute of the first child element, that is, “child_one”. Finally, the expression “/demo/child[@name=’child_two’]” would return the child element whose own name attribute contained the value “child_two” — this is the second possible child element.
From this point, XPath can get rather hairy … multiple constraints (formally called “predicates”) can be stacked on an element (and even nested in others), there are ways to navigate backwards and sideways through sibling elements (instead of only downward through children), and there are more ways to match values than simply using an equal sign. Special notation exists to specifically match different element types (like comments and processing instructions). Wildcarding, convenience shorthand notations, and namespace prefixes also can muddy up things pretty quickly. In other words, to really extract the maximum amount of power from XPath expressions for rather complex needs, you will need an actual reference. (And you will need a bit of practice, but this will definitely be time well spent if your tasks take you into XML processing territory.)
Note: If you ever become involved with XSLT to transform an input XML document into a different output XML document, you will use a lot of XPath expressions.
Using XPath Expressions
Again, to keep things simple, you will need to import the javax.xml.xpath.* package space. Creating and using an XPath expression uses a fairly similar pattern to creating and using an XML parser:
XPathFactory xpFactory = XPathFactory.newInstance(); XPath xpath = xpFactory.newXPath(); String value = xpath.evaluate("/demo/child[1]", someNode);
As an option, you also can compile an XPath expression in a separate step from evaluating it. This allows the engine to check that the XPath expression passed to it makes any sense at all. (An exception is thrown if not.) Also, note the evaluate method operates on a Node, not just a Document. What probably wasn’t clear above is that XPath expressions need not start with forward slash. Omitting the slash gives a relative XPath expression whose evaluation depends on the Node it operates on. (This might be another reason you would compile an XPath expression first so as to reuse the resulting XPathExpression object.)
Example Revisited
Revisiting the demo.xml file from above, consider the following example that parses the file into a DOM object and then uses some XPath evaluations to display some information:
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = dbFactory.newDocumentBuilder(); Document doc = builder.parse(new File("demo.xml")); XPathFactory xpFactory = XPathFactory.newInstance(); XPath xpath = xpFactory.newXPath(); int count = Integer.parseInt(xpath.evaluate("count(/demo/child)", doc)); System.out.println("There are "+count+" child elements:"); for (int i=1 ; i<=count ; i++) { String text = xpath.evaluate("/demo/child["+i+"]", doc); System.out.printf("i=%d child text is '%s'n", i, text); }
One interesting point to note is that you are no longer having to deal explicitly with getting a NodeList object just to call the getLength() method—the XPath language provides a built-in count() function that accomplishes the same objective. (In fact, in complex XPath statements, the value of a count() function can be used as a constraint predicate inside a larger XPath expression to, for example, get nodes with no children.) Also, this example has no references to Node or Element objects, because you are using the form of the evaluate method that directly returns Strings. Of course, in a more real-world scenario, you are likely to mix and match approaches somewhat, depending on your exact needs.
Conclusion
You have just had a quick tour of some Java’s built-in XML processing capabilities. Although the programming style of working with XML does in some respects differ from that of the rest of the API, the upshot of learning this approach is that it is at least “mentally portable” to many other languages. (Even JavaScript is functionally similar, should you ever dabble in AJAX-style response processing.) Of course, those with a real need to do intensive XML munging should probably at least look in on the many third-party libraries, but you should not be afraid to take things on yourself if requirements so demand.
Other Ideas
A great way to utilize XPath expressions to streamline, say, reading an XML configuration file for your application would be to store them into a standard properties file. Then, you simply can fetch an appropriate XPath expression from this file based on some a appropriately named “key” value, and evaluate it against the configuration file DOM already loaded. This allows you some flexibility to rearrange the structure of the configuration file (if needed) without having to rearrange any existing source code. With the right helper files in place, large sections of your own code won’t even know that XML is involved.