October 21, 2016
Hot Topics:

Working with XML and Java

  • January 2, 2008
  • By Rob Lybarger
  • Send Email »
  • More Articles »

Enabling XML parser validation

By default, the parser (that is, the DocumentBuilder instance) returned by the factory will be non-validating. (Well-formedness errors will show up as a parse exception either way.) This means any careless (or malicious) user could insert any random and unexpected elements into the XML document and cause you all manner of grief. Of course, this is what DTDs and Schemas are for: They define the legal structure of a given XML document. Turning on validation mode is simply a matter of calling setValidating(true) on the DocumentBuilderFactory instance before having it return you a new DocumentBuilder. (And obviously, the input XML document must declare a DTD or schema somewhere to validate against.) Now, validation errors will generate additional exceptions when the DocumentBuilder attempts to parse the file into a Document.

Note: You can refine the exact behavior of these exceptions by creating a custom org.xml.sax.ErrorHandler implementation and registering it with your DocumentBuilder instance prior to having it parse your file—you can intercept the validation exceptions and do whatever you want, although it is likely you will wrap them in a custom exception and just re-throw them (to process them elsewhere in your parsing logic). You will be given a default ErrorHandler implementation if you omit this step, but it adds an extra line of output warning you of this point.

Drill-Down Overload and XPath

The nature of the standard W3C DOM accessor methods is that they tend to induce a lot of method calls to drill down through your XML DOM tree to get to the piece of data you need. This can be all the more compounded by possible checks to be sure the Node you are working on really is a meaningful Element, and not just a spurious TextNode present only due to pretty-print formatting. The solution to this is to use XPath expressions to do the work for you.

Quick Overview of XPath Syntax

Just as a regular expression (regex) engine has all the logic necessary to greatly simplify the job of parsing an arbitrary piece of text against a tricky formatting pattern, an XPath expression engine likewise contains the code needed to walk through a DOM tree given a text-based set of instructions. And just like regular expressions, there are a lot of little details to get straight before you can say you really understand things.

Fortunately, you can do some fairly common things knowing only a couple details. Thus, with a heavy amount of simplification, here is an overly brief survivor's guide to a few parts of the XPath syntax:

  • The element(s) you want to navigate through are denoted using their node name. Parent elements are separated from child elements by a forward slash. Use a leading forward slash to denote the top of the document. (In this respect, a series of elements looks like a directory path on a UNIX-style operating system.) Example: /root/child/subchild
  • An attribute is denoted by its name prefixed by an "@" (at) symbol. Furthermore, a given attribute of an element is written as though it were a child: a forward slash separates the two. Example: /root/child/@attr
  • Because XPath expressions can match more than one node, you can qualify just which node you want by putting an index number in square brackets at the end of the element name. (Warning: Such indexes starts counting at 1, not 0.) Example: /root/child[1]
  • You also can specify another XPath expression in the square brackets as a way of saying "the element(s) in the node list that match this additional constraint." This tends to just be a simple attribute comparison. Example: /root/child[@attr='some_value']

XPath is probably more easily understood (at least at first) by example. Using the short XML document you experimented with earlier, the expression "/demo/child" returns two Element nodes, one for each "child" underneath the one and only "demo" root element. The expression "/demo/child[1]" would just return the first one. The expression "/demo/child[1]/@name" would return the value of the "name" attribute of the first child element, that is, "child_one". Finally, the expression "/demo/child[@name='child_two']" would return the child element whose own name attribute contained the value "child_two" -- this is the second possible child element.

From this point, XPath can get rather hairy ... multiple constraints (formally called "predicates") can be stacked on an element (and even nested in others), there are ways to navigate backwards and sideways through sibling elements (instead of only downward through children), and there are more ways to match values than simply using an equal sign. Special notation exists to specifically match different element types (like comments and processing instructions). Wildcarding, convenience shorthand notations, and namespace prefixes also can muddy up things pretty quickly. In other words, to really extract the maximum amount of power from XPath expressions for rather complex needs, you will need an actual reference. (And you will need a bit of practice, but this will definitely be time well spent if your tasks take you into XML processing territory.)

Note: If you ever become involved with XSLT to transform an input XML document into a different output XML document, you will use a lot of XPath expressions.

Using XPath Expressions

Again, to keep things simple, you will need to import the javax.xml.xpath.* package space. Creating and using an XPath expression uses a fairly similar pattern to creating and using an XML parser:

XPathFactory xpFactory = XPathFactory.newInstance();
XPath xpath = xpFactory.newXPath();
String value = xpath.evaluate("/demo/child[1]", someNode);

As an option, you also can compile an XPath expression in a separate step from evaluating it. This allows the engine to check that the XPath expression passed to it makes any sense at all. (An exception is thrown if not.) Also, note the evaluate method operates on a Node, not just a Document. What probably wasn't clear above is that XPath expressions need not start with forward slash. Omitting the slash gives a relative XPath expression whose evaluation depends on the Node it operates on. (This might be another reason you would compile an XPath expression first so as to reuse the resulting XPathExpression object.)

Page 4 of 5

Comment and Contribute


(Maximum characters: 1200). You have characters left.



Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date
Rocket Fuel