JavaUnderstanding the XPath Data Model

Understanding the XPath Data Model

The XPath 2.0 data model is based on an XML document's infoset. A document's infoset contains all the document's data reduced to a standard form in a set of properties; you can read all about infosets at http://www.w3.org/TR/xml-infoset/.

For example, if you're working with a processing instruction, the infoset will contain a number of properties for that processing instruction, including target, content, base-uri, and parent. These properties are then translated into XPath 2.0 data model properties. The details of how this works are not directly important to us because they're handled by the software, and the XPath 2.0 properties for a node aren't directly available to us anyway (these properties are accessed by the XPath 2.0 processor when you use the XPath 2.0 language). However, it's good to know how the process works in overview.

In general, an XML document is first reduced to its infoset, which may be validated by an XML schema (although XPath 2.0 makes provisions for DTD validation, it's clear they're focusing on schemas), resulting in a Post Schema Validated Infoset (PSVI). The PSVI's properties are then converted to the corresponding XPath 2.0 data model properties and made available to XPath processors.

All the data from the PSVI is represented in sequences (single items, like single nodes, are represented as singleton sequences). As you know, sequences can contain nodes or atomic values, or a mix of the two. The XPath 2.0 data model uses the same seven node kinds as the XPath 1.0 data model does (except that root nodes are now called document nodes):

  • Document nodes

  • Element nodes

  • Attribute nodes

  • Processing instruction nodes

  • Comment nodes

  • Text nodes

  • Namespace nodes

Note that in the XPath 2.0 data model, each node has two types of values—its string value and its typed value. The string value is just the string value of the node. Its typed value, on the other hand, is of the type the node has been declared to be. For example, if you've declared an element to contain decimal data, and if it holds the string "1.0", its type value will be the decimal value 1.0. As a result of schema validation, every element and attribute node has a type annotation, which is the name of the type against which the node was successfully validated. For attribute nodes, the type annotation is always the name of a simple type. For element nodes, the type annotation may be the name of a simple or a complex type. Now that there is more type data in the type annotation, nodes also have an associated typed value.

Typed values of attributes and elements based on simple types are just sequences of atomic values corresponding to the node's content after validation. The typed value of an element based on a complex type, on the other hand, is considered undefined.

Atomic values, on the other hand, correspond to the primitive simple types defined by the XML Schema specification, or values whose types are derived from those types by restriction in a schema.

That's what the picture looks like in overview. Now we'll take a closer look at the various legal items in the XPath 2.0 data model—nodes and atomic values—starting with the kinds of nodes allowed.

The first node kind we'll take a look at is the document node in XPath 2.0.

Document Nodes

The document node (the same as XPath 1.0 root nodes) encapsulates the entire XML document—it's the starting point in the tree that describes the XML document. In the XPath 2.0 data model, document nodes have a number of properties derived from the PSVI. You don't access these properties directly (the software you're using does)—but they give you an idea of the data that is available for a document node:

  • base-uri

  • children

  • unparsed-entities

  • document-uri

Every document node must have a unique identity and must be distinct from all other nodes. If there are children, they must consist only of element, processing instruction, comment, and text nodes. You cannot have attribute, namespace, and document nodes as direct children of a document node.

Note also that the sequence of nodes in the children property is ordered (those nodes will be in document order), and the children property must not contain two consecutive text nodes (they must be merged to normalize that text). In well-formed XML documents, the children of the document node must not be empty and must consist only of element nodes, processing-instruction nodes, and comment nodes. Exactly one of these children, the document element, is an element node. (Don't confuse the document element with the document node—the document element contains all the other elements in the document.)

Included in the information available to XPath 2.0 software about a document node is: the base URI, the kind of node the node is (which returns "document" in this case), the string value of the node (which is all the string values of all text node descendants concatenated together), the typed value of this node (which is its string value), the node's children, and the URI of the document itself.

Element Nodes

Element nodes encapsulate XML elements. In the XPath 2.0 data model, elements have these properties:

  • base-uri

  • node-name

  • parent

  • type

  • children

  • attributes

  • namespaces

In addition, element nodes must have a type annotation, which indicates what type of element they are. (As mentioned earlier, exactly how the type annotation works is implementation-specific at this point, and is not defined by XPath 2.0.) Element nodes must also have a unique identity, distinct from all other nodes. If there are children, the children of an element must be only element, processing instruction, comment, and text nodes. Attribute, namespace, and document nodes cannot be element node children.

Also, the children property may not contain two consecutive text nodes, and the sequence of nodes in the children property is ordered (in document order). The attributes of elements must have distinct names, as well as the namespace modes of an element, if there are any. And no namespace node may have the name "xmlns".

Get the Free Newsletter!
Subscribe to Developer Insider for top news, trends & analysis
This email address is invalid.
Get the Free Newsletter!
Subscribe to Developer Insider for top news, trends & analysis
This email address is invalid.

Latest Posts

Related Stories