Java JAXP, Exposing a DOM Tree
Java Programming Notes # 2204
- Preface
- Preview
- Discussion and Sample Code
- Run the Program
- Summary
- What's Next?
- Complete Program Listings
Preface
What is JAXP?
As the name implies, the Java API for XML Processing (JAXP) is an
API designed
to help you write programs for processing XML documents. JAXP is
very important for many reasons, not the least of which is the
fact that it is a critical part of the Java Web Services Developer Pack
(JWSDP). As you are probably already aware, web services is
expected by many to be a very important aspect of the Internet of the
future
This is the third lesson in a series designed to initially help you understand how to use JAXP, and to eventually help you understand how to use the JWSDP.
The first lesson was entitled Java
API for XML Processing (JAXP), Getting Started. The
previous lesson was entitled Getting
Started with Java JAXP and XSL Transformations (XSLT).
What is XML?
XML is an acronym for the eXtensible Markup Language. I will not attempt to teach XML in this series of tutorial lessons. Rather, I will assume that you already understand XML, and I will teach you how to use JAXP to write programs for creating and processing XML documents.I have published numerous tutorial lessons on XML at Gamelan.com and www.DickBaldwin.com. You may find it useful to refer to those lessons. In addition, I provided a review of the salient aspects of XML in the first lesson in this series. From time to time, I will also provide background information regarding XML in the lessons in this series.
Viewing tip
You may find it useful to open another copy of this lesson in a separate browser window. That will make it easier for you to scroll back and forth among the different listings and figures while you are reading about them.
Supplementary material
I recommend that you also study the other lessons in my extensive collection of online Java tutorials. You will find those lessons published at Gamelan.com. However, as of the date of this writing, Gamelan doesn't maintain a consolidated index of my Java tutorial lessons, and sometimes they are difficult to locate there. You will find a consolidated index at www.DickBaldwin.com.
Preview
A tree structure in memory
A DOM parser can be used to
create a tree structure in memory that represents an XML
document. In Java, that tree structure is encapsulated in an
object of the interface type Document. Document
and its superinterface Node declare numerous methods that may
be used to navigate, extract information from, modify, and otherwise
manipulate the DOM tree. As
is always the case, classes that implement Document must
provide concrete definitions of those methods.
Many operations are possible
Given an object of type Document, there are many
methods that
can be invoked on the object to perform a variety of operations.
For example, it is possible to move nodes from one location in the tree
to another location in the tree, thus rearranging the structure of the
XML document represented by the Document object. It is
also possible to delete nodes, and to insert new nodes. It is
also possible
to
recursively traverse the tree, extracting information about the nodes
along
the way.
I showed you ...
In the previous lesson on Java JAXP, I began by providing a brief
review of XSL and XSL Transformations (XSLT).
Then I showed you how to create an identity Transformer
object, and how to use that object to:
- Display a DOM tree structure on the screen in XML format.
- Write the contents of a DOM tree structure into an output XML file.
I will show you ...
In this lesson, I will show you how to write a program to display a
DOM tree on the screen in a format that is much easier to interpret
than raw XML code. I will explain two different versions of the
program. One version will simply identify text nodes in the
output tree. The other will display the value of text nodes in
the output tree. The first version will ignore attributes in the
output tree. The second version will include attributes in the
output tree.
Discussion and Sample Code
#document DOCUMENT_NODE Figure 1 |
The physical tree structure shown in Figure 1 represents the corresponding XML document as a visual tree. As I discuss the various parts of the XML document, you should be able to correlate those parts of the document to the tree structure shown in Figure 1.
The sample XML file named DomTree01.xml
The tree structure in Figure 1 corresponds to an XML file named DomTree01.xml. As is often the case, I will discuss the XML files and the programs in fragments. A complete listing of DomTree01.xml is shown in Listing 21 near the end of the lesson. Listing 1 shows the beginning of the XML file
<?xml version="1.0"?> |
That portion of the XML file shown in Listing 1 consists of five items that are represented by the following nodes in the DOM tree:
- A Document node
- A Document-Type node
- A Comment node
- A Processing Instruction node representing a stylesheet
- A Processing Instruction node representing a dummy processing instruction
The five items are separated by blank lines in Listing 1, so you should be able to correlate them visually with the five nodes in the above list.
The DOM tree exposed
Figure 2 shows a reproduction of the first five lines from Figure 1. Each line in Figure 2 represents a node in the DOM tree. You should be able to correlate each line in Figure 2 with one of the nodes in the above list, and also with one of the items in Listing 1 (except for the DOCUMENT_NODE for which there is no explicit item in Listing 1).
The indentation in Figure 2 indicates that the last four lines in Figure 2 represent nodes that are children of the node represented by the Document node in the first line.
#document DOCUMENT_NODE Figure 2 |
The prolog of the XML document
Listing 1 shows the prolog for this XML document, which includes everything prior to the start tag for the root element. Figure 2 shows the DOM nodes associated with the prolog.
The root element in the XML document
Listing 2 shows the XML code for the root element and the six nodes following the root-element node in the DOM tree. The XML code in Listing 2 produces the following node types in the DOM tree, with the parent-child relationships shown.
- An Element node named A, which is the root element node
- An Element node named Q
- A Text node
- An Element node named B
- An Element node named C
- A Text node
- A CDATA Section node
<A> |
Referring back to Figure 1, you can see that the Element node named A is a child of the Document node, which forms the root of the DOM tree. The node for element A is the root element node for the DOM tree, (which is different from the root node for the DOM tree). All of the data stored in an XML document is stored in the root element node and its children.
Figure 3 shows a reproduction of the next seven lines from Figure 1, showing the tree structure and the parent-child relationships among the nodes. The nodes shown in Figure 3 correspond to the XML code in Listing 2.
A ELEMENT_NODE Figure 3 |
Easier to interpret
Unless you have a lot of practice reading XML code, you may have concluded by now that the representations of the DOM tree in Figures 2, and 3 are much easier to get your mind around than the raw XML shown in Listings 1 and 2.
Node types seen thus far
So far, we have seen the following types of nodes:
- Document node
- Document-Type node
- Comment node
- Processing Instruction node
- Element node
- Text node
- CDATA Section node
The Document node and the XML declaration
According to XML in a Nutshell by Harold and Means, which I recommend as an excellent book,
As I mentioned earlier, every XML DOM tree is rooted in a Document node, even in the absence of an XML declaration. Apparently, the DOM tree does not contain a node that represents the XML declaration, and the XML document doesn't contain any specific text that represents the Document node.
Although the XML declaration is used for information purposes by a validating XML parser, if it is possible to recover the XML declaration from the DOM tree, I don't know how to do that at this time.
Document-Type node
A valid XML document contains a reference to a Document Type Declaration (DTD) to which the document should be compared for validation purposes. The DTD can also be included in the XML document prolog, as is the case in Listing 1.
According to XML in a Nutshell,
For example, the DTD in Listing 1 states that the element named A must contain the elements named Q, B, and B, in that order. I'm not going to try to explain the rules for writing DTDs. There are numerous tutorials on the Web that you can refer to in this regard.
The DTD in Listing 1 produced the Document-Type node in the tree in Figure 2.
Comment node
A comment in XML means pretty much the same thing as a comment in Java. XML comments are generally ignored by XML processors. They are intended primarily for human consumption.
Listing 1 contains an XML comment with the file name and some other information. This comment produced the Comment node in the tree of Figure 2.
Processing Instruction node
XML processing instructions begin with <? and end with ?>. Processing instructions are intended to provide instructions to processing programs that may be called upon to process an XML document.
Listing 1 contains two separate processing instructions. The two processing instructions gave rise to the two Processing Instruction nodes in the tree in Figure 2.
Element node
As you learned in the previous two lessons, XML syntax includes elements, consisting of start tags, end tags, optional content, and optional attributes.
Listing 2 contains all or part of several elements. The elements gave rise to the Element nodes in Figure 3. The text content of the elements gave rise to the Text nodes in Figure 3.
Text node
When you include text as part or all of the content of an XML element, each chunk of text gives rise to a text node in the DOM tree. Figure 3 shows two text nodes produced by the text content of the elements in Listing 2.
CDATA Section node
XML recognizes two kinds of text data, PCDATA and CDATA. PCDATA stands for parsed character data. CDATA stands for character data.
The primary difference between the two is as follows. PCDATA cannot contain certain characters such as left angle brackets (<) and ampersands (&). The reason is that a left angle bracket would confuse the parser, causing it to believe that it had encountered the first character in a start or end tag. Therefore, if these characters appear in PCDATA, they must be represented by entities, such as <.
A CDATA section
When a block of text is declared to be of type CDATA, it is ignored by the parser. Therefore, it can contain any characters (with the possible exception of ]]). A block of CDATA always begins with <![CDATA[. The block always ends with ]]>.
Listing 2 contains a block of CDATA, which gave rise to the CDATA Section node in Figure 3.
Note that the Element node named C in Figure 3 has two children. One child is a text node. The other child is a CDATA Section node.
An interesting case involving whitespace
I'm not going to bore you by discussing the entire XML document in this level of detail. By now, you should be able to compare the XML in Listing 21 with the DOM tree represented by Figure 1, and understand how the XML code relates to the DOM tree,.
However, there is one tricky aspect involving whitespace that deserve a little more explanation. The DOM tree nodes shown in Figure 4 represent the XML code shown in Listing 3.
E ELEMENT_NODE Figure 4 |
Too many text nodes
I have colored the obvious text in Listing 3 green for emphasis. At first glance, it would appear that there are too many Text nodes showing in Figure 4 to correspond to the text shown in Listing 3.
<E>First list item in E |
Figure 5 shows another representation of the DOM tree, similar to Figure 4, except that the actual text belonging to each Text node is shown in Figure 5.
E ELEMENT_NODE Figure 5 |
Note the blank lines in Figure 5. This is caused by newline characters in the actual XML code in Listing 3. In particular, there are two Text nodes belonging to the element named E. One of those Text nodes appears before the element named G and the other appears after the element named G. The Text node after the element named G was caused by the newline character immediately following the end tag for the element named G.
Element E may contain PCDATA
This happens because of one line in the DTD shown in Listing 1 and repeated below for convenience.
<!ELEMENT E (#PCDATA | G)*>This DTD statement says that the content for an element named E may contain Text nodes (#PCDATA) and/or elements named G in any number and in any order. Thus, simple newline characters inserted into the XML to make it easier to read were interpreted as Text nodes. This gave rise to what appears to be extra Text nodes in Figure 4.
That's probably enough talk. It's time to see some Java code.
The program named DomTree01
With the preceding discussion as background, I will now discuss the program named DomTree01, which was used to process the file named DomTree01.xml and to produce the Dom tree representation shown in Figure 1. As usual, I will discuss the program in fragments. A complete listing of the program is shown in Listing 20 near the end of the lesson.
Purpose and limitations of the program
This program produces a text-based output on the screen that represents the DOM tree structure for an XML file. Note that although the code was written to support these node types, the program was not actually tested for the following node types:
- DOCUMENT_FRAGMENT_NODE
- ENTITY_NODE
- ENTITY_REFERENCE_NODE
- NOTATION_NODE
Also note that for simplicity, no effort was made to cause the program to produce meaningful output in the event of errors and exceptions.
The program was tested using Sun's SDK 1.4.2 under WinXP.
Overall program structure
This program consists of a single class with a main method that runs as a Java application. Listing 4 shows the beginning of the class definition and the beginning of the main method.
public class DomTree01{ |
- It declares and initializes an instance variable that is used later for control of indentation in the output display.
- It also provides usage instructions if the user starts the program with the wrong number of command-line arguments.
Two command-line parameters are required. The first parameter is the path and file name of the file containing the XML document to be processed. The second command-line parameter is either "y" or "n" specifying whether or not the parser should attempt to validate the XML document.
Steps for creating a Document object
As you learned in an earlier lesson, three steps are required to create a Document object:
- Create a DocumentBuilderFactory object
- Use the DocumentBuilderFactory object to create a DocumentBuilder object
- Use the parse method of the DocumentBuilder object to create a Document object
The first step in the above list is accomplished by the code in Listing 5..
try{ |
This wasn't discussed in the previous lessons because it only works with a validating parser. The parsers used in the two previous lessons were not validating parsers.
Create a Document object
The remaining two steps required to create a Document object are accomplished in Listing 6.
//Get a DocumentBuilder (parser) object |
Process the Document object
Code that is new to this lesson begins in Listing 7. The code in Listing 7 instantiates a new object of the program class and invokes the processNode method on that object, passing the Document object's reference as a parameter.
//Instantiate an object of this class |
The processNode method
The processNode method, which begins in Listing 8, is used to recursively process the DOM tree, identifying and displaying the tree structure along the way.
private void processNode(Node node){ |
The code in Listing 8 simply checks to confirm that the incoming reference does not have a value of null. If it does, the code in Listing 8 prints an error message and returns.
Perform the recursive processing on the incoming node
The code in Listing 9 shows the beginning of what happens if the incoming parameter is not null.
indent++; |
Indentation
Recall the instance variable named indent that was declared and initialized in Listing 4. Each time control enters the processNode method (with a non-null Node parameter), the value of that instance variable is incremented. Each time control exits the method (except for the case of a null Node parameter), the value of that instance variable is decremented. Therefore, at any point in time, the value of indent indicates the current depth (in the DOM tree) of the node that is being examined.
Get node name and type
The variable named indent is incremented in Listing 9. Following this, two methods are called on the incoming Node parameter to get and save the name and the type of the node currently being examined.
Some types of nodes have generic names, such as #text. Other types of nodes have actual names, which match element names in the XML document.
The doIndent method
At this point, I am going to skip ahead and show you a very simple method named doIndent, (which actually appears near the end of the program code in Listing 20). The code for this method is shown in Listing 10.
private void doIndent(){ |
Display the name of the node
Returning to the discussion of the processNode method, Listing 11 invokes the doIndent method to produce the required indentation, and then displays the name of the current node, followed by a space. Note that the cursor remains immediately to the right of the space and does not advance to the next line at this time.
doIndent(); |
Recall that the invocation of the getNodeType method in Listing 9 returned a value of type int. The Node interface defines about a dozen symbolic constants that correlate the type values to names such as CDATA_SECTION_NODE.
A switch statement
Listing 12 shown the beginning of a switch statement that uses the type value from Listing 9, along with the constants from the Node interface to display the alphanumeric node type to the right of the node name that was displayed by the code in Listing 11.
switch(type){ |
For example, the code in Listings 11 and 12 would produce output similar to that shown in Figure 6 (the indentation may be different for different XML documents).
#cdata-section CDATA_SECTION_NODE Figure 6 |
The remainder of the switch statement
Listing 13 shows the remainder of the switch statement. There is nothing special about the code in Listing 13. As each node is examined, the code in Listing 11 performs the proper indentation and displays the name of the node. Then one of the cases in the switch statement is invoked to display the alphanumeric node type to the right of the node name and to advance the display cursor to the next line.
case Node.COMMENT_NODE:{ |
Following the switch statement, the code in Listing 14 invokes the getChildNodes method on the current node to get a list of the nodes that are children of the current node. That list is returned as an object of type NodeList. The NodeList object's reference is stored in the reference variable named children.
NodeList children = node.getChildNodes(); |
- A method named getLength returns the number of nodes in the list.
- A method named item takes a parameter of type int, and uses that parameter to return the Node object's reference that is stored at that index.
Provided that the NodeList reference in the variable named children is not null, the code in Listing 15 uses a for loop to process each node whose reference is stored in the list.
if (children != null){ |
This causes the program to recursively examine every node in the DOM tree, (except for attribute nodes) extracting and displaying information about each node as it is examined. This includes nodes in the prolog of the XML document as well as nodes in the body of the XML document.
Decrease indentation level and terminate processNode method
When all the invocations of the processNode method finally return and the current instance of the processNode method terminates, it decreases the value of the variable named indent prior to termination as shown in Listing 16.
indent--; |
The end of the doIndent method signals the end of the class and the end of the program named DomTree01.
The program named DomTree02
The program named DomTree02 is an upgraded version of DomTree01. This program displays the actual text belonging to text nodes instead of simply showing the type of node as TEXT_NODE.
DomTree02 also displays attribute names and values, which is not the case with DomTree01.
Sample output from DomTree02
Figure 7 shows the output produced by using DomTree02 to process the XML file named DomTree02.xml. (You can view a listing of this XML file in Listing 23 near the end of the lesson.)
I colored the attributes red and the text green in Figure 7 to make them easy to spot.
#document DOCUMENT_NODE Figure 7 |
Displaying text versus displaying node type
Sometimes it can be very useful to display the actual text values in the tree. At other times, the text is so voluminous that it completely overwhelms the display making it difficult to pick out the structure of the tree. In those cases, the version that simply identifies the node as a text node is probably advantageous.
Will discuss in fragments
I will discuss the program named DomTree02 in fragments. A complete listing of the program is shown in Listing 22 near the end of the lesson.
Large portions of this program are identical or very similar to the code in the program named DomTree01, discussed earlier in this lesson. Therefore, I won't repeat the discussion of that code. Rather, I will restrict this discussion to those parts of this program that differ from the earlier program.
The main method in this program is essentially the same as the main method in the previous program, so I will skip a discussion of the main method.
As before, the method named processNode is used to recursively process the entire DOM tree, extracting and displaying information about the nodes in the tree along the way. The method named processNode in this program is the same as in the previous program except for the code in a couple of cases in the switch statement.
New features in DomTree02
Previously, the cases in the switch statement were used to display the alphanumeric type of each node in the tree. In this program, the case for TEXT_NODE is modified to cause the actual text value of the text node to be displayed instead of the type of the node.
In addition, the case for ELEMENT_NODE in this program is modified to get and display the names and values of all attributes associated with elements.
The ELEMENT_NODE case
I will begin by explaining the changes to the ELEMENT_NODE case in the switch statement. Listing 17 shows the beginning of the ELEMENT_NODE case.
private void processNode(Node node){ |
There is a very important conceptual issue to deal with here. Specifically, attribute nodes are not simply child nodes of element nodes. In particular, all child nodes of an element node can be obtained in a collection of type NodeList by invoking the method named getChildNodes on the element node.
In order to get the attributes belonging to an element node, it is necessary to invoke the method named getAttributes on the element node. This method returns a reference to an object of type NamedNodeMap containing unordered references to the attribute nodes.
NamedNodeMap versus NodeList
A NamedNodeMap is a different type of data structure than a NodeList.
A NodeList is an ordered collection of references to Node objects. Items in the list are accessed on the basis of an ordinal index. They cannot be accessed on the basis of the name of a node. The order of the items in the list matches the ordering of the corresponding nodes in the DOM tree.
NamedNodeMap
Sun describes objects of type NamedNodeMap as
Sun goes on to tell us,
Therefore, references to objects representing attribute nodes can be accessed in a NamedNodeMap object either on the basis of the attribute name, or on the basis of an ordinal index. I will use an ordinal index in this program, as shown in Listing 18.
Get and display name and value of attribute nodes
Listing 18 shows the remaining code for the ELEMENT_NODE case in the switch statement.
for(int i = 0; i < attrLen; i++){ |
The modified TEXT_NODE case
Listing 19 shows the modified TEXT_NODE case in the switch statement, and the end of the switch statement.
//Case code deleted for brevity |
This version invokes the method named getNodeValue on the node and displays the String that is returned by that method. This code produced the green text values for the text nodes represented in Figure 7.
Beyond this point, both programs are the same
The remainder of this program is the same as DomTree01, and therefore, doesn't merit further discussion.
Run the Programs
I encourage you to copy the code and XML data from Listings 20 through 23 into your text editor. Compile and execute the programs. Experiment with them, making changes, and observing the results of your changes.
Summary
In this lesson, I showed you how to write a program to display a DOM tree on the screen in a format that is much easier to interpret than raw XML code. I explained two different versions of the program. One version simply identifies text nodes in the output tree. The other version displays the value of text nodes in the output tree. Also, the first version ignores attributes in the output tree, while the second version includes attributes in the output tree.What's Next?
In the next lesson, I will explain default XSLT behavior and show you how to write Java code that mimics that behavior. The resulting Java code will serve as a skeleton for more advanced transformation programs.
Complete Program Listings
/*File DomTree01.java |
/*File DomTree02.java |
<?xml version="1.0"?> |
Copyright 2003, Richard G. Baldwin. Reproduction in whole or in part in any form or medium without express written permission from Richard Baldwin is prohibited.
About the author
Richard Baldwin is a college professor (at Austin Community College in Austin, TX) and private consultant whose primary focus is a combination of Java, C#, and XML. In addition to the many platform and/or language independent benefits of Java and C# applications, he believes that a combination of Java, C#, and XML will become the primary driving force in the delivery of structured information on the Web.Richard has participated in numerous consulting projects, and he frequently provides onsite training at the high-tech companies located in and around Austin, Texas. He is the author of Baldwin's Programming Tutorials, which has gained a worldwide following among experienced and aspiring programmers. He has also published articles in JavaPro magazine.
Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.
-end-
This article was originally published on December 24, 2003