Java Programming Notes # 2206
- Preface
- Preview
- Some Details Regarding XSLT
- Discussion and
Sample Code - Run the Program
- Summary
- What’s Next?
- Complete Program
Listings
Preface
In this lesson, I will explain default XSLT behavior,
and will show you how to write Java code that mimics that
behavior.
The resulting Java code serves as a skeleton for more advanced
transformation programs.
What is JAXP?
JAXP is an
API designed
to help you write programs for creating and processing XML
documents. JAXP is
very important for many reasons, not the least of which is the
fact that it is a critical part of Sun’s Java Web Services Developer
Pack
(JWSDP). As you are probably already aware, web services is
expected by many to be a very important aspect of the Internet of the
future
This lesson is one in a series designed to help you
understand how to use JAXP and how to use the JWSDP.
The first lesson in this series was
entitled Java
API for XML Processing (JAXP), Getting Started.
The
previous lesson was entitled Java
JAXP, Exposing a DOM Tree.
What is XML?
XML is an acronym for the eXtensible Markup Language.
I will assume that you already
understand
XML, and will teach you how to use JAXP to write programs for
creating and processing XML documents.
What are XSL
and XSLT?
I provided quite a lot of background material on XSL and XSLT
in a previous lesson in this series. A brief review of
that
material follows.
XSL is an acronym for Extensible Stylesheet language.
XSLT is an acronym for XSL Transformations.
The W3C is a
governing body that has published many important documents on XML, XSL,
and
XSLT.
The uses of XSLT include the following:
- Transforming non-XML documents into XML documents.
- Transforming XML documents into other XML documents.
- Transforming XML documents into non-XML documents.
Viewing tip
You may find it useful to open another copy of this lesson in a
separate browser window. That will make it easier for you to
scroll back and forth among the different listings and figures while
you are reading about them.
Supplementary material
I recommend that you also study the other lessons in my extensive
collection of online Java and XML tutorials. You will find those
lessons
published at Gamelan.com.
As of the date of this writing, Gamelan doesn’t maintain a
consolidated index of my tutorial lessons, and sometimes
they are difficult to locate there. You will find a consolidated
index at www.DickBaldwin.com.
Preview
A tree structure in memory
A DOM parser can be used to
create a tree structure in memory that represents an XML
document. In Java, that tree structure is encapsulated in an
object of the interface type Document. Document
and its superinterface Node declare numerous methods that can
be used to navigate, extract information from, modify, and otherwise
manipulate the DOM tree. As
is always the case, classes that implement Document must
provide concrete definitions of those methods.
Many operations are possible
Given an object of type Document, there are many
methods that
can be invoked on the object to perform a variety of operations.
For example, it is possible to write Java code to move nodes from one
location in the tree
to another location in the tree, thus rearranging the structure of the
XML document represented by the Document object. It is
possible to delete nodes, and to insert new nodes. It is
also possible
to
recursively traverse the tree, extracting information about the nodes
along
the way.
Two ways to
transform an XML document
There are at least two ways to transform the contents of an XML
document into another document:
- By writing Java code to manipulate the DOM and perform the
transformation. - By using XSLT to perform the transformation.
It should be possible to write Java code to perform any
transformation that can be performed using XSLT, but the reverse may
not be true.
General
description of XSLT
Here is a partial quotation from XML In A Nutshell, (which I highly recommend), by
Elliotte Rusty Harold and
W. Scott Means. This quotation provides a general description of
XSLT:
“…
(XSLT) is a functional programming language used to specify how an
input XML document is converted into another text document — possibly,
though not necessarily, another XML document. An XSLT processor
reads both an input XML document and an XSLT stylesheet (which is
itself an XML document because XSLT is an XML application) and produces
a result tree as output. … Documents can be transformed using a
standalone program or as part of a larger program that communicates
with the XSLT processor through its API.”
In this lesson, I will provide and explain a larger program that communicates
with the XSLT processor through its API. The program will also
execute Java code that mimics the transformation provided by XSLT.
Advantages
and disadvantages
As is usually the case, there are advantages and disadvantages to
both approaches to
document transformation.
As an example of an advantage provided by XSLT, if it is possible to
perform the required
transformation using XSLT, that approach will probably require you to
write less code than would be required to perform the same
transformation by writing a Java program from scratch.
A large
library of functions
With the XSLT transformation process, you write a stylesheet, which
is somewhat analogous to a driver program in a more conventional
programming environment. That driver program accesses and
uses functions from a large library of pre-written functions to perform
a series of well-defined operations on the DOM tree to produce
the desired transformation.
(XSLT
authors don’t call them functions. Rather, they are called XSLT
elements. According to XML
In A Nutshell, there are 37 standard
XSLT
elements. Also according to XML In A Nutshell, most
XSLT
processors also provide various nonstandard extension elements and
allow you to write your own extension elements in languages such as
Java.)
Is there a
similar library of Java methods?
I am not aware of a library of Java methods in the public domain
that emulates the 37 standard XSLT Elements. However, I freely
admit that such a library may exist and I may simply not know
about it.
Therefore, to write a Java program that emulates an XSLT
transformation, you need to either
- Create your own library of Java
methods and use that library with your Java code to perform the
transformation, or - Start from scratch each time and write a
custom program to perform the transformation.
A skeleton
library of Java methods
This lesson, and several lessons to follow this one, will show you
how to write the skeleton of a Java library containing methods that
emulate the most common XSLT elements. Once you have the library,
writing Java code to transform XML documents consists simply of writing
a short driver program to access and use those methods. Thus,
given the proper library of methods, it is no more difficult to write a
driver Java program to perform the transformation than it is to write
an
XSLT stylesheet.
Library is
not my primary purpose
However, my primary purpose in these lessons is not to provide such
a library, but rather is to help you understand how to use a DOM
tree to create, modify, and manipulate XML documents. By
comparing Java code that manipulates a DOM tree with similar XSLT
operations, you will have an opportunity to learn a little about XSLT
in the process of learning how to manipulate a DOM tree using Java code.
If you already know a lot about XSLT, you may learn a little
about Java by studying these lessons. If you already know a lot
about Java, you may learn a little about XSLT. If you don’t
already know either
Java or XSLT, you may learn a little about both.
Debugging
XSLT can be difficult
While writing a Java program to emulate an XSLT Transformation may
require you to write more code than writing a stylesheet, in my
opinion, it is much easier to debug a Java program that fails to
deliver the desired result than it is to debug an XSL stylesheet that
fails to deliver. This is an advantage of
using Java code over XSLT. I find XSLT to be extremely difficult
to debug (but I haven’t attempted to
use a fancy XSLT debugger, several of which are freely available on the
Internet).
Java
provides more detailed control
Another difference in using Java code relative to XSLT has to do
with
the detailed control of the transformation process. I
believe, (but cannot prove),
that it is possible to write Java programs
to provide transformations that are not possible using standard XSLT
elements. If I am correct, this may be another
advantage of writing Java code over using XSLT.
Some
Details Regarding XSLT
The following is a partial quotation from XML In A Nutshell. (Note that I will be referring to
this excellent book several more times in this lesson. For
brevity, I will refer to it simply as Nutshell.)
“XSLT
is an XML application for specifying rules by which one XML document is
transformed into another XML document. An XSLT document — that
is, an XSLT stylesheet — contains template rules. Each template
rule has a pattern and a template. An XSLT processor compares the
elements and other nodes in an input XML document to the template-rule
patterns in a stylesheet. When one matches, it writes the
template from that rule into the output tree. … XSLT uses the
XPath syntax to identify matching nodes.”
My
explanation
Let’s see if I can explain this process in my own words.
Assume that an XML document has been parsed so as to produce a DOM tree
in memory that represents the XML document. (The creation of a DOM tree in this manner
was discussed in several previous lessons
in this series.)
An XSLT processor starts examining the DOM tree at its root
node. It
obtains instructions from the XSLT stylesheet telling it how to
navigate the
tree, and what to do with each node that it encounters along the way.
Finding
matching template rules
As each node is encountered, the processor searches the stylesheet
looking for instructions on how to treat that node. (These instructions will be referred to
later as template rules.) If the processor finds
instructions that match the node type, it performs the operations
indicated by the
instructions. If it doesn’t find matching instructions, it
executes built-in instructions appropriate to that node.
(An XML
document can contain seven different types of nodes. The
different types will be identified later. This lesson will
describe and explain the built-in
instructions for six of those seven node types. Java code will be
developed that emulates the built-in
instructions for each of the six types of nodes.)
Establishing
the context node
An XPath expression can be
used to point to a specific node and to
establish that node as the context node. Once a context node is
established, there are at least two XSLT elements that can be used to
manage the traversal among children of that node:
- xsl:apply-templates
select, optional attribute
mode, optional attribute
xsl:sort, optional XSLT element - xsl:for-each
select, required attribute
xsl:sort, optional XSLT element
The
xsl:apply-templates XSLT element
The first of these, xsl:apply-templates,
examines and processes all child nodes of the context node that match
an optional select
attribute.
(When
combined with a default template rule to be discussed later, this often
results in a recursive examination and processing of all descendant
nodes of the context node.)
According to Nutshell,
“The
xsl:apply-templates instruction tells the processor to search for and
apply the highest-priority template in the stylesheet that matches each
node identified by the select attribute.”
Applying
template rules
As each node is examined, the processor searches the stylesheet to
determine if the XSLT programmer has provided a template rule that
matches the node and defines how that
node should be treated. If a matching template rule is found, the
node is treated in the manner prescribed by the template rule.
Literal text
in the XSLT stylesheet elements
You can think of the XSLT process as operating on an input DOM tree
to produce an output DOM tree. If the template rule being applied
contains literal text, that literal text is used to
create a text node in the output tree.
(I will
explain how this feature is used to transform XML documents into XHTML
documents in a future lesson.)
If no match
is found
If a matching template rule is not found, the processor executes a
built-in template rule appropriate to the type of node involved.
Built-in template rules are provided by the XSLT processor to handle
the seven different types of nodes in an XML document:
- root node
- element node
- attribute node
- text node
- comment node
- processing instruction node
- namespace node
This lesson will explain the built-in rules that handle the first
six types of nodes in the above list.
Recursion is
common
As mentioned earlier, the combination of xsl:apply-templates and a built-in
template rule often produces recursion. Assuming that there is
nothing in a matching template rule that stops
the recursion operation, recursion continues until all descendant nodes
of the original context node have been examined and processed.
The mode
attribute
The mode attribute of xsl:apply-templates makes it
possible to cause different template
rules to match nodes of the same type at different places in the DOM
tree.
Sorting
The optional xsl:sort
element makes it possible to modify the
order in which the nodes are examined.
Iterative
operation
The second XSLT element in the above list, xsl:for-each, executes an iterative
examination and processing of all child nodes of the context node that
match the required select attribute.
According to Nutshell,
“The
xsl:for-each instruction iterates over the nodes identified by its
select attribute and applies templates to each one.”
In other words, the processor will examine all child nodes of the
context node that match the select
attribute. As each child node is examined, the processor will
search the stylesheet looking for a template rule that matches the
child node. If a matching template rule is found, the matching
template rule will be used to process that
node.
If a matching template rule is not found, a built-in template rule
appropriate for the type of node will be used to process the node.
As before, the optional xsl:sort
element makes it possible to modify the
order in which the nodes are examined. I will explain this in
detail in a future lesson.
Combined
operations
Frequently a stylesheet will combine recursive and iterative
operations to produce more complex operations.
Enough talk, let’s
see some code
I will begin by discussing the XML file named Dom11.xml (shown in Listing 29) along with
the XSL
stylesheet file named Dom11.xsl
(shown in Listing 30).
These two listings are provided near the end of the lesson.
After explaining the transformation produced by applying this
stylesheet to this XML document, I will explain the transformation
produced by applying the empty stylesheet
named Dom11a.xsl, (shown in Listing 33), to a nearly
identical XML document.
two XML files are the
same except that they refer to different stylesheet files, one of which
is empty.)
A Java program
named Dom11
Following that, I will explain a Java program (shown in Listing 31) that
emulates the behavior of the stylesheets shown in Listings 30 and 33
when
applied to the XML file shown in Listing 29.
I will explain that the Java program shown in Listing 31 emulates
the behavior of the empty stylesheet shown in Listing 33, and will
explain why that is true.
Discussion
and Sample Code
The XML
file named Dom11.xml
The XML file shown in Listing 29 is relatively straightforward. A
tree view of that XML file is shown in Figure 1.
program named DomTree02, discussed in an earlier lesson, was used to
produce this tree view of the XML file.
The values of the text nodes in Figure 1 were manually highlighted in
red to make it easier to refer to those values later in this lesson.)
#document DOCUMENT_NODE Figure 1 |
A database of books
As you may already have figured out,
this XML document represents a small database containing information
about books. However, the structure and content of this XML file
was not intended to have any purpose other than to illustrate the
default
behavior of the built-in XSLT template rules.
The XSL
stylesheet file named Dom11.xsl
The stylesheet file shown in Listing 30 is very important relative to
the purpose
of this lesson, so I will discuss it in detail.
Recall that an XSL stylesheet is itself an XML file, and can therefore
be represented as a tree. I will begin by showing you an
abbreviated version of a tree view of the stylesheet, as shown in
Figure 2.
#document DOCUMENT_NODE Figure 2 |
Why abbreviated?
The reason that I refer to this as
an abbreviated version is because I manually deleted comment nodes and
extraneous text nodes in order to emphasize the important elements in
the document.
break in the third line to force the material to fit into this narrow
publication format.)
The root element
The root node of all XML documents is the document node. However,
in addition to the root node, there is also a root element.
As you can see from Figure 2, the root element in the XSL document is
of type xsl:stylesheet.
The root element has two attributes, each of which is standard for XSL
stylesheets.
The first attribute points to the XSLT namespace URI, which you can
read about in the W3C
Recommendation. The second attribute provides the XSLT
version. According to Nutshell, the version must be 1.0.
Also, according to Nutshell,
correct. If even so much as a single character is wrong, the
stylesheet processor will output the stylesheet itself instead of
either the input document or the transformed input document.”
Unable to
verify this behavior
I have been unable to verify this behavior experimentally. When I
delete a character from the XSL namespace URI and then load the XML
file into IE 6.0, there is simply no output. The browser screen
remains blank. When I modify the XSL namespace URI and attempt to
use JAXP to apply the stylesheet to the XML file, the system throws
several errors and the program aborts. Neither approach seems to “output the stylesheet itself” as
indicated by Nutshell.
Children of the
root element node
As you can see from Figure 2, the root element node has two child
nodes, both of which are of type xsl:template.
Here is what XSLT
and XPath On The Edge by Jeni Tennison has to say about xsl:template:
can be applied (if a match pattern is specified) or called (if a name
is specified).”
As you can see from the attribute values in Figure 2, a match pattern
is provided for both of the xsl:template
nodes in Figure 2.
also called template rules.)
Back to basics
Getting back to XSLT basics, whenever the XSLT processor encounters a
node while traversing the DOM tree, it will examine all of the template
rules in the stylesheet searching for one whose match pattern matches
the node. If it finds a matching template rule, it will execute
the instructions contained as elements within the template rule.
If it doesn’t find a match, it will execute a built-in template rule
that matches the node.
An explicit
representation of a built-in template rule
Consider the first child node of the xsl:stylesheet
root element in Figure 2. Listing 1 shows this template rule in
XSL syntax, (extracted from Listing
30).
<xsl:template match="*|/"> |
The template rule shown in Listing 1
is an explicit representation of one of the built-in template
rules.
Matching the
root node and element nodes
Consider the match pattern for this template rule (the text value of the attribute named
match). According to Nutshell,
asterisk * is an XPath wild-card pattern that matches all element
nodes, regardless of what name they have or what namespace they’re in.
The forward slash / is an XPath pattern that matches the root
node.
This is the first node the processor selects for processing, and
therefore this is the first template rule the processor executes
(unless a nondefault template rule also matches the root node).
… the vertical bar combines these two expressions so that it matches
both the root node and element nodes.”
The
<xsl:apply-templates/> element
Now consider the <xsl:apply-templates/> element
that makes up the body of this template rule. This element causes
the processor to process all child nodes of each matching node,
examining nodes, searching for matching template rules, and executing
the elements embedded in matching template rules along the way.
Again, according to Nutshell, still speaking of the template rule in
Listing 1,
isolation, this rule means that the XSLT processor eventually finds and
applies templates to all nodes except attribute and namespace nodes
because every nonattribute, non-namespace node is either the root node,
a child of the root node, or a child of an element. Only
attribute and namespace nodes are not children of their parents.”
An explicit
representation of a built-in template rule
Once again, the template rule shown in Listing 1 is an explicit
representation of one of the built-in template rules. If I were
to remove this template rule from the stylesheet, and then apply the
stylesheet to the XML document, this template rule would still be
applied where appropriate by the XSLT processor, because it is built
into the processor.
Handling text
nodes by default
Listing 2 shows the template rule, in XSL syntax that corresponds to
the second child node of the root element node in Figure 2. Once
again, this is a template rule with a match pattern. This
template rule is also an explicit representation of one of the built-in
rules, which copies the value of text and attribute nodes into the
output document.
<xsl:template match="text()|@*"> |
The match
pattern
The text() in the value of
the attribute named match is an XPath pattern matching all text
nodes. The @* is
an XPath pattern matching all attribute nodes. The vertical bar
combines the two patterns. Hence, the template rule matches all
text and all attribute nodes.
The
xsl:value-of element
Once a match is made, the behavior of the rule is governed by the
single element that is embedded in the rule. The xsl:value-of element, with a select value of “.” returns the
text value of the context or current node. (This is similar to the use of a single
period to represent the current directory in some file management
systems such as MSDOS.)
Text value to
the output
Therefore, whenever the XSLT processor applies this template rule to a
text or attribute node, the text value of that node is sent to the
output document (a text node is
created in the output tree).
If the node is a text node, the value is simply the text in the node.
If the node is an attribute node, the value is the attribute value, but
not the attribute name.
The output
Now it’s time for the big question. What does the output look
like when the stylesheet shown in Listing 30 is used to transform the
XML document shown in Listing 29? The result of such a
transformation is shown in Figure 3.
that I manually inserted a line break near the end of the fourth line
in Figure 3 to force the material to fit in this narrow publication
format. This caused the text $19.60 to move down to the fifth
line.)
<?xml version="1.0" encoding="UTF-8"?> Figure 3 |
The XML declaration
The first line in Figure 3 is an XML declaration that was placed there
by the XSLT processor independent of the content of the XML file.
The text in the
output
If you compare the text in Figure 3 with the material highlighted in
red in Figure 1, you will see that the output produced by this
stylesheet containing only explicit representations of default template
rules is the concatenation of text
values for all the element nodes in the XML document.
Line breaks in
the output
The two line breaks following the words Java and rules in Figure 3 correspond to the
line breaks in the text portion of the title
element shown in Listing 3. (This
element was extracted from the original XML file in Listing 29.)
<title>Java |
Because these two line breaks occur within the text portion of the
element, they also appear in the output in Figure 3. In other
words, the line breaks are considered by the XSLT processor to be a
legitimate part of the text content of the element.
The remaining line breaks in the XML file shown in Listing 29 occur
between XML tags. Therefore, they are not considered to be a part
of the text content of any element and they do not appear in Figure 3.
No attribute
values in the output
You may have noticed that even though a couple of the elements in the
XML file have attributes (see Figure
1), and one of the template rules matches attribute nodes, the
attribute values do not appear in the output shown in Figure 3.
Nutshell explains this in the following way:
should happen when an attribute node is reached, by default the XSLT
processor never reaches attribute nodes and, therefore, never outputs
the value of an attribute.”
Nutshell goes on to tell us,
this template only if a specific rule applies templates to them, and
none of the default rules do this because attributes are not considered
to be children of their parents. In other words, if element E has
an attribute A, then E is the parent of A, but A is not the child of E.”
Finally, Nutshell tells us,
element with <xsl:apply-templates/>” does not apply templates to
attributes of the element. To do that, the xsl:apply-templates
element must contain an XPath expression specifically selecting
attributes.”
Applying an
empty stylesheet
Now consider the stylesheet shown in Listing 33, as shown in
abbreviated tree format in Figure 4.
nodes and extraneous text nodes were manually removed from Figure 4.)
#document DOCUMENT_NODE Figure 4 |
Unlike Figure 2, the stylesheet
represented by Figure 4 doesn’t contain any template rules. In
fact, except for the root (document)
node and the xsl:stylesheet
root
element node, the stylesheet is completely empty.
Produces
exactly the same output
However, the result of applying the empty stylesheet to the XML file
discussed earlier produces exactly the same result as was produced by
applying the stylesheet shown in Listing 30 and Figure 2 to that XML
file.
This is because the two template rules shown in Listing 30 and Figure 2
replicate the behavior of two of the built-in template rules.
Therefore, removing them from the stylesheet has no impact on the
result produced by applying the stylesheet to the XML file. If
they are needed, they are available as built-in rules of the XSLT
processor.
Transformation
behavior of an empty stylesheet
Because the two template
rules in the previous stylesheet replicate the behavior of two of the
built-in template rules, removing those template rules from the
stylesheet to produce an empty stylesheet had absolutely no impact on
the transformation result. The transformation result produced by
the previous stylesheet was identical to those produced by the empty
stylesheet.
According to Nutshell, when you transform an XML document using an
empty stylesheet,
declaration plus the text of the input document. … Markup from the
input document has been stripped. The net effect of applying an
empty stylesheet … to any input XML document is to reproduce the
content but not the markup of the input document. To change that,
we’ll need to add template rules to the stylesheet telling the XSLT
processor how to handle the specific elements in the input
document. In the absence of explicit template rules, an XSLT
processor falls back on built-in rules …”
Combined output
Whenever the XSLT processor
encounters a node for which you haven’t defined a matching template
rule, the default template rule for that type of node will be
applied.
Therefore, the total output is often a combination of output produced
by template rules that you provide and built-in template rules.
Therefore, if you are going to create a stylesheet containing template
rules of your own design, it is very important for you to understand
the default behavior provided by the built-in template rules. The
total output produced by your stylesheet is very likely to be a
combination of the output produced by your template rules and the
output produced by the built-in template rules.
Other built-in
template rules
I have explained the behavior of the built-in template rules that cover
the following four types of nodes:
- root node
- element node
- attribute node
- text node
I will explain the behavior of the
built-in template rules that cover the following two types of nodes
later in this lesson:
- comment node
- processing instruction node
I will also have some comments about namespace nodes later in this
lesson as well.
A Java program
that emulates the built-in template rules
Now let’s change direction and concentrate on Java code rather than
XSLT elements. The
following paragraphs describe a Java program named Dom11.
The primary purposes of
this lesson are to:
- Demonstrate Java code that
replicates the behavior of the built-in template rules for six of
the seven possible types of nodes. - Provide a skeleton program
that can be expanded later to provide more complex behavior.
This program implements six built-in
template rules for an XML processor. In addition, it implements
several other template rules that are required to support the built-in
rules, such as xsl:value-of and xsl:apply-templates.
As such, the program serves as the skeleton for the definition of
custom template rules.
Behavior of the
program
As written, this program extracts and concatenates all text values from
a specified XML file, and writes that text into a result file, using
two different approaches:
- An XSLT transformation
operating under program control. - Program code that emulates the
behavior of the XSLT transformation.
In particular, this program
illustrates Java code that emulates the XSLT templates in the files
named Dom11.xsl and Dom11a.xsl. These two XSL
files differ in terms of their dependence on the built-in templates.
As you saw in the earlier discussion, both XSL files produce the same
result when processed against the XML files named Dom11.xml and Dom11a.xml, demonstrating the
behavior of the built-in template rules. The execution of these
built-in template rules causes the contents of every text node to be
concatenated and written into the result file.
The program code in this program emulates those built-in template rules
and produces the same results.
Usage
instructions
The program requires three command line arguments in the following
order:
- The name of the input XML file
– must be Dom11.xml or Dom11a.xml. - The name of the output file to
be produced by the XSLT transformation. - The name of the output file to
be produced by the program code that emulates the XSLT transformation.
Order of execution
The program begins by executing code to transform the incoming XML file
in a way that mimics the XSLT transformation. Along the way, it
saves the processing instructions, (one
of which contains the name of the stylesheet file), for later
use by the code that governs the XSLT transformation process. (Otherwise,
the code that performs the XSLT transformation later would have to
search the DOM tree for the XSL stylesheet file name.)
The name of the XSL
stylesheet file is extracted from the processing instruction in the XML
file. Then the program
uses the XSL style sheet to transform the XML file into a result file.
Errors,
exceptions, and testing
No effort was made to provide meaningful information about errors and
exceptions. If an error or exception occurs, the default behavior
for that error or exception will occur.
The program was tested using SDK 1.4.2 under WinXP.
Will discuss in
fragments
I will discuss this program in fragments. A complete listing of
the program is shown in Listing 31 near the end of the lesson.
Listing 4 shows the beginning of the class named Dom11 and the beginning of the main method.
public class Dom11{ |
The code in Listing 4 declares a couple of variables, one of
which will be used later to save processing instruction nodes.
Then the code in Listing 4 provides usage instructions based on
command-line arguments.
Parse the input
XML file
The code in Listing 5 parses the input XML file, producing an object of
type Document, which is a DOM
tree in memory.
try{ |
Steps for creating a Document object
There is nothing new in the code in Listing 5. I have discussed
the code required to create a Document
object in several previous lessons beginning with the lesson entitled Java
API for XML Processing (JAXP), Getting Started.
As you saw in those earlier lessons, creating a Document
object involves three steps:
- Create a DocumentBuilderFactory object
- Use the DocumentBuilderFactory object to create a DocumentBuilder
object - Use the DocumentBuilder object to create a Document
object
Both the DocumentBuilderFactory class and the DocumentBuilder
class belong to the javax.xml.parsers package. As of this
writing, this package is part of J2SE 1.4.2.
Transformation
through program code
The code in Listing 6 begins the process of transforming the DOM tree
into an output file through the execution of program code (as opposed to an XSLT transformation).
The code begins by instantiating a new object of the Dom11 class.
Dom11 thisObj = new Dom11(); |
Get an output
stream
Then the program gets an output stream for the output produced by the
program code. This stream points to an output file that was
specified by the third command- line parameter.
Process the DOM
tree
The code in listing 7 invokes the processDocumentNode
method to process the DOM tree. This method (and the methods that it calls)
begins with the Document node,
and processes all
the nodes in the DOM tree to produce the required output.
thisObj.processDocumentNode(document); |
Note that the code in listing 7
passes the Document object’s
reference to the method named processDocumentNode.
This is the root node of the entire DOM tree, and can be treated as
type Node, because the Document interface extends the Node interface.
Set the main method aside
My explanation of this program will follow the execution thread through
the program. At this point, I will set the discussion of the main method aside temporarily and
come back to it later when the processDocumentNode
method returns control to the main
method.
The
processDocumentNode method
The entire processDocumentNode
method is shown in Listing 8.
void processDocumentNode(Node node){ |
This method is used to produce any text required in the output at
the document level, such as the XML declaration for an XML
document. (As you can see from
Listing 8, the code in this method writes an XML declaration into the
output.)
Invoke the
processNode method
Despite the name that I chose to give to the processDocumentNode method, it
doesn’t actually process the document node directly. Rather after
sending any required text to the output, it invokes the
method named processNode to
actually process the document node.
that the Document object’s
reference is passed to the method named processNode in Listing 8.)
When the DOM
tree has been processed …
When the processNode method
returns, (after the entire DOM tree
has been processed), the processDocumentNode
method flushes the output stream and returns control to the main method.
As you will see
later, subsequent code in the main
method invokes a method that will perform an XSLT transformation on the
XML file and write the output into a different output file. I
will discuss that method later in this lesson.
The processNode
method
There are seven possible types of nodes in an XML document:
- root or document node
- element node
- attribute node
- text node
- comment node
- processing instruction node
- namespace node
The processNode method handles
the first six types and ignores namespace nodes.
it is not possible to handle namespace nodes in a Java program because
there is no constant in the Node class that can be used to identify
namespace nodes. This will become clearer later as we examine the
code in the processNode
method.)
Get and save
the node type
The beginning of the processNode
method is shown in Listing 9. Note that the method receives an
incoming parameter, which is a reference to an object as type Node. This can include any of
the seven node types that can occur in a DOM tree.
If the parameter doesn’t point to an actual object, the method simply
returns, as opposed to throwing a NullPointerException.
void processNode(Node node){ |
The final statement in Listing 9 invokes the getNodeType method to get and save
the type of the node whose reference was received as an incoming
parameter.
Process the node
Each time the processNode
method is invoked, it receives a Node
object’s reference as an incoming parameter. The code in Listing
9 determines the type of the incoming node. Listing 10 shows the
beginning of a switch
statement that is used to initiate the processing of each incoming node
based on its type.
switch (type){ |
The switch statement has six
cases to handle six types of nodes, plus a default case to ignore
namespace nodes.
The
DOCUMENT_NODE case
The code in Listing 10 will be executed whenever the incoming method
parameter points to a document node.
that this will happen only once during the processing of a DOM
tree. The first node processed will always be the document node,
and there is only one document node in a DOM tree.)
DOCUMENT_NODE is a constant (public static final variable) that
is defined in the Node
interface. (The interface
provides similar constants for all node types other than namespace
nodes.) These constants can be used to distinguish between
different node types.
Will invoke
default behavior in this case
Note that the code in the case in Listing 10 is an if/else construct. If the
conditional clause in the if statement
evaluates to true (which is not
possible in this case), the code in the if statement will be executed.
(This is where I will place the code
for custom template rules in subsequent lessons.)
If the conditional clause in the if statement does not evaluate to
true, the code in the else
statement will be executed. (This
is where I have placed the code that mimics the built-in template
rules.)
Note that the code in the else
statement in Listing 10 invokes a method named defElOrRtNodeTemp. When I
discuss this method momentarily, you will see that its behavior mimics
one of the built-in template rules that I discussed earlier in this
lesson. Before getting to that, however, I want to give you a
preview of how I will define custom template rules in future lessons.
Creating custom
template rules
As you will see in subsequent lessons, the process for creating a
custom template rule is as follows:
- Go to the method named processNode, which I am
discussing right now. - Identify the case for the node
type in the switch statement. - Change the conditional clause
in the if statement for that
case to
implement a match for a particular node of that type. - Write code in the body of the if statement to implement the custom
template rule.
If the modified conditional clause
evaluates to true, the custom template rule will be executed. If
false, the
default rule will be executed.
The
ELEMENT_NODE case
Before getting to the discussion of the method named defElOrRtNodeTemp, I want to show
you the ELEMENT_NODE case in
Listing 11.
is still part of the switch
statement that was begun in Listing 10)
case Node.ELEMENT_NODE:{ |
Except for the type of node in the first line in Listing 11, the code
in this case is identical to the code in the DOCUMENT_NODE case shown in Listing
10. Note in particular that the default behavior for this case
invokes the same method as the default behavior for the document node
case.
As before, the code in the if
statement is not reachable in this program.
will be true for every case in this program, because this program is
designed specifically to exhibit the same behavior as the built-in XSLT
template rules.)
The method
named defElOrRtNodeTemp
Still following the execution thread, I will set my discussion of the switch
statement aside temporarily and discuss the method named defElOrRtNodeTemp. As
mentioned above, this method is invoked
as the default behavior for document nodes and element nodes in
Listings 10 and 11.
I will return to my discussion of the switch
statement shortly.
The entire method named defElOrRtNodeTemp
is shown in Listing 12.
void defElOrRtNodeTemp(Node node) |
Behavior of the
method named defElOrRtNodeTemp
This method mimics the behavior of the built-in XSLT template rule
shown in Listing 1, and repeated in Figure 5 below for convenient
viewing.
<xsl:template match="*|/"> Figure 5 |
As I indicated earlier, the match
pattern for this template rule matches the document node and all
element nodes.
cases in the switch statement
corresponding to the document node and an element node.)
Code is
straightforward
The code in this method is relatively straightforward. First it
tests to confirm that the incoming parameter points to a node of the
correct type, and throws an exception if the incoming parameter is not
of the correct type.
If the incoming parameter is of the correct type, the code in the
method invokes a method named applyTemplates
passing the node as a parameter to that method.
Listing 12 and the XSLT template rule in Figure 5.)
The method
named applyTemplates
Continuing to follow the execution thread, I will now discuss the
method named applyTemplates,
shown in Listing 13.
void applyTemplates(Node node,String select){ |
Behavior of the
apply-templates rule
The applyTemplates method
partially emulates the XSLT apply-templates
rule discussed earlier in this lesson, and shown in Figure 6.
<xsl:apply-templates Figure 6 |
The apply-templates
rule has two attributes, select and
mode.
support the mode attribute. Perhaps I will update the method in a
future lesson to support this attribute.)
As I explained earlier in this lesson,
rule processes all child nodes of the context node that match
an optional select
attribute. If the select attribute is omitted, all
child nodes are processed.”
Behavior
of the method named applyTemplates
The applyTemplates method
shown in Listing 13 receives two incoming parameters:
- The context node.
- The select parameter.
If the select parameter is null, the method
examines and processes all child nodes of the context node.
Otherwise, it processes only those child nodes that match the select parameter.
The code in Listing 13 invokes the getChildNodes
method on the context node to get a list of all child nodes of the
context node. If there are no child nodes, it quietly returns.
A recursive
method call
If there are child nodes, the
method uses a for loop to
process all child nodes that match the select
parameter as described above.
based on the name of the node obtained by invoking the method named getNodeName on the child node being
examined.)
For each matching child node, the applyTemplates
method makes a recursive call to the method
named processNode, passing the
child node’s reference as a parameter to the processNode method.
Return to
defElOrRtNodeTemp method
Eventually, the recursive process will end, and control will return to
the defElOrRtNodeTemp method
shown in Listing 12. From there, control will return to either
the DOCUMENT_NODE case or the ELEMENT_NODE
case in the switch statement
in Listing 10 or Listing 11 from which the defElOrRtNodeTemp
method was called.
That, in turn, brings us back to a discussion of the other cases in the
switch statement.
The TEXT_NODE
and ATTRIBUTE_NODE cases
The next two cases from the switch
statement that I will discuss are shown in Listing 14. (The switch
statement began in Listing 10)
Listing 14 shows the cases for text nodes and attribute nodes. I
have grouped these two cases together because the default behavior of
both cases is to invoke the method named defTextOrAttrTemp, and to send the String returned by that method to
the output.
case Node.TEXT_NODE:{ |
The
defTextOrAttrTemp method
Once again, following the execution thread, I will now discuss the
method named defTextOrAttrTemp
method. This method is called whenever:
- The processNode method
is called with a reference to either a text node or an attribute node,
and. - The default behavior for the node type is executed.
Listing 15 shows the entire method named defTextOrAttrTemp.
String defTextOrAttrTemp(Node node) |
Emulates a
built-in XSLT template rule
This method emulates the built-in XSLT template rule shown in Listing 2
and repeated in Figure 7 below for convenient viewing.
<xsl:template match="text()|@*"> Figure 7 |
As I told you earlier, this template
rule matches all text nodes and all attribute nodes. Therefore,
the defTextOrAttrTemp
method is invoked by the default behavior of either the TEXT_NODE case or the ATTRIBUTE_NODE case in the switch statement in Listing 14.
Similar behavior
Once again, note the similarity between the method named defTextOrAttrTemp in Listing 15 and
the template rule shown in Figure 7.
In Figure 7, the template rule executes the xsl:value-of XSLT element to send
the value of the context node to the output.
The method shown in Listing 15 invokes a method named valueOf,
passing “.” as a parameter (note the
period between the quotation marks). The value returned by
that method is
sent to the output by the code in the default behaviors of the two
cases in Listing 14.
The method named
valueOf
The method named valueOf,
which begins in Listing 16, is
fairly complex. I will discuss portions of this method
in this lesson and will discuss the remainder of the method in
subsequent lessons.
This method emulates an <xsl:value-of
select=”???”/> XSLT element.
Three forms of
method call
The method requires two parameters. The first parameter is of
type Node, and is the context
node. The second parameter is of type String and is a select parameter.
The valueOf method recognizes
three forms of call:
- valueOf(Node theNode,String “@attrName”)
- valueOf(Node theNode,String “.”)
- valueOf(Node theNode,String “nodeName”)
In the first form, the method returns the text value of the named
attribute of theNode. An attribute is specified by a select value
that begins with @. If the attribute doesn’t
exist, the method returns an empty string.
In the second form, which is the only form actually used in this
program, the value of the select parameter is a String containing a single
period. In this form, the method returns the concatenated text
values of the context node and all
descendants
of the context node (including text
nodes that are children of the context node).
In the third form, the method returns the concatenated text values of
all descendants of a specified child node of the context node. If
the context node has more than one child node with
the specified name, only the first one found is processed.
The others are ignored.
Features not
supported
The valueOf method does not
support the following features, which are standard features of the xsl:value-of XSLT element:
- disable-output-escaping
- processing instruction nodes
- comment nodes
- namespace nodes
Will discuss
the second form only
Since the second form of call listed above is the only form actually
used in this program, I will discuss only those portions of the method
that support that form. I will defer discussion of the other
portions of the method until they are used in subsequent lessons.
Process the
context node
The code in Listing 16 picks up at the point where it is determined
that the incoming value for select
is a String object’s reference
with a value of “.” (note the period
between the quotation marks). This is a request to return
the value of the context node.
This method supports two possibilities for the context node:
- Element node – return the concatenated text values of all
descendant nodes of the context node. - Text node – return the text value of the text node.
Clearly the first possibility is the more complex of the two, but as
you will see, recursion makes it easy to accomplish.
When the
context node is an element node …
The code in Listing 16 shows the beginning of the code required to
process the context node as an element node.
public String valueOf(Node node,String select){ |
Get list of
child nodes
In preparation for processing all descendant nodes of the context node,
the code in Listing 17 gets a list of child nodes, along with the
length of the list.
In addition, the code in Listing 17 initializes a String variable named nodeTextValue that will be used to
collect the concatenated text values of the descendant nodes.
Note that this variable is initialized to contain an empty string.
NodeList childNodes = |
Process child
nodes of context node
Having gotten a list of child nodes of the context node, all that is
required to accomplish the objective is to make a series of recursive
calls to the valueOf method,
passing each child
node in turn to the valueOf method
as shown in Listing 18.
for(int j = 0; j < listLen; j++){ |
Each child node becomes the new context node upon re-entry into the valueOf method, and each call
requests the value of the context node (the current child node) by passing
“.” for the select parameter.
Concatenation
The code in Listing 18 also deals with concatenation. The value
returned from each call to the valueOf
method is concatenated with the text value already stored in the
variable named nodeTextValue.
Finally, after all child nodes have been processed, the code in Listing
18 returns the concatenated value stored
in the variable named nodeTextValue.
When the
context node is a text node …
If you understood all of the above, (including
the recursion),
you should find it easy to
understand the code shown in Listing 19. Listing 19 shows the
case where the context node is a text node.
}else if(nodeType == Node.TEXT_NODE){ |
In this case, the method simply returns the value obtained by invoking getNodeValue on the text node.
One other
possibility
There is one other possibility that is handled by the code in Listing
20. That possibility is that the context node is neither a text
node nor an element node. In that case, the valueOf method returns an empty
string.
}else{ |
Other types of
nodes in the switch statement
Returning to the switch
statement that began in Listing 10, we find two additional cases, each
of which invokes the same method by default:
- COMMENT_NODE
- PROCESSING_INSTRUCTION_NODE
The default behavior of the cases corresponding to both of these node
types is to invoke the method named defComOrProcInstrTemp.
case Node.COMMENT_NODE:{ |
Save all
processing instructions
I will discuss the defComOrProcInstrTemp
method shortly. First, however, I will explain the extra code
that appears in the default portion of the processing instruction node
case in Listing 21.
The purpose of a processing instruction in an XML file is to provide
instructions to processing programs such as this one. The XML
file shown in Listing 29 contains the three processing instructions
shown in Listing 22.
<?dummy-target dummy-data="def"?> |
Stylesheet
identified in a processing instruction
The first and third of the three processing instructions are dummy
processing instructions put there
to test the capabilities of this program. However, the
processing instruction in the middle is a real processing instruction
that specifies the name of the file containing a stylesheet. That
stylesheet will be used later when this program causes an XSLT
transformation to take place using the XML file in Listing 29, and the
stylesheet file identified in Listing 22. (That
stylesheet actually appears in Listing 30.)
In order to use that processing instruction to identify the stylesheet
file, this program must capture the processing instruction and extract
the file name from the processing instruction. A statement in the
second case in Listing 21 causes references to all processing
instruction nodes to be added to and saved in static variable of the
Dom11 class named procInstr.
That information will be used later to extract the name of the
stylesheet file from the processing instruction.
The
defComOrProcInstrTemp method
Both of the switch cases shown
in Listing 21 invoke this method as their default behavior. A
complete listing of the defComOrProcInstrTemp
method is shown in Listing 23.
String defComOrProcInstrTemp(Node node) |
The defComOrProcInstrTemp
method emulates the built-in template rule shown in Figure 8.
<xsl:template Figure 8 |
According to Nutshell, the built-in
template rule for comments and processing instructions doesn’t output
anything into the output tree. Therefore, the defComOrProcInstrTemp method shown
in Listing 23 simply returns an empty string.
The namespace
node case
The default case for the switch
statement begun in Listing 10 is shown in Listing 24.
default:{ |
Since the switch statement
contains explicit cases for six of the seven possible types of nodes in
a Dom tree, the default case will be activated only in the case of
namespace nodes. As I mentioned earlier, the Node interface doesn’t provide a
constant that can be used to identify namespace nodes, so it isn’t
possible to create an explicit case for namespace nodes.
Also, here is what Nutshell has to say about the built-in template rule
for namespace nodes:
… template rule … instructs the processor not to copy any part of
the namespace node to the output.”
Therefore, the default case in Listing 24, which catches all namespace
nodes, doesn’t send anything to the output.
End of the
processNode method
I have discussed everything of significance in the processNode method. Continuing
to follow the execution thread, I will now turn my attention back to
the main method.
Perform an XSLT
transformation
After the code has been executed to process the document using program
code (beginning with the invocation
of the processDocumentNode
method in Listing 7), the
statement in Listing 25 invokes the doXslTransform
method to cause the XML document to be transformed using the stylesheet
identified in one of the processing instructions in the XML file.
thisObj.doXslTransform( |
Stylesheet
reference has been saved
The success of the method call in Listing 25 depends on the
stylesheet processing instruction having been saved while the document
was being processed. Otherwise, it would be necessary to add code
in this method to search the DOM tree for the stylesheet processing
instruction.
All processing instructions are saved in a Vector object by this
program. The Vector
object’s reference is passed as the third parameter to this
method. The first parameter is a reference to the Document or root node in the DOM
tree. The second parameter is the name of the output file.
The
doXslTransform method
The doXslTransform method
begins in Listing 26. This method uses an XSLT stylesheet file to
transform an incoming Document object
into an output file. A large portion of the code in this method
is dedicated to:
- Identifying the processing instruction containing the stylesheet
information. - Extracting the stylesheet information from the processing
instruction.
Identify the
processing instruction containing the stylesheet reference
The code in Listing 26 searches the Vector
object seeking a processing instruction node that contains a stylesheet
reference.
void doXslTransform(Document document, |
How does this
work?
To see how this code works, first take a look at the processing
instruction in the XML file that contains the stylesheet
reference. This processing instruction was shown in Listing 22,
and is repeated below in Figure 9 for convenient viewing.
<?xml-stylesheet Figure 9 |
The purpose of a processing instruction is to provide information to
processing programs that will be used to process the XML file.
Format of a
processing instruction
According to Nutshell,
<? and ends with ?>. Immediately following the <? is an
XML name called the target, possibly the name of the application for
which this processing instruction is intended or possibly just an
identifier for this particular processing instruction. The rest
of the processing instruction contains text in a format appropriate for
the application for which the instruction is intended.”
Applying this knowledge to the stylesheet processing instruction in
Figure 9, you can see that the target consists of the following
text: xml-stylesheet.
Accessing the
target and the data
The target of a processing instruction node can be accessed in Java by
invoking the getTarget method
on the processing
instruction node’s reference.
The remainder of the text in the processing instruction can be accessed
by invoking the getData method
on the same reference.
The code in Listing 26 examines each of the objects in the Vector, invoking getTarget and getData, searching for a processing
instruction whose target and data match that which is known to be true
for a stylesheet. When a match is found, the code breaks out of
the for loop.
If no match is found, the code in Listing 26 throws an exception.
Extract the
stylesheet file name
Having identified the processing instruction that contains the
stylesheet reference, the code in Listing 27 uses the getData method of the ProcessingInstruction interface,
along with some methods of the String
class to extract the name of the file containing the stylesheet.
String xslFile = pi.getData(). |
The ability to extract the file name is based on the known format of
the stylesheet processing instruction.
Do the XSLT
transformation
The remaining code in the doXslTransform
method is shown in Listing 28.
//Get a TransformerFactory object |
You have seen
this code before
The code in Listing 28 is not new to this series of lessons. This
code was discussed in detail in the earlier lesson entitled Getting
Started with Java JAXP and XSL Transformations (XSLT).
Therefore, other than to point out one
difference relative to the previous code, and to review the steps
involved, I won’t discuss the code in Listing 28 further in this lesson.
Steps for creating a Transformer object
The following two steps
are required to create a Transformer object. Once a Transformer object is available, it
can be used to transform one DOM tree into another DOM tree.
- Create a TransformerFactory object by invoking the
static newInstance method of the TransformerFactory class. - Invoke the newTransformer method on the TransformerFactory
object.
One important
difference
There is one important difference between the code in Listing
28 and the code in the earlier lesson. The two programs invoke
different overloaded versions of the newTransformer
method of the TransformerFactory
class.
The earlier lesson entitled Getting
Started with Java JAXP and XSL Transformations (XSLT)
invoked a version that took no parameters and returned a Transformer object that simply
copies a source tree to a result tree.
The code in Listing 28 invokes a version of the newTransformer method that takes
the stylesheet file as an input parameter and returns a Transformer object that uses the
stylesheet file to perform an XSLT transformation.
That concludes the discussion of the program named Dom11.
Run the Program
I encourage you to copy the Java code, XML files, and XSL files from
the listings near the end of this lesson. Compile and execute the
programs. Experiment with them, making changes, and observing the
results
of your
changes.
Summary
I explained default XSLT behavior
and showed you how to write Java code that mimics that behavior.
The resulting Java code serves as a skeleton for more advanced
transformation programs.
What’s Next?
In the next lesson, I will show you
how to
write a Java program that mimics an XSLT transformation for converting
an XML file into a text file. I will also show that once you
have a
library of Java
methods that
emulate XSLT elements, it is no more difficult to
write a Java program to transform an XML document than it is to
write an XSL stylesheet to transform the same document.
Complete Program Listings
Complete listings of the various files discussed in this lesson are
contained in the listings that follow.
<?xml version="1.0"?> |
<?xml version='1.0'?> |
/*File Dom11.java |
<?xml version="1.0"?> |
<?xml version='1.0'?> |
Copyright 2004, Richard G. Baldwin. Reproduction in whole or
in
part in any form or medium without express written permission from
Richard
Baldwin is prohibited.
About the author
Richard Baldwin
is a college professor (at Austin Community College in Austin, TX) and
private consultant whose primary focus is a combination of Java, C#,
and XML. In addition to the many platform and/or language independent
benefits of Java and C# applications, he believes that a combination of
Java, C#, and XML will become the primary driving force in the delivery
of structured information on the Web.
Richard has participated in numerous consulting projects, and he
frequently provides onsite training at the high-tech companies located
in and around Austin, Texas. He is the author of Baldwin’s
Programming Tutorials, which
has gained a worldwide following among experienced and aspiring
programmers. He has also published articles in JavaPro magazine.
Richard holds an MSEE degree from Southern Methodist University
and has many years of experience in the application of computer
technology to real-world problems.
-end-