JavaJava API for XML Processing (JAXP), Getting Started

Java API for XML Processing (JAXP), Getting Started

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Java Programming Notes # 2200


Preface

What is JAXP?

As the name implies, the Java API for XML Processing (JAXP) is an
API
provided by Sun designed to help you write programs for processing XML
documents.  JAXP is very important for many reasons, not the least
of which is the fact that it is a critical part of the Java Web
Services Developer Pack (Java WSDP).  According to Sun,

“The
Java WSDP is an all-in-one download containing key technologies to
simplify building of Web services using the Java 2 Platform.”

This is the first lesson in a series designed to initially help you
understand how to use JAXP, and to eventually help you understand how
to use the Java WSDP.

What is XML?

If you have been around Information Technology (IT) for the past
several years, it is doubtful that you have escaped hearing about
the eXtensible Markup Language (XML). 
However, if you are like many of the professional programmers who
enroll in my Java courses, you may not yet know much about XML.

I will not attempt to teach XML in this series of tutorial
lessons.   Rather, I will assume that you already understand
XML.  I will teach you how to use JAXP to write programs for
creating and processing XML documents.

Regarding XML, let me simply refer you to numerous tutorial lessons on
XML that I have previously published at Gamelan.com and www.DickBaldwin.com
However, as a convenience to you, I will review many of the salient
aspects of XML later in this document under General Background
Information on XML
.

Viewing tip

You may find it useful to open another copy of this lesson in a
separate browser window.  That will make it easier for you to
scroll back and forth among the different listings and figures while
you
are reading about them.

Supplementary material

I recommend that you also study the other lessons in my extensive
collection of online Java tutorials.  You will find those lessons
published at Gamelan.com
However, as of the date of this writing, Gamelan doesn’t maintain
a consolidated index of my Java tutorial lessons, and sometimes they
are difficult to locate there.  You will find a consolidated index
at www.DickBaldwin.com.

General Background Information
on
XML

Purpose of this section

As mentioned earlier, the purpose of this section is to review the
salient aspects of XML for those who are unfamiliar with the
topic.  Those of you who already know about XML can skip ahead to
the Preview section.  Those of you who are
just getting your feet wet in XML (and may have found the XML water
to be a little deep)
should continue reading this section.

Jargon

Computer people are the world’s worst at inventing new jargon. 
XML people seem to be the worst of the worst in this regard.

Go to an XML convention and everything that you hear will be X-this,
X-that, X-everything.  Sometimes I get dizzy just trying to keep
the
various X’s separated from one another.  In this explanation
of XML, I will try to either avoid the use of jargon, or will explain
the jargon the first time I use it.

So, just what is XML?

There are many definitions and descriptions for XML.  I like
the
one given in Figure 1.

XML gives us a way to create and maintain structured
documents
in plain text that can be rendered in a
variety of different ways.

A primary objective of XML is to separate content from
presentation.

Figure 1


What do I mean by a “structured document?”

I will answer this question by providing an example.  A book is
typically a structured document.

In its simplest form, a book may be composed of chapters.  The
chapters may be composed of sections.  The sections may contain
illustrations and tables.  The tables are composed of rows and
columns.

Thus, it should be possible to draw a hierarchical diagram that
illustrates the structure of a book, and most people who are familiar
with books will probably recognize it as such.

What do I mean by “plain text?”

Characters such as the letters of the alphabet and punctuation marks
are represented in the computer by numeric values, similar to a simple
substitution code that a child might devise.

ASCII is an encoding scheme

For example in one popular encoding scheme (ASCII), the upper-case
version of the character “A” is represented by the numeric value 65, a
“B” is represented by the value 66, a “C” is represented by 67, etc.

The actual correspondence between the characters and the specific
numeric values representing the characters has been described by
several different encoding schemes over the years.

ASCII is also an organization

One of the most common and enduring schemes for encoding characters
is a scheme that was devised many years ago by an organization known
as the American Standards Committee on Information Interchange.

Given the initials of the organization, this encoding scheme is
commonly known as the ASCII code.

Figure 2 contains a quotation from one author
regarding the ASCII code (or plain text). 

“This stands for American Standards Committee on Information
Interchange. What it means in practice is plain text, that is to say
text which is readable directly without using any special software.
The advantage of ASCII is that it is a lowest common denominator which
can be displayed on any platform. The disadvantage is that it is rather
limited and somewhat boring. The text cannot display bold, italics or
underlined fonts, and there is no scope for graphics or hypertext.
However,
it is simple, … and is almost idiot-proof as a means of information
exchange. To see a short example of ASCII click HERE, or to
see a journal article in ASCII click HERE.”

Figure 2

XML is not confined to the ASCII code

XML is not confined to the use of the ASCII encoding scheme. Several
different encoding schemes can be used.

However, all of them have been selected to make it possible to read
a raw XML document without the requirement for any special software (other
than perhaps a text editor or the DOS type command).

What is a raw XML document?

A raw XML document is the string of sequential characters that makes
up the document, before any specific rendering has been applied to the
document.

What do I mean by “rendering?”

In modern computer jargon, rendering typically means to
present something for human consumption.

Rendering a drawing

For example, in a computer, drawings and images are nothing more or
less than sets of numbers and possibly formulas.  Those numbers
and formulas, taken at face value, usually mean very little to most
human observers.

Recognition by a human observer

When we speak of rendering a drawing or an image, we usually mean
that we are going to present it in a way that makes it look like a
drawing or an image to a human observer. In other words, we convert the
numbers and formulas that comprise the drawing to a set of colored dots
(pixels) that a human observer will recognize as a drawing.

Rendering a document

When we speak of rendering a document, we usually mean that we are
going to present it in a way that a human will recognize it as a book,
a newspaper, or some other document style that can be read by the human
observer.

Passing information through typography

Rendering, in this case, often means to present some of the material
in boldface, some of the material in Italics, some of
the material underlined, some of the material in color, etc.  (For example, when you
view this document using an HTML browser, it is rendered to show boldface,
Italics, color, etc.)

To separate presentation from content

Raw XML doesn’t exhibit any of these presentation properties, such
as boldface, Italics, or color
Remember,
a main objective of XML is to separate presentation from content. 
XML provides only the content.  The presentation of that content
must come from somewhere else.

Consider a newspaper

These days, there are at least two different ways to render a
newspaper. One way is to print the information (daily news),
mostly in black and white, on large sheets of low-grade paper commonly
known as newsprint. This is the rendering format that ends up on my
driveway each
morning.

My online newspaper

Another way to render a newspaper is to present the information on a
computer screen, usually in full color, with the information content
trying
to fight its way through dozens of animated advertisements.

USA Today

For example, here is the
sort of rendering format that USA
Today
provides for the online version of its newspaper.  Most
of you are probably already familiar with the newsprint rendering of
that well known newspaper.

The news doesn’t change

The base information for the newspaper doesn’t (or shouldn’t)
change for the newsprint and online renderings. After all, news is news
and the content of the news shouldn’t depend on how it is presented.
What does change is the manner in which that information is presented.

A newspaper is a structured document consisting of pages, columns,
etc., which could be maintained using XML.

The great promise of XML

When the information content of a newspaper is created and
maintained in
XML, that same information content can be rendered on newsprint paper,
on
your computer screen, on your call phone, or potentially in other
formats without having to rewrite the information content.

Not necessarily boring

If you visit the above link to
the journal article rendered solely in ASCII, you will probably agree
that from a presentation viewpoint it is pretty boring (no offense
intended to the author of the article).

However, XML documents created and maintained in plain text need not
necessarily be boring.

When you combine a rendering engine with
XML …

It is possible to apply a rendering engine (such as XSL) to the XML content and
to render that content in rich and exciting ways.  (XSL is an
acronym for the Extensible Stylesheet Language, and
is an
advanced topic that I will be covering in future lessons in this series
.)

Separating content from presentation

XML is responsible for maintaining the content, independent of
presentation.

A rendering engine, such as XSL, is responsible for rendering that
content in ways that are appropriate for the application.

Achieving Structure

So, just how does XML use plain text to create and maintain
structure?

Consider the following simple structure that represents a
book.  (This
book certainly wasn’t written by me, because it is much too brief.)

The book described by the structure in Figure 3 has two chapters
with some
text in each chapter. 

Begin Book

Begin Chapter 1
Text for Chapter 1
End Chapter 1

Begin Chapter 2
Text for Chapter 2
End Chapter 2

End Book
Figure 3

A simple example

A real book obviously has a lot more structure than this.  For
example, a typical book will probably have a Forward or a Preface
A typical book will usually have a Table of Contents.

Breaking the structure down further produces paragraphs within the
text, words within the paragraphs, etc.  Also, a book will
frequently have an Alphabetical Index.

However, I am trying to keep this example as simple as possible, so
I
left those things out.

A primary objective

In the earlier description, I told you “A primary objective of
XML is to separate content from presentation.”

This separation, and the fact that the XML document is maintained in
plain text, makes it possible to share the same physical
document among different computers in a way that they all
understand.  (This is often not true, for example, for
documents that are maintained in the proprietary formats of word
processing software.)

Many different computers and operating
systems

Sharing of a document among different computers is no small
accomplishment. Over the years, dozens of different types of computers
have been built, operating under different operating systems, and
running thousands of different programs.

As a result, the modern computer world is often like being on an
island where everyone speaks a different language.

A common language for structured documents

XML attempts to rectify this situation by providing a common
language for
structured documents.

What Does XML Contribute?

I am going to ease into the technical details later.  At this
point, suffice it to say that XML provides a definition of a simple
scheme by which the structure and the content of a document can be
established.

Even I can understand an XML document

The resulting physical document is so simple that any computer (and
most humans)
can read it with only a modest amount of preparation.

You will sometimes see XML referred to as a “meta” language.

What does meta mean?

In computer jargon, the term meta is often used to identify
something that
provides information about other information.

Stock and bond price information

For example, consider the listings of stock prices, bond prices, and
mutual fund prices that commonly appear in most daily newspapers.

The various tables on the page provide information about the bid and
ask prices for the various stock, bond, and mutual fund instruments.

What you need is meta information

But, how do you read those charts?  How do you extract
information from the charts?  You need some information about the
information contained in the charts.  You need some meta
information.

Stock and bond meta information

Usually somewhere on the page, you will find an explanation as to
how to
interpret the information presented throughout the remainder of the
page.

You could probably think of the information contained in the
explanation as meta information. It provides information about other
information.

What about the alphabetical index in a book?

Is the alphabetical index of a book a form of
meta information?  Probably so.

For example, the alphabetical index can tell
you if the book contains information about XML or other topics of
interest to you.  If so, it will tell you where in the book you
can find that information.

The index can also tell you where to find
information about elements and attributes that I will
discuss later.  So, yes, in my opinion, the alphabetical index in
a book provides meta information.

So, why might people refer to XML as a meta
language?

If you write a book and maintain its content in XML, XML doesn’t
tell you
how to structure the document that represents your book.

XML provides a set of rules for structuring

Rather, XML provides you with a set of rules that you can use to
establish your own structure and content when you create the document
that represents your book.

XML is not the language that you use to establish the structure and
content of your book.  Rather, XML tells you how to create your
own language for creating structure and maintaining content.

It is up to you to decide how you will use those rules to define
your own
language for establishing the structure and content of your book.

Invent your own language

You might say that XML is a language that provides information about
a new language that you are free to invent.

Does everyone use a different language?

As it turns out, different groups of people having common interests
have gotten together and have used XML to invent a common language by
which the persons in the group can create, maintain, and exchange
document structure in their areas of interest.

The Chemical Markup Language

For example, a group of chemists
has gotten together and has used the XML rules to invent a common
language by which they create and exchange structured documents on
chemistry.

MathML

Similarly, a group of mathematicians
has gotten together and has invented a common language by which they
create and exchange structured documents on mathematics.

XML is easily transported

If you follow the rules for creating an XML document, the document
that you create can easily be transported among various computers and
rendered in a variety of different ways.

Two different renderings

For example, you might want to have two different renderings of your
book. One rendering might be in conventional printed format and the
other rendering might be in an online format.

No requirement to modify the XML source
document

The use of XML makes it practical to render your book in two or more
different ways without any requirement to modify the original document
that
you produce.

This leads to the name: eXtensible Markup Language
or XML.

Applying XML

Now let’s look at a couple of sample XML documents, either of which
might reasonably represent the simple book presented earlier.

The first sample XML document is shown in Listing 1.

<?xml version="1.0"?>
<book>
<chap>
Text for Chapter 1
</chap>

<chap>
Text for Chapter 2
</chap>
</book>

Listing 1

This example shows typical XML syntax.

Compare with earlier book description

If you compare this example with the informal book example given
earlier in Figure 3, you should see a one-to-one correspondence between
the “elements” in this XML document and the informal
description of the book presented earlier.

An improved example

Listing 2 shows a modest improvement over the XML code in Listing 1,
by including an “attribute” named number in each of the
chapter elements.  This attribute contains the chapter number and
is part of the information that defines the structure of the book. 

<?xml version="1.0"?>
<book>
<chap number="1">
Text for Chapter 1
</chap>

<chap number="2">
Text for Chapter 2
</chap>
</book>

Listing 2

The book represented by the XML code in Listing 2 has two chapters
with some text in each chapter.  This XML code contains an attribute
that describes the chapter number in each chapter element.

Now consider a new jargon word: tag.

What is a tag?

The common jargon for XML items (such as those shown in Figure 4)
enclosed in angle brackets is tag(You may be
familiar with this jargon based on HTML experience.) 

<book>
Figure 4

Start tags and end tags

The tag shown in Figure 4 is often referred to as a start tag
or a beginning tag.

The tag shown in Figure 5 is often referred to as an end tag.

</book>
Figure 5

The end tag contains a slash

What is the difference between a start tag and an end
tag?  In this case, the start tag and the end tag
differ only in that the end tag contains a slash character.

Sometimes there are other differences

However, the start tag can also contain optional attributes
as discussed
below.  (There is also another form where the start tag and
end tag
are combined into something often called an empty element.)

What is an element?

It is time to learn the meaning of the jargon element, content, and
attribute.

Using widely accepted XML jargon, I will call the sequence of
characters in Figure 6 an element.

An element begins with a start tag and ends with an end
tag
and includes everything in between. 

<chap number="1">Text for Chapter 1</chap>
Figure 6

Color coded for clarity

I used artificial color coding in Figure 6 to make it easier to
refer to
the different parts of the element.

(Note however, that because an XML document is maintained in plain
text, the characters in an XML document do not have color properties.)

What is the content?

The characters in between the tags (rendered in green in Figure
6)
constitute the content
(For more information on content, use your browser to search for the
word content in The XML
FAQ
.)

What is an attribute?

The characters rendered in blue in Figure 6 constitute an attribute.

To recap so you will remember it

An element consists of a start tag and an end tag
with the content being sandwiched in between the two tags.  The content
is part of the element.

May include optional attributes

The start tag may contain optional attributes
In this
example, a single attribute provides the number value for the
chapter. 
The start tag can contain any number of attributes, including none.

Tell me more about attributes

The term attribute is a commonly used term in computer
science and
usually has about the same meaning, regardless of whether the
discussion revolves
around XML, Java programming, or database management.

Attributes belong to things, or things have
attributes

A chapter in a book is a thing.  A chapter has a number. 
In this example, the chapter number is an attribute of the chapter
element.

An apple has a color, red or green.  An apple also has a taste,
sweet or sour.

A dog has a size, small, medium, or large.

In the above statements, number, color, taste, and size
are
attributes.  Those attributes have values like red, green,
sweet, sour, small, medium,
and large.

As you can see, attributes are a very common part of the world in
which we live and work.

People have attributes

A person also has attributes, and each attribute has a value.

Figure 7 contains a list of some of the attributes (along with
their values)
that might be used to describe a person. 

name="Joe"
height="84"
weight="176"
complexion="pale"
sex="male"
training="Java programmer"
degree="Masters"
Figure 7

Obviously, there are many more attributes that could be used to
describe a person.

The importance of an attribute depends on
the context

The decision as to which of many possible attributes are important
depends on the context in which the person is being considered.

Attributes for basketball players

For example, if the person is being considered in the context of
being a candidate for an all male basketball team, the height,
weight,
and sex attributes of a person will probably be
important considerations.

Attributes for programmers

On the other hand, if the person is being considered in the context
of being a candidate for employment as a programmer, the height,
weight,
and sex attributes should not be important at all,
but the training and degree attributes might be very
important.

Why does XML use attributes?

Earlier in this lesson, I suggested that the most common modern use
of the word rendering means to present something for human
consumption.  Usually, but not always, that refers to visual
consumption.  (My grandmother used to render fat to make soap,
but that is not modern usage of the term.)

Multiple renderings for the same document

I gave an example of a newspaper that can either be rendered on
newsprint paper, or can be rendered on a computer screen.

What is a rendering engine?

If the newspaper (structured document) is created and
maintained as an XML document, then some sort of computer program (often
referred to as a rendering engine)
will probably be used to render
it into the desired presentation format.

What about rendering our book?

Our book could also be rendered in a variety of different ways.

Regardless of how the book is rendered, it will probably be useful
to separate and number the chapters.

The value of the number attribute for each chapter element
could be used by the rendering engine to present the chapter number for
a specific rendering.

Chapter numbers may be rendered differently

In some renderings, the number might appear on an otherwise blank
page that begins a new chapter.  This is common in printed books,
but is not common in online presentations.

In a different rendering, the chapter number might appear in the
upper right or left-hand corner of each page.

Separation of content from presentation

To reiterate, one of the most important characteristics of XML (as
opposed to HTML)
is that
XML separates
content from presentation.

The XML document contains information about structure and
content.  It does not contain presentation information (as
does HTML).

Presentation of XML requires a rendering
engine

The presentation of an XML document requires the use of a rendering
engine of some sort to render the XML document in a particular
presentation style.

IE 5.0 (and later) contains a rendering
engine

As an example of rendering, IE 5.0 (and later versions)
contains a rendering engine for XML.  When provided with an XML
document and no rendering instructions, IE will render the XML document
in a default format similar to that shown in Figure 8.

Fig 8 IE Rendering of XML file

Figure 8 IE Rendering of XML File

This default rendering of an XML document is designed to emphasize
the tree structure of an XML document.  With the IE default
rendering, the nodes in the tree can be collapsed and expanded by
clicking the – and + symbols on the left, much as you can collapse and
expand the nodes in Windows Explorer (File Manager).

When provided with an XML document and appropriate rendering
instructions (such as an XSLT document), IE can transform XML
data into HTML data and render it in the browser window in different
formats.

What is an XSLT document?

I will have a lot to say about the Extensible Stylesheet Language (XSL),
and stylesheet transformations (XSLT) in future lessons.

Attributes may be useful in rendering

Now getting back to attributes, they provide information about XML
elements that may be useful to the rendering engine.

If the attribute values for an element are not important in a
particular presentation context, the rendering engine for that context
will simply ignore them.  If they are important in a particular
context, the rendering engine will use them.

(The default IE rendering engine makes no use of
attributes, but does display them along with the other information in
the XML document.)

Elements, content, etc.

So far in this lesson, I have introduced tags, elements,
content and attributes.  I have discussed tags
and attributes in detail.  Now let’s continue the
discussion with particular emphasis on elements and content.

What is meant by content?

You already know about start tags and end tags
You also know that an element consists of a start tag (with
optional attributes),
an end tag, and the content
in between as shown in Figure 9.

<chapter number="1">Content for Chapter 1</chapter>
Figure 9

In Figure 9, the optional attribute is colored blue and the content
is colored green.

(Recall however, that because an XML document is maintained in plain
text, the characters in an XML document do not have color
properties.  I used color in this lesson simply to aid in the
explanation.)

Elements can be nested

Elements can be nested inside other elements in the construction of
the XML document as shown in Figure 10. 

<book>
<chapter number="1">Content for Chapter 1
</chapter>
<chapter number="2">
Content for Chapter 2
</chapter>
</book>
Figure 10

Color coding and indentation

In Figure 10, the tags belonging to the book element are
shown in
blue while the tags belonging to the chapter elements are shown
in
green.

I also provided artificial indentation to make it easier to see that
two chapter elements are nested inside a single book
element.

Indentation is common

Such indentation is common in the presentation of raw XML data for
human consumption.  For example, the default rendering of an XML
document by IE is an indented tree structure as shown in Figure 8.

Identify the elements

The book element consists of its start tag, its end
tag
, and everything in between (including nested elements),
as shown in Figure 11.

<book>
...
</book>
Figure 11

Each chapter element consists of its start tag, its end
tag
, and everything in between, as shown in Figure 12.

  <chapter number="1">
...
</chapter>
Figure 12

Content of the book element

In this case, the two chapter elements form the content
of
the book element.

So, what is an element?

The element is the fundamental unit of information in an XML
document.  Most XML processing programs (such as rendering
engines)
depend on this fundamental unit of information in order to
do their job.

An XML document is an element

The entire XML document is an element.  As shown in
Listing 2, the entire XML document consists of the book
element.  It is often referred to as the root element.

To be of much use, an XML document must have other elements nested
inside the root element.  For example, a nested element
can define some type of information, such as chapter in our
book example. Other possibilities would be table elements and appendix
elements.

Meta information

Through the use of attributes, the element often defines
information about the information provided by the XML document (sometimes
referred to as meta information).

In our book example, the number attribute provides the
chapter number
for each of the chapter elements. In effect, the chapter number
is
information about the information contained in the chapter.

The content

Sandwiched in between the start tag and the end tag
of an element, we find the information (content) that the XML
document is designed to convey.

So, what are elements good for?

By using a well-defined structure (based on XML elements) to
create and maintain your document, you make it much easier to write
computer programs that can be used to render, and otherwise process
your document.

Writing programs to process XML documents

At some point, you might want to visit one of my earlier articles
entitled “What is SAX, Part 1.”

(You will find a link to that article at www.dickbaldwin.com.)

That article describes how to write computer programs (using the
Java programming language)
that decompose an XML document into its elements
for some useful purpose.

In those articles, I explain that SAX supports an event-based
approach to XML document processing.  (If you have a
background in event-driven programming, such as Java or Visual Basic,
you will like the SAX approach.)

Parsing events

An event-based approach reports parsing events (such as the
start and
end of elements) to the program using callbacks
The program implements and registers event handlers (callback
methods)
for the different events.

Code in the event handlers is designed to achieve the objective of
the program. 

Not critical to understanding XML

I will have a great deal more to say about processing XML documents
using SAX in future lessons. I realize that a discussion of
event-driven programming for the processing of XML documents might not
be classified as “information for Getting Started with JAXP.” 
It is not even critical
for an understanding of XML.

However, it is a good way to illustrate the benefits provided by XML
elements. 
Don’t worry too much about SAX at this at this point.  Just keep
studying,
and at some point in the future, it will fall into place.

What we have learned so far?

So far in this lesson, I have introduced you to tags, elements,
content, and attributes.  I have discussed tags,
attributes,
and elements in detail. Now, I will discuss
content in detail.

What is content?

Of the four terms mentioned above, content is the easy one. 
Sandwiched in between the start tag and the end tag of
an element, we find the information (content) that the XML
document is designed to convey.

This is where we put the information for which the document was
created.

An XML newspaper

For example, if the XML document is being used for creation and
maintenance of material for a newspaper, the content is the news.

A Java programming textbook

If the XML document is being used for creation and maintenance of a
Java programming textbook, the content contains the information about
Java programming that we want to present to the student.

Tags, attributes, and elements define
structure

The content is the raw information. The tags, attributes, and
elements define the structure into which we insert that information.

Why do we need structure?

One of the primary objectives of XML is to separate content from
presentation.

If we insert the raw material as content into a structure defined by
the tags, elements, and attributes, then that raw material can be
presented (rendered) in a variety of ways.  It can also be
searched in a variety of ways that can produce results that are more
meaningful
than simple keyword searches.

Same content, different renderings

For example, an XML document can be used to represent a newspaper.

Then that document can be presented as an ordinary hard copy
newspaper by printing the content on newsprint in a format defined by
the structure.  Typically, we would use a rendering engine
designed for that purpose.

The same XML document can be used to present the same information in
a completely different rendering on a computer screen. Again, we would
probably use a rendering engine designed for that purpose.

Rendering engine formats the content

In both cases, the rendering engine would examine the structure
defined by the tags, elements, and attributes and would then format and
present the news (content) in a format appropriate for the
presentation media being used.

What does the future hold for XML?

Obviously, I believe that XML has a very bright future. 
Otherwise, I wouldn’t be making the kind of substantial investment in
time and energy that I am making in order to understand XML.

I base this belief on the fact that many large companies, including
Microsoft and IBM have adopted XML as an important part of their
future.

XML will grease the skids of electronic
commerce

For example, here are some of the things that Simon Phipps, IBM’s
chief XML and Java evangelist had to say in his keynote speech at the Software
Development East
conference a few years ago. 

“Because it allows companies to share information with
customers or business partners without first negotiating technical
details, Extensible Markup Language (XML) will grease the skids of
electronic business
and become the assumed data format at the end of 2001.”

XML provides vendor independence

Phipps went on to say: 

“Other successful Internet technologies let people run their
systems without having to take into account another company’s own
computer systems, notably TCP/IP for networking, Java for programming,
and Web browsers for content delivery. XML fills the data formatting
piece of the puzzle.”

“These technologies do not create dependencies. It means you
can build solutions that are completely agnostic about the platforms
and software that you use.”

XML can reduce system costs

In the speech, entitled “Escaping Entropy Death” Phipps
noted that users are reaching the point where the cost of simply owning
some systems is exceeding the value they provide. 

“The key benefit to IT managers that adopt XML and other
non-proprietary standards is that they will greatly reduce the cost of
maintaining a computer’s systems and will allow them to extend existing
systems.”

“In the next decade, you can’t just ask when can you have [a
new application]. You also have to ask how much will it cost to own.”

No more vendor-imposed standards

According to Phipps: 

“The solution, interestingly enough, is not constant
innovation. You have to redeem the best of the parts you have and
combine them with the best of the future.”

Phipps contended that the IT industry has moved on from the era of “vendor-imposed
standards.”

This is an interesting observation by a representative from
IBM.  I grew up on computers during an era when IBM was the vendor
who imposed the standards.

Some would say that the role of imposing standards has now been
assumed by Microsoft (much to the dismay of IBM management).

What about Microsoft and XML?

Microsoft is making a huge investment in XML.  As mentioned
earlier, Microsoft’s IE browser currently supports XML documents, XSL
stylesheets, and XSL transforms.

(You can find links to several articles that I have previously
written discussing the rendering of XML documents using XSLT at www.DickBaldwin.com.)

In addition, many aspects of Microsoft’s latest MS.NET product
depend extensively
on XML.

The XSL Debugger from Microsoft

XSL is complex (much more complex than XML).  Designing
an XSL stylesheet, to be used by a rendering engine to properly render
an XML document, can be a daunting task.

To help us in that regard, Microsoft has developed an XSL debugger,
and has made it freely available for downloading.  As of the date
of this writing, the debugger can be downloaded from http://www.vbxml.com/xsldebugger/.
  I will discuss the use of this debugger in future lessons that
discuss the creation of XML processing programs using XSLT and JAXP.

Check out XML in MS Word

If you happen to have a copy of Microsoft Word around, use it to
create a simple HTML file.  Load that file into your HTML browser
and view the source.  When you do, you will find XML appearing at
various locations in the control information created by Word in that
HTML document.

What we have learned so far?

So far in this lesson, I have discussed tags, elements,
content, and attributes in detail.  I have also
presented a short sales pitch designed to convince you of the
importance of XML.

Now we are ready to move on to a new set of topics:  valid
documents, well-formed documents, and the DTD.

What is a DTD?

The quotation in Figure 13 was extracted from The XML FAQ.

“A DTD is usually a file (or several files to be used
together) which contains a formal definition of a particular type of
document. This sets out what names can be used for elements, where
they may occur, and how they all fit together. For example, if you want
a document type to describe <LIST>s which contain <ITEM>s,
part of your DTD would contain something like 

     <!ELEMENT item (#pcdata)> 

     <!ELEMENT list (item)+> 

This defines items containing text, and lists containing
items. 

It’s a formal language which lets processors
automatically parse a document and identify where every element comes
and how they relate to each other, so that stylesheets, navigators,
browsers, search engines, databases, printing routines, and other
applications
can be used.”

Figure 13

DTDs can be very complicated

I included the above quotation to emphasize one important point –
DTDs are, or can be, very complicated.

The reality is that the creation of a DTD of any significance is a
very complex task.

Don’t panic

However, despite their complexity, many of you will never need to
worry about having to create DTDs for the following two reasons:

  • XML does not require the use of a DTD.
  • Even when it is necessary to use a DTD, someone else may have
    already created it for you.

Many “standard” DTDs have already been developed and are available for
your use without any requirement for you to develop them.

The three amigos

An XML document has two very close friends, one of which is
optional.

I’m going to refer to them as three files just so I will have
something to call them (but they don’t have to be separate physical
files).

One file contains the content

One file contains the content of the document (words, pictures,
etc.).
  This is the part that the author wants to expose to
the client. This file contains the XML code that I have been discussing
up to this point.

This is the file that is composed of elements, having start tags,
end tags,
attributes, and content.  For convenience, the file name often has
an
extension of xml, although that is not a requirement.

A second file contains the DTD

A second file contains the DTD, which meets the above definition
that was
extracted from the FAQ
This file
is optional.

(Note that a modern alternative to the DTD is often
called a schema.  A schema, when it is available, serves the same
purpose as a DTD, but is often more powerful.  I will have more to
say about schema in future lessons.)

A third file contains a stylesheet

A third file contains a stylesheet, which establishes how the
content (that
optionally conforms to the DTD)
is to be rendered on the output
device
for a particular application.

This file defines how the author wants the material to be presented
to the client.

Rendering the XML document

Different stylesheets are often used with the same XML data to cause
that data to be rendered in different ways.  For example a tag
with an attribute of “red” might cause something to be presented bright
red according to one stylesheet and dull red according to another
stylesheet.  (It might even be presented as some shade of
green according to still another stylesheet, but that wouldn’t be a
very good design.)

DTD is optional, stylesheet is not

With XML, the DTD is optional but the stylesheet (or some
processing mechanism that substitutes for a stylesheet)
is
required.  At least that is true if the XML document is ever to be
rendered for the
benefit of a client.

Something must provide rendering
specifications

Remember, XML separates content from presentation.

There is no presentation information in the XML document
itself
.

Therefore, rendering specifications must be provided to make it
possible to render the content of the XML document in the manner
intended by the author.

A stylesheet is typical, but not required

Typically, the rendering specifications are contained in a
stylesheet.  The stylesheet is used by a rendering engine to
render the XML document according to the specifications in the
stylesheet.

However, it is possible that the specifications could be hard-coded
into a program written specifically for the purpose of rendering the
XML document.  In that case, a stylesheet might not be required.

Rendering XML with XSL and MS IE

As mentioned earlier, I have published several articles that deal
with using IE to render XML using stylesheets written in XSL.  You
will find links to those articles at www.DickBaldwin.com.

Now back to the DTD.

A DTD can be very complex

Again, according to The
XML FAQ

“… the design and construction of a DTD can be a complex
and non-trivial task, so XML has been designed so it can be used either
with or without a DTD.  DTDless  operation means
you can invent markup without  having to define it formally. 

To make this work, a DTDless file in effect ‘defines’ its own
markup, informally, by the existence and location of elements where you
create them. 

But when an XML application such as a browser encounters a
DTDless file, it needs to be able to understand the document structure
as it reads it, because it has no DTD to tell it what to expect, so
some changes have been made to the rules.”

Figure 14

What does this really mean?

It means that it is possible to create and process an XML document
without the requirement for a DTD.  A little later, I will discuss
this possibility in connection with the term well-formed.

In the meantime…

You don’t always have the luxury of avoiding the DTD.  In some
situations, you may be required to create an XML document that meets
specifications that someone else has defined.

Hopefully, a DTD will be available

Ideally, in those cases, the person who defined the specifications
has also created a DTD and can provide it to you for your use.

A valid document

Here is a new term — a valid XML document.

In the normal sense of the word, if something is not valid,
that usually means that it is not any good.  However, that is not
the case for XML.

An invalid XML document can be a
good XML
document

An invalid XML document can be a perfectly good and useful
XML document. 
A very large percentage of useful XML documents are not valid XML
documents.

So, what is a valid XML document?

Drum roll please!!!  Without further delay, a valid XML
document is one that conforms to an existing DTD in every respect.

For example…

Unless the DTD allows an element with the name “color”, an
XML document
containing an element with that name is not valid according to that DTD
(but
it might be valid according to some other DTD).

Validity is not a requirement of XML

Many very useful XML documents are not valid, simply because they
were not constructed according to an existing DTD.

To make a long story short, validation against a DTD can often be
very useful, but may not be required.

A well-formed document

Here is another new term — a well-formed document.

The concept of being well-formed was introduced as a requirement of
XML, to deal with the situation where a DTD is not available (an
invalid document).

Again, according to The
XML FAQ

“For example, HTML’s <IMG> element is defined as
‘EMPTY’: it doesn’t have an end-tag. Without a DTD, an XML application
would have no way to know whether or not to expect an end-tag for an
element, so the concept of ‘well-formed’ has been
introduced.

This makes the start and end of every element, and the
occurrence of EMPTY elements completely unambiguous.”

Figure 15

What is an HTML <IMG> tag?

Although you may not know anything about the HTML <IMG> tag,
you do know about start tags and end tags from previous
discussion in this article.

Although HTML is related to XML (a distant cousin that combines
content and presentation in the same document),
HTML documents are
not required to be well-formed.

The quotation in Figure 15 refers to the use of a start tag (<IMG>)
in HTML that doesn’t require an end tag.  If used in that
manner in an XML document, the document would not be well-formed.

All XML documents must be well-formed

XML documents need not be valid.  However:

All XML documents must be well-formed.

What does it mean to be well-formed?

For a rigorous definition of a well-formed document, see http://www.w3.org/TR/2000/REC-xml-20001006#sec-well-formed.

From a somewhat less rigorous viewpoint, XML documents must adhere
to the
following rules to be well-formed.

  • Every start-tag must have a matching end-tag.  All elements
    that can contain character data must have both start and end
    tags.  (Empty elements have a different requirement, which I
    will discuss later.)
  • Tags can’t overlap.  In other words, all elements must be
    properly nested.  If one element contains another element, the
    entire second element must be defined inside the start and end tags of
    the first element.
  • XML documents can have only one root element.
  • Element names must obey XML naming conventions
  • XML is case sensitive
  • XML will keep white space in your text

What is character data?

Although not rigorously true, for purposes of this discussion, let’s
just say that the content that we discussed in an earlier
section comprises character data.

Other requirements

All attribute values must be in quotes (apostrophes or double
quotes).
  You already know about attributes.  I discussed
them earlier in this lesson.

You can surround the value with apostrophes (single quotes)
if the
attribute value contains a double quote.  Conversely, an attribute
value
that is surrounded by double quotes can contain apostrophes.

Dealing with empty elements

We must also deal with empty elements. 
Empty elements are those that don’t contain any character data. 
You
can deal with empty elements by writing them in either of the two ways
shown in Figure 16.

<book></book>

<book/>

Figure 16

You will recognize the format of the first line as simply writing a
start tag followed immediately by an end tag with nothing in
between.  The format of the second line in Figure 16 has a slash
at the end of the word book.

The second format is preferable

This is the first time in this lesson that I have mentioned the
second format, which is actually preferable.

One reason the second format is preferable is that because of word
wrap and other causes, you could end up with the first format in Figure
16 being converted to that shown in Figure 17.

<book>

</book>

Figure 17

Really not empty

Once this happens, although the element may look empty to you, it
really isn’t empty.  Rather it contains whatever characters are
used by that platform to represent a newline character
sequence.

Typically a newline is either a carriage return character, a line
feed character, or a combination of the two.  While these
characters are not visible, their presence will cause an element to be not
empty.

If an element is supposed to be empty, but it is not really empty,
this can cause problems when the XML file is processed.

The preferred approach

So, to reiterate, the preferred approach for representing an empty
element is as shown by the second line in Figure 16.

Empty element can contain attributes

Note that an empty element can contain one or more attributes inside
the start tag, as shown in by the example in Figure 18.

<book
author=”baldwin” price=”$9.95″ />

Figure 18

Again, note the slash character at the end.

Another rule:  No markup characters
are allowed

For a document to be well-formed, it must not have markup characters
(<, >, or &)
in the text data.

What is a markup character?

Since the < character
represents the beginning of a new tag, if it were included in the text
data, it would cause the processor to become confused.  Similarly,
because the > character represents the end of a tag,
inclusion of
that character in the text data can also cause problems.  The
solution to this problem (entities, as described below) also
makes it necessary
to exclude the & character
from the text data.

The solution

If you need for your text to include the < character, the >
character, or the &
character, you can represent them
using &lt; &gt; and &amp; instead.  (Note that
I purposely
omitted the use of a comma in this list of entities to avoid having a
comma
become confused with the required syntax for an entity, which always
begins
with an ampersand and always ends with a semicolon.)

Entities

According to the prevailing jargon, these are called entities. 
You insert them into your text in place of the prohibited characters.

Entities always start with an ampersand character and end with a
semicolon.  It is that combination of characters that the
processor uses to distinguish them from ordinary text.

Other common entities

Although it may not be necessary for well-formedness, it is also
common practice to use an entity to represent the quotation mark
character (“) by the entity &quot;.  It is also
possible to use an entity to
represent many other characters, including characters that don’t appear
on
a standard English-language keyboard.

Recap of validity and well-formed requirements

Valid XML files are those which have (or refer to) a DTD and
which conform to the DTD in all respects.

XML files must be well-formed, but there is no requirement for them
to be valid.  Therefore, a DTD is not required, in which case
validity is impossible to establish.

If XML documents do have or refer to a DTD, they must conform to it,
which makes them valid.

Why use a DTD if it is not required?

There are several reasons to use a DTD, in spite of the fact that
XML doesn’t
require one.

Enforcing format specifications

Suppose, for example, that you have been charged with publishing a
weekly newsletter, and you intend to produce the newsletter as an XML
file.

Suppose also that you occasionally have a guest editor who produces
the newsletter on your behalf.

Establish format specifications

You will probably establish a set of format specifications for your
newsletter and you will need to publish those specifications for the
benefit
of the guest editors.

No guarantee of compliance

However, simply publishing a document containing format
specifications does not ensure that the guest editors will comply with
the specifications.

Use a DTD to enforce format specifications

You can enforce the format specifications by also establishing a DTD
that matches the specifications.

Then, if either you, or one of your guest editors produces an XML
document that doesn’t meet the specifications, the XML processor that
you use to render your newsletter into its final form will notify you
that the document is not valid.

Improved parser diagnostic data

Another reason that I have found a DTD to be useful is the
following.

I am occasionally called upon to write a Java program that will
parse and
process an XML document in some fashion.

My experience is that the parsers that I have used are much more
effective in identifying XML structural problems when the XML document
has a DTD than when it doesn’t.

By this I mean that often the diagnostic information provided by the
parser is more helpful when the XML document has a DTD.

This tends to make it easier to repair the document because a
validating parser does a better job of isolating the problem.

More than you wanted to know

And that is probably more than you ever wanted to know about
XML.  Now it’s time to terminate this review of XML and get to the
meat of this series of tutorial lessons – using Java JAXP to process
XML documents.

Preview

Having taken a very long detour to help the XML newcomers catch up
with everyone else, I will now get back on track and begin discussing
JAXP.

XML by itself isn’t very useful

In reality, an XML document is simply a text document constructed
according to a certain set of rules and containing information that the
author of the document may want to expose to a client.  (The
client
could be a human, or could be another computer.)

Taken by itself, the XML document isn’t worth much, particularly in
those cases where the client is a human.  To be very useful, the
XML document must be combined with a program that is designed to do
something useful with that document.  In other words, in order for
an XML document to be useful to you, you need access to a program that
can process that document to your satisfaction.

DOM and SAX

Regardless of the intended result, many XML processing programs often
begin by applying a software construct called a parser to the
XML document.   The parser performs several different functions.
 One important function is quality control.  A non-validating
parser will test the XML document to confirm that it is
well-formed.  A validating parser will confirm well-formedness,
and will also test
the XML document to confirm that it conforms to the specified DTD or
schema.

Two of the most common types of parsers are:

  • A parser based on the Document Object Model otherwise
    known as DOM.
  • A parser based on the Simple API for XML, otherwise known
    as SAX.

I will have a great deal more to say about DOM and SAX in future
lessons.   For purposes of this lesson, I need to provide a brief
introduction to DOM because I will use a DOM-based parser in the sample
program to be discussed later.

Brief introduction to DOM

An XML document can be viewed as a tree structure where the elements
constitute the nodes in the tree.  Some of the nodes have child
nodes and some do not.

(Usually those nodes that have no children are referred
to as leaf nodes.  This notation is based on the concept of a
physical tree where the root subdivides into trunk, limbs, branches,
twigs, and finally leaves.  However, the leaves don’t
subdivide.  Leaves on a physical tree don’t have children.)

An example of a tree structure

Referring back to the XML document in Listing 1, the element named book
could be viewed as the root of a tree structure.  It has two
children, which are the elements named chap.  Each of the
elements named chap has a child, which is the text shown in Listing
1.  The text forms the leaves of this tree.

A tree structure in memory

A DOM parser can be used to create a tree structure in memory, which
represents an XML document.  In Java, that tree structure is
encapsulated in an object of the interface type Document
Document declares numerous methods.  Document is
also a subinterface of Node, and inherits many method
declarations from Node.

Many operations are possible

Given an object of type Document, there are many methods
that can be invoked on the object to perform a variety of
operations.  For example, it is possible to move nodes from one
location in the tree to another location in the tree, thus rearranging
the structure of the XML document represented by the Document
object.  It is also possible to delete nodes, and to insert new
nodes.  As you will see in the sample program in this lesson, it
is also possible to recursively traverse the
tree, extracting information about the nodes along the way.

I will show you …

In this lesson, I will show you how to:

  • Use JAXP, DOM, and an input XML file to create a Document object
    that represents the XML file.
  • Recursively traverse the DOM tree, getting information about each
    node in the tree along the way.
  • Use the information about the nodes to create a new XML
    file that represents the Document object.

The Document object represents the original XML file and the
DOM tree is not modified in this example.  The final XML file
represents the unmodified Document object, which represents the
original XML file.  Therefore, the final XML file will be
functionally equivalent to the original XML file.

Nothing fancy intended

This sample program is not intended to do anything fancy. 
Rather, it is intended simply to help you take the first small step
into the fascinating world of Java, JAXP, and XML.

Discussion
and Sample Code


In total, this sample program consists of a class named Dom02.java,
a class named Dom02Writer.java, and an XML file named Dom02.xml.
  I will discuss these files in fragments.  Complete listings
of the three files are shown beginning with Listing 28 near the end of
the lesson.

The XML file named Dom02.xml

I will begin my discussion with the XML file named Dom02.xml.
  A listing of this file begins in Listing 3.

An XML file always starts with a prolog, which is the part of the XML
document that precedes the XML data. The minimal prolog, shown in
Listing 3, contains a declaration that identifies the document as an
XML document.

(Note that the declaration may also contain additional
information that is not included in this simple XML document.)

<?xml version="1.0"?>

Listing 3

The root element

The root element of this XML document is named bookOfPoems.
 An abbreviated form of the root element, (with all of its
content removed)
, is shown in Listing 4.

<bookOfPoems>
...
</bookOfPoems>

Listing 4

Children of the root element

As shown in Listing 5, the root element contains two child elements
named poem(For clarity, I eliminated the content of
each of the poem elements in Listing 5.)

<bookOfPoems>
<poem PoemNumber="1" DumAtr="dum val">
...
</poem>
<?processor ProcInstr="Dummy"?>
<!--Comment-->

<poem PoemNumber="2" DumAtr="dum val">
...
</poem>
</bookOfPoems>

Listing 5

Processing instructions and comments

Listing 5 also shows a processing instruction (colored red
for identification),
and a comment (colored blue for
identification).
 

Comments are (or may be) ignored by XML processors. 
Processing instructions are intended to provide instructions to XML
processors.  Depending on the overall design, some XML processors
may pay attention to some processing instructions and ignore
others. 
For example, a given XML document may be processed by two or more
processors
for different purposes.  The document may contain different
processing
instructions for the different XML processors.

Attributes of the poem element

Listing 5 also shows that each of the poem elements have two
attributes (colored green for identification):

  • PoemNumber
  • DumAtr

Content of the first poem element

Listing 6 shows the content of the first poem element (colored
blue for identification).

  <poem PoemNumber="1" DumAtr="dum val">
<line>Roses are red,</line>
<line>Violets are blue.</line>
<line>Sugar is sweet,</line>
<line>and so are you.</line>

</poem>

Listing 6

As you can see from Listing 6, the content of the first poem
element consists of a sequence of four elements named line
The content of each of the line elements is the text that
constitutes one line in the poem.  When this XML document is
converted to a DOM tree, each of the text lines will constitute one
leaf node in the tree.

Content of the second poem element

Listing 7 shows the content of the second poem element. 
There is nothing new here, except for the indication that I could never
make a living as a poet.

  <poem PoemNumber="2" DumAtr="dum val">
<line>Roses are pink,</line>
<line>Dandelions are yellow,</line>
<line>If you like Java,</line>
<line>You are a good fellow.</line>

</poem>

Listing 7

The entire XML document

Listing 8 shows the entire XML document with the same color coding as
above, so that you can identify all the parts, and view them in context:

<bookOfPoems>
<poem PoemNumber="1" DumAtr="dum val">
<line>Roses are red,</line>
<line>Violets are blue.</line>
<line>Sugar is sweet,</line>
<line>and so are you.</line>

</poem>
<?processor ProcInstr="Dummy"?>
<!--Comment-->
<poem PoemNumber="2" DumAtr="dum val">
<line>Roses are pink,</line>
<line>Dandelions are yellow,</line>
<line>If you like Java,</line>
<line>You are a good fellow.</line>

</poem>
</bookOfPoems>

Listing 8

It is important to note that although I have presented
this XML document with different colors to identify the different
parts,
there is no color in an actual XML document.  Recall from the
earlier
discussion that one of the most important aspects of XML documents is
that they exist in plain text, which doesn’t include attributes such
as boldface, Italics, underline, or color.  This makes XML
documents
easily transportable among different kinds of computers and different
operating systems.

The class named Dom02

The controlling class for this program is named Dom02.  I
will discuss this class in fragments.  As mentioned earlier, a
complete listing of the class is provided in Listing 28 near the end
of the lesson.

This class, when executed in conjunction with the class named Dom02Writer:

  • Creates a Document object using JAXP, DOM, and the input
    XML file named Dom02.xml.
  • Traverses the DOM tree, getting information about each element (each
    node in the tree).
  • Uses the information describing the nodes to create an output XML
    file that represents the Document object (and is
    functionally equivalent to the input XML file).

Why not identical?

By now you may be wondering why I used the weasel words “functionally
equivalent” instead of saying that the output XML file is identical
to the input XML file.  This has to do with the topic of
whitespace,
which is a fairly complex topic in XML.  (I will have much
more
to say about whitespace in future lessons.)

For now, suffice it to say that much of the whitespace in Listing 8 (newlines,
indentation, etc.)
was put there for cosmetic reasons.  For
reasons that I won’t attempt to explain in this simple example, some of
that cosmetic whitespace is not reflected in the output XML file.

Input and output file names

The names of the input and output XML files are provided to this
program by command-line arguments when the program is executed. 
The name of the input file is the first argument, and the name of the
output file is the second argument.

DocumentBuilder and Document objects

The program creates a DOM parser object, of type DocumentBuilder,
based on JAXP.  This object, along with its parse method,
is used to create a Document object (DOM tree) that
represents the input XML file.

Traverse the tree

The Document object’s reference is passed to the writeXmlFile
method of an anonymous object of the Dom02Writer class, which
traverses the tree and produces the output XML file representing that
tree.  As you will see, this is by far the most complex part of
the entire operation.  (In the next lesson, I will show you how
to accomplish the same thing with less complexity.)

Miscellaneous comments about the program

The program was tested using Sun’s SDK 1.4.2 under WinXp along
with the file named Dom02.xml described above.

No effort was made to provide meaningful information about errors and
exceptions.   The topic of providing such meaningful information,
particularly regarding parsing errors is fairly complex, and will be
addressed in a future lesson.

Import directives

Because the primary purpose of this lesson is to get you started using
JAXP, I will highlight the first three import directives, and
the classes that they represent, in Listing 9.

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;


import java.io.File;

Listing 9

Steps for creating a Document object

As you will see when we get into the code, creating a Document
object involves three steps:

  1. Create a DocumentBuilderFactory object
  2. Use the DocumentBuilderFactory object to create a DocumentBuilder
    object
  3. Use the DocumentBuilder object to create a Document
    object

Both the DocumentBuilderFactory class and the DocumentBuilder
class belong to the javax.xml.parsers package.  As of this
writing, this package is part of J2SE 1.4.2.

The DocumentBuilderFactory Class

According to Sun, the DocumentBuilderFactory class

“Defines a factory API that enables applications to
obtain a parser that produces DOM object trees from XML documents.”

The DocumentBuilderFactory class extends Object,
and defines about fifteen methods, one of which is a static method
named
newInstance.  As is often the case with factory objects,
the
newInstance method is used to create an object of the class.

The class also defines the newDocumentBuilder instance method,
which is used to create objects of the DocumentBuilder class,
discussed in the next section.

(Note that although the quotation from Sun in the next
section uses the terminology
DocumentBuilderFactory.newDocumentBuilder
method
, the newDocumentBuilder method is an
instance method and is not a static or class method.)

The DocumentBuilder Class

According to Sun, the DocumentBuilder class

“Defines the API to obtain DOM Document instances from
an
XML document. Using this class, an application programmer can obtain a
Document from XML.

An instance of this class can be obtained from the
DocumentBuilderFactory.newDocumentBuilder method. Once an instance of
this class is obtained, XML can be parsed from a variety of input
sources. These input sources are InputStreams, Files, URLs, and SAX
InputSources.

This class also extends Object, and defines about ten methods,
which include several overloaded versions of the parse instance
method.  When the parse method is invoked and passed an
input source containing XML, the method returns a Document
object (DOM tree) that represents the XML.

The code in this program will pass the file named Dom02.xml to
the parse method, thus producing a DOM tree that represents the XML
contained in that file.

The Document interface

Document is an interface in the org.w3c.dom package,
which extends the Node interface belonging to the same package.
 Thus, when we invoke the parse method described above,
the method returns a reference to an object instantiated from a class
that implements the Document interface.  The reference is
returned as type Document, not as the name of the class from
which
the object was actually instantiated.

(Because Document extends Node, that
object
could also be treated as type Node when appropriate.)

Don’t know and don’t care

As is often the case in situations like this, we don’t know, and
usually don’t care about the actual name of the class from which the
Document object was instantiated, so long as the class correctly
implements the methods declared in Document and Node.

What does Sun have to say?

Sun has this to say about a Document object:

“The Document interface represents the entire HTML or
XML
document. Conceptually, it is the root of the document tree, and
provides
the primary access to the document’s data.”

Sun describes a Node as follows:

“The Node interface is the primary datatype for the
entire Document Object Model. It represents a single node in the
document tree. While all objects implementing the Node interface expose
methods for
dealing with children, not all objects implementing the Node interface
may have children. For example, Text nodes may not have children, and
adding children to such nodes results in a DOMException being raised.

Methods of Document and Node

The Document and Node interfaces declare a large
number of methods, which make it possible to manipulate and perform
operations on the DOM tree structure encapsulated in the Document
object.  We will see several of those methods being used in the
class named Dom02Writer, as it traverses to tree to create an
output XML file that represents
the tree.

The File class

The fourth import directive in Listing 9 imports the File
class.   I will assume that you already know all you need to know
about this class.  If not, see my tutorial lessons on file I/O at
www.DickBaldwin.com.

Enough talk, let’s see some code

Listing 10 shows the beginning of the class named Dom02, and
the main method for that class. 

public class Dom02{

public static void main(String argv[]) {
if (argv.length != 2) {
System.err.println(
"usage: java Dom02 fileIn fileOut");
System.exit(0);
}//end if

Listing 10

The code in Listing 10 simply checks to confirm that the user entered
the correct number of command-line arguments, and terminates with an
error message if not true.

Recall that argv[0] should contain the name of the input XML file and
argv[1] should contain the name of the output XML file.

A DocumentBuilderFactory object

The code in Listing 11 creates and configures an object of type DocumentBuilderFactory,
which is capable of producing objects of
type DocmentBuilder.  Objects of type DocumentBuilder
are, in turn, capable of producing objects of type Document.

    try{
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();

//Configure the factory object
factory.setValidating(false);
factory.setNamespaceAware(false);

Listing 11

Configuration

An object of the DocumentBuilderFactory class provides several
methods (such as setValidating), which can be used to
control the behavior of DocumentBuilder objects produced by the
factory object.   For example, if you want the want the parser
that will be produced by this and the following code to be a validating
parser, you must invoke the setValidating method at this point,
passing true as a parameter.

(Note that the validating and namespaceAware
properties are false by default, so inclusion of the corresponding
statement in Listing 11 didn’t accomplish anything, other than to
illustrate the location and use of these methods.)

Get a DocumentBuilder (parser) object

As described earlier, the code in Listing 12 invokes the newDocumentBuilder
method on the factory object produced in Listing 11, to produce a DocumentBuilder
object.  That object’s reference is saved in the local variable
named builder.

      DocumentBuilder builder =
factory.newDocumentBuilder();

Listing 12

The object produced by the code in Listing 12, is the kind of object
that is commonly referred to in the XML literature as an XML parser.

(Thus, it would have been equally appropriate to save
the
object’s reference in a variable named parser.)

Create a Document object

The code in Listing 13 invokes the parse method on the DocumentBuilder
(parser) object to parse the XML file whose name and path were
provided by the user as the first command-line argument (argv[0]).

Since this is a non-validating parser, the parse method
will confirm that the XML is well-formed.  (The parser will
not
attempt to validate the XML.)
  If the XML is not well-formed,
the parse method will throw an exception.  If the XML is
well-formed, the parse method will create an object that
represents the XML in a DOM tree, and return that object’s reference
as the interface type Document.

      Document document = builder.parse(
new File(argv[0]));

Listing 13

The code in Listing 13 saves the Document object’s reference in
the local variable named document.

Process the DOM tree

At this point, the DOM tree represents the XML in the input file.
 The methods of the Document and Node interfaces
could be used to perform a variety of operations on that tree, such as
moving nodes, deleting nodes, inserting new nodes, modifying text
nodes, etc.  Having performed such operations, the program could
then create a new XML file that represents the modified DOM tree.

In this simple program, however, we won’t modify the DOM tree. 
Rather, we will simply create a new XML file that represents the
unmodified DOM tree.   Thus, the output XML file should be
functionally equivalent to the input XML file.

Create the output file

This program will invoke a method named writeXmlFile on
an anonymous object of the Dom02Writer class to create the
output file, whose name and path were provided by the user as the
second command-line argument.   The writeXmlFile method is
invoked by the code in Listing 14, passing the Document
object’s reference as a parameter to the method.

The writeXmlFile method will recursively traverse the DOM tree
represented by the Document object.  Along the way,
it will extract information about each of the nodes and use this
information to construct the elements in the output XML file.

      new Dom02Writer(argv[1]).
writeXmlFile(document);
}catch(Exception e){
e.printStackTrace(System.err);
}//end catch

}// end main()
} // class Dom02

Listing 14

The catch block

Listing 14 also contains the catch block that receives control
if any of the code in the try block beginning in Listing 11
throws an error or an exception.

As mentioned earlier, the code in this catch block makes no
attempt to provide meaningful information in the event of an error or
an exception.   The code to provide meaningful information in the
event of parsing errors can be rather complex, and is a topic that will
be covered in a future lesson.

End of the Dom02 class

The code in Listing 14 also signals the end of the Dom02 class,
and the main method belonging to that class.

The Dom02Writer class

This class provides a utility method named writeXmlFile, which
receives a Document object’s reference as a parameter and
writes an output XML file that matches the information encapsulated in
the
Document object.

The output file is created by recursively traversing the DOM tree
encapsulated in the Document object, identifying each of the
nodes in that tree, and converting each node to text in an XML format.

No effort is made to insert spaces and line breaks to make the
output cosmetically pleasing.  Also, nothing is done to eliminate
cosmetic whitespace that may exist in the Document object.

The name of the output XML file is established as a parameter to the
constructor for the class.

Testing

This class was briefly tested using SDK 1.4.2 and WinXP. Note
however that this class has not been thoroughly tested. If you use the
class for a critical application, be sure to test it thoroughly before
using it.

The class definition

The beginning of the Dom02Writer class, including an instance
variable and the constructor is shown in Listing 15.  (See the
complete listing near the end of the lesson for the required import
directives.)

public class Dom02Writer {
private PrintWriter out;

public Dom02Writer(String xmlFile) {
try {
out = new PrintWriter(
new FileOutputStream(xmlFile));
}catch (Exception e) {
e.printStackTrace(System.err);
}//end catch
}//end constructor

Listing 15

The constructor

The constructor is very straightforward, having nothing to do with JAXP
or XML.  The purpose of the constructor is to receive the output
file name as an incoming parameter and to establish an output stream
of type PrintWriter that is used to write information to the
output file.

If this code is unfamiliar to you, you can learn about Java stream I/O
at
www.DickBaldwin.com
.

The writeXmlFile method

Listing 16 shows the entire method named writeXmlFile, which
converts an incoming Document object to an output XML file.

  public void writeXmlFile(Document document){
try {
writeNode(document);
}catch (Exception e) {
e.printStackTrace(System.err);
}//end catch
}//end writeXmlFile()

Listing 16

This method is also straightforward.  All that it does is
pass the Document object’s reference to a recursive method
named
writeNode.

What does this mean?

Recall that I told you earlier that, according to Sun,

“The Document interface represents the entire … XML
document. Conceptually, it is the root of the document tree, …”

Recall also that when discussing a Document object, I told you

“Because Document extends Node, that
object could
also be treated as type Node when appropriate.”

We’re now going to put all of that to the test.  In effect, the Document
object is a Node, which represents
the root node of the DOM tree, and we can pass its reference to the
method named writeNode, which requires an incoming parameter of
type
Node.  

Recursion

Here is where things get a little more complicated, particularly if you
don’t have a strong background in recursive algorithms. 
The writeNode method implements a recursive algorithm.

(Typically at a time like this, I would tell you that if
you don’t understand recursion, you could visit my web site where you
will find tutorial lessons that explain recursion.  However, I
have just realized that despite the fact that I have published several
hundred lessons on OOP and Java, I have never published a lesson that
concentrates on
the implementation of recursion in Java.  Therefore, the best that
I can do at this point is to tell you to fire up your Google search
engine
and search for the keywords Java and recursion.  You will probably
find many sites that deal with recursion in Java.)

The writeNode method

The writeNode method, which begins in Listing 17, is invoked
recursively to convert Node data to XML format and to write the
XML format data to the output file.

The method begins by executing code designed to avoid the infamous NullPointerException
that occurs when the incoming reference
fails to refer to an actual object of type Node.  In this
event, the program will abort gracefully with a message appearing on
the standard error device.

  public void writeNode(Node node) {
if (node == null) {
System.err.println(
"Nothing to do, node is null");
return;
}//end if

Listing 17

Process the node based on its type

The code in Listing 18 invokes the getNodeType method to
determine the type of the node whose reference was received an incoming
parameter.   According to Sun, this method returns a short
value representing the type of the node.  (Why did I treat it
as type int?  Just an oversight I suppose.)

    int type = node.getNodeType();

Listing 18

The Sun documentation shows that the Node interface defines
final static variables that represent the following types (variables
defined in an interface are implicitly final):

  • ATTRIBUTE_NODE
  • CDATA_SECTION_NODE
  • COMMENT_NODE
  • DOCUMENT_FRAGMENT_NODE
  • DOCUMENT_NODE
  • DOCUMENT_TYPE_NODE
  • ELEMENT_NODE
  • ENTITY_NODE
  • ENTITY_REFERENCE_NODE
  • NOTATION_NODE
  • PROCESSING_INSTRUCTION_NODE
  • TEXT_NODE

These values will be used in a switch statement to identify the
type of incoming node, and to take appropriate action regarding
the information written in the output XML file.  (Note
however,
that this simple test case was not designed to test all possibilities
in
the above list.)

Process the Document node

I will discuss each case in the switch statement separately.
  Listing 19 shows the code that is executed when the incoming Node
object is type
DOCUMENT_NODE.


    switch (type) {
case Node.DOCUMENT_NODE: {
out.print("<?xml version="1.0"?>");

//Get and write the root element of the
// Document. Note that this is a
// recursive call.
writeNode(
((Document)node).getDocumentElement());
out.flush();
break;
}//end case Node.DOCUMENT_NODE

Listing 19

The code in Listing 19 begins by writing the required line in the XML
file that indicates that the file contains XML data.  This is
required as the first line in an XML file.

The getDocumentElement method

Then the code in Listing 19 downcasts the Node object’s
reference to type Document and invokes the getDocumentElement
method on that reference.  Here is what Sun has to say about this
method:

“This is a convenience attribute that allows direct
access to the child node that is the root element of the document. For
HTML documents, this is the element with the tagName “HTML”.”

For the XML file being processed in this example, this will be the
element named bookOfPoems.

(Although I’m not certain, I suspect that the
documentation author intended to say convenience method instead of
convenience
attribute.)

An object of interface type Element

The getDocumentElement method returns a reference to
an object of the interface type Element, which is a
subinterface
of the Node interface.

Here is what Sun has to say about objects of type Element:

“The Element interface represents an element in an HTML
or XML document. Elements may have attributes associated with them;
since the Element interface inherits from Node, the generic Node
interface attribute attributes may be used to retrieve the set of all
attributes for an element. There are methods on the Element interface
to retrieve either an Attr
object by name or an attribute value by name. In XML, where an
attribute
value may contain entity references, an Attr object should be retrieved
to examine the possibly fairly complex sub-tree representing the
attribute
value.”

The interface declares about fifteen methods, which make it possible to
perform various operations on an Element object.

A recursive call to the writeNode method

The code in Listing 19 gets the Element object, (which
is also a Node object)
corresponding to the root element of
the XML document and passes that object’s reference, recursively, to
the
writeNode method.

When the writeNode method ultimately returns, the code in
Listing 19 flushes the output buffer to ensure that all data that has
been written to the output buffer is actually written to the output
file.

Important:  The statement that reads out.flush
and all of the remaining code in this method will not be executed until
the recursive call to
writeNode() returns.

In effect, the code for the DOCUMENT_NODE case in the writeNode
method (Listing 19) simply gets the object in the DOM tree
corresponding to the root element in the XML document and passes it
recursively to
the writeNode method.  This causes the information
corresponding to the root element to be written to the output file.

Node type ELEMENT_NODE

Listing 20 shows the beginning of the code in the switch case where
the node type is ELEMENT_NODE.

      case Node.ELEMENT_NODE: {
out.print('<');//begin the start tag
out.print(node.getNodeName());

Listing 20

The code in Listing 20 is simple enough.

  • Begin the case clause for type ELEMENT_NODE.
  • Write a left angle bracket (“<“) into the output file to begin
    the tag for the element.
  • Get and write the name of the node immediately following the left
    angle bracket.

The Attr interface

An element can have none, one, or more attributes.  The Attr interface
extends the Node interface.  Here is part of what Sun has
to say about the Attr interface:

“The Attr interface represents an attribute in an
Element
object. Typically the allowable values for the attribute are defined in
a
document type definition.

Attr objects inherit the Node interface, but since they are
not actually child nodes of the element they describe, the DOM does not
consider them part of the document tree. Thus, the Node attributes
parentNode,
previousSibling, and nextSibling have a null value for Attr objects.
The DOM takes the view that attributes are properties of elements
rather
than having a separate identity from the elements they are associated
with; this should make it more efficient to implement such features as
default attributes associated with all elements of a given type.
Furthermore,
Attr nodes may not be immediate children of a DocumentFragment.
However,
they can be associated with Element nodes contained within a
DocumentFragment.
In short, users and implementors of the DOM need to be aware that Attr
nodes have some things in common with other objects inheriting the Node
interface, but they also are quite distinct.

Process the attributes, if any

Continuing with the case for node type ELEMENT_NODE, the code in
Listing 21 gets the attributes, if any, belonging to the element and
writes them into the output file in the correct XML format.

        //Get attributes into an array
Attr attrs[] = getAttrArray(
node.getAttributes());

//Process attributes in the array.
for (int i = 0; i < attrs.length; i++){
Attr attr = attrs[i];
out.print(' ');//write a space
out.print(attr.getNodeName());
out.print("="");//write ="
//Convert <,>,&, and quotation char to
// entities and write the text
// containing the entities.
out.print(
strToXML(attr.getNodeValue()));
out.print('"');//write closing quote
}//end for loop
out.print('>');//write end of start tag

Listing 21

Get attributes into an array object

The code begins by invoking the getAttrArray method, (which
is defined later in this class),
to get the attributes and to store
them in an array object of type Attr. I will explain the getAttrArray
method later.  For now, suffice it to say that the getAttrArray
method returns a reference to an array object of type Attr
where each element in the array represents one of the attributes
associated
with the node being processed.

Process the attributes, if any

With one exception, the code to process the array, getting the name and
value of each attribute and writing them into the output XML file is
straightforward. All it really amounts to is invoking the getNodeName
and getNodeValue methods to get the name and the value of the
attribute, and then creating the correct sequence of text, spaces, and
punctuation characters.  For the case where the node is of type Attr,
these two methods simply return strings.

The strToXML method

The exception mentioned above has to do with the call to the method
named strToXML. This method is used to replace extraneous angle
brackets, ampersands, and quotation marks in the text with the
corresponding
XML entities. I will explain the inner workings of this method later in
this lesson.

Nested elements

At this point, we must deal with the possibility that this node may
have children, and must process them if they exist.  This is
accomplished by the code in Listing 22, where we are still dealing with
the switch
case of node type ELEMENT_NODE.

        NodeList children = node.getChildNodes();

if (children != null) {//nested elements
int len = children.getLength();
//Iterate on NodeList of child nodes.
for (int i = 0; i < len; i++) {
//Write each of the nested elements
// recursively.
writeNode(children.item(i));
}//end for loop
}//end if
break;
}//end case Node.ELEMENT_NODE

Listing 22

Listing 22 begins by invoking the getChildNodes method on
the current node to get an object of type NodeList containing
a collection of the children of this node.

The items in the NodeList object are accessible via an
integral index, starting from 0, via a method named item
The item method takes an integral index as a parameter, and
returns a reference to an object of type Node.

A NodeList object also provides a method named getLength,
which returns the number of nodes in the list.

Getting the nodes in the list

The getChildNodes method returns an empty list if there are no
children. (If there are children, getLength returns a value
greater than zero.)

Assuming that you are comfortable with recursion, the code in Listing
22 is straightforward:

  • Invoke getLength to get the number of nodes.
  • Use a for loop to iterate on each of the nodes.
  • Make a recursive call inside the for loop to the writeNode
    method to process each child node.

That ends the processing for the switch case ELEMENT_NODE.

Entity reference nodes

The code in Listing 23 is the code for the switch case
ENTITY_REFERENCE_NODE.

      case Node.ENTITY_REFERENCE_NODE:{
out.print('&');
out.print(node.getNodeName());
out.print(';');
break;
}//end case Node.ENTITY_REFERENCE_NODE

Listing 23

The code in Listing 23 sandwiches the name of a node of type
ENTITY_REFERENCE_NODE between an ampersand and a semicolon and writes
the combination into
the output file. This produces an entity reference in the output XML
file.

Briefly, an entity reference is a reference to something that has been
defined elsewhere. Since this lesson is not intended to teach you about
entities, I will drop it at that. The sample XML file used to test this
program didn’t contain any entity references, so this code has not been
tested.

Text nodes

The code in Listing 24 handles the following switch cases:

  • CDATA_SECTION_NODE
  • TEXT_NODE
      case Node.CDATA_SECTION_NODE:
case Node.TEXT_NODE: {
//Eliminate <,>,& and quotation marks and
// write to output file.
out.print(strToXML(node.getNodeValue()));
break;
}//end case Node.TEXT_NODE

Listing 24

Without getting into the technical XML details as to why, a block of
text can be represented by either of two node types:

  • CDATA_SECTION_NODE
  • TEXT_NODE

(The sample XML file that I used to test this program contained only
the second type.)

The processing for this type of node, shown in Listing 24, is very
simple:

  • Get the value of the node, which contains the actual text.
  • Invoke the strToXML method to replace angle brackets,
    ampersands, and quotation marks with entities.
  • Write the modified text to the output file.

Note, however, that by replacing angle brackets, ampersands, and
quotation marks with entities, the code in Listing 24 essentially
converts CDATA into PCDATA.  In some cases, that may not be
desirable, so this may not be the best approach for dealing with CDATA.

Processing instruction nodes

The code in Listing 25 is the code for switch case
PROCESSING_INSTRUCTION_NODE

      case Node.PROCESSING_INSTRUCTION_NODE:{
out.print("<?");
out.print(node.getNodeName());
String data = node.getNodeValue();
if (data != null && data.length() > 0){
out.print(' ');//write space
out.print(data);
}//end if
out.print("?>");
break;
}//end Node.PROCESSING_INSTRUCTION_NODE
}//end switch

Listing 25

Based on what you have learned up to this point, the processing of this
node type in Listing 25 should be straightforward.  In this
case, the getNodeName method returns a string corresponding to
the target of the processing instruction.  The getNodeValue
method returns a string consisting of the “entire content excluding
the target.”

The target string is written into the output XML file preceded by
“<?”.

If the string returned by getNodeValue is not null and has a
length greater than zero, that string is then written into the output
file preceded by a space.

Finally the characters “?>” are written into the output file
completing the processing instruction.

Close the element

There is one more thing that needs to be done before exiting the
writeNode method being used to process a node. As shown in
Listing
26, if the node being processed is an element, the end tag for the
element
needs to be created and written to the output file. That is
accomplished
in a straightforward manner in Listing 26.

    //Now write the end tag for element nodes
if (type == Node.ELEMENT_NODE) {
out.print("</");
out.print(node.getNodeName());
out.print('>');

}//end if

}//end writeNode(Node)

Listing 26

Listing 26 also signals the end of the writeNode method.

Utility methods

That brings us to some utility methods that are invoked by the code
discussed above.

The strToXML method

The purpose of the strToXML method, shown in Listing 27, is to
modify and return a String object replacing angle brackets,
ampersands, and quotation marks with XML entities.

  private String strToXML(String s) {
StringBuffer str = new StringBuffer();

int len = (s != null) ? s.length() : 0;

for (int i = 0; i < len; i++) {
char ch = s.charAt(i);
switch (ch) {
case '<': {
str.append("&lt;");
break;
}//end case '<'
case '>': {
str.append("&gt;");
break;
}//end case '>'
case '&': {
str.append("&amp;");
break;
}//end case '&'
case '"': {
str.append("&quot;");
break;
}//end case '"'
default: {
str.append(ch);
}//end default
}//end switch
}//end for loop

return str.toString();

}//end strToXML()

Listing 27

The method receives a String object’s reference as an incoming
parameter.   It replaces the <,>,&, and quotation mark
characters in that string with XML entities, and returns the modified
string.

The code in Listing 27 is completely straightforward, and shouldn’t
require further explanation.

The getAttrArray method

In the earlier discussion of attribute elements, I promised to provide
a further discussion of the getAttrArray method shown in
Listing 28.   Briefly, this method converts a NamedNodeMap
into an
array object of type Attr.

  private Attr[] getAttrArray(
NamedNodeMap attrs){
int len = (attrs != null) ?
attrs.getLength() : 0;
Attr array[] = new Attr[len];
for (int i = 0; i < len; i++) {
array[i] = (Attr)attrs.item(i);
}//end for loop

return array;
}//end getAttrArray()

} // end class Dom02Writer

Listing 28

The getAttributes method

Backtracking a bit, the code in Listing 21 invokes the getAttributes
method on the node, and passes the returned value as a parameter to the
getAttrArray method shown in Listing 28.

The getAttributes method returns a reference to an object
of type NamedNodeMap containing the attributes of the node (if
it is an Element)
and null otherwise.

Thus, the getAttrArray method shown in Listing 28 receives an
incoming parameter of type NamedNodeMap, which may be null.

The NamedNodeMap interface

Here is part of what Sun has to say about a NamedNodeMap object:

“Objects implementing the NamedNodeMap interface are
used
to represent collections of nodes that can be accessed by name. …
Objects
contained in an object implementing NamedNodeMap may also be accessed
by
an ordinal index, but this is simply to allow convenient enumeration of
the
contents of a NamedNodeMap, and does not imply that the DOM specifies
an
order to these Nodes.”

A NamedNodeMap object provides several methods, which can
be used to

  • Get the number of items in the collection.
  • Access the items in the collection.
  • Remove items from the collection.
  • Add items to the collection.

The method named item

The code in Listing 28 takes advantage of the fact that Objects
contained in an object implementing NamedNodeMap may also be accessed
by an ordinal index,
…”  This is accomplished by
invoking the method named item on the NamedNodeMap
object, passing
an ordinal index as a parameter.

Given this information, the process for converting a NamedNodeMap
object into an array object of type Attr, as implemented by the
getAttrArray method in Listing 28, is relatively straightforward:

  • Get required length for the array.
  • Instantiate the new array object of the proper length.
  • Use a for loop and the item method to extract
    each item from the NamedNodeMap object and use it to populate
    the array object.
  • Return the array object.

End of class Dom02Writer

The code in Listing 28 also signals the end of the class definition for
the class named Dom02Writer.

Run the Program

I encourage you to copy the code from Listings 28, 29, and 30 into
your text editor, compile it, and execute it.  Experiment with it,
making changes, and observing the results of your changes.

Summary

In this first lesson on Java JAXP, I began by providing a brief
description of JAXP and XML.  Then I reviewed the salient aspects
of XML for those who need to catch up on XML technology.

Following that, I provided a brief discussion of the Document Object
Model (DOM) and the Simple API for XML (SAX).  I discussed how a
DOM object represents an XML document as a tree structure in
memory.  I explained that once you have the tree structure in
memory, there are many operations that you can perform to create,
manipulate, and/or modify the structure.  Then you can convert
that modified tree structure into a new XML document.

Using two sample Java class files, I showed you how to:

  • Use JAXP, DOM, and an input XML file to create a Document object
    that represents the XML file.
  • Recursively traverse the DOM tree, getting information about each
    node in the tree along the way.
  • Use the information about the nodes to create a new XML file that
    represents the Document object.

What’s Next?

What I did not do in this lesson, (but will do in a future
lesson),
is to show you how to modify the tree structure for
purposes of creating a modified XML file.

The things that you learned about traversing the tree structure and
getting information about each node in the tree will serve you well in
the future.   However, if all you need to do is to write an output
XML file that represents the DOM, there is an easier way to do that
using Extensible Stylesheet Language Transformations (XSLT).  That
will be the primary topic of the next lesson.

In this lesson, I didn’t show you how to write code that produces
meaningful output in the event of a parser error or exception.  I
will also cover that topic in the next lesson.

Complete Program Listings


Complete listings of the two Java classes and the XML document
discussed in this lesson are shown in Listings 28, 29, and 30 below.

/*File Dom02.java
Copyright 2003 R.G.Baldwin

This program and the class named Dom02Writer used
by this program shows you how to:

1. Create a Document object using JAXP, DOM, and
an input XML file.
2. Traverse the DOM tree getting information
about each element.
3. Use the information describing the elements to
create an XML file that represents the
Document object.

The input XML file name is provided by the user
as the first command-line argument. The output
XML file name is provided by the user as the
second command-line argument.

The program requires access to the following
class file:
Dom02Writer.class

The program instantiates a DOM parser object
based on JAXP. The parser is non-validating
and is not namespace aware.

The program uses the parse() method of the parser
object to parse an XML file specified on the
command line. The parse method returns an object
of type Document that represents the parsed XML
file.

The program passes the Document object to a
method named writeXmlFile() on an object of a
class named Dom02Writer. The purpose of this
method and this class is to write an XML file
that represents the information contained in the
Document object.

Tested using JDK 1.4.2 and WinXP with an XML
file that reads as follows:

<?xml version="1.0"?>
<bookOfPoems>
<poem PoemNumber="1" DumAtr="dum val">
<line>Roses are red,</line>
<line>Violets are blue.</line>
<line>Sugar is sweet,</line>
<line>and so are you.</line>
</poem>
<?processor ProcInstr="Dummy"?>
<!--Comment-->
<poem PoemNumber="2" DumAtr="dum val">
<line>Roses are pink,</line>
<line>Dandelions are yellow,</line>
<line>If you like Java,</line>
<line>You are a good fellow.</line>
</poem>
</bookOfPoems>

When viewed with an editor that restores most of
the cosmetic XML structure, the output file looks
like the following (note that the comment from
the input XML file is missing in the output
XML file):

<?xml version="1.0"?><bookOfPoems>
<poem PoemNumber="1" DumAtr="dum val">
<line>Roses are red,</line>
<line>Violets are blue.</line>
<line>Sugar is sweet,</line>
<line>and so are you.</line>
</poem>
<?processor ProcInstr="Dummy"?>

<poem PoemNumber="2" DumAtr="dum val">
<line>Roses are pink,</line>
<line>Dandelions are yellow,</line>
<line>If you like Java,</line>
<line>You are a good fellow.</line>
</poem>
</bookOfPoems>


Note. No effort was made to provide meaningful
information about errors and exceptions. This is
a complex topic that will be covered in a
subsequent sample program.
************************************************/

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import java.io.File;
import org.w3c.dom.Document;

public class Dom02 {

public static void main(String argv[]) {
if (argv.length != 2) {
System.err.println(
"usage: java Dom02 fileIn fileOut");
System.exit(0);
}//end if

try{
//Get a factory object for DocumentBuilder
// objects
DocumentBuilderFactory factory =
DocumentBuilderFactory.newInstance();

//Configure the factory object
factory.setValidating(false);
factory.setNamespaceAware(false);

//Get a DocumentBuilder (parser) object
DocumentBuilder builder =
factory.newDocumentBuilder();

//Parse the XML input file to create a
// Document object that represents the
// input XML file.
Document document = builder.parse(
new File(argv[0]));

//Use an anonymous object of the
// Dom02Writer class to traverse the
// Document object, extracting information
// about each of the nodes, and using that
// information to write an output XML
// file that represents the Document
// object.
new Dom02Writer(argv[1]).
writeXmlFile(document);
}catch(Exception e){
e.printStackTrace(System.err);
}//end catch

}// end main()
} // class Dom02

Listing 28
/*File Dom02Writer.java
Copyright 2003 R.G.Baldwin

This class provides a utility method named
writeXmlFile() that receives a DOM Document
object as a parameter and writes an output XML
file that matches the information contained in
the Document object.

The output file is created by recursively
traversing the Document object, identifying each
of the nodes in that object, and converting each
node to text in an XML format.

No effort is made to insert spaces and line
breaks to make the output cosmetically pleasing.
Also, nothing is done to eliminate cosmetic
whitespace that may exist in the Document object.

The name of the XML file is established as a
parameter to the constructor for the class.

A cosmetically pleasing view of the output file
can be obtained by opening the output file in
IE 5.0 or later.

Briefly tested using JDK 1.4.2 and WinXP. Note
however that this class has not been thoroughly
tested. If you use it for a critical application,
test it thoroughly before using it.
************************************************/

import java.io.PrintWriter;
import java.io.FileOutputStream;

import org.w3c.dom.*;

public class Dom02Writer {
private PrintWriter out;

//-------------------------------------------//

public Dom02Writer(String xmlFile) {
try {
out = new PrintWriter(
new FileOutputStream(xmlFile));
}catch (Exception e) {
e.printStackTrace(System.err);
}//end catch
}//end constructor
//-------------------------------------------//

//This method converts an incoming Document
// object to an output XML file
public void writeXmlFile(Document document){
try {
//Write the contents of the Document object
// into an ontput file in XML file format.
writeNode(document);
}catch (Exception e) {
e.printStackTrace(System.err);
}//end catch
}//end writeXmlFile()
//-------------------------------------------//

//This method is used recursively to convert
// node data to XML format and to write the XML
// format data to the output file.
public void writeNode(Node node) {
if (node == null) {
System.err.println(
"Nothing to do, node is null");
return;
}//end if

//Process the node based on its type.
int type = node.getNodeType();

switch (type) {
//Process the Document node
case Node.DOCUMENT_NODE: {
//Write a required line for an XML
// document.
out.print("<?xml version="1.0"?>");

//Get and write the root element of the
// Document. Note that this is a
// recursive call.
writeNode(
((Document)node).getDocumentElement());
out.flush();
break;
}//end case Node.DOCUMENT_NODE

//Write an element with attributes
case Node.ELEMENT_NODE: {
out.print('<');//begin the start tag
out.print(node.getNodeName());

//Get and write the attributes belonging
// to the element. First get the
// attributes in the form of an array.
Attr attrs[] = getAttrArray(
node.getAttributes());

//Now process all of the attributes in
// the array.
for (int i = 0; i < attrs.length; i++){
Attr attr = attrs[i];
out.print(' ');//write a space
out.print(attr.getNodeName());
out.print("="");//write ="
//Convert <,>,&, and quotation char to
// entities and write the text
// containing the entities.
out.print(
strToXML(attr.getNodeValue()));
out.print('"');//write closing quote
}//end for loop
out.print('>');//write end of start tag

//Deal with the possibility that there
// may be other elements nested in this
// element.
NodeList children = node.getChildNodes();
if (children != null) {//nested elements
int len = children.getLength();
//Iterate on NodeList of child nodes.
for (int i = 0; i < len; i++) {
//Write each of the nested elements
// recursively.
writeNode(children.item(i));
}//end for loop
}//end if
break;
}//end case Node.ELEMENT_NODE

//Handle entity reference nodes
case Node.ENTITY_REFERENCE_NODE:{
out.print('&');
out.print(node.getNodeName());
out.print(';');
break;
}//end case Node.ENTITY_REFERENCE_NODE

//Handle text
case Node.CDATA_SECTION_NODE:
case Node.TEXT_NODE: {
//Eliminate <,>,& and quotation marks and
// write to output file.
out.print(strToXML(node.getNodeValue()));
break;
}//end case Node.TEXT_NODE

//Handle processing instruction
case Node.PROCESSING_INSTRUCTION_NODE:{
out.print("<?");
out.print(node.getNodeName());
String data = node.getNodeValue();
if (data != null && data.length() > 0){
out.print(' ');//write space
out.print(data);
}//end if
out.print("?>");
break;
}//end Node.PROCESSING_INSTRUCTION_NODE
}//end switch

//Now write the end tag for element nodes
if (type == Node.ELEMENT_NODE) {
out.print("</");
out.print(node.getNodeName());
out.print('>');

}//end if

}//end writeNode(Node)
//-------------------------------------------//

//The following methods are utility methods

//This method inserts entities in place
// of <,>,&, and quotation mark
private String strToXML(String s) {
StringBuffer str = new StringBuffer();

int len = (s != null) ? s.length() : 0;

for (int i = 0; i < len; i++) {
char ch = s.charAt(i);
switch (ch) {
case '<': {
str.append("&lt;");
break;
}//end case '<'
case '>': {
str.append("&gt;");
break;
}//end case '>'
case '&': {
str.append("&amp;");
break;
}//end case '&'
case '"': {
str.append("&quot;");
break;
}//end case '"'
default: {
str.append(ch);
}//end default
}//end switch
}//end for loop

return str.toString();

}//end strToXML()
//-------------------------------------------//

//This method converts a NamedNodeMap into an
// array of type Attr
private Attr[] getAttrArray(
NamedNodeMap attrs){
int len = (attrs != null) ?
attrs.getLength() : 0;
Attr array[] = new Attr[len];
for (int i = 0; i < len; i++) {
array[i] = (Attr)attrs.item(i);
}//end for loop

return array;
}//end getAttrArray()

//-------------------------------------------//

} // end class Dom02Writer

Listing 29
<?xml version="1.0"?>
<bookOfPoems>
<poem PoemNumber="1" DumAtr="dum val">
<line>Roses are red,</line>
<line>Violets are blue.</line>
<line>Sugar is sweet,</line>
<line>and so are you.</line>
</poem>
<?processor ProcInstr="Dummy"?>
<!--Comment-->
<poem PoemNumber="2" DumAtr="dum val">
<line>Roses are pink,</line>
<line>Dandelions are yellow,</line>
<line>If you like Java,</line>
<line>You are a good fellow.</line>
</poem>
</bookOfPoems>

Listing 30


Copyright 2003, Richard G. Baldwin.  Reproduction in whole or
in
part in any form or medium without express written permission from
Richard
Baldwin is prohibited.

About the author

Richard Baldwin
is a college professor (at Austin Community College in Austin, TX) and
private consultant whose primary focus is a combination of Java, C#,
and XML. In addition to the many platform and/or language independent
benefits of Java and C# applications, he believes that a combination
of Java, C#, and XML will become the primary driving force in the
delivery of structured information on the Web.

Richard has participated in numerous consulting projects, and he
frequently provides onsite training at the high-tech companies located
in and around Austin, Texas.  He is the author of Baldwin’s
Programming Tutorials, which
has gained a worldwide following among experienced and aspiring
programmers. He has also published articles in JavaPro magazine.

Richard holds an MSEE degree from Southern Methodist University
and has many years of experience in the application of computer
technology to real-world problems.

Baldwin@DickBaldwin.com

-end-
 

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories