Parsing XML Documents: Events, Part 4
Preface This is the fourth in a series of six lessons designed to teach you how to write custom XML processing programs using a SAX-based parser and the Java programming language. A sample program is provided and explained in the sixth lesson in the series. The five lessons leading up to that lesson are intended to prepare you to understand the complex technical material presented in the sixth lesson.
I maintain a consolidated index of hyperlinks to all of my XML articles at my personal website. You can easily locate and access my XML articles from there.
Preview One of the common ways to create custom XML processing tools is through the use of an event-based parser that implements the SAX interface, along with either the Java or Python programming languages. In the previous lesson, I provided quite a few details about SAX.
One way to parse an XML document is to analyze the XML document as a stream of text, recognizing the various components as they are encountered, and applying the processing algorithm as the components are recognized. A very common way to implement this approach is by using a concept (often referred to as SAX) that will examine the sequence of characters that comprise the XML text and raise events (such as the start and end of elements) as the components in the document are encountered.
An event-based parser reports events to the processing program using callbacks. The program implements and registers event handlers for the different types of events. Code written into the event handlers is designed to achieve the overall objective of the program.
In the previous lesson entitled Parsing XML Documents: Events, Part 3, I promised that I would continue the discussion of SAX in this lesson. I also promised to introduce you to a software product from IBM named XML for Java or XML4J for short. This product supports both SAX and DOM, and lets the application programmer combine the two approaches in a single application.
Several parsers are freely available that implement SAX for event-based parsing. Links to some of those parsers can be found at the following URL. Notable among the available parsers are IBM's XML for Java and Sun's Java API for XML.
IBM's XML for Java (XML4J) IBM's XML for Java is a validating XML parser written in 100% pure Java. As of this writing, it can be downloaded free of charge from the IBM alphaWorks site. This is the parser that I will use (in most cases) in the sample programs that I will provide for this series of articles on SAX and DOM.
In terms of a recommendation, Version 1 of IBM's XML for Java was the highest rated Java XML parser in Java Report's February 1999 review of XML parsers. The parser has been upgraded to add new features since that report was written. Very important to application programmers is the fact that the parser supports several existing standards. As of January 20, 2001, according to IBM:
|The XML Parser for Java Version 3.1.1 Release (XML4J-3_1_1) is now available. This release contains public and stable support of the DOM Level 1, and SAX Level 1 specifications. It also contains implementations of the DOM Level 2, SAX Level 2 implementations, and partial April 7 W3C Schema implementations but these are considered experimental, as the specifications themselves are still subject to change.|
Support of SAX is of particular interest in this tutorial lesson. Support for DOM will become important in subsequent lessons.
What does it mean to say that the parser supports the SAX Level 1 specifications? The SAX specification consists primarily of a set of interface definitions. There are few, if any concrete class and method definitions in SAX. Therefore, SAX is really a definitive specification as to how an event-based parser should behave from the programming interface viewpoint. Therefore, a parser that implements SAX Level 1 will implement the interfaces defined in SAX Level 1, providing concrete class and method definitions for the methods declared in the various SAX interfaces. Exactly how those interfaces are implemented is up to the designer of the parser product. Furthermore, those classes and methods will be implemented in such a way that application programmers can gain access to the various event-based capabilities of the parser using the method signatures declared in the SAX interface definitions. The programmer doesn't need to know the names of the actual classes used to implement the interfaces. Thus, programming to the SAX interface produces code that is largely vendor-independent with respect to the parser itself. One vendor may implement the interface methods differently from another vendor insofar as the names of the classes and the inner workings of the methods are concerned. However, the programming interface and the resulting behavior of the methods will be as defined in SAX.
For example, SAX declares a method named startDocument(). This is the declaration for an event-handler method that will be invoked by the parser when the parser encounters the beginning of the XML document. It is the responsibility of the application programmer to override this method to provide the desired behavior when the parser begins parsing a document.
If the terminology "override this method" is new to you, see my online Java tutorials for an explanation of this and other Object-Oriented Programming concepts.
It is the responsibility of the parser software to invoke this overridden method when the parser begins parsing a document. This method invocation causes the behavior of the overridden method to manifest itself, causing the behavior of the program to adhere to the design of the programmer who overrode the method. It is also the responsibility of the parser software to provide a default version of this method that will be invoked when the application programmer chooses not to override it.
In addition to supporting SAX and DOM, the IBM parser also provides a number of other capabilities. According to IBM:
|The rich generating and validating capabilities allow the XML4J Parser to be used for: |
A FAQ is available inside the XML4J download package that should answer many of the questions that you may have about the parser.
As mentioned earlier, the IBM parser is not the only SAX and DOM compliant parser available. Sun provides a parser that you can download free of charge. You will have to become a registered member of the Java Developer Connection to download this parser, but registration is free. This is what Sun has to say about their parser:
|Java(TM) Project X Technology Release 2 is a maintenance release that offers full conformance to the XML 1.0 specification and SAX 1.0 APIs and continues to lead the industry with substantial improvements in performance. |
Java(TM) Project X is the code name for XML technology services written completely in the Java language. This package provides core XML capabilities including a fast XML parser with optional validation and an in-memory object model tree that supports the W3C DOM Level 1 recommendation. With Java Project X, developers can build robust, flexible XML-oriented applications and network services.
Another interesting parser is the OpenXML parser that you can download free of charge from OpenXML. The documentation indicates that this parser is also SAX and DOM compliant. You should be able to use any of these parsers to work with the application programs that I will begin developing in the next lesson.
What's Next? My plan for the next lesson is to continue the discussion of SAX-based parsers. I will show you how to write a simple Java program that uses XML4J to parse the XML document shown in Listing 1. The program will deliver a series of events to the appropriate event handler methods as the parser traverses the XML document. The event handler methods will extract and display information about the XML document.
<?xml version="1.0"?> <bookOfPoems> <poem PoemNumber="1" DummyAttribute="dummy value"> <line>Roses are red,</line> <line>Violets are blue.</line> <line>Sugar is sweet,</line> <line>and so are you.</line> </poem> <poem PoemNumber="2" DummyAttribute="dummy value"> <line>Twas the night before Christmas,</line> <line>And all through the house,</line> <line>Not a creature was stirring,</line> <line>Not even a mouse.</line> </poem> </bookOfPoems> Listing 1
A Tree View Even though this series of lessons is intended to explain event-based parsing, I am going to take a side trip here and show you a tree view of the XML file in Listing 1. Those of you who have studied my previous lessons on XSLT will know that when you load the above XML file into IE 5.0 or later, it will be rendered in a tree format that looks something like Listing 2. (Note that I had to manually insert a couple of line breaks to force the text to fit in this narrow presentation format.)
<?xml version="1.0" ?> - <bookOfPoems> - <poem PoemNumber="1" DummyAttribute="dummy value"> <line>Roses are red,</line> <line>Violets are blue.</line> <line>Sugar is sweet,</line> <line>and so are you.</line> </poem> - <poem PoemNumber="2" DummyAttribute="dummy value"> <line>Twas the night before Christmas, </line> <line>And all through the house,</line> <line>Not a creature was stirring,</line> <line>Not even a mouse.</line> </poem> </bookOfPoems> Listing 2
The IE5 rendering actually looks much nicer than Listing 2, because that rendering uses color to separate the different parts of the output. (I decided that rather than to spend the time to add the color to this presentation, I will just let you load the file into IE5 and see for yourself what it looks like).
When you load the file into IE5, you will get a dynamic tree structure that can be expanded and collapsed using the minus signs in the left-hand side of the display. Each minus sign flags a node in the tree that has child nodes. When you click on one of the minus signs, it collapses to hide all of its children, and the minus turns into a plus. If you later click on the plus, the children of that node are again displayed.
Summary A common way to create custom XML processing tools is through the use of an event-based parser that implements the SAX interface. In the previous lesson, I provided quite a few details about the SAX interface. In this lesson, I introduced you to a specific event-based parser that implements the SAX interface. That parser is currently available for downloading at no charge from IBM, and is known as IBM's XML for Java (XML4J). In addition, I pointed you to some additional SAX-based parsers that are also currently available for downloading at no cost. Finally, I showed you both a raw view and a tree view of a simple XML file that will be used in the next lesson to illustrate the use of IBM's XML4J parser software.
Copyright 2000, Richard G. Baldwin. Reproduction in whole or in part in any form or medium without express written permission from Richard Baldwin is prohibited.Richard Baldwin (email@example.com) is a college professor and private consultant whose primary focus is a combination of Java and XML. In addition to the many platform-independent benefits of Java applications, he believes that a combination of Java and XML will become the primary driving force in the delivery of structured information on the Web.
Richard has participated in numerous consulting projects involving Java, XML, or a combination of the two. He frequently provides onsite Java and/or XML training at the high-tech companies located in and around Austin, Texas. He is the author of Baldwin's Java Programming Tutorials, which has gained a worldwide following among experienced and aspiring Java programmers. He has also published articles on Java Programming in Java Pro magazine.
Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.