Parsing XML Documents: Events, Part 3
Preface This is the third in a series of six lessons designed to teach you how to write custom XML processing programs using a SAX-based parser and the Java programming language. A sample program is provided and explained in the sixth lesson in the series. The five lessons leading up to that lesson are intended to prepare you to understand the complex technical material presented in the sixth lesson.
I maintain a consolidated index of hyperlinks to all of my XML articles at my personal website. You can easily locate and access my XML articles from there.
Preview One of the common ways to create custom XML processing tools is through the use of an event-based parser that implements the SAX interface, along with either the Java or Python programming languages. In the previous lesson, I showed you how to write general event-driven programs using the Java programming language. In this and the next few lessons, I will show you how to use the Java programming language and a SAX parser to write programs for doing custom processing of your XML documents. In subsequent lessons, I will show you how to use a DOM parser to parse and process XML documents on the basis of the Document Object Model.
One way to parse an XML document is to analyze the XML document as a stream of text, recognizing the various components that make up the document as they are encountered. With this approach, a processing algorithm is applied as the components are recognized. A very common way to implement this approach is by using a concept (often referred to as SAX) that examines the sequence of characters that comprise the XML text and raises events (such as the start and end of elements) as the components in the document are encountered. An event-based parser reports events back to the processing program using callbacks. The processing program implements and registers event handlers for the different types of events. Code written into the event handlers is designed to achieve the overall objective of the program.
So, the bottom line is, XML can be a very useful tool, but in order for it to be useful, computers must be equipped with programs that understand, speak, read, and write XML. Where do these programs come from? Obviously programmers write them. Programmers need programming tools, and SAX is a programming tool.
Computer programs are written using programming languages, such as Fortran, Python, Pascal, C, C++, and Java. There was a time when programmers like myself started every new program from scratch and reinvented the wheel on a daily basis. There were many reasons why we did this. One of the important reasons was the fact that it was often easier to reinvent a wheel than to try to use an old wheel that was integrated into a previously-written program. (Another reason, which was not a very good one, was that we felt that we were being more creative when we did it this way. Unfortunately, some programmers still feel that way.)
Fortunately, some modern programmers, including myself (yes, sometimes, you can teach an old dog new tricks), have learned that starting from scratch every time is not the best approach. Some among us have learned the value of "reusable code." There is a body of technology, often referred to as Object Oriented Programming, which has as one of its major advantages a strong emphasis on the reusability of code. In this technology area, reusable code typically comes in the form of class libraries that make it fairly easy to do difficult tasks the same way every time without the need to reinvent the code every time. (A good example of code reuse through a class library is the creation and rendering of a typical button in a graphical user interface (GUI).)
Object oriented programming (of the Java variety at least) also brings with it another higher-level and very abstract concept known as an interface. In a nutshell, an interface definition in a programming language such as Java specifies the programming interface to a module of code. And this is where SAX comes in.
What is SAX? For the most part, SAX is simply a set of interface definitions. Those definitions specify one of the ways that application programs can interact with XML documents. There are other ways for programs to interact with XML documents as well. Prominent among them is the Document Object Model, or DOM, which will be the topic for a later tutorial lesson.
Another modern programming concept is event-driven programming. In a nutshell, event-driven programming describes a programming style where the program goes into a quiescent state and waits for some interesting event to happen. When an event happens, an event handler springs into action and takes some appropriate action related to the event. An event can be anything of interest in a particular context, such as a change in the price of a stock, or a mouse click on a GUI button, for example.
SAX provides the following benefits to the programmer who is interested in writing programs for processing XML documents:
- A standard programming interface
- An event driven programming model
SAX is not a commercial product that is intended for sale. Rather, SAX is a technical specification that explains how those who develop software products for XML document processing should go about it.
SAX is a technical specification provided by Megginson Technologies Ltd.
As of 1/20/01,
- SAX 2.0 is a free API for event-based XML parsing
- SAX has become a defacto standard for event-based XML parsers
- SAX can be downloaded from http://www.megginson.com/SAX/Java/index.html.
According to Megginson Technologies,
|"SAX is a common interface implemented for many different XML parsers (and things that pose as XML parsers), just as the JDBC is a common interface implemented for many different relational databases (and things that pose as relational databases)."|
You may not need to download SAX in order to use it. If you are using a SAX-based parser (such as IBM's XML for Java, or Sun's Java API for XML) for the development of XML processing programs, the SAX libraries and documentation may already be contained in the libraries and documentation for the parser. A separate download of SAX may not be necessary. I will have more to say about IBM's XML for Java in the next lesson in this series.
So what is a parser? A parser, in this context, is a software tool that preprocesses an XML document in some fashion, handing the results over to an application program. The primary purpose of the parser is to do much of the hard work up front and to provide the application program with the XML information in a form that is easier to work with than would otherwise be the case.
Once again, quoting the folks at Megginson Technologies,
|"SAX, the Simple API for XML, is a standard interface for event-based XML parsing, developed collaboratively by the members of the XML-DEV mailing list, currently hosted by OASIS. SAX 2.0 was released on Friday 5 May 2000, and is free for both commercial and non-commercial use."|
People who write parser programs have the option of basing those programs on SAX.
So what is an API? The world of programming is a world of jargon. API is the common jargon for Application Programming Interface. An API usually contains a variety of features that make it easier for the application programmer to write what might otherwise be difficult programs (such as GUI programs). There are at least two types of XML parser APIs commonly used for the development of programs to process XML documents:
- Tree-based APIs
- Event-based APIs.
As I demonstrated in an earlier lesson entitled Transformations (XSLT) using IE5, Trees, Nodes, and Templates, Part I, an XML document can be viewed as a tree. A tree-based API can be used to convert an XML document into an internal tree structure. This makes it possible for an application program to navigate and manipulate the tree to achieve its processing objectives. Typically, once the processing of the tree has been completed, the tree is converted back into a modified XML document. The Document Object Model (DOM) working group at the W3C is responsible for developing a standard tree-based API for XML.
An event-based API
However, this tutorial lesson isn't about DOM and tree-based APIs. It is about SAX, which provides an event-based API. As mentioned earlier, an event-based API reports parsing events (such as the start and end of elements) back to the application using callbacks. The application implements and registers event handlers for each of the different parsing events that are of interest in the context of the application. Code in the event handlers is designed to achieve the objective of the application.
The process is similar (but not identical) to creating and registering event listeners in the Java Delegation Event Model discussed in the previous lesson entitled Parsing XML Documents: Events, Part 2. (If you would like to learn more about event driven programming in Java, I have published a large number of tutorial lessons on this topic on my web site.)
In some cases, using an event-based API can be more efficient than using a tree-based API. This is particularly true when the objectives of the processing program can be achieved during a single on-the-fly pass through the XML document. In cases such as this, the computing effort required to create a tree, modify the tree, and convert the tree back to an XML document may be less efficient than an event-based approach. In addition, Java programmers who are often familiar with the use of event-driven programming may find the event-based API to be more familiar ground. Generally, an event-based API provides a simpler, lower-level access to an XML document.
SAX is not the only approach to event-based parsing of XML documents. If you decide to use an event-based parser, (instead of a tree-based parser) why should you care whether or not the parser is based on SAX? There are several advantages to using a parser based on SAX. Foremost among them is the aspect of standardization. If you learn how to use one SAX based parser, then you will know how to use most, if not all, SAX based parsers. Another advantage is code portability among parsers. As long as you maintain version compatibility, code written for one SAX based parser should be compatible with another SAX based parser with few, if any, modifications required.
Summary SAX, the Simple API for XML, is a standard interface for event-based XML parsing. In this lesson, I have provided information about SAX that is intended to prepare you for the use of SAX in subsequent lessons.
What's Next? In the next lesson, I will continue the discussion of SAX-based parsers, and in particular will introduce you to a software product from IBM named XML for Java or XML4J for short. This product supports both SAX and DOM, and lets the application programmer combine the two approaches in a single application.
As of 1/20/01, you can download this parser from IBM free of charge. You might want to go ahead and download and install it and be prepared for the next tutorial lesson in this series.
Copyright 2000, Richard G. Baldwin. Reproduction in whole or in part in any form or medium without express written permission from Richard Baldwin is prohibited.Richard Baldwin (firstname.lastname@example.org) is a college professor and private consultant whose primary focus is a combination of Java and XML. In addition to the many platform-independent benefits of Java applications, he believes that a combination of Java and XML will become the primary driving force in the delivery of structured information on the Web.
Richard has participated in numerous consulting projects involving Java, XML, or a combination of the two. He frequently provides onsite Java and/or XML training at the high-tech companies located in and around Austin, Texas. He is the author of Baldwin's Java Programming Tutorials, which has gained a worldwide following among experienced and aspiring Java programmers. He has also published articles on Java Programming in Java Pro magazine.
Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.