Parsing XML Documents: Events, Part 5
Preface This is the fifth in a series of six lessons designed to teach you how to write custom XML processing programs using a SAX-based parser and the Java programming language. A sample program is provided and explained in the sixth lesson in the series. The five lessons leading up to that lesson are intended to prepare you to understand the complex technical material presented in the sixth lesson.
I maintain a consolidated index of hyperlinks to all of my XML articles at my personal website. You can easily locate and access my XML articles from there.
In order for XML to be useful, you must be able to process your XML documents so as to produce a useful output. That is what this series of six lessons is all about -- processing XML documents.
Preview One of the common ways to create custom XML processing tools is through the use of an event-based parser that implements the SAX interface, along with either the Java or Python programming language. In the previous lesson, I introduced you to a parser program from IBM known as XML4J that is currently available for free downloading.
In this lesson, I will show and discuss the output produced by using the Java programming language and the XML4J parser to process an XML file containing an XML syntax error. The syntax error was purposely introduced into the file to illustrate the error handling capability of a SAX-based XML parser.
In the next lesson, I will show and discuss the details of the actual Java program that was used to produce the results shown in this lesson.
An event-based parser reports events to the processing program using callbacks. The program implements and registers event handlers for the different types of events. Code written into the event handlers is designed to achieve the overall objective of the program.
IBM's XML for Java is a validating XML parser written in 100% pure Java. This is the parser that I use in the sample program discussed in this and the next lesson in this series. According to IBM, XML4J Version 3.1.1 contains public and stable support of the SAX Level 1 specifications.
A Sample Program In the previous lessons of this series of articles on SAX, I have promised to show you how to write a Java program that uses XML4J to parse a simple XML document.
I promised that the program will deliver a series of events to the appropriate event handler methods as the parser traverses the XML document, and that the event handler methods will extract and display information about the XML document. In this lesson, I will:
- Discusses the general aspects of the program
- Show the output
- Discuss the output
<?xml version="1.0"?> <bookOfPoems> <poem PoemNumber="1" DummyAttribute="dummy value"> <line>Roses are red,</line> <line>Violets are blue.</line> <line>Sugar is sweet,</line> <line>and so are you.</line> </poem> <poem PoemNumber="2" DummyAttribute="dummy value"> <line>Twas the night before Christmas,</line> <line>And all through the house, <line>Not a creature was stirring,</line> <line>Not even a mouse.</line> </poem> </bookOfPoems> Listing 1
As you can see from Listing 1, the XML file used with this sample program represents the rudimentary aspects of a book of poems. It contains one verse each from two well-known poems.
(Do you see anything incorrect in the XML file of Listing 1?)
Sometimes I find it easier to visualize the overall element structure of an XML document by removing everything but the tags. Listing 2 is a representation of the element structure with the attributes and the content of each element removed.
<?xml version="1.0"?> <bookOfPoems> <poem> <line></line> <line></line> <line></line> <line></line> </poem> <poem> <line></line> <line> <line></line> <line></line> </poem> </bookOfPoems> Listing 2
This book of poems contains two poems, one about roses, and the other about a mouse. The XML markup for the first poem is correct from a syntax viewpoint. However, a syntax error was purposely introduced into the second poem to illustrate the error-handling capability of SAX and the IBM parser. The error is highlighted in bold in Listing 2 above. The highlighted element is missing its end tag (</line>).
This program uses the IBM XML Parser for Java (XML4J) along with the XML file shown earlier to illustrate the trapping and handling of parser events along with customized error handling. The purpose of the program is to
- Traverse the XML file
- Display the elements
- Display the attributes
- Display the text of the poems.
The first part of the output from the program is shown in Listing 3. This part deals only with the beginning of the Document element, the beginning of the bookOfPoems element, and the first poem element. A section of output shown later deals with the remainder of the XML file.
I manually inserted some line breaks to force the output material shown in Listing 3 to fit in this narrow presentation format. I also deleted some blank lines to reduce the overall size of the output listing. If you compare this output with the raw XML document shown in Listing 2, you will see that the first poem was parsed and displayed successfully. The output produced by the program included
- The beginning and ending of each element
- The element names
- The attribute values for the elements
- The contents of each element (the text of the poem)
Start Document Start element: bookOfPoems Start element: poem Attribute: PoemNumber, Value = 1, Type = CDATA Attribute: DummyAttribute, Value = dummy value, Type = CDATA Start element: line Roses are red, End element: line Start element: line Violets are blue. End element: line Start element: line Sugar is sweet, End element: line Start element: line and so are you. End element: line End element: poem Listing 3
Each portion of output was the result of an event handler being invoked by the parser. Each event handler extracted and displayed information about that portion of the XML document with which it was concerned when it was invoked. For example, the first line of output shown in Listing 3 that reads Start Document was the result of the parser detecting the beginning of the document and invoking the appropriate event handler. Except for the ending of the Document element, and the ending of the bookOfPoems element, the result of detecting the beginning and the end of each element was also included in the output shown in Listing 3. The endings of the Document and bookOfPoems elements are not shown in Listing 3 because, as mentioned earlier, this output does not describe the entire document. This output only describes the beginning of the Document, the beginning of the bookOfPoems element, and the first poem element. Additional output is shown later in Listing 4.
As mentioned earlier, a syntax error was purposely introduced into the second poem in the XML file. The second poem was displayed as shown in Listing 4 below. This output is a continuation of the output shown in Listing 3, and as before, I manually inserted some line breaks to force the text to fit in this narrow display format. I highlighted the line with the missing end element using boldface in the output of Listing 4 so that you can see where the problem actually occurs.
Start element: poem Attribute: PoemNumber, Value = 2, Type = CDATA Attribute: DummyAttribute, Value = dummy value, Type = CDATA Start element: line Twas the night before Christmas, End element: line Start element: line And all through the house, Start element: line Not a creature was stirring, End element: line Start element: line Not even a mouse. End element: line systemID: file:///D:/Baldwin /AA-School/JavaProg/Combined/Java /Sax01.xml [Fatal Error] Sax01.xml:17:9: The element type "line" must be terminated by the matching end-tag "</line>". Terminating Listing 4
Note that a fatal error occurred at the point where the parser was able to determine that the end tag was missing from one of the earlier lines in the poem. The error was detected and error processing began following the last line in the second poem. The output from error processing began with the line that reads systemID: This is highlighted in red boldface to make it easy to spot.
As you can see from the positions of the two sets of boldface characters, the error determination was not made until several lines beyond the actual missing tag. A customized error message was produced showing the line number and character number where the error was detected along with the nature of the error. This delay in detecting the problem resulted from the fact that no DTD was provided and a non-validating parser was used. (Actually, the XML4J parser was used in its non-validating mode.) Therefore, the parser initially believed that the appearance of a start tag ahead of an expected end tag indicated a nesting condition. It wasn't until the parser was later able to determine that this was not an allowable nesting condition that it was able to determine that there was a missing end tag.
Presumably, if there had been a DTD specifying that <line> tags may not be nested inside of <line> tags, a validating parser would have recognized the error as soon as it occurred.
Summary One of the common ways to create custom XML processing tools is through the use of an event-based parser that implements the SAX interface, along with either the Java or Python programming languages. In this lesson, I have shown and discussed the output produced by using the Java programming language and the XML4J parser to process an XML file containing an XML syntax error. The syntax error was purposely introduced into the file to illustrate the error handling capability of a SAX-based XML parser.
Copyright 2000, Richard G. Baldwin. Reproduction in whole or in part in any form or medium without express written permission from Richard Baldwin is prohibited.Richard Baldwin (firstname.lastname@example.org) is a college professor and private consultant whose primary focus is a combination of Java and XML. In addition to the many platform-independent benefits of Java applications, he believes that a combination of Java and XML will become the primary driving force in the delivery of structured information on the Web.
Richard has participated in numerous consulting projects involving Java, XML, or a combination of the two. He frequently provides onsite Java and/or XML training at the high-tech companies located in and around Austin, Texas. He is the author of Baldwin's Java Programming Tutorials, which has gained a worldwide following among experienced and aspiring Java programmers. He has also published articles on Java Programming in Java Pro magazine.
Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.