LanguagesXMLWell-Formed XML

Well-Formed XML

This article is brought to you by Hungry Minds, Inc. publisher of Elliotte Rusty Harlold’s  XML Bible, 2nd Edition

HTML 4.0 has nearly 100 different elements. Most of these elements have a dozen or more possible attributes for several thousand different possible variations. Since XML is more powerful than HTML, you might think that you need to learn even more elements, but you don’t. XML gets its power through simplicity and extensibility, not through a plethora of elements.

In fact, XML predefines no elements at all. Instead XML allows you to define your own elements as needed. However, these elements and the documents built from them are not completely arbitrary. Instead, they have to follow a specific set of rules elaborated in this chapter. A document that follows these rules is said to be well-formed. Well-formedness is the minimum criteria necessary for XML processors and browsers to read files. This article examines the rules for well-formed documents. It explores the different constructs that make up an XML documenttags, text, attributes, elements, and so onand discusses the primary rules each of these must follow. Particular attention is paid to how XML differs from HTML. Along the way I introduce several new XML constructs, including comments, processing instructions, entity references, and CDATA sections. This article isn’t an exhaustive discussion of well-formedness rules. Some of the rules I present here must be adjusted slightly for documents that have a document type definition (DTD), and there are additional rules for well-formedness that define the relationship between the document and its DTD.

Well-Formedness Rules

Although XML allows you to invent as many different elements and attributes as you need, these elements and attributes, as well as their contents and the documents that contain them, must all follow certain rules in order to be well-formed. If a document is not well-formed, any attempts to read it or render it will fail.

The XML specification strictly prohibits XML parsers from trying to fix and understand malformed documents. All a conforming parser is allowed to do is report the error. It may not fix the error. It may not make a best-faith effort to render what the author intended. It may not ignore the offending malformed markup. All it can do is report the error and exit.

Note: The objective here is to avoid the bug-for-bug compatibilitywars that have hindered HTML, and that have made writing HTML parsers andrenderers so difficult. Because Web browsers allow malformed HTML, Web-pagedesigners don’t make the extra effort to ensure that their HTML iscorrect. In fact, they even rely on bugs in individual browsers to achievespecial effects. In order to properly display the huge installed base of HTMLpages, every new Web browser must support every quirk of all the Web browsersthat have come before. The marketplace would ignore any browser that strictlyadhered to the HTML standard. It is to avoid tthis sorry state that XMLprocessors are explicitly required to only accept well-formedXML.

To be well-formed, an XML document must follow more than 100 different rules. However, most of these rules simply forbid things that you’re not very likely to do anyway if you follow the examples given in this book. For instance, one rule is that the name of the element must immediately follow the < of the element’s start tag. For example, <triangle> is a legal start tag but < triangle> isn’t. On the other hand, the same rule says that it is OK to have extra space before the tag’s closing angle bracket. That is, both <triangle> and <triangle > are well-formed start tags. Another rule says that element names must have at least one character; that is, <> is not a legal start tag, and </> is not a legal end tag. Chances are it never would have occurred to you to create an element with a zero-length name, but computers are dumber than human beings, and need to have constraints like this spelled out for them very formally. XML’s well-formedness rules are designed to be understood by software rather than human beings, so quite a few of them are a little technical and won’t present much of a problem in practice. The only source for the complete list of rules is the XML specification itself. However, if you follow the rules given here, and check your work with an XML parser such as Xerces before distributing your documents, they should be fine.

Cross Reference: The XML specification itself is found in Appendix C. The formalsyntax the XML specification uses is called the Backus-Naur-Form, or BNF forshort. BNF grammars are an outgrowth of compiler theory that very formallydefines what is and is not a syntactically correct program or, in the case ofXML, a syntactically correct document. A parser can compare any document to theXML BNF grammar character by character and determine definitively whether or notit satisfies the rules of XML. There are no borderline cases. BNF grammars,properly written, leave no room for interpretation. The advantage of this shouldbe obvious to anyone who’s had to struggle with HTML documents thatdisplay in one browser but not in another.

As well as matching the BNF grammar, a well-formed XML document must alsomeet various well-formedness constraints that specify conditions thatcan’t be easily described in the BNF syntax. Well-formedness is theminimum level that a document must achieve to be parsed. Appendix B provides anannotated description of the complete XML 1.0 BNF grammar as well as all of thewell-formedness tconstraints.

XML Documents

An XML document is made up of text that’s divided between markup and character data. It is a sequence of characters with a fixed length that adheres to certain constraints. It may or may not be a file. For instance, an XML document may be:

  • A CLOB field in an Oracle database
  • The result of a query against a database that combines several records from different tables
  • A data structure created in memory by a Java program
  • A data stream created on the fly by a CGI program written in Perl
  • Some combination of several different files, each of which is embedded in another
  • One part of a larger file containing several XMLdocuments

However, nothing essential is lost if you think of an XML document as a file, as long as you keep in the back of your mind that it might not really be a file on a hard drive.

XML documents are made up of storage units called entities. Each entity contains either text or binary data, never both. Text data is comprised of characters. Binary data is used for images and applets and the like.

Note: To use a concrete example, a raw HTML file that includes<IMG> tags is an entity but not a document. An HTML file plus allthe pictures embedded in it with <IMG> tags is a completedocument.

The XML declaration

In this and the next several chapters, I treat only simple XML documents that are made up of a single entity, the document itself. Furthermore, these documents only contain text data, not binary data such as images or applets. Such documents can be understood completely on their own without reading any other files. In other words, they stand alone. Such a document normally contains a standalone pseudo-attribute in its XML declaration with the value yes, similar to this one.

<?xml version="1.0" standalone="yes"?>

Note: I call this a pseudo-attribute because technically onlyelements can have attributes. The XML declaration is not an element. Thereforestandalone is not an attribute even if it looks likeone.

External entities and entity references can be used to combine multiple files and other data sources to create a single XML document. These documents can-tnot be parsed without reference to other files. Therefore, they normally have a stand-alone pseudo-attribute with the value no.

<?xml version="1.0" standalone="no"?>

If a document does not have an XML declaration, or if a document has an XML declaration but that XML declaration does not have a standalone pseudo-attribute, then the value no is assumed. That is, the document is assumed to be incapable of standing on its own, and the parser will prepare itself to read external pieces as necessary. If the document can, in fact, stand on its own, nothing is lost by the parser being ready to read an extra piece.

XML documents do not have to include XML declarations, although they should unless you’ve got a specific reason not to include them. If an XML document does include an XML declaration, then this declaration must be the first thing in the file (except possibly for an invisible Unicode byte order mark). XML processors determine which character set is being used (UTF-8, big-endian Unicode, or little-endian Unicode) by reading the first several bytes of a file and comparing those bytes against various encodings of the string <?xml . Nothing should come before this, including white space. For instance, this line is not an acceptable way to start an XML file because of the extra spaces at the front of the line.

              <?xml version="1.0" standalone="yes"?>

A document must have exactly one root element that completely containsall other elements.

An XML document has a root element that completely contains all other elements of the document. This is also sometimes called the document element, although this element does not have to have the name document or root. Root elements are delimited by a start tag and an end tag, just like any other element. For instance, consider Listing 1.

Listing 1: –greeting.xml

<?xml version="1.0" standalone="yes"?><GREETING>Hello XML!</greeting>

In this document, the root element is GREETING. The XML declaration is not an element. Therefore, it does not have to be included inside the root element. Similarly, other nonelement data in an XML document, such as an xml-stylesheet processing instruction, a DOCTYPE declaration, or comments, do not have to be inside the root element. But all other elements (other than the root itself) and all raw character data must be contained in the root element.

Text in XML

An XML document is made up of text. Text is made up of characters. A character is a letter, a digit, a punctuation mark, a space or tab, or some similar thing. XML uses the Unicode character set which not only includes the usual letters and symbols from English and other Western European alphabets, but also the Cyrillic, Greek, Hebrew, Arabic, and Devanagari alphabets, as well as the most common Han ideographs for Chinese, Japanese, and Korean Hangul syllables. For now, I’ll stick to the English language, the Roman script, and the ASCII character set; but I’ll introduce many alternatives in the next chapter.

A document’s text is divided into character data and markup. To a first approximation, markup describes a document’s logical structure, while character data provides the basic information of the document. For example, in Listing 1, <?xml version="1.0" standalone="yes"?>, <greeting>, and </greeting> are markup. Hello XML!, along with its surrounding white space, is the character data. A big advantage of XML over other formats is that it clearly separates the actual data of a document from its markup.

To be more precise, markup includes all tags, processing instructions, DTDs, entity references, character references, comments, CDATA section delimiters, and the XML declaration. Everything else is character data. However, this is tricky because when a document is processed some of the markup turns into character data. For example, the markup &gt; is turned into the greater than sign character (>). The character data that’s left after the document is processed, and after all markup that refers to character data has been replaced by the actual character data, is called parsed character data, or PCDATA for short.

Elements and Tags

An XML document is a singly rooted hierarchical structure of elements. Each element is delimited by a start tag (also known as an opening tag) and an end tag (also known as a closing tag) or is represented by a single, empty element tag. An XML tag has the same form as an HTML tag. That is, start tags begin with a < followed by the name of the element the tags start, and they end with the first > after the opening < (for example, <GREETING>). End tags begin with a </ followed by the name of the element the tag finishes and are terminated by a > (for example, </GREETING>). Empty element tags begin with a < followed by the name of the element and are terminated with a /> (for example, <GREETING/>).

Element names

Every element has a name made up of one or more characters. This is the name included in the element’s start and end tags. Element names begin with a letter such as y or A or an underscore _. Subsequent characters in the name may include letters, digits, underscores, hyphens, and periods. They cannot include other punctuation marks such as %, ^, or &. They cannot include white space. (The underscore often substitutes for white space.) Both lower- and uppercase letters may be used in XML names. In this book, I mostly follow the convention of making my names uppercase, mainly because this makes them stand out better in the text. However, when I’m using a tag set that was developed by other people it is necessary to adopt their case convention. For example, the following are legal XML start tags with legal XML names:

<HELP><Book><volume><heading1><section.paragraph><Mary_Smith><_8ball>

Note: Colons are also technically legal in tag names. However, theseare reserved for use with namespaces. Namespaces allow you to mix and match XMLapplications that may use the same tag names. Chapter 13 introduces namespaces.Until then, you should not use colons in your tagnames.

The following are not legal start tags because they don’t contain legal XML names:

<Book%7><volume control><3heading><Mary Smith><.employee.salary>

Note: The rules for element names actually apply to names of manyother things as well. The same rules are used for attribute names, ID attributevalues, entity names, and a number of other constructs you’ll encounterover the next several chapters.

Every start tag must havea corresponding end tag

Web browsers are relatively forgiving if you forget to close an HTML tag. For instance, if you include a <B> tag in your document but no corresponding </B> tag, the entire document after the <B> tag will be made bold. However, the document will still be displayed.

XML is not so forgiving. Every nonempty tagthat is, tags that do not end with />must be closed with the corresponding end tag. If a document fails to close an element with the right end tag, the browser or renderer reports an error message and does not display any of the document’s content in any form.

End tags have the same name as the corresponding start tag, but are prefixed with a / after the initial angle bracket. For example, if the start tag is <FOO> the end tag is </FOO>. These are the end tags for the previous set of legal start tags.

</HELP></Book></volume></heading1></section.paragraph></Mary_Smith></_8ball>

XML names are case sensitive. This is different from HTML in which <P> and <p> are the same tag, and a </p> can close a <P> tag. The following are not end tags for the set of legal start tags we’ve been discussing:

</help></book></Volume></HEADING1></Section.Paragraph></MARY_SMITH></_8BALL>

Empty element tags

Many HTML elements do not have closing tags. For example, there are no </LI>, </IMG>, </HR>, or </BR> tags in HTML. Some page authors do include </LI> tags after their list items, and some HTML tools also use </LI>. However, the HTML 4.0 standard specifically denies that this is required. Like all unrecognized tags in HTML, the presence of an unnecessary </LI> has no effect on the rendered output.

This is not the case in XML. The whole point of XML is to allow new elements and their corresponding tags to be discovered as a document is parsed. Thus, unrecognized tags may not be ignored. Furthermore, an XML processor must be capable of determining on the fly whether a tag it has never seen before does or does not have an end tag. It does this by looking for special empty-element tags that end in />.

Elements that are represented by a single tag without a closing tag are called empty elements because they have no content. Tags that represent empty elements are called empty-element tags. These empty element tags are closed with a slash and a closing angle bracket (/>); for example, <BR/> or <HR/>. From the perspective of XML, these are the same as the equivalent syntax using both start and end tags with nothing in between themfor example, <BR></BR> and <HR></HR>.

However, empty element tags can only be used when the element is truly empty, not when the end tag is simply omitted. For example, in HTML you might write an unordered list like this:

<UL><LI>I've a Feeling We're Not in Kansas Anymore<LI>Buddies<LI>Everybody Loves You</UL>

In XML, you cannot simply replace the <LI> tags with <LI/> because the elements are not truly empty. Instead they contain text. In normal HTML the closing </LI> tag is omitted by the editor and filled in by the parser. This is not the same thing as the element itself being empty. The first LI element above contains the content I've a Feeling We're Not in Kansas Anymore. In XML, you must close these tags like this:

<UL><LI>I've a Feeling We're Not in Kansas Anymore</LI><LI>Buddies</LI><LI>Everybody Loves You</LI></UL>

On the other hand, a BR or HR or IMG element really is empty. It doesn’t contain any text or child elements. Thus, in XML, you have two choices for these elements. You can either write them with a start and an end tag in which the end tag immediately follows the start tagfor example, <HR></HR>or you can write them with an empty element tag as in <HR/>.

Note: Current Web browsers deal inconsistently with empty elementtags. For instance, some browsers will insert a line break when they see a<HR/> tag and some won’t. Furthermore, the problem mayarise even without empty element tags. Some browsers insert two horizontal lineswhen they see <HR></HR> and some insert one horizontalline. The most generally compatible scheme is to use an extra attribute beforethe closing />. The class attribute is often a goodchoicefor example, <HR CLASS="empty"/>. XSLToffers a few more ways to maintain compatibility with legacy browsers. Chapter17 discusses these methods.

Elements may nest but may notoverlap

Elements may contain (and indeed often do contain) other elements. However, elements may not overlap. Practically, this means that if an element contains a start tag for an element, it must also contain the corresponding end tag. Conversely, an element may not contain an end tag without its matching start tag. For example, this is legal XML.

<H1><CITE>What the Butler Saw</CITE></H1>

However, the following is not legal XML because the closing </CITE> tag comes before the closing </H1> tag:

<H1><CITE>What the Butler Saw</H1></CITE>

Most HTML browsers can handle this case with ease. However, XML browsers are required to report an error for this construct.

Empty element tags may appear anywhere, of course. For example,

<PLAYWRIGHTS>Oscar Wilde<HR/>Joe Orton</PLAYWRIGHTS>

This implies that for all nonroot elements, there is exactly one other element that contains the element, but which does not contain any other element containing the element. This immediate container is called the parent of the element. The contained element is called the child of the parent element. Thus each nonroot element always has exactly one parent, but a single element may have an indefinite number of children or no children at all.

Consider Listing 2. The root element is the PLAYS element. This contains two PLAY children. Each PLAY element contains three children: TITLE, AUTHOR, and YEAR. Each of these contains only character data, not more children.

Listing 2: –Parents and Children

<?xml version="1.0" standalone="yes"?><PLAYS>  <PLAY>    <TITLE>What the Butler Saw</TITLE>    <AUTHOR>Joe Orton</AUTHOR>    <YEAR>1969</YEAR>  </PLAY>  <PLAY>    <TITLE>The Ideal Husband</TITLE>    <AUTHOR>Oscar Wilde</AUTHOR>    <YEAR>1895</YEAR>  </PLAY></PLAYS>

In programmer terms, this means that XML documents form a tree. It starts from the root and gradually bushes out to the leaves on the ends. Trees have a number of nice properties that make them congenial to programmatic traversal, although this doesn’t matter so much to you as the author of the document.

Note: Trees are more commonly drawn from the top down. That is, theroot of the tree is shown at the top of the picture rather than the bottom.While this looks less like a real tree, it doesn’t affect the topology ofthe data structure in the least.

Attributes

Elements may optionally have attributes. Each attribute of an element is encoded in the start tag of the element as a name-value pair separated by an equals sign (=) and, optionally, some extra white space. The attribute value is enclosed in single or double quotes. For example,

<GREETING LANGUAGE="English">  Hello XML!  <MOVIE SRC = WavingHand.mov'/></GREETING>

Here the GREETING element has a LANGUAGE attribute that has the value English. The MOVIE element has an SRC attribute with the value WavingHand.mov.

Attribute names

Attribute names are strings that follow the same rules as element names. That is, attribute names must contain one or more characters, and the first character must be a letter or the underscore (_). Subsequent characters in the name may include letters, digits, underscores, hyphens, and periods. They may not include white space or other punctuation marks.

The same element may not have two attributes with the same name. For example, this is illegal:

<RECTANGLE SIDE="8" SIDE="10"/>

Attribute names are case sensitive. The SIDE attribute is not the same as the side or the Side attribute. Therefore, the following is legal:

<BOX SIDE="8" side="10" Side="31"/>

However, this is extremely confusing, and I strongly urge you not to write markup that depends on case.

Attribute values

Attributes values are strings. Even when the string shows a number, as in the LENGTH attribute below, that number is the two characters 7 and 2, not the binary number 72.

<RULE LENGTH="72"/>

If you’re writing a program to process XML, you’ll need to convert the string to a number before performing arithmetic on it.

Unlike attribute names, there are few limits on the content of an attribute value. Attribute values may contain white space, begin with a number, or contain any punctuation characters (except, sometimes, for single and double quotes). The only characters an attribute value may not contain are the angle brackets < and >, though these can be included using the &lt; and &gt; entity references (discussed soon).

XML attribute values are delimited by quote marks. Unlike HTML attribute values, XML attribute values must be enclosed in quotes whether or not the attribute value includes spaces. For example,

<A HREF="http://www.ibiblio.org/">IBiblio</A>

Most people choose double quotes. However, you can also use single quotes, which is useful if the attribute value itself contains a double quote. For example,

<IMG SRC="sistinechapel.jpg"     ALT='And God said, Let there be light,"          and there was light'/>

If the attribute value contains both single and double quotes, then the one that’s not used to delimit the string must be replaced with the proper entity reference. I generally just go ahead and replace both, which is always legal. For example,

<RECTANGLE LENGTH='8&apos;7&quot;' WIDTH="10&apos;6&quot;"/>

If an attribute value includes both single and double quotes, you may use the entity reference &apos; for a single quote (an apostrophe) and &quot; for a double quote. For example,

<PARAM NAME="joke" VALUE="The diner said,   &quot;Waiter, There&apos;s a fly in my soup!&quot;">

Entity References

You’re probably familiar with a number of entity references from HTML. For example, &copy; inserts the copyright symbol ) and &reg; inserts the registered trademark symbol . XML predefines the five entity references listed in Table 1. These predefined entity references are used in XML documents in place of specific characters that would otherwise be interpreted as part of markup. For instance, the entity reference &lt; stands for the less than sign (<), which would otherwise be interpreted as beginning a tag.

Table 1

XML Predefined Entity references

Entity Reference

Character

&amp;

&

&lt;

<

&gt;

>

&quot;

&apos;

Caution: In XML, unlike HTML, entity references must end with asemicolon. &gt; is a correct entity reference; &gt isnot.

XML assumes that the opening angle bracket always starts a tag, and that the ampersand always starts an entity reference. (This is often true of HTML as well, but most browsers are more forgiving.) For example, consider this line,

<H1>A Homage to Ben & Jerry's    New York Super Fudge Chunk Ice Cream</H1>

Web browsers that treat this as HTML will probably display it correctly. However, XML parsers will reject it. You should escape the ampersand with &amp; like this:

<H1>A Homage to Ben &amp; Jerry's      New York Super Fudge Chunk Ice Cream</H1>

The open angle bracket (<) is similar. Consider this common Java code embedded in HTML:

<CODE>  for (int i = 0; i <= args.length; i++ ) { </CODE>

Both XML and HTML consider the less than sign in <= to be the start of a tag. The tag continues until the next >. Thus a Web browser treating this fragment as HTML will render this line as

      for (int i = 0; i

rather than

      for (int i = 0; i <= args.length; i++ ) {

The = args.length; i++ ) { is interpreted as part of an unrecognized tag. Again, an XML parser will reject this line completely because it’s malformed.

The less than sign can be included in text in both XML and HTML by writing it as &lt;. For example,

<CODE>  for (int i = 0; i &lt;= args.length; i++ ) { </CODE>

Raw less than signs and ampersands in normal XML text are always interpreted as starting tags and entity references respectively. (The abnormal text is CDATA sections, described below.) Therefore, less than signs and ampersands that are text rather than markup must always be encoded as &lt; and &amp; respectively. Attribute values are text, too, and as you already saw, entity references may be used inside attribute values.

Greater than signs, double quotes, and apostrophes must be encoded when they would otherwise be interpreted as part of markup. However, it’s easier just to get in the habit of encoding all of them rather than trying to figure out whether a particular use would or would not be interpreted as markup.

Other than the five entity references already discussed, you can only use an entity reference if you define it in a DTD first. Since you don’t know about DTDs yet, if the ampersand character & appears anywhere in your document, it must be immediately followed by amp;, lt;, gt;, apos;, or quot;. All other uses violate well-formedness.

Cross Reference: Chapter 10 teaches you how to define new entity references forother characters and longer strings of text usingDTDs.

Comments

XML comments are almost exactly like HTML comments. They begin with <!-- and end with --> . All data between the <!-- and --> is ignored by the XML processor. It’s as if it weren’t there. This can be used to make notes to yourself or your coauthors, or to temporarily comment out sections of the document that aren’t ready, as Listing 3 demonstrates.

Listing 3: –An XML document that contains a comment

<?xml version="1.0" standalone="yes"?><!-- This is Listing 6-3 from The XML Bible --><GREETING>Hello XML!<!--Goodbye XML--></GREETING>

Since comments aren’t elements, they may be placed before or after the root element. However, comments may not come before the XML declaration, which must be the very first thing in the document. For example, this is not a well-formed XML document:

<!-- This is Listing 6-3 from The XML Bible --><?xml version="1.0" standalone="yes"?><GREETING>Hello XML!<!--Goodbye XML--></GREETING>

Comments may not be placed inside a tag. For example, this is also illegal:

<?xml version="1.0" standalone="yes"?><GREETING>Hello XML!</GREETING <!--Goodbye--> >

However comments may surround and hide tags. In Listing 4, the t<antigreeting> tag and all its children are commented out. They are not tshown when the document is rendered. It’s as if they don’t exist.

Listing 4: –A comment that comments out an element

<?xml version="1.0" standalone="yes"?><DOCUMENT>  <GREETING>    Hello XML!  </GREETING> <!--  <ANTIGREETING>    Goodbye XML!  </ANTIGREETING> --></DOCUMENT>

Because comments effectively delete sections of text, you must take care to ensure that the remaining text is still a well-formed XML document. For example, be careful not to comment out essential tags, as in this malformed document:

<?xml version="1.0" standalone="yes"?><GREETING>Hello XML!<!--</GREETING>-->

Once the commented text is removed what remains is

<?xml version="1.0" standalone="yes"?><GREETING>Hello XML!

Because the <greeting> tag is no longer matched by a closing </greeting> tag, this is no longer a well-formed XML document.

There is one final constraint on comments. The two-hyphen string -- may not occur inside a comment except as part of its opening or closing tag. For example, this is an illegal comment:

<!-- The red door--that is, the second one--was left open -->

This means, among other things, that you cannot nest comments like this:

<?xml version="1.0" standalone="yes"?><DOCUMENT>  <GREETING>    Hello XML!  </GREETING> <!--  <ANTIGREETING>    <!--Goodbye XML!-->  </ANTIGREETING> --></DOCUMENT>

It also means that you may run into trouble if you’re commenting out a lot of C, Java, or JavaScript source code that’s full of expressions such as i-- or numberLeft--. Generally, it’s not too hard to work around this problem once you recognize it.

Processing Instructions

Processing instructions are like comments that are intended for computer programs reading the document rather than people reading the document. However, XML parsers are required to pass along the contents of processing instructions to the application on whose behalf they’re parsing, unlike comments, which a parser is allowed to silently discard. The application that receives the information is free to ignore any processing instruction it doesn’t understand.

Processing instructions begin with <? and end with ?>. The starting <? is followed by an XML name called the target, which identifies the program that the instruction is intended for, followed by data for that program. For example, you saw this processing instruction in the last chapter.

<?xml-stylesheet type="text/xml" href="5-2.xsl"?>

The target of this processing instruction is xml-stylesheet, a standard name that means the data in this processing instruction is intended for any Web browser that can apply a style sheet to the document. type="text/xml" href="5-2.xsl" is the processing instruction data that will be passed to the application reading the document. If that application happens to be a Web browser that understands XSLT, then it will apply the style sheet 5-2.xsl to the document and render the result. If that application is anything other than a Web browser, it will simply ignore the processing instruction.

Note: Appearances to the contrary, the XML declaration is technicallynot a processing instruction. The difference is academic unless you’rewriting a program to read an XML document using an XML parser. In that case, theparser’s API will provide different methods to get the contents ofprocessing instructions and the contents of the XMLdeclaration.

xml-stylesheet processing instructions are always placed in the document’s prolog between the XML declaration and the root element start tag. Other processing instructions may also be placed in the prolog, or at almost any other convenient location in the XML document, either before, after, or inside the root element. For example, PHP processing instructions generally appear wherever you want the PHP processor to place its output. The only place a processing instruction may not appear is inside a tag or before the XML declaration.

The target of a processing instruction may be the name of the program it is intended for or it may be a generic identifier such as xml-stylesheet that many different programs recognize. The target name xml (or XML, Xml, xMl, or any other variation) is reserved for use by the World Wide Web Consortium. However, you’re free to use any other convenient name for processing instruction targets. Different applications support different processing instructions. Most applications simply ignore any processing instruction whose target they don’t recognize.

The xml-stylesheet processing instruction uses a very common format for processing instructions in which the data is divided into pseudo-attributes; that is, the data is passed as name-value pairs, and the values are delimited by quotes. However, as with the XML declaration, these are not true attributes because a processing instruction is not a tag. Furthermore, this format is optional. Some processing instructions will use this style; others won’t. The only limit on the content of processing instruction data is that it may not contain the two-character sequence ?> that signals the end of a processing instruction. Otherwise, it’s free to contain any legal characters that may appear in XML documents. For example, this is a legal processing instruction.

<?html-signature  Copyright 2001 <a href=http://www.macfaq.com/personal.html>    Elliotte Rusty Harold</a><br>    <a href=mailto:elharo@metalab.unc.edu>      elharo@metalab.unc.edu</a><br>    Last Modified May 3, 2001?>

In this example, the target is html-signature. The rest of the processing instruction is data and contains a lot of malformed HTML that would otherwise be illegal in an XML document. Some programs might read this, recognize the html-signature target, and copy the data into the signature of an HTML page. Other programs that don’t recognize the html-signature target will simply ignore it.

CDATA Sections

Suppose your document contains one or more large blocks of text that have a lot of <, >, &, or characters but no markup. This would be true for a Java or HTML tutorial, for example. It would be inconvenient to have to replace each instance of one of these characters with the equivalent entity reference. Instead, you can include the block of text in a CDATA section.

CDATA sections begin with <![CDATA[ and end with ]]>. For example:

<![CDATA[System.out.print(<");if (x <= args.length && y > z) {  System.out.println(args[x - y]);}System.out.println(>");]]>

The only text that’s not allowed within a CDATA section is the closing CDATA tag ]]>. Comments may appear in CDATA sections, but do not act as comments. That is, both the comment tags and all the text they contain will be displayed.

Most of the time anything inside a pair of <> angle brackets is markup, and anything that’s not is character data. However, in CDATA sections, all text is pure character data. Anything that looks like a tag or an entity reference is really just the text of the tag or the entity reference. The XML processor does not try to interpret it in any way. CDATA sections are used when you want all text to be interpreted as pure character data rather than as markup.

CDATA sections are extremely useful if you’re trying to write about HTML or XML in XML. For example, this book contains many small blocks of XML code. The word processor I’m using doesn’t care about that. But if I were to convert this book to XML, I’d have to painstakingly replace all the less than signs with &lt; and all the ampersands with &amp; like this:

&lt;?xml version="1.0" standalone="yes"?&gt;&lt;greeting&gt;Hello XML!&lt;/greeting&gt;

To avoid having to do this, I can instead use a CDATA section to indicate that a block of text is to be presented as is with no translation. For example:

<![CDATA[<?xml version="1.0" standalone="yes"?><GREETING>Hello XML!</GREETING>]]>

Note: Because ]]> may not appear in a CDATA section, CDATAsections cannot nest. This makes it relatively difficult to write about CDATAsections in XML. If you need to do this, you just have to bite the bullet anduse the &lt; and &amp;escapes.

CDATA sections aren’t needed that often, but when they are needed, they’re needed badly.

Tools

It is not particularly difficult to write well-formed XML documents that follow the rules described in this article. However, XML browsers are less forgiving of poor syntax than are HTML browsers, so you do need to be careful.

If you violate any well-formedness constraints, XML parsers and browsers will report a syntax error. Thus, the process of writing XML can be a little like the tprocess of writing code in a real programming language. You write it; then you compile it; then when the compilation fails, you note the errors reported and fix them. In the case of XML you parse the document rather than compile it, but the pattern is the same.

Generally, this is an iterative process in which you go through several edit-parse cycles before you get your first look at the finished document. Despite this, there’s no question that writing XML is a lot easier than writing C or Java source code. With a little practice, you’ll get to the point where you have relatively few errors and can write XML almost as quickly as you can type.

There are several tools that will help you clean up your pages, most notably RUWF (Are You Well Formed?) from XML.COM and Tidy from Dave Raggett of the W3C.

RUWF

Any tool that can check XML documents for well-formedness can test well-formed HTML documents as well. One of the easiest to use is the RUWF well-formedness checker from XML.COM at http://www.xml.com/pub/a/tools/ruwf/check.html . Simply type in the URL of the page that you want to check, and RUWF returns the first several dozen errors on the page.

Here’s the first batch of errors RUWF found on the White House home page. Most of these errors are malformed XML, but legal (if not necessarily well styled) HTML. However, at least one error (Line 55, column 30: Encountered </FONT> with no start-tag.“) is a problem for both HTML and XML.

Line 28, column 7: Encountered </HEAD> expected </META>...assumed </META> ...assumed </META> ...assumed </META>...assumed </META>Line 36, column 12, character '0': after AttrName= in start-tagLine 37, column 12, character '0': after AttrName= in start-tagLine 38, column 12, character '0': after AttrName= in start-tagLine 40, column 12, character '0': after AttrName= in start-tagLine 41, column 10, character 'A': after AttrName= in start-tagLine 42, column 12, character '0': after AttrName= in start-tagLine 43, column 14: Encountered </CENTER> expected </br>...assumed </br> ...assumed </br>Line 51, column 11, character '+': after AttrName= in start-tagLine 52, column 51, character '0': after AttrName= in start-tagLine 54, column 57: after &Line 55, column 30: Encountered </FONT> with no start-tag.Line 57, column 10, character 'A': after AttrName= in start-tagLine 59, column 15, character '+': after AttrName= in start-tag

Tidy

After you’ve identified the problems, you’ll want to fix them. Many common problemsfor instance, putting quote marks around attribute valuescan be fixed automatically. The most convenient tool for doing this is Dave Raggett’s command line program HTML Tidy. Tidy is a character mode program written in ANSI C tthat can be compiled and run on most platforms, including Windows, UNIX, BeOS, and Mac.

Tidy cleans up HTML files in several ways, not all of which are relevant to XML well-formedness. In fact, in its default mode Tidy tends to remove unnecessary (for HTML, but not for XML) end tags such as </LI>, and to make other modifications that break well-formedness. However, you can use the -asxml switch to specify that you want well-formed XML output. For example, to convert the file index.html to well-formed XML, you would type this command from a DOS window or shell prompt:

C:> tidy -m -asxml index.html

The -m flag tells Tidy to convert the file in place. The -asxml flag tells Tidy to format the output as XML.

Summary

In this article, you learned about XML’s well-formedness rules. In particular, you learned:

  • XML documents are sequences of characters that meet certain well-formedness criteria.
  • The text of an XML document is divided into character data and markup.
  • An XML document is a tree structure made up of elements.
  • Tags delimit elements.
  • Start tags and empty tags may contain attributes, which describe elements.
  • Entity references allow you to include <, >,&, , and ' in your document.
  • CDATA sections are useful for embedding text that contains a lot of<, >, and & characters.
  • Comments can document your code for other people who read it, but parsers may ignore them. Comments can also hide sections of the document that aren’t ready.
  • Processing instructions allow you to pass application-specific informationto particular applications.

About the Author

Elliotte Rusty Harold is an internationally respected writer, programmer, and educator both on the Internet and off. He got his start writing FAQ lists for the Macintosh newsgroups on Usenet and has since branched out into books, Web sites, and newsletters. He’s an adjunct professor of computer science at Polytechnic University in Brooklyn, new York. His books include XML Bible, The Java Developer’s Resource, Java Network Programming, Java Secrets, JavaBeans, XML: Extensible markup Language, and Java I/O.

This article is brought to you by Hungry Minds, Inc. publisher of XML Bible, 2nd Edition
© Copyright Hungry Minds, All Rights Reserved

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories