Well-Formed XML
Entity References
You're probably familiar with a number of entity references from HTML. For example, © inserts the copyright symbol ) and ® inserts the registered trademark symbol. XML predefines the five entity references listed in Table 1. These predefined entity references are used in XML documents in place of specific characters that would otherwise be interpreted as part of markup. For instance, the entity reference < stands for the less than sign (<), which would otherwise be interpreted as beginning a tag.
Table 1
XML Predefined Entity references
|
Entity Reference |
Character |
|
& |
& |
|
< |
< |
|
> |
> |
|
" |
" |
|
' |
' |
Caution: In XML, unlike HTML, entity references must end with a semicolon. > is a correct entity reference; > is not.
XML assumes that the opening angle bracket always starts a tag, and that the ampersand always starts an entity reference. (This is often true of HTML as well, but most browsers are more forgiving.) For example, consider this line,
<H1>A Homage to Ben & Jerry's
New York Super Fudge Chunk Ice Cream</H1>
Web browsers that treat this as HTML will probably display it correctly. However, XML parsers will reject it. You should escape the ampersand with & like this:
<H1>A Homage to Ben & Jerry's
New York Super Fudge Chunk Ice Cream</H1>
The open angle bracket (<) is similar. Consider this common Java code embedded in HTML:
<CODE> for (int i = 0; i <= args.length; i++ ) {
</CODE>
Both XML and HTML consider the less than sign in <= to be the start of a tag. The tag continues until the next >. Thus, a Web browser treating this fragment as HTML will render this line as:
for (int i = 0; i
rather than
for (int i = 0; i <= args.length; i++ ) {
The = args.length; i++ ) { is interpreted as part of an unrecognized tag. Again, an XML parser will reject this line completely because it's malformed.
The less than sign can be included in text in both XML and HTML by writing it as <. For example,
<CODE> for (int i = 0; i <= args.length; i++ ) {
</CODE>
Raw less than signs and ampersands in normal XML text are always interpreted as starting tags and entity references respectively. (The abnormal text is CDATA sections, described below.) Therefore, less than signs and ampersands that are text rather than markup must always be encoded as < and & respectively. Attribute values are text, too, and as you already saw, entity references may be used inside attribute values.
Greater than signs, double quotes, and apostrophes must be encoded when they would otherwise be interpreted as part of markup. However, it's easier just to get in the habit of encoding all of them rather than trying to figure out whether a particular use would or would not be interpreted as markup.
Other than the five entity references already discussed, you can only use an entity reference if you define it in a DTD first. Since you don't know about DTDs yet, if the ampersand character & appears anywhere in your document, it must be immediately followed by amp;, lt;, gt;, apos;, or quot;. All other uses violate well-formedness.
Cross Reference: Chapter 10 teaches you how to define new entity references for other characters and longer strings of text using DTDs.
Comments
XML comments are almost exactly like HTML comments. They begin with <!-- and end with --> . All data between the <!-- and --> is ignored by the XML processor. It's as if it weren't there. This can be used to make notes to yourself or your coauthors, or to temporarily comment-out sections of the document that aren't ready, as Listing 3 demonstrates.
Listing 3: -An XML document that contains a comment
<?xml version="1.0" standalone="yes"?> <!-- This is Listing 6-3 from The XML Bible --> <GREETING> Hello XML! <!--Goodbye XML--> </GREETING>
Since comments aren't elements, they may be placed before or after the root element. However, comments may not come before the XML declaration, which must be the very first thing in the document. For example, this is not a well-formed XML document:
<!-- This is Listing 6-3 from The XML Bible --> <?xml version="1.0" standalone="yes"?> <GREETING> Hello XML! <!--Goodbye XML--> </GREETING>
Comments may not be placed inside a tag. For example, this is also illegal:
<?xml version="1.0" standalone="yes"?> <GREETING> Hello XML! </GREETING <!--Goodbye--> >
However comments may surround and hide tags. In Listing 4, the t<antigreeting> tag and all its children are commented out. They are not tshown when the document is rendered. It's as if they don't exist.
Listing 4: -A comment that comments out an element
<?xml version="1.0" standalone="yes"?>
<DOCUMENT>
<GREETING>
Hello XML!
</GREETING>
<!--
<ANTIGREETING>
Goodbye XML!
</ANTIGREETING>
-->
</DOCUMENT>
Because comments effectively delete sections of text, you must take care to ensure that the remaining text is still a well-formed XML document. For example, be careful not to comment out essential tags, as in this malformed document:
<?xml version="1.0" standalone="yes"?> <GREETING> Hello XML! <!-- </GREETING> -->
Once the commented text is removed what remains is
<?xml version="1.0" standalone="yes"?> <GREETING> Hello XML!
Because the <greeting> tag is no longer matched by a closing </greeting> tag, this is no longer a well-formed XML document.
There is one final constraint on comments. The two-hyphen string -- may not occur inside a comment except as part of its opening or closing tag. For example, this is an illegal comment:
<!-- The red door-- that is, the second one -- was left open -->
This means, among other things, that you cannot nest comments like this:
<?xml version="1.0" standalone="yes"?>
<DOCUMENT>
<GREETING>
Hello XML!
</GREETING>
<!--
<ANTIGREETING>
<!--Goodbye XML!-->
</ANTIGREETING>
-->
</DOCUMENT>
It also means that you may run into trouble if you're commenting out a lot of C, Java, or JavaScript source code that's full of expressions such as i-- or numberLeft--. Generally, it's not too hard to work around this problem once you recognize it.
Processing Instructions
Processing instructions are like comments that are intended for computer programs reading the document rather than people reading the document. However, XML parsers are required to pass along the contents of processing instructions to the application on whose behalf they're parsing, unlike comments, which a parser is allowed to silently discard. The application that receives the information is free to ignore any processing instruction it doesn't understand.
Processing instructions begin with <? and end with ?>. The starting <? is followed by an XML name called the target, which identifies the program that the instruction is intended for, followed by data for that program. For example, you saw this processing instruction in the last chapter.
<?xml-stylesheet type="text/xml" href="5-2.xsl"?>
The target of this processing instruction is xml-stylesheet, a standard name that means the data in this processing instruction is intended for any Web browser that can apply a style sheet to the document. type="text/xml" href="5-2.xsl" is the processing instruction data that will be passed to the application reading the document. If that application happens to be a Web browser that understands XSLT, then it will apply the style sheet 5-2.xsl to the document and render the result. If that application is anything other than a Web browser, it will simply ignore the processing instruction.
Note: Appearances to the contrary, the XML declaration is technically not a processing instruction. The difference is academic unless you're writing a program to read an XML document using an XML parser. In that case, the parser's API will provide different methods to get the contents of processing instructions and the contents of the XML declaration.
xml-stylesheet processing instructions are always placed in the document's prolog between the XML declaration and the root element start tag. Other processing instructions may also be placed in the prolog, or at almost any other convenient location in the XML document, either before, after, or inside the root element. For example, PHP processing instructions generally appear wherever you want the PHP processor to place its output. The only place a processing instruction may not appear is inside a tag or before the XML declaration.
The target of a processing instruction may be the name of the program it is intended for or it may be a generic identifier such as xml-stylesheet that many different programs recognize. The target name xml (or XML, Xml, xMl, or any other variation) is reserved for use by the World Wide Web Consortium. However, you're free to use any other convenient name for processing instruction targets. Different applications support different processing instructions. Most applications simply ignore any processing instruction whose target they don't recognize.
The xml-stylesheet processing instruction uses a very common format for processing instructions in which the data is divided into pseudo-attributes; that is, the data is passed as name-value pairs, and the values are delimited by quotes. However, as with the XML declaration, these are not true attributes because a processing instruction is not a tag. Furthermore, this format is optional. Some processing instructions will use this style; others won't. The only limit on the content of processing instruction data is that it may not contain the two-character sequence ?> that signals the end of a processing instruction. Otherwise, it's free to contain any legal characters that may appear in XML documents. For example, this is a legal processing instruction.
<?html-signature
Copyright 2001 <a href=http://www.macfaq.com/personal.html>
Elliotte Rusty Harold</a><br>
<a href=mailto:elharo@metalab.unc.edu>
elharo@metalab.unc.edu</a><br>
Last Modified May 3, 2001
?>
In this example, the target is html-signature. The rest of the processing instruction is data and contains a lot of malformed HTML that would otherwise be illegal in an XML document. Some programs might read this, recognize the html-signature target, and copy the data into the signature of an HTML page. Other programs that don't recognize the html-signature target will simply ignore it.
CDATA Sections
Suppose your document contains one or more large blocks of text that have a lot of <, >, or & characters but no markup. This would be true for a Java or HTML tutorial, for example. It would be inconvenient to have to replace each instance of one of these characters with the equivalent entity reference. Instead, you can include the block of text in a CDATA section.
CDATA sections begin with <![CDATA[ and end with ]]>. For example:
<![CDATA[
System.out.print(<");
if (x <= args.length && y > z) {
System.out.println(args[x - y]);
}
System.out.println(>");
]]>
The only text that's not allowed within a CDATA section is the closing CDATA tag ]]>. Comments may appear in CDATA sections, but do not act as comments. That is, both the comment tags and all the text they contain will be displayed.
Most of the time anything inside a pair of <> angle brackets is markup, and anything that's not is character data. However, in CDATA sections, all text is pure character data. Anything that looks like a tag or an entity reference is really just the text of the tag or the entity reference. The XML processor does not try to interpret it in any way. CDATA sections are used when you want all text to be interpreted as pure character data rather than as markup.
CDATA sections are extremely useful if you're trying to write about HTML or XML in XML. For example, this book contains many small blocks of XML code. The word processor I'm using doesn't care about that. But if I were to convert this book to XML, I'd have to painstakingly replace all the less than signs with < and all the ampersands with & like this:
<?xml version="1.0" standalone="yes"?> <greeting> Hello XML! </greeting>
To avoid having to do this, I can instead use a CDATA section to indicate that a block of text is to be presented as is with no translation. For example:
<![CDATA[<?xml version="1.0" standalone="yes"?> <GREETING> Hello XML! </GREETING>]]>
Note: Because ]]> may not appear in a CDATA section, CDATA sections cannot nest. This makes it relatively difficult to write about CDATA sections in XML. If you need to do this, you just have to bite the bullet and use the < and & escapes.
CDATA sections aren't needed that often, but when they are needed, they're needed badly.
Tools
It is not particularly difficult to write well-formed XML documents that follow the rules described in this article. However, XML browsers are less forgiving of poor syntax than are HTML browsers, so you do need to be careful.
If you violate any well-formedness constraints, XML parsers and browsers will report a syntax error. Thus, the process of writing XML can be a little like the tprocess of writing code in a real programming language. You write it; then you compile it; then when the compilation fails, you note the errors reported and fix them. In the case of XML you parse the document rather than compile it, but the pattern is the same.
Generally, this is an iterative process in which you go through several edit-parse cycles before you get your first look at the finished document. Despite this, there's no question that writing XML is a lot easier than writing C or Java source code. With a little practice, you'll get to the point where you have relatively few errors and can write XML almost as quickly as you can type.
There are several tools that will help you clean up your pages, most notably RUWF (Are You Well Formed?) from XML.COM and Tidy from Dave Raggett of the W3C.
RUWF
Any tool that can check XML documents for well-formedness can test well-formed HTML documents as well. One of the easiest to use is the RUWF well-formedness checker from XML.COM at http://www.xml.com/pub/a/tools/ruwf/check.html. Simply type in the URL of the page that you want to check, and RUWF returns the first several dozen errors on the page.
Here's the first batch of errors RUWF found on the White House home page. Most of these errors are malformed XML, but legal (if not necessarily well styled) HTML. However, at least one error ("Line 55, column 30: Encountered </FONT> with no start-tag.") is a problem for both HTML and XML.
Line 28, column 7: Encountered </HEAD> expected </META> ...assumed </META> ...assumed </META> ...assumed </META> ...assumed </META> Line 36, column 12, character '0': after AttrName= in start-tag Line 37, column 12, character '0': after AttrName= in start-tag Line 38, column 12, character '0': after AttrName= in start-tag Line 40, column 12, character '0': after AttrName= in start-tag Line 41, column 10, character 'A': after AttrName= in start-tag Line 42, column 12, character '0': after AttrName= in start-tag Line 43, column 14: Encountered </CENTER> expected </br> ...assumed </br> ...assumed </br> Line 51, column 11, character '+': after AttrName= in start-tag Line 52, column 51, character '0': after AttrName= in start-tag Line 54, column 57: after & Line 55, column 30: Encountered </FONT> with no start-tag. Line 57, column 10, character 'A': after AttrName= in start-tag Line 59, column 15, character '+': after AttrName= in start-tag
Tidy
After you've identified the problems, you'll want to fix them. Many common problemsfor instance, putting quote marks around attribute valuescan be fixed automatically. The most convenient tool for doing this is Dave Raggett's command line program HTML Tidy. Tidy is a character mode program written in ANSI C tthat can be compiled and run on most platforms, including Windows, UNIX, BeOS, and Mac.
Tidy cleans up HTML files in several ways, not all of which are relevant to XML well-formedness. In fact, in its default mode Tidy tends to remove unnecessary (for HTML, but not for XML) end tags such as </LI>, and to make other modifications that break well-formedness. However, you can use the -asxml switch to specify that you want well-formed XML output. For example, to convert the file index.html to well-formed XML, you would type this command from a DOS window or shell prompt:
C:> tidy -m -asxml index.html
The -m flag tells Tidy to convert the file in place. The -asxml flag tells Tidy to format the output as XML.
Summary
In this article, you learned about XML's well-formedness rules. In particular, you learned:
- XML documents are sequences of characters that meet certain well-formedness criteria.
- The text of an XML document is divided into character data and markup.
- An XML document is a tree structure made up of elements.
- Tags delimit elements.
- Start tags and empty tags may contain attributes, which describe elements.
- Entity references allow you to include <, >, &, , and ' in your document.
- CDATA sections are useful for embedding text that contains a lot of <, >, and & characters.
- Comments can document your code for other people who read it, but parsers may ignore them. Comments can also hide sections of the document that aren't ready.
- Processing instructions allow you to pass application-specific information to particular applications.
About the Author
Elliotte Rusty Harold is an internationally respected writer, programmer, and educator both on the Internet and off. He got his start writing FAQ lists for the Macintosh newsgroups on Usenet and has since branched out into books, Web sites, and newsletters. He's an adjunct professor of computer science at Polytechnic University in Brooklyn, New York. His books include "XML Bible", "The Java Developer's Resource", "Java Network Programming", "Java Secrets", "JavaBeans", "XML: Extensible markup Language", and "Java I/O".
|
This article is brought to you by Hungry Minds, Inc.
publisher of XML Bible, 2nd Edition © Copyright Hungry Minds, All Rights Reserved. |
|


