RSS RSS feed
November 07, 2009
Hot Topics:

Well-Formed XML

  • July 9, 2001
  • By Elliotte R Harold
  • Send Email »
  • More Articles »
This article is brought to you by Hungry Minds, Inc., publisher of Elliotte Rusty Harlold's  XML Bible, 2nd Edition

HTML 4.0 has nearly 100 different elements. Most of these elements have a dozen or more possible attributes for several thousand different possible variations. Since XML is more powerful than HTML, you might think that you need to learn even more elements, but you don't. XML gets its power through simplicity and extensibility, not through a plethora of elements.

In fact, XML predefines no elements at all. Instead XML allows you to define your own elements as needed. However, these elements and the documents built from them are not completely arbitrary. Instead, they have to follow a specific set of rules elaborated in this chapter. A document that follows these rules is said to be well-formed. Well-formedness is the minimum criteria necessary for XML processors and browsers to read files. This article examines the rules for well-formed documents. It explores the different constructs that make up an XML document -- tags, text, attributes, elements, and so on -- and discusses the primary rules each of these must follow. Particular attention is paid to how XML differs from HTML. Along the way I introduce several new XML constructs, including comments, processing instructions, entity references, and CDATA sections. This article isn't an exhaustive discussion of well-formedness rules. Some of the rules I present here must be adjusted slightly for documents that have a document type definition (DTD), and there are additional rules for well-formedness that define the relationship between the document and its DTD.

Well-Formedness Rules

Although XML allows you to invent as many different elements and attributes as you need, these elements and attributes, as well as their contents and the documents that contain them, must all follow certain rules in order to be well-formed. If a document is not well-formed, any attempts to read it or render it will fail.

The XML specification strictly prohibits XML parsers from trying to fix and understand malformed documents. All a conforming parser is allowed to do is report the error. It may not fix the error. It may not make a best-faith effort to render what the author intended. It may not ignore the offending malformed markup. All it can do is report the error and exit.

Note: The objective here is to avoid the bug-for-bug compatibility wars that have hindered HTML, and that have made writing HTML parsers and renderers so difficult. Because Web browsers allow malformed HTML, Web-page designers don't make the extra effort to ensure that their HTML is correct. In fact, they even rely on bugs in individual browsers to achieve special effects. In order to properly display the huge installed base of HTML pages, every new Web browser must support every quirk of all the Web browsers that have come before. The marketplace would ignore any browser that strictly adhered to the HTML standard. It is to avoid tthis sorry state that XML processors are explicitly required to only accept well-formed XML.

To be well-formed, an XML document must follow more than 100 different rules. However, most of these rules simply forbid things that you're not very likely to do anyway if you follow the examples given in this book. For instance, one rule is that the name of the element must immediately follow the < of the element's start tag. For example, <triangle> is a legal start tag but < triangle> isn't. On the other hand, the same rule says that it is OK to have extra space before the tag's closing angle bracket. That is, both <triangle> and <triangle > are well-formed start tags. Another rule says that element names must have at least one character; that is, <> is not a legal start tag, and </> is not a legal end tag. Chances are it never would have occurred to you to create an element with a zero-length name, but computers are dumber than human beings, and need to have constraints like this spelled out for them very formally. XML's well-formedness rules are designed to be understood by software rather than human beings, so quite a few of them are a little technical and won't present much of a problem in practice. The only source for the complete list of rules is the XML specification itself. However, if you follow the rules given here, and check your work with an XML parser such as Xerces before distributing your documents, they should be fine.

Cross Reference: The XML specification itself is found in Appendix C. The formal syntax the XML specification uses is called the Backus-Naur-Form, or BNF for short. BNF grammars are an outgrowth of compiler theory that very formally defines what is and is not a syntactically correct program or, in the case of XML, a syntactically correct document. A parser can compare any document to the XML BNF grammar character by character and determine definitively whether or not it satisfies the rules of XML. There are no borderline cases. BNF grammars, properly written, leave no room for interpretation. The advantage of this should be obvious to anyone who's had to struggle with HTML documents that display in one browser but not in another.

As well as matching the BNF grammar, a well-formed XML document must also meet various well-formedness constraints that specify conditions that can't be easily described in the BNF syntax. Well-formedness is the minimum level that a document must achieve to be parsed. Appendix B provides an annotated description of the complete XML 1.0 BNF grammar as well as all of the well-formedness tconstraints.

XML Documents

An XML document is made up of text that's divided between markup and character data. It is a sequence of characters with a fixed length that adheres to certain constraints. It may or may not be a file. For instance, an XML document may be:

  • A CLOB field in an Oracle database
  • The result of a query against a database that combines several records from different tables
  • A data structure created in memory by a Java program
  • A data stream created on the fly by a CGI program written in Perl
  • Some combination of several different files, each of which is embedded in another
  • One part of a larger file containing several XML documents

However, nothing essential is lost if you think of an XML document as a file, as long as you keep in the back of your mind that it might not really be a file on a hard drive.

XML documents are made up of storage units called entities. Each entity contains either text or binary data, never both. Text data is comprised of characters. Binary data is used for images and applets and the like.

Note: To use a concrete example, a raw HTML file that includes <IMG> tags is an entity but not a document. An HTML file plus all the pictures embedded in it with <IMG> tags is a complete document.

The XML declaration

In this and the next several chapters, I treat only simple XML documents that are made up of a single entity, the document itself. Furthermore, these documents only contain text data, not binary data such as images or applets. Such documents can be understood completely on their own without reading any other files. In other words, they stand alone. Such a document normally contains a standalone pseudo-attribute in its XML declaration with the value yes, similar to this one.

<?xml version="1.0" standalone="yes"?>

Note: I call this a pseudo-attribute because technically only elements can have attributes. The XML declaration is not an element. Therefore standalone is not an attribute even if it looks like one.

External entities and entity references can be used to combine multiple files and other data sources to create a single XML document. These documents can-tnot be parsed without reference to other files. Therefore, they normally have a stand-alone pseudo-attribute with the value no.

<?xml version="1.0" standalone="no"?>

If a document does not have an XML declaration, or if a document has an XML declaration but that XML declaration does not have a standalone pseudo-attribute, then the value no is assumed. That is, the document is assumed to be incapable of standing on its own, and the parser will prepare itself to read external pieces as necessary. If the document can, in fact, stand on its own, nothing is lost by the parser being ready to read an extra piece.

XML documents do not have to include XML declarations, although they should unless you've got a specific reason not to include them. If an XML document does include an XML declaration, then this declaration must be the first thing in the file (except possibly for an invisible Unicode byte order mark). XML processors determine which character set is being used (UTF-8, big-endian Unicode, or little-endian Unicode) by reading the first several bytes of a file and comparing those bytes against various encodings of the string <?xml . Nothing should come before this, including white space. For instance, this line is not an acceptable way to start an XML file because of the extra spaces at the front of the line.

              <?xml version="1.0" standalone="yes"?>

A document must have exactly one root element that completely contains all other elements.

An XML document has a root element that completely contains all other elements of the document. This is also sometimes called the document element, although this element does not have to have the name document or root. Root elements are delimited by a start tag and an end tag, just like any other element. For instance, consider Listing 1.

Listing 1: -greeting.xml

<?xml version="1.0" standalone="yes"?>
<GREETING>
Hello XML!
</greeting>

In this document, the root element is GREETING. The XML declaration is not an element. Therefore, it does not have to be included inside the root element. Similarly, other nonelement data in an XML document, such as an xml-stylesheet processing instruction, a DOCTYPE declaration, or comments, do not have to be inside the root element. But all other elements (other than the root itself) and all raw character data must be contained in the root element.

Text in XML

An XML document is made up of text. Text is made up of characters. A character is a letter, a digit, a punctuation mark, a space or tab, or some similar thing. XML uses the Unicode character set which not only includes the usual letters and symbols from English and other Western European alphabets, but also the Cyrillic, Greek, Hebrew, Arabic, and Devanagari alphabets, as well as the most common Han ideographs for Chinese, Japanese, and Korean Hangul syllables. For now, I'll stick to the English language, the Roman script, and the ASCII character set; but I'll introduce many alternatives in the next chapter.

A document's text is divided into character data and markup. To a first approximation, markup describes a document's logical structure, while character data provides the basic information of the document. For example, in Listing 1, <?xml version="1.0" standalone="yes"?>, <greeting>, and </greeting> are markup. Hello XML!, along with its surrounding white space, is the character data. A big advantage of XML over other formats is that it clearly separates the actual data of a document from its markup.

To be more precise, markup includes all tags, processing instructions, DTDs, entity references, character references, comments, CDATA section delimiters, and the XML declaration. Everything else is character data. However, this is tricky because when a document is processed some of the markup turns into character data. For example, the markup &gt; is turned into the greater than sign character (>). The character data that's left after the document is processed, and after all markup that refers to character data has been replaced by the actual character data, is called parsed character data, or PCDATA for short.

Elements and Tags

An XML document is a singly rooted hierarchical structure of elements. Each element is delimited by a start tag (also known as an opening tag) and an end tag (also known as a closing tag) or is represented by a single, empty element tag. An XML tag has the same form as an HTML tag. That is, start tags begin with a < followed by the name of the element the tags start, and they end with the first > after the opening < (for example, <GREETING>). End tags begin with a </ followed by the name of the element the tag finishes and are terminated by a > (for example, </GREETING>). Empty element tags begin with a < followed by the name of the element and are terminated with a /> (for example, <GREETING/>).

Element names

Every element has a name made up of one or more characters. This is the name included in the element's start and end tags. Element names begin with a letter such as y or A or an underscore _. Subsequent characters in the name may include letters, digits, underscores, hyphens, and periods. They cannot include other punctuation marks such as %, ^, or &. They cannot include white space. (The underscore often substitutes for white space.) Both lower- and uppercase letters may be used in XML names. In this book, I mostly follow the convention of making my names uppercase, mainly because this makes them stand out better in the text. However, when I'm using a tag set that was developed by other people it is necessary to adopt their case convention. For example, the following are legal XML start tags with legal XML names:

<HELP>
<Book>
<volume>
<heading1>
<section.paragraph>
<Mary_Smith>
<_8ball>

Note: Colons are also technically legal in tag names. However, these are reserved for use with namespaces. Namespaces allow you to mix and match XML applications that may use the same tag names. Chapter 13 introduces namespaces. Until then, you should not use colons in your tag names.

The following are not legal start tags because they don't contain legal XML names:

<Book%7>
<volume control>
<3heading>
<Mary Smith>
<.employee.salary>

Note: The rules for element names actually apply to names of many other things as well. The same rules are used for attribute names, ID attribute values, entity names, and a number of other constructs you'll encounter over the next several chapters.

Every start tag must have a corresponding end tag

Web browsers are relatively forgiving if you forget to close an HTML tag. For instance, if you include a <B> tag in your document but no corresponding </B> tag, the entire document after the <B> tag will be made bold. However, the document will still be displayed.

XML is not so forgiving. Every nonempty tag -- that is, tags that do not end with /> -- must be closed with the corresponding end tag. If a document fails to close an element with the right end tag, the browser or renderer reports an error message and does not display any of the document's content in any form.

End tags have the same name as the corresponding start tag, but are prefixed with a / after the initial angle bracket. For example, if the start tag is <FOO> the end tag is </FOO>. These are the end tags for the previous set of legal start tags.

</HELP>
</Book>
</volume>
</heading1>
</section.paragraph>
</Mary_Smith>
</_8ball>

XML names are case sensitive. This is different from HTML in which <P> and <p> are the same tag, and a </p> can close a <P> tag. The following are not end tags for the set of legal start tags we've been discussing:

</help>
</book>
</Volume>
</HEADING1>
</Section.Paragraph>
</MARY_SMITH>
</_8BALL>

Empty element tags

Many HTML elements do not have closing tags. For example, there are no </LI>, </IMG>, </HR>, or </BR> tags in HTML. Some page authors do include </LI> tags after their list items, and some HTML tools also use </LI>. However, the HTML 4.0 standard specifically denies that this is required. Like all unrecognized tags in HTML, the presence of an unnecessary </LI> has no effect on the rendered output.

This is not the case in XML. The whole point of XML is to allow new elements and their corresponding tags to be discovered as a document is parsed. Thus, unrecognized tags may not be ignored. Furthermore, an XML processor must be capable of determining on the fly whether a tag it has never seen before does or does not have an end tag. It does this by looking for special empty-element tags that end in />.

Elements that are represented by a single tag without a closing tag are called empty elements because they have no content. Tags that represent empty elements are called empty-element tags. These empty element tags are closed with a slash and a closing angle bracket (/>); for example, <BR/> or <HR/>. From the perspective of XML, these are the same as the equivalent syntax using both start and end tags with nothing in between themfor example, <BR></BR> and <HR></HR>.

However, empty element tags can only be used when the element is truly empty, not when the end tag is simply omitted. For example, in HTML you might write an unordered list like this:

<UL>
<LI>I've a Feeling We're Not in Kansas Anymore
<LI>Buddies
<LI>Everybody Loves You
</UL>

In XML, you cannot simply replace the <LI> tags with <LI/> because the elements are not truly empty. Instead they contain text. In normal HTML the closing </LI> tag is omitted by the editor and filled in by the parser. This is not the same thing as the element itself being empty. The first LI element above contains the content I've a Feeling We're Not in Kansas Anymore. In XML, you must close these tags like this:

<UL>
<LI>I've a Feeling We're Not in Kansas Anymore</LI>
<LI>Buddies</LI>
<LI>Everybody Loves You</LI>
</UL>

On the other hand, a BR or HR or IMG element really is empty. It doesn't contain any text or child elements. Thus, in XML, you have two choices for these elements. You can either write them with a start and an end tag in which the end tag immediately follows the start tag -- for example, <HR></HR> -- or you can write them with an empty element tag as in <HR/>.

Note: Current Web browsers deal inconsistently with empty element tags. For instance, some browsers will insert a line break when they see a <HR/> tag and some won't. Furthermore, the problem may arise even without empty element tags. Some browsers insert two horizontal lines when they see <HR></HR> and some insert one horizontal line. The most generally compatible scheme is to use an extra attribute before the closing />. The class attribute is often a good choice -- for example, <HR CLASS="empty"/>. XSLT offers a few more ways to maintain compatibility with legacy browsers. Chapter 17 discusses these methods.

Elements may nest but may not overlap

Elements may contain (and indeed often do contain) other elements. However, elements may not overlap. Practically, this means that if an element contains a start tag for an element, it must also contain the corresponding end tag. Conversely, an element may not contain an end tag without its matching start tag. For example, this is legal XML.

<H1><CITE>What the Butler 

Saw</CITE></H1>

However, the following is not legal XML because the closing </CITE> tag comes before the closing </H1> tag:

<H1><CITE>What the Butler 

Saw</H1></CITE>

Most HTML browsers can handle this case with ease. However, XML browsers are required to report an error for this construct.

Empty element tags may appear anywhere, of course. For example,

<PLAYWRIGHTS>Oscar Wilde<HR/>Joe 

Orton</PLAYWRIGHTS>

This implies that for all nonroot elements, there is exactly one other element that contains the element, but which does not contain any other element containing the element. This immediate container is called the parent of the element. The contained element is called the child of the parent element. Thus each nonroot element always has exactly one parent, but a single element may have an indefinite number of children or no children at all.

Consider Listing 2. The root element is the PLAYS element. This contains two PLAY children. Each PLAY element contains three children: TITLE, AUTHOR, and YEAR. Each of these contains only character data, not more children.

Listing 2: -Parents and Children

<?xml version="1.0" standalone="yes"?>
<PLAYS>
  <PLAY>
    <TITLE>What the Butler Saw</TITLE>
    <AUTHOR>Joe Orton</AUTHOR>
    <YEAR>1969</YEAR>
  </PLAY>
  <PLAY>
    <TITLE>The Ideal Husband</TITLE>
    <AUTHOR>Oscar Wilde</AUTHOR>
    <YEAR>1895</YEAR>
  </PLAY>
</PLAYS>

In programmer terms, this means that XML documents form a tree. It starts from the root and gradually bushes out to the leaves on the ends. Trees have a number of nice properties that make them congenial to programmatic traversal, although this doesn't matter so much to you as the author of the document.

Note: Trees are more commonly drawn from the top down. That is, the root of the tree is shown at the top of the picture rather than the bottom. While this looks less like a real tree, it doesn't affect the topology of the data structure in the least.

Attributes

Elements may optionally have attributes. Each attribute of an element is encoded in the start tag of the element as a name-value pair separated by an equals sign (=) and, optionally, some extra white space. The attribute value is enclosed in single or double quotes. For example,

<GREETING LANGUAGE="English">
  Hello XML!
  <MOVIE SRC = 'WavingHand.mov'/>
</GREETING>

Here the GREETING element has a LANGUAGE attribute that has the value English. The MOVIE element has an SRC attribute with the value WavingHand.mov.

Attribute names

Attribute names are strings that follow the same rules as element names. That is, attribute names must contain one or more characters, and the first character must be a letter or the underscore (_). Subsequent characters in the name may include letters, digits, underscores, hyphens, and periods. They may not include white space or other punctuation marks.

The same element may not have two attributes with the same name. For example, this is illegal:

<RECTANGLE SIDE="8" SIDE="10"/>

Attribute names are case sensitive. The SIDE attribute is not the same as the side or the Side attribute. Therefore, the following is legal:

<BOX SIDE="8" side="10" Side="31"/>

However, this is extremely confusing, and I strongly urge you not to write markup that depends on case.

Attribute values

Attributes values are strings. Even when the string shows a number, as in the LENGTH attribute below, that number is the two characters 7 and 2, not the binary number 72.

<RULE LENGTH="72"/>

If you're writing a program to process XML, you'll need to convert the string to a number before performing arithmetic on it.

Unlike attribute names, there are few limits on the content of an attribute value. Attribute values may contain white space, begin with a number, or contain any punctuation characters (except, sometimes, for single and double quotes). The only characters an attribute value may not contain are the angle brackets < and >, though these can be included using the &lt; and &gt; entity references (discussed soon).

XML attribute values are delimited by quote marks. Unlike HTML attribute values, XML attribute values must be enclosed in quotes whether or not the attribute value includes spaces. For example:

<A HREF="http://www.ibiblio.org/">IBiblio</A>

Most people choose double quotes. However, you can also use single quotes, which is useful if the attribute value itself contains a double quote. For example:

<IMG SRC="sistinechapel.jpg"
     ALT='And God said, "Let there be light,"
          and there was light'/>

If the attribute value contains both single and double quotes, then the one that's not used to delimit the string must be replaced with the proper entity reference. I generally just go ahead and replace both, which is always legal. For example,

<RECTANGLE LENGTH='8&apos;7&quot;' 

WIDTH="10&apos;6&quot;"/>

If an attribute value includes both single and double quotes, you may use the entity reference &apos; for a single quote (an apostrophe) and &quot; for a double quote. For example,

<PARAM NAME="joke" VALUE="The diner said,
     &quot;Waiter, There&apos;s a fly in my 
     soup!&quot;">

1 2




Networking Solutions





Partners

  • Partner With Us














More for Developers

internet.commediabistro.comJusttechjobs.comGraphics.com

Search:

WebMediaBrands Corporate Info

Legal Notices, Licensing, Reprints, Permissions, Privacy Policy.
Advertise | Newsletters | Shopping | E-mail Offers | Freelance Jobs