Making Mistakes with XML
Since the first publication of the XML 1.0 Recommendation in 1998, hundreds of applications, data models, document formats, tools, specifications, libraries, references, tutorials, books, papers, excitement, enthusiasm, and energy have exploded onto the scene, making this relatively simple idea one of the most important developments in the computer industry since the microprocessor.
But, it's not always been a smooth ride. It's just as easy to misuse and abuse XML as it is to get it right. In this article, I discuss what I feel are the top ten mistakes you can make with XML.
10. Failing to Parse Your Own Documents
With XML pervading configuration files and document formats, we're still hand-editing XML as much as dealing with the machine-generated variety. But changing attributes and adding elements to an XML configuration file is an invitation to omit a quote or to forget to escape an ampersand. And nothing is more frustrating than not discovering the problem until after the slow startup time of the program that uses the configuration file, or until after you get a message back from a distant colleague saying the file you sent doesn't work.
And yet running a quick and easy "lint check" of an XML file is, well, quick and easy! Many Linux systems as well as Mac OS X include the xmllint tool, and it can be compiled for Windows. For example, you might check the Tomcat Web server configuration file before starting it up as follows:
% xmllint --noout /Volumes/web/conf/server.xml /Volumes/web/conf/server.xml:121: parser error: Unescaped '<' not allowed in attribute values
The --noout option tells xmllint not to write the parsed XML file to the screen, which it normally does if it finds no errors—in other words, no news is good news. You should run xmllint on any XML file you edit by hand, be it an XHTML document, configuration file, DocBook fragment, or otherwise.
9. Using Characters That Can't Be Handled
One of the things that makes XML so successful is that it supports internationalization as a matter of course. Characters in XML come from the Unicode character set which supports a huge number of languages. As a result, XML documents can include Armenian, Devanagari, Cherokee, Mongolian, Deseret, and Cypriot—all in the same document.
However, XML is often used for interchange between systems. And, although XML may support an enormous variety of characters, target systems might not. For example, suppose you want to transfer the contents of a database from a system using ISO-8859-1 (Latin 1) into another system using US-ASCII. It's not enough to mark the XML document's encoding as follows:
<?xml version='1.0' encoding='US-ASCII'?>
The encoding in the XML processing instruction merely tells what the document's encoding is. There's nothing to prevent a US-ASCII document from containing numeric entities, such as:
<product id='x45'>Café avec crème</product>
And while I prefer my café black instead of with crème, a US-ASCII database is going to choke on coffee with such accented characters. (Also, note that even if you've got full Unicode-supporting systems on each end of a data transfer, XML itself supports only a well-defined subset of Unicode.)
8. Using Named Character Entities
XML clearly shows its SGML roots with features such as doctypes, public and system identifiers, and character entities. Named character entities are, at first glance, a nice feature that enables you to assign a name to a sequence of text once, and have that text appear where ever the name's used. Here's an example:
<?xml version='1.0' encoding='UTF-8'?> <!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Strict//EN' 'http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd' [ <!ENTITY prodname 'Y-Box 361'> ]> <html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'> <head> <title>&prodname;® User Manual</title> </head> <body> <p>Congratulations on your new &prodname;! ...
Here, both in the title and in the p, the product name "Y-Box 361" will appear, and we add a registered trademark symbol in the title, too. This enables the company producing this document to change the name of the product easily (should it be discovered to sound too much like an existing game console from Microsoft, for example).
What's the problem?
- To define entities, you've got to have a DOCTYPE declaration. But, the DOCTYPE declaration is one of the most loathed parts of XML, and there's repeated talk of removing it altogether from future versions of XML. In fact, Norman Walsh created a leaner, meaner version of XML 1.1 called XML Kernel (or XMLK). The differences between XMLK and XML? Only one: no doctypes.
- Certain applications of XML forbid the use of the DOCTYPE declaration. SOAP is one such example.
- In order to expand entities, an XML parser must be able to parse a document type definition (DTD), whether internal in the document or external in the referenced DTD. However, lightweight XML parsers, such as in mobile phones or embeddable computers, often forego such code in order to save space.
- The DTD referenced in the DOCTYPE declaration may define entities as well (such as the definition for ® used in the title element above). But, if the parser can't retrieve xhtml1-strict.dtd because the network's down, it will have no idea what text to put in place of ®.
Avoid named entities. (For that matter, avoid doctypes!)
7. Describing Resources with XML
XML is often used to express metadata. Metadata is, of course, data about data. An MP3 music file contains the compressed audio of your favorite new Ashlee Simpson track, "L.O.V.E." from the new I Am Me album, a "pop" music hit that's 2 minutes and 33 seconds in length. The data is the MP3 file, whereas the artist name, track name, album, genre, and music length are all metadata. You might be tempted to define a brand new XML vocabulary with elements called artist, album, and so forth, for your music library management system.
There's already another technology out there that's ideal for such tasks, called the Resource Description Framework, or RDF. RDF lets you make metadata statements such as "artist is Ashlee Simpson" and do so in a way that's extensible and compatible with other RDF systems. By using RDF, you can avoid making your own new XML language, which brings us to mistake #6 …
6. Making New XML Languages
XML brings the power of language syntax to the masses. Gone are the days of writing lexical scanners from scratch. All you have to do now is pick your elements and attributes and suddenly you've got a new XML-based language for whatever the task at hand may be.
The question is: Should you?
There are about 1000 XML languages in the wild, and chances are someone else has already tackled what you'd want to tackle with a fresh language. By reusing, you gain a chance that existing applications will interoperate with your own applications. You save time (and money) from having to consider options that someone else has already considered. And, "not invented here" is never a good reason to re-invent. Tim Bray, one of the authors of the original XML Specification, has even more to say.
But, if you absolutely have to design an XML language from scratch, read on!
5. Overusing Elements, Underusing Attributes
XML evolved from a world of document markup where marking elements with tags in mixed content, <example>like <emphasis>this</emphasis></example>, is commonplace. Yet for some reason, many people take that notion into XML when used for data interchange. Consider this (hypothetical) XML-based procedure call:
<?xml version='1.0' encoding='UTF-8'?> <procCall> <target>server</target> <function>lookup</function> <parameters> <parameter> <param-name>id</param-name> <param-value>x73</param-value> </parameter> <parameter> <param-name>type</param-name> <param-value>product</param-value> </parameter> </parameters> </procCall>
At 362 characters, that's a really verbose procedure call. By using attributes instead, you can streamline it:
<?xml version='1.0' encoding='UTF-8'?> <procCall target='server' function='lookup'> <parameter name='id' value='x73'/> <parameter name='type' value='product'/> </procCall>
That comes out to 175 characters, less than half. It's easier to type by hand and easier to read, too. Prefer using attributes over elements when order doesn't matter (elements mandate ordering, attributes don't), when the attribute values are small, and when there's no additional markup in the values. Otherwise, go ahead and use elements.