http://www.developer.com/

Back to article

Making Mistakes with XML


February 6, 2006

Since the first publication of the XML 1.0 Recommendation in 1998, hundreds of applications, data models, document formats, tools, specifications, libraries, references, tutorials, books, papers, excitement, enthusiasm, and energy have exploded onto the scene, making this relatively simple idea one of the most important developments in the computer industry since the microprocessor.

But, it's not always been a smooth ride. It's just as easy to misuse and abuse XML as it is to get it right. In this article, I discuss what I feel are the top ten mistakes you can make with XML.

10. Failing to Parse Your Own Documents

With XML pervading configuration files and document formats, we're still hand-editing XML as much as dealing with the machine-generated variety. But changing attributes and adding elements to an XML configuration file is an invitation to omit a quote or to forget to escape an ampersand. And nothing is more frustrating than not discovering the problem until after the slow startup time of the program that uses the configuration file, or until after you get a message back from a distant colleague saying the file you sent doesn't work.

And yet running a quick and easy "lint check" of an XML file is, well, quick and easy! Many Linux systems as well as Mac OS X include the xmllint tool, and it can be compiled for Windows. For example, you might check the Tomcat Web server configuration file before starting it up as follows:

% xmllint --noout /Volumes/web/conf/server.xml
/Volumes/web/conf/server.xml:121: parser error:
   Unescaped '<' not allowed in attribute values

The --noout option tells xmllint not to write the parsed XML file to the screen, which it normally does if it finds no errors—in other words, no news is good news. You should run xmllint on any XML file you edit by hand, be it an XHTML document, configuration file, DocBook fragment, or otherwise.

9. Using Characters That Can't Be Handled

One of the things that makes XML so successful is that it supports internationalization as a matter of course. Characters in XML come from the Unicode character set which supports a huge number of languages. As a result, XML documents can include Armenian, Devanagari, Cherokee, Mongolian, Deseret, and Cypriot—all in the same document.

However, XML is often used for interchange between systems. And, although XML may support an enormous variety of characters, target systems might not. For example, suppose you want to transfer the contents of a database from a system using ISO-8859-1 (Latin 1) into another system using US-ASCII. It's not enough to mark the XML document's encoding as follows:

<?xml version='1.0' encoding='US-ASCII'?>

The encoding in the XML processing instruction merely tells what the document's encoding is. There's nothing to prevent a US-ASCII document from containing numeric entities, such as:

<product id='x45'>Caf&#x00e9; avec cr&#x00e8;me</product>

And while I prefer my café black instead of with crème, a US-ASCII database is going to choke on coffee with such accented characters. (Also, note that even if you've got full Unicode-supporting systems on each end of a data transfer, XML itself supports only a well-defined subset of Unicode.)

8. Using Named Character Entities

XML clearly shows its SGML roots with features such as doctypes, public and system identifiers, and character entities. Named character entities are, at first glance, a nice feature that enables you to assign a name to a sequence of text once, and have that text appear where ever the name's used. Here's an example:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.0 Strict//EN'
   'http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd' [
<!ENTITY prodname 'Y-Box 361'>
]>
<html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'>
   <head>
      <title>&prodname;&reg; User Manual</title>
   </head>
   <body>
      <p>Congratulations on your new &prodname;! ...

Here, both in the title and in the p, the product name "Y-Box 361" will appear, and we add a registered trademark symbol in the title, too. This enables the company producing this document to change the name of the product easily (should it be discovered to sound too much like an existing game console from Microsoft, for example).

What's the problem?

  • To define entities, you've got to have a DOCTYPE declaration. But, the DOCTYPE declaration is one of the most loathed parts of XML, and there's repeated talk of removing it altogether from future versions of XML. In fact, Norman Walsh created a leaner, meaner version of XML 1.1 called XML Kernel (or XMLK). The differences between XMLK and XML? Only one: no doctypes.
  • Certain applications of XML forbid the use of the DOCTYPE declaration. SOAP is one such example.
  • In order to expand entities, an XML parser must be able to parse a document type definition (DTD), whether internal in the document or external in the referenced DTD. However, lightweight XML parsers, such as in mobile phones or embeddable computers, often forego such code in order to save space.
  • The DTD referenced in the DOCTYPE declaration may define entities as well (such as the definition for &reg; used in the title element above). But, if the parser can't retrieve xhtml1-strict.dtd because the network's down, it will have no idea what text to put in place of &reg;.

Avoid named entities. (For that matter, avoid doctypes!)

7. Describing Resources with XML

XML is often used to express metadata. Metadata is, of course, data about data. An MP3 music file contains the compressed audio of your favorite new Ashlee Simpson track, "L.O.V.E." from the new I Am Me album, a "pop" music hit that's 2 minutes and 33 seconds in length. The data is the MP3 file, whereas the artist name, track name, album, genre, and music length are all metadata. You might be tempted to define a brand new XML vocabulary with elements called artist, album, and so forth, for your music library management system.

Don't!

There's already another technology out there that's ideal for such tasks, called the Resource Description Framework, or RDF. RDF lets you make metadata statements such as "artist is Ashlee Simpson" and do so in a way that's extensible and compatible with other RDF systems. By using RDF, you can avoid making your own new XML language, which brings us to mistake #6 …

6. Making New XML Languages

XML brings the power of language syntax to the masses. Gone are the days of writing lexical scanners from scratch. All you have to do now is pick your elements and attributes and suddenly you've got a new XML-based language for whatever the task at hand may be.

The question is: Should you?

There are about 1000 XML languages in the wild, and chances are someone else has already tackled what you'd want to tackle with a fresh language. By reusing, you gain a chance that existing applications will interoperate with your own applications. You save time (and money) from having to consider options that someone else has already considered. And, "not invented here" is never a good reason to re-invent. Tim Bray, one of the authors of the original XML Specification, has even more to say.

But, if you absolutely have to design an XML language from scratch, read on!

5. Overusing Elements, Underusing Attributes

XML evolved from a world of document markup where marking elements with tags in mixed content, <example>like <emphasis>this</emphasis></example>, is commonplace. Yet for some reason, many people take that notion into XML when used for data interchange. Consider this (hypothetical) XML-based procedure call:

<?xml version='1.0' encoding='UTF-8'?>
<procCall>
   <target>server</target>
   <function>lookup</function>
   <parameters>
      <parameter>
         <param-name>id</param-name>
         <param-value>x73</param-value>
      </parameter>
      <parameter>
         <param-name>type</param-name>
         <param-value>product</param-value>
      </parameter>
   </parameters>
</procCall>

At 362 characters, that's a really verbose procedure call. By using attributes instead, you can streamline it:

<?xml version='1.0' encoding='UTF-8'?>
<procCall target='server' function='lookup'>
   <parameter name='id' value='x73'/>
   <parameter name='type' value='product'/>
</procCall>

That comes out to 175 characters, less than half. It's easier to type by hand and easier to read, too. Prefer using attributes over elements when order doesn't matter (elements mandate ordering, attributes don't), when the attribute values are small, and when there's no additional markup in the values. Otherwise, go ahead and use elements.

4. Designing a Language that Resists Change

So, you've decided to design your own XML langauge instead of re-using one. You've made a data model, picked names of elements and attributes, determined the minimal set of the vocabulary you need to solve your problem, even managed the entire process. It's hard work, but you've done it, released it, and people love it.

Now they want version 2.

Although making a new XML language is hard work, it's even harder to design one that can evolve over time and be backwards-compatible with earlier incarnations. Making a language that's resilient to change up-front will pay off and ensure the survivability of the language and applications based on it.

One approach is to include attributes such as version and compatible-with in your elements. The values of these attributes can indicate what version of your language an element comes from, and what versions an element is compatible with. This affords applications an opportunity to take some more intelligent action based on what they receive and what they generate. Another approach is to leverage the "must-ignore/must-understand" design pattern (although it suffers some shortcomings).

3. Storing Images in XML

Compression in file formats such as PNG and JPG can work wonders on images, turning millions of bytes into mere thousands. And yet, those same compact bytes are often converted into the printable base-64 format, expanding the size by 33%, just to put them into an XML document. And, with so much XML still being hand-edited, the sheer girth of a base-64 encoded image is enough to crash even the most hardy of text editors. (DOM parsers have been rumored to explode while ingesting base-64 encoded video clips.)

XML documents are great at containing characters, but lousy at containing binary data like images, audio, video, and other media. People often try to shoehorn such media into XML with base-64 encoding to avoid metadata in the XML from being separated from the data (the media) it's describing.

Technologies exist, though, that obviate such needs. XLink can describe various kinds of links to other media. MIME messages can contain multiple parts: One part may be the XML document, the other parts the media; SOAP with Attachments uses just such a technique. (Unparsed entities don't count because they require the doctype declaration; refer to point #8)

2. Storing XML in XML

XML is nice in that it combines metadata with data. Element and attribute names identify the meanings of their contained values: <measurement units='celsius'>23.3</measurement> is clearly a measurement of 23.3 degrees celsius. But, when the value to be stored in XML is itself XML, things get tricky. You see this most often in RSS feeds, where XHTML markup goes into the site summary in escaped form, such as:

<title>Whither XML?</title>
<link>http://bitterness.cx/blog/938182/</link>
<description>In today's <em>rant</em>, guest blogger
   <strong>Ken Doe</strong> gets down and dirty with everyone's
   favorite whipping-tech: &lt;code&gt;XML&lt;/code&gt;.
</description>

Sometimes the XHTML is enclosed in a CDATA section instead of being escaped; the end result is the same. Why is this bad? For one thing, there's nothing more than mere convention (in this RSS example) that says that the content of an element such as description is in fact markup to be unescaped and rendered. And, in general, it hides the contained XML that tools could otherwise manipulate and make use of. In addition, blindly escaping such content or enclosing it in a CDATA section does not address text encoding issues: The escaped content might be in ISO-8859-4, whereas the XML document is in MacRoman.

Norman Walsh makes several recommendations on how to avoid these problems, including sticking to plain text, using XML markup directly, or even using base-64 encoding. I'll let you decide what approach makes the most sense.

1. Using XML

American psychologist Abraham Maslow observed, "He that is good with a hammer tends to think everything is a nail." If XML is in your toolbox, does every problem start to look like markup? XML is not a silver bullet, nor a panacea for all the problems of application development, Web site deployment, data interchange, procedure call representation, and so forth.

XML comes from the world of document markup, where logical tree structures are the norm. The fact that it can be used to mark up data as well as documents is convenient. But, if your data is not tree shaped, XML is not appropriate. A table of temperature measurements taken on a 2D surface is perfectly happy existing as a series of comma-separated values with column headings "X," "Y," and "degrees celsius."

XML is great for the interchange of data. But, for the storage, of data it's weak. Computing an average temperatue out of that table of temperature measurements is a one-liner if you store the data in a relational database. It's a simple function if it's in a spreadsheet. But, if it's in an XML document, it requires XPath gymnastics. Finding a specific measurement requires devling into the dubious world of XQuery.

For internationalized processing of a standardized syntax, XML is clearly a winner. But, all that power is overkill for an application configuration file that just wants to know where to find the font directory or what the address of the mail server is. The fad of applications using XML for their configuration files is dismaying, to say the least, especially when many development languages include their own simpler configuration facilities (such as Java's properties files or Python's ConfigParser-style files).

AJAX is certainly bringing to the browser a better Web experience that feels almost like a desktop application. But, there's really no requirement for the data that goes back and forth between the browser and the server to be in XML. When designing a new AJAX application, consider the JavaScript Object Notation (JSON). JSON provides a more efficient way of transferring data specifically for AJAX. What makes JSON clever is that it uses the JavaScript language itself to represent data: no XML parsing is required in the browser. As a result, applications send and retrieve fewer bytes and perform less in-browser processing, improving the "desktop application" experience even more.

XML is indeed a powerful tool. But, not always the best tool.

Conclusion

Eight years since the publication of the first XML recommendation have collectively led to a set of best practices—as well as common mistakes you can make with this technology. Just as a chainsaw can cause personal injury, XML too has its own set of hazards. What will the next eight years bring? Undoubtedly, you'll be making a different set of mistakes, but as creatures of habit, sometimes bad habits do endure.

About the Author

Sean Kelly has been making mistakes with XML shortly after the first XML Recommendation was published. He's learned quite a bit from those mistakes, though, and currently provides XML, Python, Java, Web services, and other software development consulting for the medical, aerospace, and digital media industries. He's available for hire; he'll try his best to minimize the mistakes he'll make for you.

He resides in an undisclosed location with his wife Mary and daughter Ariana, who are routinely annoyed by his home automation hobby.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date