October 28, 2016
Hot Topics:

Making Mistakes with XML

  • February 6, 2006
  • By Sean Kelly
  • Send Email »
  • More Articles »

4. Designing a Language that Resists Change

So, you've decided to design your own XML langauge instead of re-using one. You've made a data model, picked names of elements and attributes, determined the minimal set of the vocabulary you need to solve your problem, even managed the entire process. It's hard work, but you've done it, released it, and people love it.

Now they want version 2.

Although making a new XML language is hard work, it's even harder to design one that can evolve over time and be backwards-compatible with earlier incarnations. Making a language that's resilient to change up-front will pay off and ensure the survivability of the language and applications based on it.

One approach is to include attributes such as version and compatible-with in your elements. The values of these attributes can indicate what version of your language an element comes from, and what versions an element is compatible with. This affords applications an opportunity to take some more intelligent action based on what they receive and what they generate. Another approach is to leverage the "must-ignore/must-understand" design pattern (although it suffers some shortcomings).

3. Storing Images in XML

Compression in file formats such as PNG and JPG can work wonders on images, turning millions of bytes into mere thousands. And yet, those same compact bytes are often converted into the printable base-64 format, expanding the size by 33%, just to put them into an XML document. And, with so much XML still being hand-edited, the sheer girth of a base-64 encoded image is enough to crash even the most hardy of text editors. (DOM parsers have been rumored to explode while ingesting base-64 encoded video clips.)

XML documents are great at containing characters, but lousy at containing binary data like images, audio, video, and other media. People often try to shoehorn such media into XML with base-64 encoding to avoid metadata in the XML from being separated from the data (the media) it's describing.

Technologies exist, though, that obviate such needs. XLink can describe various kinds of links to other media. MIME messages can contain multiple parts: One part may be the XML document, the other parts the media; SOAP with Attachments uses just such a technique. (Unparsed entities don't count because they require the doctype declaration; refer to point #8)

2. Storing XML in XML

XML is nice in that it combines metadata with data. Element and attribute names identify the meanings of their contained values: <measurement units='celsius'>23.3</measurement> is clearly a measurement of 23.3 degrees celsius. But, when the value to be stored in XML is itself XML, things get tricky. You see this most often in RSS feeds, where XHTML markup goes into the site summary in escaped form, such as:

<title>Whither XML?</title>
<description>In today's <em>rant</em>, guest blogger
   <strong>Ken Doe</strong> gets down and dirty with everyone's
   favorite whipping-tech: &lt;code&gt;XML&lt;/code&gt;.

Sometimes the XHTML is enclosed in a CDATA section instead of being escaped; the end result is the same. Why is this bad? For one thing, there's nothing more than mere convention (in this RSS example) that says that the content of an element such as description is in fact markup to be unescaped and rendered. And, in general, it hides the contained XML that tools could otherwise manipulate and make use of. In addition, blindly escaping such content or enclosing it in a CDATA section does not address text encoding issues: The escaped content might be in ISO-8859-4, whereas the XML document is in MacRoman.

Norman Walsh makes several recommendations on how to avoid these problems, including sticking to plain text, using XML markup directly, or even using base-64 encoding. I'll let you decide what approach makes the most sense.

1. Using XML

American psychologist Abraham Maslow observed, "He that is good with a hammer tends to think everything is a nail." If XML is in your toolbox, does every problem start to look like markup? XML is not a silver bullet, nor a panacea for all the problems of application development, Web site deployment, data interchange, procedure call representation, and so forth.

XML comes from the world of document markup, where logical tree structures are the norm. The fact that it can be used to mark up data as well as documents is convenient. But, if your data is not tree shaped, XML is not appropriate. A table of temperature measurements taken on a 2D surface is perfectly happy existing as a series of comma-separated values with column headings "X," "Y," and "degrees celsius."

XML is great for the interchange of data. But, for the storage, of data it's weak. Computing an average temperatue out of that table of temperature measurements is a one-liner if you store the data in a relational database. It's a simple function if it's in a spreadsheet. But, if it's in an XML document, it requires XPath gymnastics. Finding a specific measurement requires devling into the dubious world of XQuery.

For internationalized processing of a standardized syntax, XML is clearly a winner. But, all that power is overkill for an application configuration file that just wants to know where to find the font directory or what the address of the mail server is. The fad of applications using XML for their configuration files is dismaying, to say the least, especially when many development languages include their own simpler configuration facilities (such as Java's properties files or Python's ConfigParser-style files).

AJAX is certainly bringing to the browser a better Web experience that feels almost like a desktop application. But, there's really no requirement for the data that goes back and forth between the browser and the server to be in XML. When designing a new AJAX application, consider the JavaScript Object Notation (JSON). JSON provides a more efficient way of transferring data specifically for AJAX. What makes JSON clever is that it uses the JavaScript language itself to represent data: no XML parsing is required in the browser. As a result, applications send and retrieve fewer bytes and perform less in-browser processing, improving the "desktop application" experience even more.

XML is indeed a powerful tool. But, not always the best tool.


Eight years since the publication of the first XML recommendation have collectively led to a set of best practices—as well as common mistakes you can make with this technology. Just as a chainsaw can cause personal injury, XML too has its own set of hazards. What will the next eight years bring? Undoubtedly, you'll be making a different set of mistakes, but as creatures of habit, sometimes bad habits do endure.

About the Author

Sean Kelly has been making mistakes with XML shortly after the first XML Recommendation was published. He's learned quite a bit from those mistakes, though, and currently provides XML, Python, Java, Web services, and other software development consulting for the medical, aerospace, and digital media industries. He's available for hire; he'll try his best to minimize the mistakes he'll make for you.

He resides in an undisclosed location with his wife Mary and daughter Ariana, who are routinely annoyed by his home automation hobby.

Page 2 of 2

Comment and Contribute


(Maximum characters: 1200). You have characters left.



Enterprise Development Update

Don't miss an article. Subscribe to our newsletter below.

Sitemap | Contact Us

Thanks for your registration, follow us on our social networks to keep up-to-date
Rocket Fuel