Making Mistakes with XML, Page 2
4. Designing a Language that Resists Change
So, you've decided to design your own XML langauge instead of re-using one. You've made a data model, picked names of elements and attributes, determined the minimal set of the vocabulary you need to solve your problem, even managed the entire process. It's hard work, but you've done it, released it, and people love it.
Now they want version 2.
One approach is to include attributes such as version and compatible-with in your elements. The values of these attributes can indicate what version of your language an element comes from, and what versions an element is compatible with. This affords applications an opportunity to take some more intelligent action based on what they receive and what they generate. Another approach is to leverage the "must-ignore/must-understand" design pattern (although it suffers some shortcomings).
3. Storing Images in XML
Compression in file formats such as PNG and JPG can work wonders on images, turning millions of bytes into mere thousands. And yet, those same compact bytes are often converted into the printable base-64 format, expanding the size by 33%, just to put them into an XML document. And, with so much XML still being hand-edited, the sheer girth of a base-64 encoded image is enough to crash even the most hardy of text editors. (DOM parsers have been rumored to explode while ingesting base-64 encoded video clips.)
XML documents are great at containing characters, but lousy at containing binary data like images, audio, video, and other media. People often try to shoehorn such media into XML with base-64 encoding to avoid metadata in the XML from being separated from the data (the media) it's describing.
Technologies exist, though, that obviate such needs. XLink can describe various kinds of links to other media. MIME messages can contain multiple parts: One part may be the XML document, the other parts the media; SOAP with Attachments uses just such a technique. (Unparsed entities don't count because they require the doctype declaration; refer to point #8)
2. Storing XML in XML
XML is nice in that it combines metadata with data. Element and attribute names identify the meanings of their contained values: <measurement units='celsius'>23.3</measurement> is clearly a measurement of 23.3 degrees celsius. But, when the value to be stored in XML is itself XML, things get tricky. You see this most often in RSS feeds, where XHTML markup goes into the site summary in escaped form, such as:
<title>Whither XML?</title> <link>http://bitterness.cx/blog/938182/</link> <description>In today's <em>rant</em>, guest blogger <strong>Ken Doe</strong> gets down and dirty with everyone's favorite whipping-tech: <code>XML</code>. </description>
Sometimes the XHTML is enclosed in a CDATA section instead of being escaped; the end result is the same. Why is this bad? For one thing, there's nothing more than mere convention (in this RSS example) that says that the content of an element such as description is in fact markup to be unescaped and rendered. And, in general, it hides the contained XML that tools could otherwise manipulate and make use of. In addition, blindly escaping such content or enclosing it in a CDATA section does not address text encoding issues: The escaped content might be in ISO-8859-4, whereas the XML document is in MacRoman.
Norman Walsh makes several recommendations on how to avoid these problems, including sticking to plain text, using XML markup directly, or even using base-64 encoding. I'll let you decide what approach makes the most sense.
1. Using XML
American psychologist Abraham Maslow observed, "He that is good with a hammer tends to think everything is a nail." If XML is in your toolbox, does every problem start to look like markup? XML is not a silver bullet, nor a panacea for all the problems of application development, Web site deployment, data interchange, procedure call representation, and so forth.
XML comes from the world of document markup, where logical tree structures are the norm. The fact that it can be used to mark up data as well as documents is convenient. But, if your data is not tree shaped, XML is not appropriate. A table of temperature measurements taken on a 2D surface is perfectly happy existing as a series of comma-separated values with column headings "X," "Y," and "degrees celsius."
XML is great for the interchange of data. But, for the storage, of data it's weak. Computing an average temperatue out of that table of temperature measurements is a one-liner if you store the data in a relational database. It's a simple function if it's in a spreadsheet. But, if it's in an XML document, it requires XPath gymnastics. Finding a specific measurement requires devling into the dubious world of XQuery.
For internationalized processing of a standardized syntax, XML is clearly a winner. But, all that power is overkill for an application configuration file that just wants to know where to find the font directory or what the address of the mail server is. The fad of applications using XML for their configuration files is dismaying, to say the least, especially when many development languages include their own simpler configuration facilities (such as Java's properties files or Python's ConfigParser-style files).
XML is indeed a powerful tool. But, not always the best tool.
Eight years since the publication of the first XML recommendation have collectively led to a set of best practices—as well as common mistakes you can make with this technology. Just as a chainsaw can cause personal injury, XML too has its own set of hazards. What will the next eight years bring? Undoubtedly, you'll be making a different set of mistakes, but as creatures of habit, sometimes bad habits do endure.
About the Author
Sean Kelly has been making mistakes with XML shortly after the first XML Recommendation was published. He's learned quite a bit from those mistakes, though, and currently provides XML, Python, Java, Web services, and other software development consulting for the medical, aerospace, and digital media industries. He's available for hire; he'll try his best to minimize the mistakes he'll make for you.
He resides in an undisclosed location with his wife Mary and daughter Ariana, who are routinely annoyed by his home automation hobby.
Page 2 of 2