LanguagesXMLXML for Beginners, Part 6: Valid Documents, Well-Formed Documents, and the DTD

XML for Beginners, Part 6: Valid Documents, Well-Formed Documents, and the DTD

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Preface

I have authored numerous online articles on XML.  These articles cover the waterfront from introductory to advanced.  I maintain a consolidated index of hyperlinks to all of my XML articles at my personal website so that you can access earlier articles from there.

Introduction

This is the last installment in a series of articles designed to explain XML to beginners.

Experts skip this article

Those of you who already know all about XML can skip ahead to something more challenging, such as some of my articles on XSL.  You will find links to all of my articles at my personal website.

Beginners, keep reading

Those of you who are just getting your feet wet in this area (and may have found the XML water to be a little deep), keep reading.

What is XML?

In an earlier article, I provided the following brief description of XML.
 

XML gives us a way to create and maintain structured documents in plain text that can be rendered in a variety of different ways.

A primary objective of XML is to separate content from presentation.

Since then, I have been working to break down the jargon into plain English and have provided some examples of structured documents and rendering.

What we have learned so far?

So far in this series on XML for Beginners, I have discussed tags, elements, content, and attributes in detail.

Now we are ready to move along to a new set of topics:  valid documents, well-formed documents, and the DTD.

What is a DTD?

According to the FAQ.
 

“A DTD is usually a file (or several files to be used together) which contains a formal definition of a particular type of document. This sets out what names can be used for elements, where they may occur, and how they all fit together. For example, if you want a document type to describe <LIST>s which contain <ITEM>s, part of your DTD would contain something like 

     <!ELEMENT item (#pcdata)> 

     <!ELEMENT list (item)+> 

This defines items containing text, and lists containing items. 

It’s a formal language which lets processors automatically parse a document and identify where every element comes and how they relate to each other, so that stylesheets, navigators, browsers, search engines, databases, printing routines, and other applications can be used.”

DTDs can be very complicated

I included the above quotation to emphasize one important point DTDs are complicated.

The reality is that the creation of a DTD of any significance is a very complex task.

Don’t panic

However, despite their complexity, many of you will never need to worry about having to create DTDs for the following two reasons:

  • Fortunately, XML does not require the use of a DTD.
  • Even when it is necessary to use a DTD, someone else may have already created it for you.

Many “standard” DTDs have already been developed and are available for your use without any requirement for you to develop them.

The three amigos

An XML document has two very close friends, one of which is optional.

I’m going to refer to them as files just so I will have something to call them (but they don’t have to be separate physical files).

One file contains the content

One file contains the content of the document (words, pictures, etc.).  This is the part that the author wants to expose to the client. I have discussed this in several previous articles.

This is the file that is composed of elements, having start tags, end tags, attributes, and content.

A second file contains the DTD

A second file is the DTD, which meets the above definition that was extracted from the FAQ.  This file is optional.

A third file contains a stylesheet

A third file contains a stylesheet that establishes how the content (that optionally conforms to the DTD) is to be rendered on the output device.

This file defines how the author wants the material to be presented to the client.

Rendering the XML document

For example a tag with an attribute of “red” might cause something to be presented bright red according to one stylesheet and dull red according to another stylesheet.  (It might even be presented as some shade of green according to still another stylesheet.)

DTD is optional, stylesheet is not

With XML, the DTD is optional but the stylesheet (or some processing mechanism that substitutes for a stylesheet) is required.  At least that is true if the XML document is ever to be rendered.

Something must provide rendering specifications

Remember, XML separates content from presentation.  Let me say it in boldface:  There is no presentation information in the XML document itself.

Therefore, something must be able to render the content of the XML document in the manner that the author intended.

A stylesheet is typical, but not required

Typically, the rendering specifications are contained in a stylesheet.  The stylesheet is used by a rendering engine to render the XML document according to the specifications in the stylesheet.

However, it is possible that the specifications could be hard-coded into a program written specifically for the purpose of rendering the XML document.  In that case, a stylesheet might not be required.

Rendering XML with IE5 and CSS

I have written several articles that deal with the use of Microsoft IE5 to render XML files using Cascading Style Sheets (CSS).  You will find links to those articles at my personal website.

Rendering XML with XSL

In addition, I have several new articles in the works that deal with rendering XML using stylesheets written in XSL.  When they are published, you will also find links to those articles at my personal website as well.

Now back to the DTD.

A DTD can be very complex

Again, according to the FAQ.
 

“… the design and construction of a DTD can be a complex and non-trivial task, so XML has been designed so it can be used either with or without a DTD.  DTDless  operation means you can invent markup without  having to define it formally. 

To make this work, a DTDless file in effect ‘defines’ its own markup, informally, by the existence and location of elements where you create them. 

But when an XML application such as a browser encounters a DTDless file, it needs to be able to understand the document structure as it reads it, because it has no DTD to tell it what to expect, so some changes have been made to the rules.”

What does this really mean?

It means that it is possible to create and process an XML document without the requirement for a DTD.  A little later, I will discuss this possibility in connection with the term well-formed.

In the meantime…

You don’t always have the luxury of avoiding the DTD.  In some situations, you may be required to create an XML document that meets specifications that someone else has defined.

Hopefully, a DTD will be available

Ideally, in those cases, the person who defined the specifications has also created a DTD and can provide it to you for your use.

What is a valid document?

Here is a new term — a valid XML document.

In the normal sense of the word, if something is not valid, that usually means that it is not any good.  However, that is not the case for XML.

An invalid XML document can be a good XML document

An invalid XML document can be a perfectly good and useful XML document.  A very large percentage of useful XML documents are invalid.

So, what is a valid XML document?

Drum roll please!!!  Without  further delay, a valid XML document is one that conforms to an existing DTD in every respect.

For example…

Unless the DTD allows an element with the name “color”, an XML document containing an element with that name is not valid according to that DTD (but it might be valid according to some other DTD).

Validity is not a requirement of XML

Because XML does not require a DTD, in general, an XML processor cannot require validation of the document.

Many very useful XML documents are not valid, simply because they were not constructed according to an existing DTD.

To make a long story short, validation against a DTD can often be very useful, but is not required.

What is a well-formed document?

Here is another new term — a well-formed document.

The concept of being well-formed was introduced as a requirement of XML, to deal with the situation where a DTD is not available (an invalid document).

Again, according to the FAQ.
 

“For example, HTML’s <IMG> element is defined as ‘EMPTY’: it doesn’t have an end-tag. Without a DTD, an XML application would have no way to know whether or not to expect an end-tag for an element, so the concept of ‘well-formed’ has been introduced.

This makes the start and end of every element, and the occurrence of EMPTY elements completely unambiguous.”

What is an <IMG> tag?

Although you may not know anything about the HTML <IMG> tag, you do know about start tags and end tags from previous articles in this series.

HTML documents are not required to be well-formed

Although HTML is related to XML (a distant cousin that combines content and presentation in the same document), HTML documents are not required to be well-formed.

The above quotation is referring to the use of a start tag in HTML that doesn’t require an end tag.  If used in that manner in an XML document, the document would not be well-formed.

All XML documents must be well-formed

Let me say it again, in boldface.  XML documents need not be valid, but all XML documents must be well-formed.

What does it mean to be well-formed?

There are several requirements for an XML document to be well-formed.  I will discuss them separately.

Start and end tags are required

To be well-formed, all elements that can contain character data must have both start and end tags.  (Empty elements have a different requirement that I will discuss later.)

What is character data?

For purposes of this explanation, let’s just say that the content that we discussed in an earlier lesson comprises character data.

Are there any other requirements for well-formed?

All attribute values must be in quotes (apostrophes or double quotes).  You already know about attributes.  I discussed them in an earlier lesson in this series.

You can surround the value with apostrophes (single quotes) if the attribute value contains a double quote.  An attribute value that is surrounded by double quotes can contain apostrophes.

Is that all?

No, we must also deal with empty elements.  Empty elements are those that don’t contain any character data.

Dealing with empty elements

We can deal with empty elements by writing them in either of the following two ways:
 

<book></book>

<book/>

You will recognize the first format as simply writing a start tag followed immediately by an end tag with nothing in between.

The second format is preferable

This is the first time that I have mentioned the second format, which is actually preferable.

One reason that the second format is preferable is that you could end up with the following for the first format.
 

<book>
</book>

Really not empty

Although this element may look empty to you, it really isn’t empty.  Rather it contains whatever characters are used by that platform to represent a newline character sequence.

Typically a newline is either a carriage return character, a line feed character, or a combination of the two.  While these characters are not visible, their presence will cause an element to be not empty.

If an element is supposed to be empty, but it is not really empty, this can cause problems when the XML file is processed.

The preferred approach

So, to reiterate, the preferred approach for representing an empty element is as follows:
 

<book/>

In case you haven’t spotted the difference yet, note the slash character immediately before the right-angle bracket at the end of the tag.

Empty element can contain attributes

Note that an empty element can contain one or more attributes inside the start tag, as shown in the following example.
 

<book author=”baldwin” price=”$9.95″ />

Again, note the slash character at the end.

No markup characters are allowed

For a document to be well-formed, it must not have markup characters (< or &) in the text data.

What is a markup character?

Since the <  character represents the beginning of a new tag, if it were included in the text data, it would cause the processor to become confused.  The solution to this problem (described below) also makes it necessary to exclude the & character from the text data.

The solution

If you need for your text to include the <character, or the & character, you can represent them using &lt; and &amp; instead.

Entities

According to the prevailing jargon, these things are called entities.  You insert them into your text in place of the prohibited characters.

Note that both entities start with an ampersand character and end with a semicolon.  It is that combination of characters that the processor uses to distinguish them from ordinary text.

Elements must nest properly

Another requirement of a well-formed document is:

  • If one element contains another element, the entire second element must be defined inside the start and end tags of the first element.

Recap of validity and well-formed requirements

Valid XML files are those which have (or refer to) a DTD and which conform to the DTD in all respects.

XML files must be well-formed, but there is no requirement for them to be valid.

A DTD is not required, in which case validity is impossible to establish.

If XML documents have a DTD, they must conform to it, which makes them valid.

Why use a DTD if it is not required?

There are several reasons to use a DTD, in spite of the fact that XML doesn’t require one.

Enforcing format specifications

Suppose, for example, that you have been charged with publishing a weekly newsletter, and you intend to produce the newsletter as an XML file.

Suppose also that you occasionally have a guest editor who produces the newsletter on your behalf.

Establish format specifications

You will probably establish a set of format specifications for your newsletter and you will need to publish those specifications for the benefit of the guest editors.

No guarantee of compliance

However, simply publishing a document containing format specifications does not ensure that the guest editors will comply with the specifications.

Use a DTD to enforce format specifications

You can enforce the format specifications by also establishing a DTD that matches the specifications.

Then, if either you, or one of your guest editors produces an XML document that doesn’t meet the specifications, the XML processor that you use to render your newsletter into its final form will notify you that the document is not valid.

Improved parser diagnostic data

Another reason that I have found a DTD to be useful goes as follows.

I am occasionally called upon to write a Java program that will parse and process an XML document in some fashion.

My experience is that the parsers that I have used are much more effective in identifying XML structural problems when the XML document has a DTD than when it doesn’t.

By this I mean that often the diagnostic information provided by the parser is more helpful when the XML document has a DTD.

This tends to make it easier to repair the document because it does a better job of isolating the problem.

What’s Next?

Unless I think of something more that I need to explain, this will be the last article in the series, XML for Beginners.

Future articles will dig more deeply into some of the topics that I have touched on lightly here (such as DTDs and XSL) and many important topics that weren’t even mentioned here.

Keep checking back.  XML is already a very broad topic, and it continues to evolve.  Every day I learn something about XML that I didn’t know the day before.  I will try to pass some of that knowledge along to you.

Copyright 2000, Richard G. Baldwin.  Reproduction in whole or in part in any form or medium without  express written permission from Richard Baldwin is prohibited.


About the author:

Richard Baldwin is a college professor and private consultant whose primary focus is a combination of Java and XML. In addition to the many platform-independent benefits of Java applications, he believes that a combination of Java and XML will become the primary driving force in the delivery of structured information on the Web.

Richard has participated in numerous consulting projects involving Java, XML, or a combination of the two.  He frequently provides onsite Java and/or XML training at the high-tech companies located in and around Austin, Texas.  He is the author of Baldwin’s Java Programming Tutorials, which has gained a worldwide following among experienced and aspiring Java programmers. He has also published articles on Java Programming in Java Pro magazine.

Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories