LanguagesXMLWell-Formed HTML

Well-Formed HTML

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

This article is brought to you by Hungry Minds, Inc. publisher of Elliotte Rusty Harlold’s  XML Bible, 2nd Edition

Well-formed HTML is HTML that adheres to XML’s well-formedness constraints but only uses standard HTML tags. Well-formed HTML is easier to read than the sloppy HTML most humans and WYSIWYG tools such as FrontPage write. It’s also easier for Web robots and automated search engines to understand. It’s more robust, and less likely to break when you make a change. And it’s less likely to be subject to annoying cross-browser and cross-platform differences in rendering. Furthermore, you can then use XML tools to work on your HTML documents, while still maintaining backward compatibility with browsers that don’t support XML.

Rules for HTML

Real-world Web pages are extremely sloppy. Tags aren’t closed. Elements overlap. Raw less than signs are included in pages. Semicolons are omitted from the ends of entity references. Web pages with these problems are technically incorrect, but most Web browsers accept them. Nonetheless, your Web pages will be cleaner, display faster, and be easier to maintain if you fix these problems.

Some of the common problems that you need to look for on Web pages include:

    1.Start tags without matching end tags (unclosed elements)

    2.End tags without start tags (orphaned tags)

    3.Overlapping elements

    4.Unquoted attributes

    5.Unescaped <, >, and & signs

    6.Documents without root elements

    7.End tags in a different case than the corresponding starttag

I’ve listed these in rough order of importance. Exact details vary from tag to tag, however. For instance, an unclosed <STRONG> tag will turn all elements following it bold. However, an unclosed <LI> or <P> tag causes no problems at all.

There are also some rules that only apply to XML documents that might actually cause problems if you attempt to integrate them into your existing HTML pages. These XML-only constructs include:

    8.Start documents with an XML declaration

    9.Close empty element tags with a />.

    10.Only use the &amp;, &lt;, &gt;,&apos;, and &quot; entityreferences.

Fixing these problems isn’t hard, but there are a few pitfalls to trip up the unwary. Let’s explore them.

Close all elements

Any element that contains content, whether text or other child elements, should have a start tag and an end tag. HTML doesn’t absolutely require this. For instance, <P> , <DT>, <DD>, and <LI> are often used in isolation. However, this relies on the Web browser to make a good guess at where the element ends, and browsers don’t always do quite what authors want or expect. Therefore, it’s best to explicitly close all start tags.

Probably the biggest change this requires to how you write HTML is thinking of <P> as a container rather than a simple paragraph break mark. For instance, previously you would have formatted these maxims from Oscar Wilde’s Phrases and Philosophies for the Use of the Young like this:

Wickedness is a myth invented by good people to account for thecurious attractiveness of others.<P>Those who see any difference between soul and body haveneither.<P>Religions die when they are proved to be true. Science is therecord of dead religions.<P>The well-bred contradict other people. The wise contradictthemselves.<P>

Now you have to format them like this instead:

<P>Wickedness is a myth invented by good people to account for thecurious attractiveness of others.</P><P>Those who see any difference between soul and body have neither.</P><P>Religions die when they are proved to be true. Science is therecord of dead religions.</P><P>The well-bred contradict other people. The wise contradictthemselves.</P>

You’ve probably been taught to think of <P> as ending a paragraph. Now you have to think of it as beginning one. This does offer you some advantages though. For instance, you can easily assign a variety of formatting attributes to a paragraph. For example, here’s the original HTML title of House Resolution 581 as seen on http://thomas.loc.gov/home/hres581.html:

<center><p><h2>House Calendar No. 272</h2><p><h1>105TH CONGRESS 2D SESSION H. RES. 581</h1><p>[Report No. 106-795]<p><b>Authorizing and directing the Committee on theJudiciary to investigate whether sufficient groundsexist for the impeachment of William Jefferson Clinton,President of the United States.</b></center>

Here’s the same text, but using well-formed HTML. The align attribute now replaces the deprecated center element, and a CSS style attribute is used instead of the <b> tag.

<h2 align="center">House Calendar No. 272</h2><h1 align="center">105TH CONGRESS 2D SESSION H. RES. 581</h1><p align="center">[Report No. 106-795]</p><p align="center" style="font-weight: bold">Authorizing and directing the Committee on the Judiciary toinvestigate whether sufficient grounds exist for theimpeachment of William Jefferson Clinton,President of the United States.</p>

Delete orphaned end tags; don’t let elements overlap

When editing pages, it’s not uncommon to remove a start tag and forget to remove its associated end tag. In HTML, an orphaned end tag, such as a </STRONG> or </TD> that doesn’t have any matching start tag, is unlikely to cause problems by itself. However, it does make the file longer than it needs to be, increases the time that it takes to download the document, and has the potential to confuse people or tools that are trying to understand and edit the HTML source. Therefore, you should make sure that each end tag is properly matched with a start tag.

However, more often an end tag that doesn’t match any start tag means that elements incorrectly overlap. Most elements that overlap on Web pages are quite easy to fix. For instance, consider this common problem found on the White House home page (http://www.whitehouse.gov/, November 4, 1998).

<font size=2><b><!-- New Begin --><a href="/WH/New/html/19981104-12244.html">Remarks Of The    President Regarding Social Security</a><BR><!-- New End --> </font></b>

Because the b element starts inside the font element, it must end inside the font element. All that’s needed to fix it is to swap the end tags like this:

<font size=2><b><!-- New Begin --><a href="/WH/New/html/19981104-12244.html">   Remarks Of The President Regarding Social Security</a><BR><!-- New End --></b></font>

Alternately, you can swap the start tags instead:

<b><font size=2><!-- New Begin --><a href="/WH/New/html/19981104-12244.html">   Remarks Of The President Regarding Social Security</a><BR><!-- New End --> </font></b>

Occasionally, you may have a tougher problem. For example, consider this larger fragment from the same page. I’ve emboldened the problem tags to make it easier to see the mistake:

<TD valign=TOP width=85><FONT size=+1><A HREF="/WH/New"><img border=0src="/WH/images/pin_calendar.gif"align=LEFT height=50 width=75 hspace=5 vspace=5></A><br> </TD><TD valign=TOP width=225><A HREF="/WH/New"><B>What's New:</B></A><br></FONT>What's happening at the White <nobr>House - </nobr><br> <font size=2><b><!-- New Begin --><a href="/WH/New/html/19981104-12244.html">   Remarks Of The President Regarding Social Security</a><BR><!-- New End --> </font></b></TD>

Here the <FONT size=+1> element begins inside the first <TD valign=TOP width=85> element and continues past that element into the <TD valign=TOP width=225> element where it finishes. The proper solution in this case is to close the FONT element immediately before the first </TD> closing tag, and to then add a new <FONT size=+1> start tag immediately after the start of the second TD element, like this:

<TD valign=TOP width=85><FONT size=+1><A HREF="/WH/New"><img border=0src="/WH/images/pin_calendar.gif"align=LEFT height=50 width=75 hspace=5 vspace=5></A><br></FONT></TD><TD valign=TOP width=225><FONT size=+1><A HREF="/WH/New"><B>What's New:</B></A><br></FONT>What's happening at the White <nobr>House - </nobr><br> <b><font size=2><!-- New Begin --><a href="/WH/New/html/19981104-12244.html">   Remarks Of The President Regarding Social Security</a><BR><!-- New End --> </font></b></TD>

Quote all attributes

HTML attributes only require quote marks if they contain embedded white space. Nonetheless, it doesn’t hurt to include them. Furthermore, using quote marks may help in the future, if you later decide to change the attribute value to something that does include white space. It’s quite easy to forget to add the quote marks later, especially if the attribute is something like an ALT in an <IMG> whose malformedness is not immediately apparent when viewing the document in a Web browser. For instance, consider this <IMG> tag:

<IMG SRC=cup.gif WIDTH=89 HEIGHT=67 ALT=Cup>

It should be rewritten like this:

<IMG SRC="cup.gif" WIDTH="89" HEIGHT="67" ALT="Cup">

The previous fragment from the White House home page has a lot of attributes that require quoting. When the quote marks are fixed, it looks like this:

<TD valign="TOP" width="85"><FONT size="+1"><A HREF="/WH/New"><img border="0"src="/WH/images/pin_calendar.gif"align="LEFT" height="50" width="75" hspace="5" vspace="5"></A><br></FONT></TD><TD valign="TOP" width="225"><FONT size="+1"><A HREF="/WH/New"><B>What's New:</B></A><br></FONT>What's happening at the White <nobr>House - </nobr><br> <b><font size="2"><!-- New Begin --><a href="/WH/New/html/19981104-12244.html">   Remarks Of The President Regarding Social Security</a><BR><!-- New End --> </font></b></TD>

Escape <, >, and & signs

HTML is more forgiving of loose less than signs and ampersands than is XML. Nonetheless, even in pure HTML, they do cause trouble, especially if they’re followed immediately by some other character. For instance, consider this e-mail address as it might easily be copied and pasted from the From: header in Eudora:

Elliotte Rusty Harold <elharo@metalab.unc.edu>

Were it to be rendered in HTML, this is all you would see:

Elliotte Rusty Harold

The e-mail address has been unintentionally hidden by the angle brackets. Anytime you want to include a raw less than sign or ampersand in HTML, you really should use the &lt; and &amp; entity references. The correct HTML for such a line would be:

From: Elliotte Rusty Harold &lt;elharo@metalab.unc.edu&gt;

You’re slightly less likely to see problems with an unescaped greater than sign because this will only be interpreted as markup if it’s preceded by an as yet unfinished tag. However, there may be such unfinished tags in a document, and a nearby greater than sign can mask their presence. For example, consider this fragment of Java code.

for (int i=0;i<10;i++) {  for (int j=20;j>10;j--) {

It’s likely to be rendered as

for (int i=0;i10;j--) {

If these are only 2 lines in a 100-line program, it’s entirely possible you’ll miss the problem when casually proofreading. On the other hand, if the greater than sign is escaped, the unescaped less than sign will probably obscure the rest of the program, and the problem will be much more obvious.

Use the same case for all tags

HTML isn’t case sensitive, but XML is. If you open an element with <TD> you can’t close it with </td>. When I went back to the White House home page for the second edition of this book, I found that they’d fixed the problems I noted above. However, this time I found a lot of elements like this:

<A href="/WH/Services"><B>Commonly Requested Federal Services:</B></a>

The end tags need to at least match the case of the corresponding start tags. Thus in this example, </a> should be </A>, like this:

<A href="/WH/Services"><B>Commonly Requested Federal  Services:</B></A>

However, most of the time I’d go a little further. In particular, I recommend picking a single convention for tag case, either all uppercase or all lowercase, and sticking to it throughout the document. This is easier than trying to remember details of each tag. In this book, I’m mostly using all uppercase tags so that the tags will stand out in the text, but for HTML I normally use all lowercase because it’s much easier to type and because, eventually, XHTML will require it. Thus, I’d rewrite the above fragment like this:

<a href="/WH/Services"><b>Commonly Requested Federal Services:</b></a>

Cross Reference: XHTML is discussed in Chapter 22.

Includea root element

The root element for HTML files is supposed to be html. Most browsers forgive a failure to include this. Nonetheless, it’s definitely better to make the very first tag in your document <html> and the very last </html>. If any extra text or tags have gotten in front of <html> or behind </html>, move them between <html> and </html>.

One common manifestation of this problem is simply forgetting to include </html> at the end of the document. I always begin my documents by typing <html> and </html>, then type in between them, rather than waiting until I’ve finished writing the document and hoping that by that point, possibly days later, I still remember that I need to put in a closing </html> tag.

Close empty tags with a />

Empty tags are the bjte noir of converting HTML to well-formed XML. HTML does not formally recognize the XML <elementname/> syntax for empty tags. You can convert <BR> to <BR/>, <HR> to <HR/>, <IMG> to <IMG/>, and so on quite easily. However, it’s a tossup whether any given browser will render the transformed tags properly or not.

Caution: Do not confuse truly empty elements such as <BR>,<HR>, and <IMG> with elements that do containcontent but often only have a start tag in standard HTML, such as<P>, <LI>, <DT>, and<DD>.

The simplest solution, and the one approved by the XML specification, is to replace the empty tags with start tag/end tag pairs with no content. The browser should then ignore the unrecognized end tag. For example,

<BR></BR><HR></HR><IMG SRC="cup.gif" WIDTH="89" HEIGHT="67" ALT="Cup"></IMG>

This seems to work well in practice with one notable exception. Netscape treats </BR> the same as <BR>; that is, as a signal to break the line. Thus while <BR> is a single line break, <BR></BR> is a double line break, more akin to a paragraph mark in practice. Furthermore, Netscape ignores <BR/> completely. Web sites that must support legacy browsers (essentially all Web sites) thus cannot use either <BR></BR> or <BR/>. What does seem to work in practice for XML and legacy browsers is this:

<BR />

Note the space between <BR and />. If the space bothers you, you can add an extra attribute like this:

<BR CLASS="empty"/>

Don’t use any entity references other than &amp;, &lt;,&gt;, &apos;, and &quot;

Many Web pages don’t need entity references other than &amp;, &lt;, &gt;, &apos;, and &quot;. However, the HTML 4.0 specification does define many more including:

  • &trade;, the trademark symbol ()
  • &copy;, the copyright symbol ())
  • &infin;, the infinity symbol 8
  • &pi;, the lowercase Greek letterp

There are several hundred others. These are just a sample. However, using any of these will make your document not well-formed. The real solution to this problem is to use a DTD. I discuss the effect that DTDs have on entity references in Chapter 10 of my book XML Bible, 2nd Edition. In the meantime, there are several short-term solutions.

The simplest is to write your document in a character set that has all the symbols you need, and then use a <META> directive to specify the character set in use. For example, to specify that your document uses UTF-8 encoding, a character set discussed in the next chapter that contains all the characters you’re likely to want, you would place this <META> directive in the head of your document.

<META http-equiv="Content-Type" content="text/html;      charset=UTF-8"></META>

Alternately, you can simply configure your Web server to emit the necessary content type header. However, it’s normally easier to use the <META> tag.

Content-Type: text/html; charset=UTF-8

The problem with this approach is that many browsers are not capable of displaying the UTF-8 character set. The same is true of most of the other character sets that you’re likely to use to provide these special characters.

HTML 4.0 supports character entity references just like XML’s; that is, you can replace a character with &# and the decimal or hexadecimal value of the character in Unicode. For example:

  • &#8482; is the trademark symbol ()
  • &#169; is the copyright symbol ())
  • &#8734; is the infinity symbol 8
  • &#960; is the lowercase Greek letterp

Unfortunately, HTML 3.2 only officially supports the numeric character references between 0 and 255 (ISO Latin-1), and many commonly used Web browsers won’t recognize character references outside this range.

If you’re really desperate for well-formed XML that’s backward compatible with HTML, you can include these characters as inline images. For example:

  • <img src="tm.gif" width="12"height="12" alt="TM"></img> includes the trademark symbol ()
  • <img src="copyright.gif" width="12"height="12" alt="Copyright"></img> includes the copyright symbol ())
  • img src="infinity.gif" width="12"height="12" alt=t"infinity"></img> includes the infinity symbol 8
  • <img src=”pi.gif” width=”12″height=”12″ alt=”pi”></img> includes thelowercase Greek letterp

In practice, however, I don’t recommend using these characters as inline images. Well-formedness is not nearly so important in HTML that it justifies the added download and rendering time that using characters as inline images imposes on your readers.

Don’t include an XML declaration

HTML documents don’t need XML declarations. However, they can have them. Web browsers should simply ignore tags they don’t recognize. From their perspective, the line

<?xml version="1.0" standalone="yes"?>

is just another tag. Because browsers that don’t understand XML don’t understand the <?xml?> tag, they quietly ignore it. However, I’ve encountered strange behaviors when different browsers are presented with an HTML document that includes an XML declaration. When faced with such a file, Internet Explorer 4.0 for the Mac tried to download the file rather than displaying it. Netscape Navigator 3.0 showed the declaration as text at the top of the document. Admittedly, these are older browsers, but they are still used by many millions of people. Consequently, since the XML declaration is not required for XML documents and since it doesn’t really add a lot to XMLized HTML pages, I’ve removed it from my Web sites.

Summary

In this article, you learned about creating well-formed HTML.

About the Author

Elliotte Rusty Harold is an internationally respected writer, programmer, and educator both on the Internet and off. He got his start writing FAQ lists for the Macintosh newsgroups on Usenet and has since branched out into books, Web sites, and newsletters. He’s an adjunct professor of computer science at Polytechnic University in Brooklyn, new York. His books include XML Bible, The Java Developer’s Resource, Java Network Programming, Java Secrets, JavaBeans, XML: Extensible markup Language, and Java I/O.

This article is brought to you by Hungry Minds, Inc. publisher of XML Bible, 2nd Edition
© Copyright Hungry Minds, All Rights Reserved

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories