Friday, October 21, 2005

There's no such thing as an HTML comment, and why that's important

If a comment contains 2 consecutive dashes and its document has an XHTML doctype declaration, Gecko-based browsers like NS 7 and Firefox will show the content after the double dashes. The reason is convoluted but understandable:
  • In the SGML spec, <!--this is a comment--> is actually two nested sets of delimiters:
  • The <! and > are SGML declaration delimiters. Declarations are usually self-enclosed structural definitions meant for a parser, not content meant for a viewer, and do not look or behave like HTML tags:
    <!ELEMENT gallery (imagebase*,image+) --an imagebase child element is optional, but there must be at least one image-- >
    Declarations abound in other kinds of SGML documents (such as DTDs), but HTML only demonstrates one in action, the doctype:
    <!
    DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
  • Inside declarations, double-dashes are the actual comment delimiters (see the example with ELEMENT above).
  • Because HTML comes from SGML, its creators saw no reason to reinvent the wheel. Instead of creating an HTML comment tag, they reused an empty SGML declaration with SGML's own declaration comment syntax. There is, in fact, no such thing as an HTML comment tag. When the W3C released the HTML spec, they only reiterated this fine grained point in a separate document comparing SGML and HTML, because they figured everyone was familiar enough with SGML's 15-year-old standard to understand this already. Double-dashes are comment delimiters in several programming languages, and "dash-dash-space-endofline" is the delimiter used by mail programs to separate your message from your sig.
The authors of NCSA Mosaic, the first browser, likely reasoned that in the absence of any other SGML declaration types besides doctype, there was no need to parse "HTML comments" beyond the predictable <!-- and --> start/end pair. Newcomers to the browser-writing scene unfamiliar with SGML assumed this pair was one set of delimiters, not two. The distinction is almost academic.

Doctypes change the rules. Remember, doctypes are declarations, and declarations define a parser's behavior. In their absence (or in the presence of an HTML 4.0 doctype), browsers typically revert to "quirks mode," which often means they parse HTML using the parser they had in 1999. Add an XHTML doctype, however, and the document will be parsed to stricter guidelines more tightly conforming to XML spec...

...in theory. Gecko browsers (Mozilla/Firefox/Netscape) currently enforce SGML comment syntax in the presence of an XHTML doctype, IE and Opera do not. For now, there isn't much more incentive than there was in 1999; however, as browser based XML/XSLT web apps take off, parsing the XML datasets' custom DTDs will become more necessary, and as hinted before DTDs are just a laundry list of SGML declarations. Given their common heritage and syntax, it's conceivable that a browser's DTD parser and the XHTML/XML/XSLT parser will merge enough to use the same rule for SGML comments in all contexts.

You'll see more SGML in web pages in the future, especially when web servers start using XHTML's correct MIMEtype and stop using HTML's "text/html". This change in MIMEtype triggers even stricter XML parsing in Gecko and Opera, which means that XML-nonconformant content such as JavaScript and CSS inside web pages have to be escaped with <![[CDATA]]> blocks. (If you're using external .css/.js files none of this is an issue.)

Unlike SGML comments, CDATA doesn't hide content but tells the XML parser to ignore it as raw character-based data (similar to how the PRE tag works) and thereby allow JavaScript/CSS operators like < > -- to exist without triggering the error that started this article. Remember, this is 2005 and the number of browsers today which still attempt to render the contents of style/script blocks as text is practically nil, even counting microbrowsers, Lynx, and Netscape 4.7. SGML comment hiding of style/script blocks is unnecessary and deprecated.

No comments: