Surfin' Safari

Understanding HTML, XML and XHTML

Posted by Maciej Stachowiak on Wednesday, September 20th, 2006 at 3:35 pm

OR, Close your <script> and <canvas> tags!

The relationships among HTML, XML and XHTML are an area of considerable confusion on the web. We often see questions on the webkit-dev mailing list where people wonder why their seemingly XHTML documents result in HTML output. Or we’re asked why an XML construct like <b /> doesn’t actually close the bold tag.

This article will attempt to clear up some of that confusion.

You may be wondering what the subtitle has to do with the title. Well, the HTML/XHTML distinction may seem like an obscure topic, but it can have significant practical effects. In particular, it is likely to affect Dashboard Widget developers in a huge way in upcoming WebKit versions. I’ll explain further at the end.

What are HTML, XML and XHTML?

The original language of the World Wide Web is HTML (HyperText Markup Language), often referred to by its current version, HTML 4.01 or just HTML4 for short. HTML was originally an application of SGML (Standard Generalized Markup Language), a sort of meta-language for making markup languages. SGML is quite complicated, and in practice most browsers do not actually follow all of its oddities. HTML as actually used on the web is best described as a custom language influenced by SGML.

Another important thing to note about HTML is that all HTML user agents (this is a catchall term for programs that read HTML, including web browsers, search engine web crawlers, and so forth) have extremely lenient error handling. Many technically illegal constructs, like misnested tags or bad attribute names, are allowed or safely ignored. This error-handling is relatively consistent between browsers. But there are lots of differences in edge cases, because this error handling behavior is not documented or part of any standard. This is why it is a good idea to validate your documents.

XML and XHTML are quite different. XML (eXtensible Markup Language) grew out of a desire to be able to use more than just the fixed vocabulary of HTML on the web. It is a meta-markup language, like SGML, but one that simplifies many aspects to make it easier to make a generic parser. XHTML (eXtensible HyperText Markup Language) is a reformulation of HTML in XML syntax. While very similar in many respects, it has a few key differences.

First, XML always needs close tags, and has a special syntax for tags that don’t need a close tag. In HTML, some tags, such as img are always assumed to be empty and close themselves. Others, like p may close implicitly based on other content. And others, like div always need to have a close tag. In XML (including XHTML), any tag can be made self-closing by putting a slash before the code angle bracket, for example <img src="funfun.jpg"/>. In HTML that would just be <img src="funfun.jpg">

Second, XML has draconian error-handling rules. In contrast to the leniency of HTML parsers, XML parsers are required to fail catastrophically if they encounter even the simplest syntax error in an XML document. This gives you better odds of generating valid XML, but it also makes it very easy for a trivial error to completely break your document.

HTML-compatible XHTML

When XML and XHTML were first standardized, no browser supported them natively. To enable at least partial use of XHTML, the W3C came up with something called “HTML-compatible XHTML”. This is a set of guidelines for making valid XHTML documents that can still more or less be processed as HTML. The basic idea is to use self-closing syntax for tags where HTML doesn’t want a close tag, like img, br or link, with an extra space before the slash. So our ever-popular image example would look like this: <img src="funfun.jpg" />. The details are described in the Appendix C of the XHTML 1.0 standard.

It’s important to note that this is kind of a hack, and depends on the de facto error handling behavior of HTML parsers. They don’t really understand the XML self-closing syntax, but writing things this way makes them treat / as an attribute, and then discard it because it’s not a legal attribute name. And if you tried to do something like <div />, they wouldn’t understand that the div is supposed to be empty.

There are also many other subtle differences between HTML and XHTML that aren’t covered by this simple syntax hack. In XHTML, tag names are case sensitive, scripts behave in subtly different ways, and missing implicit elements like <tbody> aren’t generated automatically by the parser.

So if you take an XHTML document written in this style and process it as HTML, you aren’t really getting XHTML at all – and trying to treat it as XHTML later may result in all sorts of breakage.

What determines if my document is HTML or XHTML?

You may be a bit thrown off by the last sections talk of treating an XHTML as HTML. After all, if my document is XHTML, that should be the end of the story, right? After all, I put an XHTML doctype! But it turns out that things are not so simple.

So what really determines if a document is HTML or XHTML? The one and only thing that controls whether a document is HTML or XHTML is the MIME type. If the document is served with a text/html MIME type, it is treated as HTML. If it is served as application/xhtml+xml or text/xml, it gets treated as XHTML. In particular, none of the following things will cause your document to be treated as XHTML:

  • Using an XHTML doctype declaration
  • Putting an XML declaration at the top
  • Using XHTML-specific syntax like self-closing tags
  • Validating it as XHTML

In fact, the vast majority of supposedly XHTML documents on the internet are served as text/html. Which means they are not XHTML at all, but actually invalid HTML that’s getting by on the error handling of HTML parsers. All those “Valid XHTML 1.0!” links on the web are really saying “Invalid HTML 4.01!”.

HTML is probably what you want

Perhaps you’re just now realizing that your lovingly crafted valid XHTML document is actually invalid HTML. You have a couple of choices:

  1. Serve your content as application/xhtml+xml. That’s probably not such a good idea though. Microsoft Internet Explorer will not handle XHTML at all, and serving it such a MIME type will lead it to download. Unless you’re willing to completely lock out IE users, you probably don’t want to take this option.
  2. Serve as text/html to IE, but as application/xhtml+xml to other browsers. This way your content at least has a chance of working in IE, and uses HTML-compatible XML for its original intended purpose, as a fallback compatibility hack. However, there are still downsides. Your documents will be processed in entirely different ways in IE vs other browsers. A construct that may be perfectly valid HTML could totally break XML parsing, due to the strict error handling rules. Or conversely, some kinds of valid XHTML changes might result in an HTML document that looks wrong. Furthermore, the XHTML modes of the browsers that support it are not nearly as mature or well tested as the HTML modes. This is definitely the case for Safari. And Mozilla also discourages this practice due to lack of support for incremental rendering. And they have a list of some of the many differences in processing XHTML vs HTML.
  3. Stick with the status quo. Another option is to just stick with the status quo – generate XHTML content but serve it as HTML. The disadvantages here are mainly that you are losing out on HTML validators, which will validate the document in a way that matches how browsers parse it; and that you run the risk of subtle incompatibilities if your document is ever actually processed as XHTML. But this also raises the question: what do you think you are getting out of using XHTML? You may have heard a lot of hype about it, experts may have told you it’s the next big thing, but what kind of benefits do you get if in the end it’s just treated as HTML tag soup?
  4. Serve valid HTML. This is the option I recommend – serve valid HTML documents with a text/html MIME type. This way you’ll be using the best-tested mode of web browsers, won’t have to worry as much about weird compatibility issues, and will get the most benefit out of HTML-based toolchains.

So overall it seems best to go with HTML, and follow through consistently. But don’t just take my word for it. Leading web standards experts like Ian Hickson, Anne van Kesteren and Mark Pilgrim have all pointed out the pitfalls of serving XHTML as HTML.

Best practices

On today’s web, the best thing to do is to make your document HTML4 all the way. Full XHTML processing is not an option, so the best choice is to stick consistently with HTML4. Here’s the best way to do that:

  1. Use an HTML4 doctype declaration, ideally one that will trigger “standards mode” in web browsers. One example of an HTML4 standards mode doctype is <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
  2. Serve your content with the text/html MIME type, or for local content give it a .html or .htm suffix. This will lead browsers, search engines, and other apps to properly process your content as HTML.
  3. Validate your content as HTML, not as XHTML. One handy way is using a validation service, such as the W3C Validator. (But beware, the validator looks at your doctype instead of the MIME type, unlike browsers.)

Unfortunately, sometimes you are not fully in control of the content you produce. For example, this very blog, published with WordPress tags. If you find yourself in this same boat, encourage your tools vendors to provide support for generating valid HTML.

About those <script> and <canvas> tags

I promised at the start of this post to tell you all what this had to do with closing script and canvas tags in Dashboard widgets. Well, the upshot is that XML-style self-closing syntax in HTML is not always so innocuous.

The Safari 2.0 version of WebKit had a special quirk for treating script elements with the self-closing syntax (like this: <script src="myscript.js" />) as if they were actually properly closed. At the time Gecko-based browsers like Firefox had a similar quirk, and we decided to copy it for compatibility with particular web sites. However, future versions of Firefox will remove this quirk, and this kind of behavior is going to be explicitly outlawed by future standards that build on HTML, such as Web Apps 1.0. So we will probably remove this quirk in future versions of WebKit as well. Unfortunately, HTML relying on this parsing quirk has crept into a lot of Dashboard widgets. A WebKit that didn’t support this quirk would lead to broken widgets – the external script code would never run.

There is a similar issue with the canvas alement, as it makes its way through the standardization process. canvas was originally implemented in Safari as an empty tag like img, but standards and other browsers have all gone with making it require an explicit close tag, to support fallback content. Widgets hit two different pitfalls here – many use the XML self-closing syntax (<canvas />), while others have just a plain old unclosed (<canvas>) tag. Either way, you need to change to using an explicit close tag (<canvas></canvas>, or future WebKit versions will think all the rest of your document is inside the canvas element and won’t render it.

Conclusion

It’s easy to get confused about HTML and XHTML, and many of the experts out there give misleading advice on the subject. Fortunately, most of the time it doesn’t matter. But sometimes it does, and can badly break your content. So make sure you understand the difference, and serve up some good clean markup.

95 Responses to “Understanding HTML, XML and XHTML”

  1. Federico Says:

    The Safari 2.0 version of WebKit had a special quirk for treating script elements with the self-closing syntax…
    In text/html, right?

  2. Pingback from Dear WebKit Open Source Project « Slash dev Slash null:

    [...] Dear WebKit Open Source Project Regarding you recent blog post: “Understanding HTML, XML, and XHTML”. Before we begin, I’d like to point out that I did read the w [...]

  3. mcroft Says:

    Yeah, I’ve been bit by wierd self-closing tag attempts in HTML. I ran into an HTML page that used <a id=”myid” />, which makes sense for anchors as “targets I link to within a document”, but not so much for anchors as ‘enclosing content that links to some other document or target”. Not that I see any way to change that overloaded tag type at this stage.

    So http://www.whiterose.org/test/testanchor2.html is a page that validates as XHTML 1.0 Transitional and serves as text/html. The anchor tag with the misguided self-closing “/” confuses the hell out of all the browsers I saw, and is similarly broken on all browsers I tried. To be honest, I’m not even sure if it should just self-close or not.

    The wierdest thing is that Safari and Konquerer display different behavior than Mozilla family/IE6 if the self-closing anchor tag is inside 2 nested divs, but not just 1. I ran into this in the wild on a reasonably popular blog and tried to report it as a Safari bug (8879), but it was closed as badly reported. I’d be happy to have help describing it better, because while it would be nice if it worked right, it should at least not work too differently from Mozilla and IE. This still happens differently with revision 16490 and Firefox 2.0b2.

  4. Trackback from Ones and Zeros:

    Every now and then I try to fix stuff, or at least identify what needs fixed…

    A while back I was bit by wierd self-closing tag attempts in HTML. I ran into an HTML page that used <a…

  5. Drew Says:

    In fact, the vast majority of supposedly XHTML documents on the internet are served as text/html. Which means they are not XHTML at all, but actually invalid HTML that’s getting by on the error handling of HTML parsers. All those “Valid XHTML 1.0!” links on the web are really saying “Invalid HTML 4.01!”.

    Does anyone find it ironic that this blog is marked as XHTML 1.0 Transitional?

  6. Pingback from Kernel Mustard » Blog Archive » It turns out that nobody knows how to use XHTML:

    [...] to use XHTML The Surfin’ Safari blog has a post by maciej pointing out that most people use XHTML wrong, including (in particular) almost everyone that displays the “Valid XHTML 1.0&#824 [...]

  7. skybrian Says:

    In answer to “what do you get out of this?”

    The main thing I get is an easy way to write rigorous unit tests. If all my server’s pages are well-formed XML, I can using any convenient XML parser in my unit tests, and the parser will telll me if there are mismatched tags and many other problems. Then I can run XPath expressions on the DOM tree to make sure the right content makes it onto the page in the right spot. This can also be done with a forgiving parser like HTML Tidy, but that lets through too much obviously bad HTML, leaving to more time debugging when something goes wrong.

    (And the same thing could probably also be done using a strict HTML parser that generates a document model supporting xpath queries, but they seem a bit harder to find, while XML parsers are everywhere.)

    So all I really care about is that the page looks fine in all browsers and is also well-formed XML. Give me that and I’ll use whatever mime-type or document declaration the standards people say to use; I don’t care. The way that seems to work the best right now is to have the browser treat the page as (broken, but acceptable) HTML.

  8. dutchcelt Says:

    I’ve argued with Anne, Faruk and other webheads on this topic often enough. The problem is that at one time the “guru’s”, I use the term loosely, compelled us to use XHTML and now we’re back to HTML. Somebody make there mind up! The fact is, and this post actually suggests this already, is that it doesn’t really matter that much. Just as long you make an informed choice. Writing XHTML is fine just be aware of the pitfalls.

    Personally I use both. XHTML for web applications and HTML for content driven/focused websites. Most CMS systems can’t handle (x)HTML reliably so I go for the format that gives me the most margin for error within it’s specification.

  9. Philippe Says:

    1. Thanks for advocating HTML 4.0. It would be nicer though to suggest a ‘strict’ doctype, esp for new documents.

    2. WordPress can be made to work with an HTML 4.0 DTD:
    http://www.robertnyman.com/2006/09/20/how-to-deliver-html-instead-of-xhtml-with-wordpress/
    There are plugins for other tools, like Textpattern.

    3. Thanks for adding decent HTTP response headers to WebKit (noticed recently).

  10. Mark Rowe Says:

    Drew, that is a very good point. We recently went over the static portions of the website and ensured that they are valid HTML 4.01 Strict, but the blog is a little bit trickier. Some of the HTML that WordPress generates is XHTML-ish, so we need to make a few tweaks to get it to generate valid HTML. It should be there in the next day or two. Practicing what we preach is always a good idea :-)

  11. Matt Round Says:

    I can’t help but sigh every time I read a well-meaning article recommending HTML ahead of XHTML, as I know it’ll confuse many novice developers and give others the excuse to stick with outdated practices for a few more years.

    Even though XHTML served as HTML isn’t technical correct, you’re overlooking one key advantage: a developer only has to know one syntax for markup and XML. No one should be taught HTML nowadays, and a move from invalid HTML to valid XHTML served as HTML is still a worthwhile step forward. Tackling the parsing and scripting issues with true XHTML is a whole separate battle.

  12. Pingback from Origa.me » Blog Archive » Understanding HTML, XML and XHTML:

    [...] e; don’t try and move the whole world back to 1999. There you go … rant over Understanding HTML, XML and XHTML: “OR, Close your <script> and <canvas> tags! The relationships [...]

  13. mcroft Says:

    Should the w3c validator have a category for HTML-compatible XHTML that checks the MIME type? Lots of developers think they’re doing pretty well if they get the w3c blessing. Even if it just described the actual situation for the page better, it might help educate developers. “Warning, this page contains valid XHTML 1.0 Transitional but is served as text/html by the web server. User agents will interpret this as invlid HTML4.”

    Would that help? Would the w3c validator folks agree?

  14. Pingback from crawlspace|media :: blog:

    [...] HTML & XHTML Clarity September 21st, 2006 Understanding HTML, XML and XHTML via Surfin’ Safari. The gist…if you’re serving XHTML as HTML yo [...]

  15. Pingback from write.myobie.com » Blog Archive » XHTML or HTML:

    [...] ive Post XHTML or HTML Read: http://webkit.org/blog/?p=68 When I first decided to use XHTML, it was so that when browsers could understand it, I [...]

  16. Pingback from Naik’s News » Understanding HTML, XML and XHTML:

    [...] zed |

    [link][more] Click here for original website post by Cyrus Farivar and re-bloged by Naik Michel

    [...]

  17. luomat Says:

    http://www.w3.org/TR/xhtml-media-types/#summary says there’s nothing wrong with serving XHTML as text/html if it’s HTML compatible.

    So we can write XHTML serve it today as HTML and then when IE6 dies, switch to a better media type.

    I still fail to see why we have to have this argument by some expert every 6 months. Aren’t there more productive things to do than rehash arguments which have been made repeatedly by others?

  18. xenon Says:

    @mcroft, I think that would help remove some of the confusion.

  19. elmer fudd Says:

    this is getting ridiculous!!! Please explain the part where you said something like: “html is more understood by browsers than XHTML”

    if i’m declaring MY document as XHTML, then it’s MY responsibility to make it proper, BUT the browser HAS TO accept the document declaration and parse it as such.

    XHTML is supposed to remove a large swath of ambiguities from the old HTML docs. It’s supposed to make browsers leaner. Are you saying that we’ve been led by the nose?
    At this point I couldn’t care less. If safari doesn’t like XHTML, it will join the internet exploder club and become the last browser i code for. XHTML works well with AJAX parsing snippets or server side scripting languages going through document nodes, thanks to well documented xml parsers:: Things that you haven’t addressed in this post.
    At this point you are penalising the people that are striving most to make the web a more harmonious place. Are you trying to justify apple’s lame attempt at a web editor?

    I understand there are issues with the specs, but i’m sure it’s something that can be worked out.

  20. h3h Says:

    It’s good to raise awareness about the differences between XHTML and HTML, but I’d advise most to take it with a grain of salt. More reading:

    Sending XHTML as text/html Considered Harmful to Feelings

    Correct HTML

  21. js.eu Says:

    I dont belive Safari or Apple is agaist xhtml or xml. Elmer was a big agressive on this topic as i belive many in here were.

    Lets be real specs are specs Safari works well with xHtml all it wants is to conform with the specs. IE wont work if all browsers tend to conform with something that others wont. Safari needs a way to support xml in the specs away and a soluction has to be found.

    what about using something like this..

    for xhtml

    this way IE would load because the mime type is text/html and Safari would be smart to know that it should treat it as xhtml.

    does this works on safari?

  22. js.eu Says:

    ops.. could’t post html in here, here it is.. (i hope)
     <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"&gt;
    <html xmlns="http://www.w3.org/1999/xhtml&quot; xml:lang="en" lang="en">
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Site Title</title>
    <meta name="ROBOTS" content="ALL" />

    </head>

    </html>

  23. js.eu Says:

    for xhtml  we would have..
    <meta http-equiv="Content-Type" content="text/xml; charset=utf-8" />

  24. donnie Says:

    The entire argument of seperating style and data is left out here. In my opinion XHTML is just a transitioning markup between HTML and XML. XSLT is left entirely out of this equation. I disagree 100% with declaring your doctype as plain html. HTML is an obsolete markup, and its looser set of syntax error handling is a detriment, not a feature. Read STANDARDS here. When code breaks it’s a good thing, it keeps everyone on the same page, so we don’t have the Microsofts of the world mishandling MIME types. It seems, Maciej, that you are discouraging standards practices here.

  25. nickfitz Says:

    @Elmer Fudd: It is not correct to say that “if i’m declaring MY document as XHTML… the browser HAS TO accept the document declaration and parse it as such”.

    Neither the HTML 4.01 nor the XHTML 1.0 Recommendations impose any requirement on User Agents to pay any attention at all to the Document Type Declaration. If a User Agent wishes to claim to be a Validating User Agent, then it will use the declared DTD to validate the document, but this still says nothing about how it should then parse it into a render tree. However, the Recommendations do go into some detail about the MIME type with which the page is served being used to determine how a document should be parsed.

    In other words, if you want a browser (none of which, AFAIK, claim to be Validating User Agents when dealing with XHTML) to treat your content as XHTML, you need to serve it with the correct MIME type. Your Document Type Declaration has no bearing on the matter.

  26. Pingback from Another Caffeinated Day:

    [...] and visa versa.If you spend a lot of time coding XHTML 1.0 and 1.1 web content like I do, read this post on Understanding HTML, XML and XHTML. You won't waste y [...]

  27. Mark Rowe Says:

    @donnie: The last time I checked, HTML 4 was as much a standard as XHTML 1.0. It is quite clear that HTML 4 is not obsolete. The majority of the web uses HTML 4, browsers support it to a much better level than XHTML. If the major browsers supported XHTML to as good a level as their HTML 4 support then your point may have some validity. Until this happens the most pragmatic approach is to simply use the working, valid alternative: HTML 4.

    “When code breaks it’s a good thing, it keeps everyone on the same page, so we don’t have the Microsofts of the world mishandling MIME types.”

    Are you really suggesting that everyone switch to XHTML and serve it with the correct MIME type? As Maciej notes, this will result in Internet Explorer prompting the user to save the file to disk rather than rendering it as a web page. The alternative would be to serve XHTML as text/html, which is exactly the mishandling of MIME types that you complain about . . .

  28. donnie Says:

    @mark: The point I was trying to make is that the seperation of data and style was left out. HTML is _becoming_ an obsolete language. Just because “the majority of the web uses HTML 4″ doesn’t mean that the abstraction I mentioned above should no longer be a goal. Naturally, technology is progressive. To revert to HTML seems lazy to me. I am not advocating leaving MSIE users in the dust by serving up true XHTML, but using well formed and semantic XHTML will give a web designer headway and understanding to the disciplines of the future. Like I said XHTML is a transitional markup, meant to smooth the evenutal adoption of XML/XSLT/CSS.

    Of course, there are those who leave the progression up to others and continue on in the fashion of Peter Keating. If that’s your bag then that’s fine. But I don’t understand why stepping toward the abstraction of content and style here is not considered a “best practice.” Not that you can’t achieve a level of abstraction using HTML, but I think we can agree the layer is thicker using XHTML.

  29. Mark Rowe Says:

    @donnie: I’m confused as to why you think the use of XHTML is required to produce well formed, semantic markup. It is every bit as possible to create valid, standards-compliant, semantic markup using HTML 4 as it is with XHTML. It is equally possible to separate content from styling with HTML 4 and XHTML. An XML docment can very easily be transformed via XSLT to produce a valid HTML 4 document that is styled via CSS. Heck, if you perform the XSLT server-side you could even serve up valid XHTML, with the correct MIME-type, to user agents that accept it while serving valid HTML 4 to others.

    As the XHTML 1.0 specification states, XHTML is simply “a reformulation of the three HTML 4 document types as applications of XML 1.0″. Can you enlighten me as to how HTML 4 as an XML dialect increases abstraction?

  30. Pingback from The Ankle Biter @ Quibbling.net » Blog Archive » links for 2006-09-22:

    [...] « usability-gasm, and a cool product links for 2006-09-22 Surfin’ Safari – Blog Archive » Understanding HTML, XML and XHTML via Jeff, an educational post on stuff I was on [...]

  31. Pingback from Justin Blanton | Understanding HTML, XML and XHTML:

    [...] This is a “bit” and not a regular weblog entry Understanding HTML, XML and XHTML.
    © 1999-2006 Justin Blanton     &nb [...]

  32. Pingback from links for 2006-09-22 « greencrab capsules:

    [...] ant Guide (tags: food wiki) UNIX productivity tips (tags: linux productivity tutorial) Surfin’ Safari – Blog Archive » Understanding HTML, XML and XHTML (tags: xml web) UNIX History (tags: linux his [...]

  33. ppk Says:

    >Another option is to just stick with the status quo – generate XHTML content but serve it as HTML. [...] this also raises the >question: what do you think you are getting out of using XHTML?

    The possibility of importing these pages as XML later on. On QuirksMode.org I moved all pages to XHTML 1.0 because in the future I might decide to implement an Ajax-y kind of interface that loads the content pages as XML. With HTML that’s impossible, with XHTML it’s quite easy.

    In short: forward compatibility.

  34. elmer fudd Says:

    to nickfitz
    I’m using xhtml because i strive to make the document structure behave according to intentions. I come from a typographic background in print design. Clients send me lousy word documents that don’t even use proper indentation and list styles. Obviously it’s my work to clean up the client’s mess. Now I’m told that using xml type documents is not good. This blog is telling me to appease the lowest common denominator à la Microsoft: look where it got them!

    If a browser cannot handle the presentation, then offer the document in a simple layout. You either can or you can’t. at least the info will be available and understandable.

    Please note that that’s my primary aim, i haven’t even touched on the MVC paradigm. What has happened ot the great ideals and aspirations of the internet? Are we getting bogged down by the attrition of the unwashed masses? I’m striving to reach the standrads because they work, now I’m being kicked in the face because many can’t get there. So what they pay somebody who can!

    And for an other contributor: afaik XHTML conforms to XML. If you want to use your particular XML schema you still can and in modern browsers you can style it as well, but your schema will not be a standard.

    The biggest problem -from my ingorant perspective- is getting CSS and javascript to behave as needs be.
    I NEED THE STRUCTURE PROVIDED BY XHTML. Most of it comes from typographic rules of old. Why do we have to invent the wheel instead of capitalising and building on traditions that work.

    I’m sorry if I sound rude. I appreciate the work that goes here and the major effort needed. Respect

  35. squareman Says:

    Like PPK, I think it is a good idea for forward compatibility; but more so I code to XHMTL for a bit of social engineering. What do I mean? Read on.

    Yes, everyone who has said it is right about HTML and XHTML being equally capable of doing separation of presentation from content from behavior. I would love to be coding in HTML 4.01 STRICT, but there is a _giant_ backlog of existing content with things like target=”" attributes and other deprecated items that I unfortunately must continue to support. I cannot code to 4.01 STRICT and have it validate. My choices are then 4.01 TRANSITIONAL or XHTML 1 TRANSITIONAL, both allowing deprecated elements but XHTML allowing a smaller subset of them (tossing FONT and other heinous ones out the window).

    Yes, the document is being served as text/html and ultimately, therefore, as HTML. But, and that’s a huge “but,” specifying that DOCTYPE is getting people around the company to write better formed, more-semantic markup. It’s improving the quality of the code—not because XHTML is a better format, but because of the perceived need of people to write better markup.

  36. squareman Says:

    @Elmer Fudd

    I don’t think you’re getting it yet. You still seem to think there’s a difference in the semantic structure of XHTML and HTML. There is no significant difference. You can write good markup in both.

    Read very carefully Mark Rowe’s comments at http://webkit.org/blog/?p=68#comment-12131 on this page.

  37. Pingback from Ajaxian » Why doesn’t <script /> work?:

    [...] work for script (and canvas and div and …). The Safari folks have spelled it all out in Understanding HTML, XML and XHTML. The articles goes into the differences, and you end up with the obvious conclusio [...]

  38. Pingback from Surfin’ Safari - Blog Archive » Understanding HTML, XML and XHTML - Matt Heerema : Web Design:

    [...]

    Surfin’ Safari – Blog Archive » Understanding HTML, XML and XHTML

    The a href=”http://webkit.org/blog/?p=68″>relationships among HTML, XML and XHTML are an area of considerable confusion on the web. …This article will [...]

  39. maciej Says:

    @squareman

    The XHTML 1.0 Transitional DTD includes the exact same set of elements and attributes as HTML 4.01 Transitional, including the font element.

    Check it out if you don’t believe me:

    http://www.w3.org/TR/xhtml1/dtds.html#a_dtd_XHTML-1.0-Transitional

    Note the declaration of the font element:

  40. Myrd Says:

    You mentioned the mime-type as the document is served is the way Safari (and other browsers) choose which rendering mode to use. What about locally opened content? If I open a file on my computer with Safari, how will Safari decide if its XHTML or HTML? Does it look at the doctype, or generate a mime-type from Mac OS X’s file system metadata? If so, how does that work?

  41. squareman Says:

    @maciej

    [blush] How embarrassing—caught in a mistake. Still, the “social engineering” aspect of it works well for me as people tend to be a little more careful about there code in my 9-5 corporate job.

  42. Pingback from BarelyBlogging » Blog Archive » links for 2006-09-23:

    [...]
    « links for 2006-09-16

    links for 2006-09-23

    Surfin’ Safari – Blog Archive » Understanding HTML, XML and XHTML (tags: html xhtml webdesign xml) Cross-docume [...]

  43. Pingback from links for 2006-09-22 « kobak del.icio.us könyvjelzői:

    [...] l.icio.us könyvjelzői automata blog links for 2006-09-22 Surfin’ Safari – Blog Archive » Understanding HTML, XML and XHTML A webkitesek szepen osszeszedtek a html, xml es [...]

  44. maciej Says:

    @ppk

    “The possibility of importing these pages as XML later on. On QuirksMode.org I moved all pages to XHTML 1.0 because in the future I might decide to implement an Ajax-y kind of interface that loads the content pages as XML. With HTML that’s impossible, with XHTML it’s quite easy.”

    Actually, lots of Ajax-y stuff works fine with HTML. For example, check out: http://microformats.org/wiki/rest/ahah

    If you want a parsed DOM and not just markup to insert, you can use an offscreen iframe. And future versions of the XMLHttpRequest spec are likely to support “responseHTML” to parse transmitted HTML content. I wouldn’t be surprised if responseHTML becomes widely available sooner than it becomes practical to deploy XHTML documents on the web.

    “In short: forward compatibility.”

    This is the other problem with XHTML – each version the W3C comes out with breaks backwards compatibility. XHTML 1.1 drops a number of elements and attribtues from 1.0, and XHTML 2.0 is shaping up to be wildly incompatible. So you’re not actually getting forward compatibility. In contrast, Web Apps 1.0 (aka HTML5) will be fully backwards compatible with HTML4 (and XHTML1).

    That being said, it is certainly possible for advanced authors to make documents that are both correct XHTML and will work the same whether served as HTML or as XHTML, but this is quite difficult and there are no validators around to help you get it really right. So it is definitely not a good idea for the everyday author.

  45. elmer fudd Says:

    @squareman:
    i’m interested in the strict xml qualities of XHTML. Squareman: HTML 4 is for whishy washy hippies!
    };-{)~

    @maciej:
    These technologies are in a state of flux. That’s why it’s important to have the document declaration up top, first thing in the document. You tell pkk to use html 4 because in the future there’s going to be a “responseHTML” yet you frown upon the W3C deprecating stuff in new versions of XHTML. A browser can do with it what it may. I like unix’ shebang declaration at the start of a script: you can have all the versions you want. Now i understand the problem of having a browser that cannot parse an XHTML 1.1 or XHTML 9.8.

    AFAIK good information architecture tries to focus on the content structure first and XHTML 1 does it quite well. Admittedly there are issues that are being addressed in XHTML 2, like the H1-6 and proper handling of document sections. The key issue is that if I code in XHTML 1, I ‘might’ be able to repurpose a document faster to XHTML 2 than if I code it in HTML, but correct me if i’m wrong. Indeed some enterprising ppl might issue ‘standard’ xslt transforms for browsers that cannot understand newer XHTML standards to downgrade the document for accessibility at least.

    NOTE: please ppl don’t take my comments as arrogant, it’s not my intention. I’m writing off the top of my head, and my language suffers. indeed most of you are way better qualified people than me. I come from a world of traditional design and ‘dabble in programming’. So i’m in no position to impose anything. It’s just that things are getting a bit awry. please note my handle: i do shoot blindly from the hip sometimes ;-)

  46. maciej Says:

    @elmer fudd:

    Also, a browser can not, in fact, “do what it may”. The IETF MIME type registrations for “text/html” and “application/xhtml+xml” tell a browser how to interpret the content and require ignoring any doctype declaration. The W3C’s Technical Architecture Group also reiterates this point in a recent finding: http://www.w3.org/2001/tag/doc/mime-respect-20060412

    So browsers are just following standards when they interpret “text/html” documents as HTML. This is what a lot of people don’t understand. The doctype declaration makes no difference, and that is what the standards require.

    I’ll also add that XHTML 1 and HTML 4 have *the exact same* semantic structure. The only difference is surface syntax. So if you think you need XHTML 1 for good content structure, I’m not sure you have full clarity on what XHTML is.

  47. Pingback from AnySurfer blogt » Die browsers toch:

    [...] er 2006
    Die browsers toch
    Geschreven door Roel Van Gils om 9u53
    Een fijn artikel op Webkit.org over de manier waarop browsers XHTML écht interpreteren. Eigenlijk kompt het erop neer dat het geen bars [...]

  48. elmer fudd Says:

    @ maciej:
    IETF MIME: fine you’re right, but the types of uses referred to in the document are:
    process and display, process and store/save/pass on,
    don’t process and display, don’t process and save/store/pass on
    If the server is sending text/html or application/xhml+xml or text/xml the browser/application SHOULD render as best it COULD, preferably according to doctype. Or else do as the document advises and ask the recipient to decide: but i’d hate to let granny decide on whether the information i’m sending her is html or xhtml :o

    correct me if i’m wrong, but afaik html 4 doesn’t REQUIRE closing tags for most elements. And i cannot validate on such a loose dtd (I use BBEdit for template coding and usually grep semi-automatically through client suppl;ied word docs). I want to know if my document doesn’t pass muster with XML parsers. I can grep my way through complex documents, but I find it difficult to be precise. xml and machine parsing “should” make that efforless (I’m getting into perl processing: started with regexp but want to move to libXML). I’d recommend ppl to read Information Architecture*, I know many did. In the end, what counts is the information…

    My rusty and mouldy 2c:
    When I code an HTML template I zone it in enclosing DIVs/ULs for different sub-structures with well defined boundaries, the main document story or list of stories, page furniture, related stories and other stuff. This way I can re-purpose stuff in a jiffy (theoretically) I tend to set these sub-structures as absolutely positioned divs to create a boundary: if one breaks it doesn’t affect the others much. I also wrap up the whole thing in a “chase” (as in letterpress) and position it relative/absolute in the document window. AFAIK the body element is interpreted too vaguely by most browsers.

    Personally i don’t like liquid designs thatmuch. if something breaks, you get a mess spilling over and breaking everything. Indeed browsers that take this trend too much to heart will end up wasting too many resources for nothing. for accessibility browsers can provide a simple zoom feature. I hate reading 50words per line on a 20″ screen.

    * http://www.amazon.com/Information-Architecture-World-Wide-Web/dp/0596000359/sr=8-1/qid=1159021271/ref=pd_bbs_1/102-3784862-9532111?ie=UTF8&s=books

  49. macnoid Says:

    I realize that many have only heard of xhtml because they learn it when they’re learning about css. These are two independent standards, but they are easily conflated. Most texts teach them as the proper way to reform bad web authoring habits. They share similar validation tools. Their proponents share similar calls for future compatability. They both fail in similar browsers :-) . And (when you can dump IE from the equation) they both seem very easy, logical and perhaps even preferable.

    My like for xhtml+xml comes from programmer laziness though. Despite it’s wordiness, I like xml flavors in things like file formats, property lists, user preferences, and so forth. This isn’t some kind of “universal default”, but in cases where I don’t care about the data format I tend to fall back on some flavor of xml so I can focus on other things up front. I like that some of the “toy” apps I’ve made can quickly pre-, post-, or chain process xhtml+xml “borrowed” from the web. Users of some of my web tools want xhtml output as their interchange format. Sure, when my tools originate a new file I could use html, sgml, or probably even old word perfect file formats since I’m in control. But xhtml has caught on among web authors as a good format. It lets me focus on the aspects I’m interested in and leave the rendering up to the webkit or mozilla team. :-)

    One worry is that some may read maciej’s article as justification to give up on xhtml. Paraphrasing one part of the article it might be interpreted like: “Is xhtml really what you want? It’s completely incompatible with IE, and if that’s not enough consider that Safari and Mozilla don’t implement it really well. Forget about xhtml all together and just go back to html4.”

    Instead I hope that it’s message is interpreted more as a “line in the sand” call for the attention of web authors. “Hey! Drop these hybrid methods. Serve your content as xhtml+xml or go back to strict html. It’s the hybrid serving methods that are broken; not the web browsers.”

    I certainly hope that web authors will still speak of xhtml as a favored interchange and authoring format. You may need some automated post-processing before it hits the web, but to me that’s preferable to authoring in html or using one of those hybrid serving tricks.

    Because of the IE elephant, it’s only through non-web uses will the xhtml standards ever really be implemented and evolve. Dashboard widgets are an excelent playground for xhtml so I really hope the widget world will become the place where standard xhtml tools thrive. I wish there were a way to take that “expertise” and move it straight to people’s work for the web i.e. a hybrid serving method that worked well, but perhaps this article is the best that can be said about the matter until the standards improve.

    I really hope people don’t give up on xhtml and wipe it from their minds though. My preference would be to author and validate xhtml, but convert it into html4 as a final, automated step before it’s served on the web. Photoshop users prefer working in raw file formats and then coverting to a gif or jpeg at the end. Office users prefer working in doc or rtf formats and then converting to something else at the end. From the programmer perspective, I see xhtml+xml as being a distinctly better common format for exchanging information during the authoring and validation periods. It’s not an ideal, but it’s human and machine parsable which is important during the authoring stage.

  50. Pingback from Simplicity: digital hub » Differences between XML, XHTML, and HTML:

    [...] XHTML and HTML, and how XHTML relates to XML. Well, you get the picture, go check it out: Surfin’ Safari: Understanding HTML, XML and XHTML No comments have been [...]

  51. ianji Says:

    I switched my site to pure XHTML about three months ago. I haven’t *completely* locked out IE users because most of them find my pages through Google, and Google helpfully translates XHTML to HTML for the benefit of poor IE users. Yes my site is IE hostile but IE does not account for the majority of my traffic anyway and I have no obligation to serve my content universally – if you want to see it you use a compatible browser – it is no skin off my nose. As well as Linux and Mac users and Windows/Firefox users I also get plenty of traffic from mobile devices (I serve XHTML Mobile Profile 1.0 as application/xhtml+xml). Lots of people are spending time coding XHTML rendering engines and it seems a shame that they have so little content to play with. Now they have my site, and the more “pure XHTML” sites people put up the more incentive there will be to improve rendering engines, which will encourage more people to put up sites. Microsoft are holding everyone back – it would be trivial for them to install a plugin to do what Google already does for them and translate XHTML into usable HTML. No it doesn’t look great but at least as an interim measure. I hope that the WebKit people continue to develop the XHTML engine and put at least as much effort into it as the tag soup parser.

  52. Pingback from Rambles of a University Systems Manager » One of the 50 billion reasons:

    [...] ted in rambles by jayoung on September 24th, 2006

    That web development is so hard. Understanding HTML, XHTML, and XML (also, one of the reasons that I stay subscribed to the Hyatt Safari blog)

    [...]

  53. Pingback from Shoob » Blog Archive » links for 2006-09-27:

    [...] to search queries, the firm said. Both forms of search were far a (tags: pay-per-click) Surfin’ Safari – Blog Archive » Understanding HTML, XML and XHTML XML.com: The Road to XHTML 2.0: MIME Types [...]

  54. Pingback from Joe Dolson Accessible Web Design | On Transitional Doctype Declarations:

    [...] rgence in discussion: Jack Pickard at Accessites.org Roger Johansson, 456BereaStreet.com Surfin’ Safari My opinion falls squarely in the XHTML 1.0 Strict camp. I don’t feel there’s any [...]

  55. Pingback from Why I Use HTML 4.01 Strict | yellow5.us:

    [...] Strict”> A couple days ago, Surfin’ Safari posted a guide on understanding HTML, XML and XHTML, in which they recommend against using XHTML for a few reasons.
    Initially after r [...]

  56. Pingback from release » Stealing from Jeff: XHTML vs. HTML:

    [...] called “valid XHTML” really is no such thing. These sources have more on why: Surfin’ Safari Anne Van Kesteren Mark Pilgrim Ian Hickson One big reason: there’s no good way to serve [...]

  57. Pingback from Notes | Blog Archive » Understanding HTML, XML and XHTML:

    [...] HTML There was some news to me in this very long article on the Webkit blog. Link: webkit.org/blog/?p=68  Categorie(s): Web Design by Eric, September 29th, 2006 [...]

  58. Pingback from Solutions Log » XHTML vs HTML:

    [...]

    « Image uploads in Django

    XHTML vs HTML

    If XHTML served as text is really just invalid HTML that renders predictably, then why does a document with a HTML 4 [...]

  59. Andrew Says:

    Fantastic post. Many thanks for clearing this up!

  60. robburns Says:

    First I want to say how much I typically learn from reading the Surfin’ Safari blog. Unofrtuantely this current post does stand up to the typically informative posts on the blog. While its important ffor web authors, user agent developers and to understand the issues surrounding the transition from html to xhtml, this post is filled with errors, omissions and misstatements. The W3C offers a much better piece on the intracies of authoring html and xhtml compatible documents at http://www.w3.org/TR/xhtml1/guidelines.html. Many have already pointed out some of the misconceptions in this post in the comments above. Let me just try to highlith some of the problems with this post.

    1) I think it overstates the xhtml parsing problems with Gecko and WebKit. Though one may encounter bugs every now and then, they usually have very easy workarounds.I know I’ve encounted a bug or two where one or the other declared a fatal errror by mistake, but I’m am unable to recall or reproduce those bugs right now.

    Second the post overstates the issue of invalid HTML 4.01. Who cares? If one follows the W3C guidelines above and produces HTML in that way its technically invalid. It doesn’t matter. Every major browser deals with it just fine. There’s a few extra “/’ characters in the tag. The user agenst ingore those. If one duplicates the language in the lang and xml:lang attributes and the “id” and “name” attributes, the browsers ignore that as well.

    Third, the introduction of XML and XHTML has been a large factor in encouraging less error-prone HTML. By following many of the XML syntax rules authors are producing better HTML even when they don’t try to produce XHTML. In other words they’re not ommiting quotes when they shouldn’t; they’re closing elements that require closing; and they’re paying a bit closer attention to issues of validity.

    Uninformed and misguided posts like this serve — whether intentionally or unintentionally — as FUD that slows the adoption fo these evolving and superior recommendations. The X in XHTML is key here. SVG, MathML, XForms, PList, etc are all extensions to XHTML that are not possible in HTML. The Web Apps proposal (metnioned in the post) is another misguided approach adding to this FUD. The arguement is if we make small increment improvements to HTML we won’t have to take advantage of the vastly superior recommendtations of these various XML flavors. Instad we (as web authors and user agent developers) should emvrace these superior recommendations. Surer there’s the 50-pound gorilla of Internet Explorer which is trying to underminde these recommendations too. But why should the opensource community help them out. If mozilla and WebKit were not waiting for Web Apps 1.0 and arleady impelmented XForms 1.1 it would be easier to implement Web Apps 1.0 on top of an XForms implemnetation (since it largely tries to implement those same concepts without the benefit of XPath, XEvents and so on. Plus there’s no guarantee that IE will adopt Web Apps any more than it will adopt XML and XForms. Here’s a better altevnative. How about creating a WebKit or Geckio plugin for Internet explorer that handles XHTML. Whever a user encounters an application/xhtml+xml page the page would encourage them to downlaod and install the plug-in to be ablve to view modern web pages.

    Anyway that’s an aside to the main issue. There are issues with HTML and XHTML. The W3C covers many best practices to deal with those differences. The post mentions differences in script behavior, but then doesn’t elaborate. Those are the sorts of thought-provoking topics I would expect from the Surfin Safari blog.: not more of this FUD over the dangers of serving a valid XHTML document as if it were an HTML document (oh my!). This issue of the differences in script behavior (and DOM behavior) is the only new information provided by this post, but then there’s no further elaboration.

  61. robburns Says:

    First I want to say how much I typically learn from reading the Surfin’ Safari blog. Unofrtuantely this current post does stand up to the typically informative posts on the blog. While its important ffor web authors, user agent developers and to understand the issues surrounding the transition from html to xhtml, this post is filled with errors, omissions and misstatements. The W3C offers a much better piece on the intracies of authoring html and xhtml compatible documents at http://www.w3.org/TR/xhtml1/guidelines.html. Many have already pointed out some of the misconceptions in this post in the comments above. Let me just try to highlith some of the problems with this post.

    1) I think it overstates the xhtml parsing problems with Gecko and WebKit. Though one may encounter bugs every now and then, they usually have very easy workarounds.I know I’ve encounted a bug or two where one or the other declared a fatal errror by mistake, but I’m am unable to recall or reproduce those bugs right now.

    Second the post overstates the issue of invalid HTML 4.01. Who cares? If one follows the W3C guidelines above and produces HTML in that way its technically invalid. It doesn’t matter. Every major browser deals with it just fine. There’s a few extra “/’ characters in the tag. The user agenst ingore those. If one duplicates the language in the lang and xml:lang attributes and the “id” and “name” attributes, the browsers ignore that as well.

    Third, the introduction of XML and XHTML has been a large factor in encouraging less error-prone HTML. By following many of the XML syntax rules authors are producing better HTML even when they don’t try to produce XHTML. In other words they’re not ommiting quotes when they shouldn’t; they’re closing elements that require closing; and they’re paying a bit closer attention to issues of validity.

    Uninformed and misguided posts like this serve — whether intentionally or unintentionally — as FUD that slows the adoption fo these evolving and superior recommendations. The X in XHTML is key here. SVG, MathML, XForms, PList, etc are all extensions to XHTML that are not possible in HTML. The Web Apps proposal (metnioned in the post) is another misguided approach adding to this FUD. The arguement is if we make small increment improvements to HTML we won’t have to take advantage of the vastly superior recommendtations of these various XML flavors. Instad we (as web authors and user agent developers) should emvrace these superior recommendations. Surer there’s the 50-pound gorilla of Internet Explorer which is trying to underminde these recommendations too. But why should the opensource community help them out. If mozilla and WebKit were not waiting for Web Apps 1.0 and arleady impelmented XForms 1.1 it would be easier to implement Web Apps 1.0 on top of an XForms implemnetation (since it largely tries to implement those same concepts without the benefit of XPath, XEvents and so on. Plus there’s no guarantee that IE will adopt Web Apps any more than it will adopt XML and XForms. Here’s a better altevnative. How about creating a WebKit or Geckio plugin for Internet explorer that handles XHTML. Whever a user encounters an application/xhtml+xml page the page would encourage them to downlaod and install the plug-in to be ablve to view modern web pages.

    Anyway that’s an aside to the main issue. There are issues with HTML and XHTML. The W3C covers many best practices to deal with those differences. The post mentions differences in script behavior, but then doesn’t elaborate. Those are the sorts of thought-provoking topics I would expect from the Surfin Safari blog.: not more of this FUD over the dangers of serving a valid XHTML document as if it were an HTML document (oh my!). This issue of the differences in script behavior (and DOM behavior) is the only new information provided by this post, but then there’s no further elaboration.

  62. Mark Rowe Says:

    While its important ffor web authors, user agent developers and to understand the issues surrounding the transition from html to xhtml, this post is filled with errors, omissions and misstatements.

    If there are any errors or misstatements in the post, I would encourage you to correct them. As far as I can see, all of the issues that you have raised are matters of opinion rather than fact. Disagreeing with Maciej doesn’t make him wrong (or you, for that matter).

    1) I think it overstates the xhtml parsing problems with Gecko and WebKit.

    The post says very little on the matter of XHTML parsing, other than the XHTML modes of the browsers that support it are not nearly as mature or well tested as the HTML modes. This is definitely the case for Safari. And Mozilla also discourages this practice due to lack of support for incremental rendering. The lack of incremental rendering in Mozilla is a big downer, the differences between the XHTML and HTML DOMs can cause issues with existing JavaScript code, and the lack of testing makes it possible you’ll run into undocumented issues. The simple fact is that the XHTML modes of browsers is tested a lot less thoroughly than the HTML mode as there is very little real-world XHTML content.

    Second the post overstates the issue of invalid HTML 4.01. Who cares?

    The main points that I took from the post are that if you serve XHTML content as text/html you should expect it to be treated as HTML, and that certain quirks that allowed you to omit closing tags from certain elements are being removed. These two items, in combination, can cause compatibility problems. If you have an XHTML document with a tag such as <script src=”file.js” type=”text/javascript” /> it will be treated correctly by an XHTML user agent. If you serve the same document to an HTML 4 user agent it will be treated only as an opening script tag, as the script element is not defined as being self-closing. XHTML served as HTML isn’t always going to result in the same parsing as when the content were served as XHTML.

  63. robburns Says:

    mcroft writes:

    Should the w3c validator have a category for HTML-compatible XHTML that checks the MIME type? Lots of developers think they’re doing pretty well if they get the w3c blessing. Even if it just described the actual situation for the page better, it might help educate developers. “Warning, this page contains valid XHTML 1.0 Transitional but is served as text/html by the web server. User agents will interpret this as invlid HTML4.”

    I think the idea of adding HTML compatible XHTML validation to validators is a good one. However, the warning is again just FUD. Following appendix C of the XHTML recommendation does not need any warning. Just knowing that the markup will work served either way is what an author needs to know from the validator (leaving aside the DOM issues; there could be a warning about that though).

    Myrd writes:

    You mentioned the mime-type as the document is served is the way Safari (and other browsers) choose which rendering mode to use. What about locally opened content? If I open a file on my computer with Safari, how will Safari decide if its XHTML or HTML? Does it look at the doctype, or generate a mime-type from Mac OS X’s file system metadata? If so, how does that work?

    Changing the filename extension to .xhtml will cause WebKit and Firefox to process the file as application/xhtml+xml.

  64. robburns Says:

    If there are any errors or misstatements in the post, I would encourage you to correct them. As far as I can see, all of the issues that you have raised are matters of opinion rather than fact. Disagreeing with Maciej doesn’t make him wrong (or you, for that matter).

    Well, I wasn’t going to get that pedantic about it, but since you ask.

    SGML is quite complicated, and in practice most browsers do not actually follow all of its oddities. HTML as actually used on the web is best described as a custom language influenced by SGML.

    This gets it backwards. SGML may be complicated, but the real-world treatment of HTML by browsers is infinitely more complicated. This is one of the benefits XHTML and XML brings us is that nearly a decade after the introduction of XML we shouldn’t still be trying to match the complicated error-handling of Internet Explorer.

    Second, XML has draconian error-handling rules.

    This takes one of the benefits of XML and spins it as if its a negative. Authors want their content to be rendered at least somewhat similarly across platforms. The stricter error-handling was added to help authors identify mistakes during authorship. This helps ensure the author’s meaning is recorded in the document, not a subsequent loose interpretation of the meaning by an easy-going error handler.

    Whenever I write something — whether in plain english or in a markup language — I want to know at the time I write it that it may be subjected to misinterpretation. That’s what XML error-handling brings us at least for our markup.

    It’s important to note that this is [a] kind of a hack, and depends on the de facto error handling behavior of HTML parsers.

    Here the post refers to the W3C HTML compatibility guidelines in XHTML 1.0′’s appendix C as a hack. The XHTML 1.0 recommendation tries to move the technology forward. And it also includes a transition strategy that relies on transitioning from where we are (the de facto error handling behavior of HTML parsers) to move us to where we should be (using XHTML and stricter XML parsers). That’s not a hack in my view.

    There are also many other subtle differences between HTML and XHTML that aren’t covered by this simple syntax hack. In XHTML, tag names are case sensitive,scripts behave in subtly different ways, and missing implicit elements like <tbody> aren’t generated automatically by the parser.

    All of these issues are discussed in the appendix C of the XHTML 1.0 recommendation. The issue of scripting and the DOM is something that may not be covered as much as it should. However, authoring documents to the XHTML 1.0 and DOM 3.0 specification and serving them as HTML does not create the problems implied by the post. There may be issues, but those issues are not covered in this post.

    So if you take an XHTML document written in this style and process it as HTML, you aren’t really getting XHTML at all – and trying to treat it as XHTML later may result in all sorts of breakage.

    This is just plain false. If you author documents to the XHTML recommendations (including appendix C) and DOM 3 you are getting XHTML exactly. Treating as HTML 4.01 now (text/html) and then treating it as XHTML 1.0 later (application/xhtml+xml) will not break anything.

    Which means they are not XHTML at all, but actually invalid HTML that’s getting by on the error handling of HTML parsers. All those “Valid XHTML 1.0!” links on the web are really saying “Invalid HTML 4.01!”.

    Such documents are valid XHTML, so this is another inaccuracy. When XHTML 1.0 documents are served as text/html (which is perfectly acceptable according to W3C recommendations) you are serving your XHTML 1.0 documents with the text/html mime type. Yes the browser treats it as tag soup. But it treats it as tag soup in the sense that it treats all HTML as tag soup. So, this raises the question I posed before: So what?

    I could go on, but I think that’s enough. One of the key things XHTML and XML have brought to authors (and there are many more benefits that this post and all the other posts like it try to gloss over) is to differentiate between validity and well-formedness. In HTML authoring these two concepts are collapsed. If I write a paragraph <p> and then move on to a <table> Am I writing a well-formed and valid document where the paragraph has implicitly ended and a table began? Or am I writing an invalid and ill-formed document where I’ve placed a table inside my paragraph?

    Let me just conclude by discussing the two elements that relate to the post’s subtitle: <script> and <canvas> The <script> example violates the W3C recommendations of Appendix C where any elements that are not defined as empty elements should include explicit close tags. On the other hand, the <canvas> element is an exception in that it’s not part of the smooth transition map provided by the W3C because it is not yet a stabilized standard. For the sake of Apple’s widget developers, the WebKit team might want to add fallback content to the <canvas> element only with the transition to XHTML. So for instance <canvas> in HTML would have no content. While XHTML would require <canvas></canvas> or <canvas /> Obviously this would create some incompatibilities between WebKit and Gecko — but only for the HTML 4.01 authoring that I hope after all this time we can begin to leave behind.

  65. maciej Says:

    @robburns

    Since you mentioned specific claimed inaccuracies, I’ll reply:


    SGML is quite complicated, and in practice most browsers do not actually follow all of its oddities. HTML as actually used on the web is best described as a custom language influenced by SGML.

    This gets it backwards. SGML may be complicated, but the real-world treatment of HTML by browsers is infinitely more complicated.

    Your comment doesn’t in any way refute my factual statement. I gave no opinion on which is more complicated. It is simply a fact that SGML sets parsing requirements in some unusual situations that no browser respects. One example is SGML comment parsing. This isn’t a value judgment, it’s just a true fact.


    Second, XML has draconian error-handling rules.

    This takes one of the benefits of XML and spins it as if its a negative.

    Once again, you are taking a statement of fact, recasting it as an opinion, and disagreeing with the opinion. Here is Tim Bray, who had a huge influence on XML’s error handling rules, describing it as “draconian”.

    Here the post refers to the W3C HTML compatibility guidelines in XHTML 1.0′s appendix C as a hack. … That’s not a hack in my view.

    This is a difference of opinion, not a technical innacuracy. It is definitely a fact that Appendix C depends on de facto HTML error handling.

    All of these issues are discussed in the appendix C of the XHTML 1.0 recommendation. … There may be issues, but those issues are not covered in this post.

    So are the issues discussed by the post real or not? Does Appendix C mention imaginary issues? Whether they are mentioned or not, no one reads Appendix C, and there’s no tool out there to validate against its recommendation.

    This is just plain false. If you author documents to the XHTML recommendations (including appendix C) and DOM 3 you are getting XHTML exactly. Treating as HTML 4.01 now (text/html) and then treating it as XHTML 1.0 later (application/xhtml+xml) will not break anything.

    Finally, a real factual disagreement! Unfortuantely, you’ve provided no facts to back these assertions. Please provide supporting evidence if you’d like to debate this. Note that the W3C TAG says “text/html” content must be treated as HTML 4.01 per internet standards. If you claim it “is” XHTML, this can make sense only in the same abstract sense as it “is” also plaintext — it won’t get processed as such, but maybe it theoretically could. Your claim about DOM3 is also wrong. First of all, what DOM calls you use has nothing to do with whether a document is HTML or XHTML. Second, there are standard DOM calls that will behave differently depending on whether your document is HTML or XHTML.


    Which means they are not XHTML at all, but actually invalid HTML that’s getting by on the error handling of HTML parsers. All those “Valid XHTML 1.0!” links on the web are really saying “Invalid HTML 4.01!”.

    Such documents are valid XHTML, so this is another inaccuracy.

    This is about as relevant as saying “Valid plain text!” It may be totally true, but it’s also totally irrelevant to how the document will be processed. Whereas serving “Valid HTML” will generally give you more consistent cross-browser processing, and so actually is relevant.

    I think the bottom line is that you disagree with my opinions on what format content authors should use on the web, and are recasting this as disagreement on the facts. It’s perfectly possible for people to look at the same set of facts and come to different conclusions on the right thing to do, based on their values. Regardless of which format they think it’s best to use, I think a lot of people learned new things about HTML vs. XHTML from my post.

    I also think you need to re-evaluate your faith in all things done by the W3C. For instance, your claim that they have a “smooth transition map” is wild, since they’ve already managed to break backwards compatibility with XHTML 1.0 twice (1.1 and the forthcoming 2.0) even though the XHTML1 transition still hasn’t happened. The reason things like Web Apps 1.0 exist and are done outside the W3C is because the W3C has blown off the real-world concerns of browser vendors and web content authors to focus on theoretical beauty. If that floats your boat, that’s cool, but a lot of us care more about whether something works than how pretty it is.

  66. robburns Says:

    Your response doesn’t really address what I’m trying to say, so let me just begin to try to clarify by suggesting the type of information that would help authors with these issues. First there’s the Appendix C that the post references. To summarize the important issues there:
    write documents as XHTML 1 (strict, loose or frameset)
    when using a self-closing tag add an extra space before the slash like this: <br />
    when declaring a language on an element include both the ‘lang’ and ‘xml:lang’ attributes set to the same value
    whenever including ‘name’ on elements that use the deprecated ‘name’ attribute include an ‘id’ set to the same value.
    do not use the self-closing shortcut unless an element is defined by the schema to have no content (so don’t use <script /> use <script></script>

    This leads to the following validator warnings:

    When the language is declared on an element:

    if the doctype is set to one of the XHTML1 doctypes the author will see errors for each ‘lang’ attribute because that attribute is not defined in the XHTML1 schema.
    if the doctype is set to one of the HTML4.01 doctypes the author will see errors for each ‘xml:lang’ attribute because that attribute is not defined in the HTML4.01 schema.

    regardless of doctype setting both ‘name’ and ‘id’ attributes to the same value (of type ID) is technically wrong, but I doubt any validator will catch that anyway.
    the slash in self-closing tags is technically incorrect but most validators have been updated to overlook that issue

    None of this is cause for alarm. This merely requires authors to be sentient beings who are made aware of these validation warnings and understand that they can safely ignore them. If I’m wrong about that and there are some practical consequences to ignoring those warnings in one browser or another please let us know.

    There may be other issues that authors should be made aware of. For eample:

    To ensure one stays out of quirksmode (in IE) I believe one should also leave off the xml declaration (so that means one has to stick with xml 1.0 and utf-8 or utf-16 encoding).
    When using the DOM avoid DOM level-0 and check for the availability of the namespace methods/functions and call the appropriate method whether the document is served as application/xhtml+xml or text/html. I’m not suggesting the same document be served concurrently as both types, but rather providing suggestions for forward-looking authors to be ready to make the transition with ease and agility.

    So if we return to the possible courses of action for serving the resulting XHTML 1 document (authored according to Appendix C guidelines), the outcome depends on the needs of the authors, sites and implementations in question.

    Only one of those approaches may be ill-advised under typical circumstances: that of serving text/html to IE and serving application/xhtml+xml to everything else. Mozilla folks discourage this except under special circumstances and I think they provide good reasons.
    Serving the document as text/html. This is perfectly acceptable under W3C guidelines. Many authors want to know that it’s OK to focus only on XHTML. It is, and Appendix C (along with advice from places like Surfin’ Safari) should show them the way. Change merely the doctype, if you wish, on an Appendix C authored page from XHTML1 to HTML4.01 and that’s all (if even that; you could just leave the doctype set to XHTML1 and I am unaware of any practical negative implications)
    serving as application/xhtml+xml and leave older and browsers and IE 6 and 7 behind. This may be a perfectly reasonable response for many sites. I think the WebKit site could take that approach with almost no negative implications. Others could do so in similar situations. Just like the old frames days add the following code for IE:
    This site uses XHTML. If you’re using an out-of-date browser please download one of the many free browsers: (and then present a list of browsers)
    Stick with HTML 4.. Why when XHTML 1 works fine as a substitute. Plus as authors were originally told (and correctly) authoring in XHTML 1 prepares them for the future? The introduction of XML and XHTML has alredy improved the quality of HTML authroing tools and the quality of HTML when hand-coded. Following XHTMl syntax is just easier for most folks. Moving from ‘lang’ to ‘xml:lang’ and ‘name’ to ‘id’ shouldn’t be that big of a deal either.

    One other approach left out of the discussion would be to use application/xml or text/xml and then one may also support IE7 with pure XHTML.

    The point is authors need to know there’s a difference in the way browsers handle mime types. If that’s what this article wants to say I’m 110% behind you. If you’re trying to say what other misguided (some cited here) have said that there’s something wrong with serving HTML compatible XHTML as text/html, that’s where we part ways. Is that a matter of opinion? Yes. Except I like my opinions backed up by reasoned argument. I don’t see that in this post or in a couple of those sites you cite.

  67. Pingback from SitePoint Blogs » Oct 8, 2006 News Wire:

    [...] /articlelist/48″
    title=”Kevin Yank’s Author Bio”>Kevin Yank

    Understanding HTML, XML and XHTML From the Safari crew, this is one of the most well-reasoned and pragmatic takes o [...]

  68. Pingback from IE 5.2 Mac strzt mit XHTML ab - Seite 2 - XHTMLforum:

    [...] in interessanter Artikel: Understanding HTML, XML and XHTML.

    __________________
    Markus W [...]

  69. Pingback from www.web-garden.be » Blog Archive » HTML vs XML vs XHTML:

    [...] ML and XHTML are an area of considerable confusion on the web. This article will attempt to clear up some of that confusion.

    This entr [...]

  70. Pingback from CSS-Valid aber Warnungen - XHTMLforum:

    [...] Dokument. Siehe dazu auch Understanding HTML, XML and XHTML.

    __________________
    Markus W [...]

  71. Pingback from alexking.org: Blog > Around the web:

    [...] wIndicator(1160963331); Posted in: General Understanding HTML, XML and XHTML BBColors 1.0 Google, YouTube and Copyright – good question. Jack Sl [...]

  72. Pingback from TechSpeak » Blog Archive » Understanding HTML, XML and XHTML:

    [...] ober 16th, 2006
    Understanding HTML, XML and XHTML
    Here is a helpful overview on Understanding HTML, XML and XHTML. (sighted on alexking.org) Posted by Ken in General

    Possibly relate [...]

  73. Pingback from Noah On » Blog Archive » XHTML and MIME headers:

    [...] ie.ch/advocacy/xhtml.fr The Safari development team posted a blog entry on this topic: http://webkit.org/blog/?p=68 Context This was originally written in September 2002 in the context of this Web log [...]

  74. Pingback from Internet Explorer zeigt weie Seite trotz richtigem Quellcodes - Apfeltalk:

    [...] denen HTML Varianten? Ich benutze XHTML strict. Da liegt dein Problem. Aus dem Artikel Understanding HTML, XML and XHTML des Surfin’ Safari-Blogs der Webkit-Entwickler: Code: Serve your [...]

  75. Pingback from techblog.tilllate.com » HTML4 verwenden. XHTML vergessen.:

    [...] r nicht verstanden – und nicht angezeigt. Die Safari-Entwickler sehen das anders: In einem aktuellen Artikel empfehlen sie die Verwendung von HTML anstatt XHTML. Als Gründe nennen sie die weitere Verbreitun [...]

  76. Pingback from HTML vs. XHTML at Aaron Heimlich - Web Developer:

    [...] p you out: No to XHTML – Spartanicus’ Web tips SitePoint Forums – XHTML vs HTML FAQ Understanding HTML, XML and XHTML HTML vs. XHTML – WHATWG Wiki Mozilla Web Author FAQ — Should I serve applica [...]

  77. Jakob Peterhänsel Says:

    OK, all those glamorous standards – what are they god for, if they are not used?

    Why do we make a DTD declaration if the browsers only look at the MIME type from the server? That is, to be honest, very stupid! I’m puzzled, I’m amazed, I’m out of words… almost.

    1: It would make sense to me, it the browser looked at the MIME type as a ‘do this qualify as something I can process at ALL’ and then it looked at the DTD to know the REAL content.
    Remember, the MIME type from the server is File Extention Based! The you REALLY think all webpages will now get a .xml extention so it can be served as text/xml and understood by the browser?

    2: If the above is to be (.xml file endings = text/xml) how is the browser able to verify the Type of XML, and parse it? According to this article, browsers don’t read and don’t care about the DTD!

    3: May I be s bold as to suggest that browsers take the MIME type from the server ‘with a grain of salt’ and only use it as a guideline, and then looks at the DTD to determin the actual document type?

    Ohh, and may I be so bold (sorry it’s late here) to note that I think it sily to Require an explicit tag close () if the tag is self-containing! Why the …. should I make a if all that is needed is ?
    To me, it seems as all those people making up the standards are on 100+MBit connections to servers that are next door, and they don’t care about bandwidth. So you say “what’s that extra 6 letters gonna matter? Well, in a simpe page, with a few of these tags, it’s quickly 1000 letters, easily doubling the size of the page served!

    I understand it’s nice to be able to make content in the Canvas tag for browsers that don’t render it, but if _I_ as a deveoper decide that there should NOT de such content, why do I HAVE to make the extra tag?

    PS: Does this also mean that Script includes in the Header section of a page should have the tag ? If so, this is outright STUPID. Why does script need it, when META or LINK does not????????????

    Someone should stop smoking that grass…. ;-)

  78. Jakob Peterhänsel Says:

    Some tags is removed (by weordpress) from the above post..

  79. Max_B Says:

    Trying to write clean, valid and lasting code is getting more and more complex…
    No matter the efforts you put in you’ll be off the way!

    This is a pity that major browser vendors, at least open-source ones if not all, and standard consortium are not able to set up and agree on clear guidelines and validation process.

  80. Max_B Says:

    To elaborate on my previous desperate comment, could someone outline a clean, safe process to validate xhtml content for both case of mime type?
    Is it enough to validate the markup at W3C and then serve the file as html, then as xml to an xhtml capable browser like WebKit?

  81. Pingback from strictmode » Blog Archive » WordPress and Content-Types:

    [...] to see Ian Hickson’s article, Sending XHTML as text/html Considered Harmful and also Understanding HTML, XML and XHTML by the developers of a prominent web browser. This entry wa [...]

  82. Pingback from lacovnk의 일기장 » Blog Archive » XHTML에서 , IE 이녀석:

    [...] 38;lt/script&rt;처럼 열고 따로 닫아주길 권장한다고 한다. 이에 관련한 다른 글도 있고. 그러면 XHTML 문서를 [...]

  83. creative4w3 Says:

    Serving content as application/xhtml also causes extra issues for SEs.

  84. sublimation blanks Says:

    […] ML and XHTML are an area of considerable confusion on the web. This article will attempt to clear up some of that confusion.

    This entr […]

  85. Pingback from links for 2007-02-24 | On Influence and Automation:

    [...] an industry insider speaking to us under conditions of anonymity. (tags: retail compusa) Understanding HTML, XML and XHTML The relationships among HTML, XML and XHTML are an area of considerable confusion [...]

  86. dunix Says:

    they should make it one language with different ways of styling, instead of making a mess with all these little different names.

  87. Nuwendoorn Says:

    In some books I read that search engines are better updated when a site is built in XML than HTML. My question is whether this is really true.

  88. Mark Rowe Says:

    Nuwendoorn: the theory is that search engines deal with your content better if you use semantic markup, not XML vs HTML. In my experience, it can be a big benefit compared to using table-laden non-semantic markup.

  89. Pingback from Tech Center Current » Blog Archive » Who says XHTML is good to use?:

    [...] s the widest browser and search engine support. The Safari development team posted on its official blog: On today’s web, the best thing to do is to make your document HTML4 all the way. Full XHTML pr [...]

  90. Pingback from holotone.net:

    [...] ground! YouTube – Kill -9 – You thought the seven layer model referred to a burrito… Surfin’ Safari – Blog Archive » Understanding HTML, XML and XHTML – OR, Close your <script> and <canvas& [...]

  91. Pingback from Web Development » Blog Archive » Why you should be using HTML 4.01 instead of XHTML:

    [...] g valid HTML 4.01 as text/html ensures the widest browser and search engine support." Apple (Safari): "On today’s web, the best thing to do is to make your document HTML4 all [...]

  92. Pingback from Joe’s Robot-like Adventures » Blog Archive » Switching back to HTML4:

    [...] ation for the possible standardization of the WHATWG proposal for HTML5 and because of the recommendation on the Surfin’ Safari Weblog. Also I’ve learned that my carefully crafted XHTML 1.1 is b [...]

  93. Pingback from Bad Web 2.0 - Helsinki cycling journey planner « “Tech IT Easy” - Jeremy Fain’s blog:

    [...] re so many things wrong in trying to publish something on the internet as XHTML Strict and serving it as text/html that it warrants it’s own article. What is worse, if y [...]

  94. Pingback from ColorThreads » Blog Archive » XHTML Safari Bug:

    [...] I knew that IE 6 doesn’t accept application/xhmlt+xml content type, which is needed for real XHTML, but that wasn’t the problem for me. I’ve created a simple test to see whether browsers [...]

  95. Pingback from soeren says » Blog Archive » XHTML remains a mess:

    [...] just isn’t any good, and pretending it doesn’t exist isn’t a good idea. (Go read this, too, particularly the section with “You have a couple of choices:”.) Rather, the answer (to [...]