Surfin' Safari

The HTML5 Parsing Algorithm

Posted by Eric Seidel on Thursday, August 5th, 2010 at 12:39 pm

Over the past few months, we’ve been hard at work implementing the parsing algorithm from HTML5. Before HTML5, there was no standard for how browsers should parse invalid HTML. As a result, every browser developed their own parsing quirks, harming interoperability for pages that contain invalid HTML. HTML5, in contrast, specifies a complete algorithm for parsing HTML documents. Switching to the HTML5 parsing algorithm has three main benefits:

  1. Better interoperability between browsers. All browsers that implement the HTML5 parsing algorithm should parse HTML the same way, which means your web page should parse the same way in Firefox 4 and the WebKit nightly, even if it contains invalid markup. Improving interoperability makes it easier to author HTML by reducing the differences between browsers.
  2. Better compatibility with web pages. A mind boggling amount of data analysis has gone into designing the HTML parsing algorithm. By crawling the web, the designers were able to carefully weigh the trade-offs and maximize compatibility with existing web pages. By implementing the algorithm, we leverage their effort and improve the compatibility of our parser, making it less likely that users will run across broken pages.
  3. SVG and MathML in HTML. One of the cool new features of the HTML5 parsing algorithm is the ability to embed SVG and MathML directly in HTML pages. To embed SVG, you simply add an <svg> tag to your HTML page and you can use the full power of SVG.


    Look mom, SVG in HTML! (If you had an HTML5 compliant browser, the previous text would be colored and on a path.)

    (View source to see the demo SVG code inline in this HTML post.)

We’ve been implementing the HTML5 parsing algorithm in phases. Two months ago, we finished the first phase, which consisted of the tokenization algorithm. Late last night, we finished the second major piece: the tree builder algorithm. Together, these two algorithms form the core of the parser and consist of over 10,000 lines of code. In the next phase, we’ll tackle fragment parsing (which is used by innerHTML and HTML5test.com).

One of the challenges we’ve encountered in switching to the HTML5 parsing algorithm is that some HTML documents rely on WebKit-specific parser quirks. For example, some websites use self-closing script tags (e.g., <script src=”…” />). WebKit is the only major rendering engine that supports this syntax (other browsers ignore the trailing “/” and look for a “</script>” tag). By implementing HTML5, we improve interoperability with other browsers at the cost of compatibility with some WebKit-specific content. In the long run, however, we believe these changes are good for the web platform as a whole.

Implementing the HTML5 parsing algorithm has also given us an opportunity to give back to the community. For example, we’ve sent feedback to the W3C HTML working group about how to improve the correctness and compatibility of the parsing algorithm itself and we’ve contributed over 250 test cases to the HTML5lib parser test suite.

We would like to invite you to take a nightly build for a spin and give the new parsing algorithm a try. If you run into any compatibility or stability problems, please let us know by filing a bug.

6 Responses to “The HTML5 Parsing Algorithm”

  1. randallfarmer Says:

    Perhaps this isn’t the place for compatibility tips, but it turns out that the current Gecko, WebKit, and Opera will all let you adopt an svg element parsed with DOMParser into an HTML document. So, for instance, this works:

    var svgDoc = (new DOMParser).parseFromString(mySVG,’text/xml’); // ‘image/svg+xml’ not supported cross-browser
    var svgEl = document.adoptNode(svgDoc.querySelector(‘svg’));
    document.body.appendChild(svgEl);

    Then you can mess with svgEl and its children the same as any DOM nodes. I use it to insert some SVG received via JSONP in a document, insert it in a paragraph, and then mess with its vertical-align based on info inside the SVG. (The SVG has math notation (rendered LaTeX) in it, which is why it can be inline in a paragraph. And I probably didn’t pick the prettiest way to insert the SVG, but it works.)

  2. richcon Says:

    Eric,

    I noticed the embedded SVG code uses the normal XML parsing rules: a self-closing tag, and namespaces in (though nowhere in the code was that namespace declared). Yet you also indicated that XML-style self-closing tags will not be supported in HTML.)

    Are the contents of the tag parsed differently, in a full XML parser with the proper namespaces automatically registered, while everything outside of it is parsed as HTML?

  3. richcon Says:

    Oops, my code was stripped. Here it is again:

    I noticed the embedded SVG code uses the normal XML parsing rules: a self-closing [path /] tag, and namespaces in the xlink:href attribute (though nowhere in the code was that namespace declared). Yet you also indicated that XML-style self-closing [script /] tags will not be supported in HTML.

    Are the contents of the tag parsed differently, in a full XML parser with the proper namespaces automatically registered, while everything outside of it is parsed as HTML?

  4. abarth Says:

    @richcon: The contents of the tag are parsed using the “foreign content” parsing rules from HTML5. They’re designed to make it easy for you to copy and paste SVG content directly into HTML.

  5. basil Says:

    Does the HTML5 parsing algorithm yield any performance benefits?

  6. m0 Says:

    @basil: more discussion on that on the mailing list https://lists.webkit.org/pipermail/webkit-dev/2010-June/thread.html#13215 Adam stated some benchmarks (although 2 months old) in that list which can be read from here https://lists.webkit.org/pipermail/webkit-dev/2010-June/013244.html