Announcing SunSpider 1.0

The popular SunSpider JavaScript benchmark suite was originally released by the WebKit team over five years ago, in December 2007. It was engineered to be a balance of real JavaScript code from the web, and to serve as a blueprint for the sorts of language-level operations that the WebKit JavaScript engine should, but was as yet unable to, optimize. And optimize we did: intriguingly, the original version of the benchmark reported an execution time of 9206.2ms in Safari 3 on a 2.33 GHz machine. Running the SunSpider 0.9 benchmark on a modern browser like Safari 6 with a recent machine, such as my 2.2GHz i7 MacBook Pro, can easily result in a score around 250ms—this represents over a 30-fold improvement!

Yet despite such dramatic performance improvements and the introduction of other suites like Kraken and Octane, SunSpider continues to be a useful benchmark. SunSpider focuses on a broad range of JavaScript operations, from Date, String, and Regexp manipulation to a wide variety of numerical, array-oriented, object-oriented, and functional idioms. Compared to other suites, SunSpider places greater emphasis on features that make JavaScript uniquely difficult to optimize, such as ‘eval’ and ‘for/in’. Most importantly, SunSpider is a simple and practical test of not just how quickly JavaScript code can execute if it has been running for a while, but also, how quickly the engine can warm up for very short-running code. Most websites don’t have JavaScript event handlers that run for seconds at a time; thus, SunSpider’s focus on short-running tests has direct relevance to the web.

Today we’d like to announce SunSpider 1.0, an update of the suite that fixes a number of bugs and aims to further increase test accuracy and repeatability. This update comprises fixes to test validation, better interaction with power management, and a backlog of minor fixes to the harness and tests.

Test Validation

SunSpider was originally written at a time when JavaScript engines used relatively straight-forward interpretation or template compilation strategies. Consequently it didn’t make sense for the tests to defend against optimizing compiler tricks that could lead to the entirety of the test being folded away as dead code. It also didn’t make much sense to validate the tests’ output—if your execution strategy is just interpretation, then you can validate your engine’s correctness using other approaches (like test262), and leave the benchmark to just measure execution time.

But the JavaScript runtimes of today employ increasingly sophisticated execution tiers. For example, WebKit’s JavaScript engine has three execution engines: an interpreter for start-up code, a template just-in-time compiler (JIT) for polymorphic code, and an optimizing JIT for long-running structured code. Optimizing compilers for JavaScript are increasingly capable of eliminating dead code: code whose results are known to be never used. But even more importantly, as the engines reach ever-higher levels of complexity, it makes sense to make our benchmarks perform some validation of correctness. Benchmarks are uniquely capable of stressing the optimizing JIT in a way that conformance tests cannot. This observation comes from our own experience: as we added validation to our benchmarks, we were able to catch bugs sooner.

We address both problems by adding validation checks to 23 out of the 26 SunSpider tests. The validation checks are intended to incur minimal overhead to the running-time of the tests: <2% overhead was our goal. The tests not covered by validation are ones where results depend on timezone or random number generation; on those tests we expect different users and implementations to get different results. We also do not perform validation on the exact results of certain math functions like Math.sin(), since ECMAScript 5 permits these functions to return implementation-dependent results.

These validation checks serve two purposes: they force the JavaScript engine to execute the test in full, and they provide a quick way of checking that the engine executed the test correctly.

The changes to include test validation and prevent dead code elimination are covered by WebKit bugs 38446, 63863, 114852, and 114895.

Power Management

Short-running tests of the sort used in SunSpider require special care from the test harness. The SunSpider 1.0 release further improves the repeatability of test execution times by eliminating the delay that previous versions used between test executions, thus reducing the chances of interference from the operating system’s power management logic.

The original SunSpider 0.9 test harness employed a 500ms delay between tests. The intent was to give the browser a chance to complete any asynchronous activities before doing another round of measurement. But performance improved, and the power management facilities in modern operating systems became more sophisticated. Running a test that took on average 10ms with a 500ms delay between test executions meant that the machine could enter a lower clock-rate, power-saving state for each test execution. If power management did kick in, the machine would run slower, and the SunSpider score would be penalized. Whether power management kicked in or not depended on a variety of factors outside the harness’s control. Paradoxically, this would lead to slower or noisier machines exhibiting higher performance. A slower machine would be active for a higher fraction of the test time slice thereby reducing the likelihood of interference from power management. A noisier machine—for example a machine also busy doing other work while SunSpider testing was in progress—would also have a chance of getting a better result because the noisiness would also inhibit power management.

The SunSpider 0.9.1 update attempted to address this problem by reducing the test delay to 10ms instead of 500ms. But as SunSpider performance has further improved due to a combination of hardware improvements and JavaScript engine enhancements, we are once again seeing power management cause unpredictable SunSpider slow-downs. For example, testing on a 2.7GHz MacBook Pro may occasionally lead to worse results than a 2.2GHz MacBook Pro, not because the hardware is any worse or because the browser is any different but because the faster machine appears to be active for a smaller fraction of time. Worse, the faster machine does not consistently run slower; we have observed bimodal execution times that ping-pong between 130ms and 150ms depending on the state of the machine.

Having bimodal results on fast hardware runs counter to our goal of making SunSpider a reliable and repeatable test. We considered various solutions to this problem. We could have made SunSpider tests run longer, but rejected this approach since one of SunSpider’s unique strengths is its short-running nature; also, other JavaScript benchmarks already do a good job of providing test coverage for long-running programs. We also considered requiring users to disable power management prior to running SunSpider; this could have worked but would have made the tests more difficult to use. Ultimately we chose to remove the 10ms test delay entirely. The original intent of the test delay was to improve the repeatability and reliability of SunSpider rather than inhibit it. As we played with the delay, we found that eliminating the delay entirely eliminated the bimodal results on fast hardware, and never significantly hurt test repeatability on any hardware platform we tested.

SunSpider 1.0 eliminates the delay between test executions. In addition to reducing interference from power management facilities, it also has two other useful side-effects:

  1. The SunSpider test suite now runs up to twice as fast. Typical SunSpider tests require less than 10ms to run on fast machines, so the 10ms delay between test executions meant that most of the running time of SunSpider was spent idling. Removing this delay means that the tests finish faster, by spending less time waiting in between individual tests.
  2. SunSpider now gives the browser less opportunity to hide asynchronous activities from the test harness. Our goal is to measure as much of a JavaScript engine’s overheads as possible. A 10ms delay between <10ms test runs meant that a browser that aggressively postponed garbage collection (GC) activity until idle time would never have any of its GC overheads measured by the benchmark. Postponing GC activity until idle time is a well-documented technique for improving performance, but we don’t view it as a goal of SunSpider to specifically reward that aspect of a browser’s memory management strategy. GC performance is important, and we believe that SunSpider should err on the side of measuring it rather than ignoring it.

The change to improve test repeatability when power management is in effect is covered by bug 114458.

Minor Tweaks in the Harness

SunSpider 1.0 also includes a handful of additional fixes and features in the command-line and in-browser test harnesses:

  • bug 47045: The harness doesn’t actually close() its test documents.
  • bug 60937: Avoid using an undeclared variable named ‘name’, which is a DOM API.
  • bug 71799: Extend SunSpider driver to be able to run Kraken.
  • bug 80783: Add –instruments option to SunSpider to profile with Instruments.

Give it a try!

We invite you to try out our new SunSpider 1.0 benchmark suite, report your findings, and provide feedback if you encounter unexpected issues. It is still possible, though not recommended, to run previous versions of the harness from the all versions page.