Introducing the JetStream Benchmark Suite

Today we are introducing a new WebKit JavaScript benchmark test suite, JetStream. JetStream codifies what our de facto process has been — to combine latency and throughput benchmarks with roughly equal weighting, and capturing both metrics of traditional JavaScript programming styles as well as new JavaScript-based technologies that have captured our imaginations. Scores on JetStream are a good indicator of the performance users would see in advanced web applications like games.

Optimizing the performance of our JavaScript engine is a high priority for the WebKit project. Examples of some of the improvements we introduced in the last year include concurrent compilation, generational GC, and the FTL JIT. Engineering such improvements requires focus: we try to prioritize high-impact projects over building and maintaining complex optimizations that have smaller benefits. Thus, we motivate performance work with benchmarks that illustrate the kinds of workloads that WebKit users will likely encounter. This philosophy of benchmark-driven development has long been part of WebKit.

The Previous State of JavaScript Benchmarking

As we made enhancements to the WebKit JavaScript engine, we found that no single benchmark suite was entirely representative of the scenarios that we wanted to improve. We like that JSBench measures the performance of JavaScript code on popular websites, but WebKit already does very well on this benchmark. We like SunSpider for its coverage of commonly-used language constructs and for the fact that its running time is representative of the running time of code on the web, but it falls short for measuring peak throughput. We like Octane, but it skews too far in the other direction: it’s useful for determining our engine’s peak throughput but isn’t sensitive enough to the performance you’d be most likely to see on typical web workloads. It also downplays novel JavaScript technologies like asm.js; only one of Octane’s 15 benchmarks was an asm.js test, and this test ignores floating point performance.

Finding good asm.js benchmarks is difficult. Even though Emscripten is gaining mindshare, its tests are long-running and until recently, lacked a web harness. So we built our own asm.js benchmarks by using tests from the LLVM test suite. These C and C++ tests are used by LLVM developers to track performance improvements of the clang/LLVM compiler stack. Emscripten itself uses LLVM to generate JavaScript code. This makes the LLVM test suite particularly appropriate for testing how well a JavaScript engine handles native code. Another benefit of our new tests is that they are much quicker to run than the Emscripten test suite.

Having good JavaScript benchmarks allows us to confidently pursue ambitious improvements to WebKit. For example, SunSpider guided our concurrent compilation work, while the asm.js tests and Octane’s throughput tests motivated our work on the FTL JIT. But allowing our testing to be based on a hodgepodge of these different benchmark suites has become impractical. It’s difficult to tell contributors what they should be testing if there is no unified test suite that can tell them if their change had the desired effect on performance. We want one test suite that can report one score in the end, and we want this one score to be representative of WebKit’s future direction.

Designing the New JetStream Benchmark Suite

Different WebKit components require different approaches to measuring performance. For example, for DOM performance, we just introduced the Speedometer benchmark. In some cases, the obvious approach works pretty well: for example, many layout and rendering optimizations can be driven by measuring page load time on representative web pages. But measuring the performance of a programming language implementation requires more subtlety. We want to increase the benchmarks’ sensitivity to core engine improvements, but not so much so that we lose perspective on how those engine improvements play out in real web sites. We want to minimize the opportunities for system noise to throw off our measurements, but anytime a workload is inherently prone to noise, we want a benchmark to show this. We want our benchmarks to represent a high-fidelity approximation of the workloads that WebKit users are likely to care about.

JetStream combines a variety of JavaScript benchmarks, covering a variety of advanced workloads and programming techniques, and reports a single score that balances them using a geometric mean. Each test is run three times and scores are reported with 95% confidence intervals. Each benchmark measures a distinct workload, and no single optimization technique is sufficient to speed up all benchmarks. Some benchmarks demonstrate tradeoffs, and aggressive or specialized optimization for one benchmark might make another benchmark slower. Demonstrating trade-offs is crucial for our work. As discussed in my previous post about our new JIT compiler, WebKit tries to dynamically adapt to workload using different execution tiers. But this is never perfect. For example, while our new FTL JIT compiler gives us fantastic speed-ups on peak throughput tests, it does cause slight regressions in some ramp-up tests. New optimizations for advanced language runtimes often run into such trade-offs, and our goal with JetStream is to have a benchmark that informs us about the trade-offs that we are making.

JetStream includes benchmarks from the SunSpider 1.0.2 and Octane 2 JavaScript benchmark suites. It also includes benchmarks from the LLVM compiler open source project, compiled to JavaScript using Emscripten 1.13. It also includes a benchmark based on the Apache Harmony open source project’s HashMap, hand-translated to JavaScript. More information about the benchmarks included in JetStream is available on the JetStream In Depth page.

We’re excited to be introducing this new benchmark. To run it, simply visit browserbench.org/JetStream. You can file bugs against the benchmark using WebKit’s bug management system under the Tools/Tests component.