Targeting Domains with Content Blockers

Content Blockers extensions are enjoying a tremendous success on iOS and Mac. Developers are being creative with the API and users love the speed and privacy benefits. If you are interested in writing your own Content Blocker, I suggest starting with the introduction and the official documentation. Content Blockers are different than traditional Safari Extensions. Rather than have the engine (in this case, WebKit) ask extension code how to behave, the extension tells the engine how to behave in advance. Each Content Blocker extension defines a list of rules to follow, in order. Each rule can be activated only when certain conditions are met. Those conditions are defined in the “trigger” of each rule. In this blog post, we are going to explore one particularly popular type of triggers: activation based on domain names.

Domain names and URL-filter

One of the most popular kind of trigger is activated based on matching certain domains. Typically, the author wants to prevent the browser from disclosing information to those domains. The trigger specification does not have an explicit way of matching requests of a certain domain. Instead, the rule “url-filter” allow matching of a regular expression over the complete URL of every request. There are many ways to write a regular expression that match a domain name. In general, the runtime performance is great regardless of how the expression is written. What happens in the backend is that the Content Blocker engine compiles the rules into a small number of optimized finite state machines. Different ways of writing the same expression generates the same machines. Some definitions are more generic than others; it is the engine’s job to optimize them. The bad news is different regular expressions can cause very different compile times. Some extensions compile in milliseconds while others compile in seconds. The more work we give to the optimizer, the longer the compilation process takes. On iOS, this translates into it taking longer after a user turns a content blocker on for it to apply to loaded pages. In the following sections, we will see how the definition of URL-filters affects the compile time. The running example is to match “a-tracker.com” and all its subdomains.

Generic domain matching

Before going through the Content Blocker engine, URLs are canonicalized. The most direct way of matching a domain is to match the exact sequence of characters that can appear in a valid canonical URL. It would look something like this:

[
    {
        "trigger": {
            "url-filter": "^[a-z][-+.0-9a-z]*:/+([!$%&'()*+,-.0-9:;=a-z_~]*@)?([-%0-9a-z_]+\.)*a-tracker\.com[:/]"
        },
        "action": {
            "type": "block"
        }
    }
]

Compiling fifty thousand patterns of this form takes about 10 seconds in the worst case. A bit long. The reason compiling such pattern takes a long time is its complexity. The optimizer has to spend a fair amount of time figuring the most efficient way to handle all the character sets.

Let’s simplify the expression for http-like URLs

The URL specifications give plenty of freedom on what can be defined. In most cases, we only care about the subset that matches URLs with the HTTP scheme. We can improve the pattern by making the structure stricter! First, the scheme and domain names are always lowercase in a canonical URL. We can change the trigger to:

"trigger": {
    "url-filter": "^[a-z][-+.0-9a-z]*://+([!$%&'()*+,-.0-9:;=A-Za-z_~]*@)?([-%0-9a-z_]+\.)*a-tracker\.com[:/]",
    "url-filter-is-case-sensitive":true
}

This is already 20% faster! Next, let’s simplify the scheme definition. When compiling “^[a-z][-+.0-9a-z]*://+“, the compiler generates code matching specifically the characters defined here. Matching the first character is easy, it just needs to be between “a” and “z” inclusive. Matching the next characters is a bit more complicated because more possibilities need to be tested. There are two options to make that part trivial. If it is okay to restrict the pattern to HTTP/HTTPS, using the pattern “^https?://+” is effective. If you need to match any scheme, the pattern “^[^:]+://” is a good option. Instead of looking for a character in the range from “a” to “z”, the compiler can assume the input can be skipped until the first character “:”. This is a simpler operation, which makes it simpler to optimize. We can use the same idea to match anything in front of the domain. Instead of matching the username, password and subdomains exactly, we can just exclude anything past the domain name. We end up with a simplified expression like this:

"trigger": {
    "url-filter": "^[^:]+://+([^:/]+\.)?a-tracker\.com[:/]",
    "url-filter-is-case-sensitive":true
}

Compiling fifty thousand patterns like this takes 2.7 seconds in the worst case, almost four times faster than the original pattern!

Conclusion

The Content Blocker compiler is there to optimize the performance of running Content Blockers. Such optimizations can take time. If your extension has several thousand rules, it is worth looking into using simple regular expressions to reduce the compile time. Doing so ensures that users can enjoy the benefits of your extension as soon as it is activated. For domain matching, my advice is to use triggers of the form:

"trigger": {
    "url-filter": "^[^:]+://+([^:/]+\.)?domain-to-match\.tld[:/]",
    "url-filter-is-case-sensitive":true
}

For other cases, keep in mind the following:

  • Use “url-filter-is-case-sensitive” when possible. It halves the number of characters to consider.
  • Do not forget to escape special characters. In particular “.” can make a pattern much more generic than it needs to.
  • Sometimes, it is simpler to define what character should not match instead of listing everything that can match.

I would be happy to answer questions about anything related to Content Blockers in WebKit. You can find me on twitter at @awfulben. Brian can answer questions regarding the use of Content Blockers in Safari. As usual, you can also get in touch with @jonathandavis with any other questions.