"CanLII's robots.txt" by Addison Cameron-Huff

CanLII's robots.txt file provides preferential access to its database of laws & cases to users of Google vs. other search engines. This blog post explains how they're doing it and why that is inconsistent with CanLII's privileged position in the Canadian legal publishing landscape.

As a preface to this post: I'm a big fan of CanLII's service and use it regularly for my legal practice. Just because something is great doesn't mean it can't be better.

CanLII is the main distribution channel for Canadian laws and cases. It is the only "free access" service for Canadian case law. In Ontario, they are one of three corporations that the court system distributes court decisions to. At the Superior Court level, CanLII is the only place you can find decisions for free because the decisions are not published online (only Court of Appeal decisions are still published directly by the courts).

CanLII is the only free way that the vast majority of Canadians can access court cases. Although CanLII is not a government body (it's a non-profit corporation funded by Canadian lawyers, notaries and paralegals), there is a moral obligation when an entity is given such an important role by the courts (see also Montreal Declaration on Free Access to Law).

Given CanLII's important role (and mission statement), it takes an odd approach to third-party search engines.

CanLII's terms only permit "indexing" (inclusion into search results) of pages on its website (i.e. cases) that are ...authorized by the instructions in the robots exclusion file at <http://www.canlii.org/robots.txt>.... But the robots.txt file that CanLII has written prohibits all indexing by everyone except Google:

User-agent: Googlebot
Disallow: /en/search
Disallow: /fr/search
Disallow: /search
Disallow: /eliisa
Disallow: /images
...

User-agent: *
Disallow: /

For those who don't speak robots.txt, the above means that "GoogleBot" (the name Google gives to its search engine) can index all pages except the listed ones (which appears to be the majority of the CanLII site). Any other search engine is not allowed to include any of CanLII's pages.

The result of CanLII's robots.txt file is that Google doesn't list most of the cases on CanLII. CanLII even blocks access to Ontario Court of Appeal cases ("/Disallow: /en/on/onca/") that are available from the Court of Appeal on its public website.

Search engines other than Google can't include anything.

According to a blog post by David Whelan, in 2010 CanLII blocked all search engines equally. So it would seem that CanLII (through its robots.txt file at www.canlii.org) is deliberately favouring Google but it's possible that it's just an oversight and they meant to block all search engines equally.

Why has CanLII chosen Google as the one search engine they permit (or is this just a mistake)? Is there a deal with Google or do they just prefer it? I can't find an explanation on CanLII's blog.

Curiously, Microsoft's search engine, Bing, does have lots of CanLII results (despite the robots.txt and terms of use prohibition): http://www.bing.com/search?q=site%3Acanlii.org. I'm not quite sure why but I think it's because of the non-conforming way that Bing applies robots.txt.

If CanLII wants to prohibit search engines from accessing the majority of case law, that's one thing (although not a good thing). But to do that only for Google and not allow any access by other search engines is another.

An organization that the courts are entrusting with public access to cases should not be discriminating between search engines. And if they are going to discriminate then CanLII should at least explain why they prefer Google to Bing (and other search engines like DuckDuckGo).

Addison Cameron-Huff, Cryptocurrency Lawyer

Thoughts and opinions of a Toronto-based cryptocurrency lawyer who's worked in the industry since 2014. Follow Addison: Twitter or LinkedIn.

CanLII's robots.txt