Top 5 Biggest Queries of 2014

During this year I became slightly addicted to the fantastic community site; a site to help people playing around with the data available in Google’s BigQuery share their queries and get help, comments, and validation on that idea.

A query can start a conversation which can end up refining or even changing the direction of the initial idea.

BigQuery contains a few different publicly available large datasets for you to query, including all of Wikipedia, Shakespeare’s works, and Github meta data.

HTTP Archive

The main use of is for discussing the contents of the HTTP Archive (there are a few about other things, however) and that’s where I’ve been focussing my nerdiness.

What follows is a summary of the five most popular HTTP Archive queries created this year, by page view. I’m hoping that you find them as fascinating as I do, and perhaps even sign up at and continue the conversation or even sign up for Big Query and submit your query for review.

Here they are, in reverse order:

5) 3rd party content: Who is guarding the cache? (1.5k views)

Doug Sillars (@dougsillars) riffs on a previous query by Ilya Grigorik to try investigating what percentage of requests are coming from 3rd parties, what is the total amount of this (in MB), and how much of it is cacheable.

I’ve run what I believe to be the same query over the entire year of 2014 and you can see the results below:

We can see that there’s a generally good show from the 3rd parties, with June and October being particularly highly cacheable; something appears to have happened in September though, as there’s a sudden drop-off after 80 of the top 100 sites whereas the other months we see that same drop-off after 90 sites.

4) Analyzing HTML, CSS, and JavaScript response bodies (2.4k views)

Ilya Grigorik (@igrigorik) gets stuck into a recent addition to the HTTP Archive (in fact, it only exists for ONE run due to the sheer volume of data); the response bodies! Mental.

By searching within the response bodies themselves – such as raw HTML, Javascript, and CSS – you’re able to look inside the inner workings of each site. The field is just text and can be interrogated by applying regular expressions or “contains” type functions.

The query he references (actually created as an example query by Steve Souders (@souders)) examines the asynchronous vs synchronous usages of the Google Analytics tracking script, which tells us that threre are 80577 async uses, 44 sync uses and a bizarre 6707 uses that fall into neither category.

I’m working on several queries myself using the response body data; it’s amazing that this is even available for querying! Do be aware that if you’re using BigQuery for this you will very quickly use up your free usage! Try downloading the mysql archive if you’re serious.

3) Sites that deliver Images using gzip/deflate encoding (4.4k views)

Paddy Ganti (@paddy_ganti) starts a great conversation by attempting to discover which domains are disobeying a guideline for reducing payload: don’t gzip images or other binary files, since their own compression algorithms will do a better job than gzip/deflate which might even result in a larger file. Yikes!

The query looks into the response’s’ content type, checking that it’s an image, and compares this with the content encoding, checking if compression has been used.

There are over 19k compressed image responses coming from Akamai alone in the latest dataset:

Although you can see the results suggest a significant number of requests are gzip or deflate encoded images, the great discussion that follows sheds some light on the reasons for this.

2) Are Popular Websites Faster? (4.9k views)

Doug Sillars (@dougsillars) has another popular query where he looks into the speed index of the most popular websites (using the “rank” column).

We’re all aware of the guideline around keeping a page load as close to a maximum of 2 seconds as possible, so do the “big sites” manage that better than the others?

If we graph the top 1000 sites – split into top 100, 100-500, and 500-1000 – and get a count of sites per Speed Index (displayed as a single whole number along the x-axis; e.g. 2 = SI 2000), we can see the relative performance of each group.

Top 100

The top 100 sites have between 25-30 sites with Speed Indexes around 2000-3000 then drop off sharply.

Top 100-500

Although the next 400 have over 60 sites each with a Speed Index of 2000 or 4000, and almost 90 sites with 3000, their drop off is smoother and there’s a long tail out to 25000.

Top 500-1000

The next 500 have a similar pattern but a much less dramatic drop off, then a gentle tail out to around 25000 again.

This shows that although there are sites in each range which achieve extremely good performance, the distribution of the remainder gets more and more spread out. Essentially the percentage of each range who achieve good performance is reduced.

The post is very detailed with lots of great visualisations of the data, leading to some interesting conclusions.

1) M dot or RWD. Which is faster? (7.6k views)

The most popular query by quite a way is another one from Doug Sillars (@dougsillars).

The key question he investigates is whether a website which redirects from the main domain to a mobile-specific domain performs better than a single responsive website.

He identifies those sites which may mobile specific using the cases below:

 WHEN HOST(requests.url)  LIKE 'm.%' then "M dot"
 WHEN HOST(requests.url)  LIKE 't.%' then "T dot"
 WHEN HOST(requests.url)  LIKE '' then "dot mobi"
 WHEN HOST(requests.url)  LIKE 'mobile%' then "mobile"
 WHEN HOST(requests.url)  LIKE 'iphone%' then "iphone"
 WHEN HOST(requests.url)  LIKE 'wap%' then "wap"
 WHEN HOST(requests.url)  LIKE 'mobil%' then "mobil"
 WHEN HOST(requests.url)  LIKE 'movil%' then "movil"
 WHEN HOST(requests.url)  LIKE 'touch%' then "touch"

The key is this clause, used to check when the HTML is being served:

 WHERE requests.firstHtml=true

These are then compared to sites whose url don’t significantly change (such as merely adding or removing “www.”).

The fascinating article goes into a heap of detail and ultimately results in the conclusion that responsively designed websites appear to outperform mobile-specific websites. Obviously, this is only true for well written sites, because it is still easy to make a complete mess of a RWD site!

Hopefully this has given you cause to head over to the website, check out what other people are looking into and possibly help out or try your own web performance detective work out over the holiday season.