A Comparison of Data: Piwik vs Google Analytics

by

A month ago, I wrote An Adventure In Analytics, describing how user privacy and accurate reporting find themselves at odds with each other. I said that I know that I'm getting traffic that I'm not reporting, and promised that I would be running two different analytic suites (Google Analytics and Piwik) side by side for a month, to see how each reported data. Broadly, the thesis here is that some ad blockers interfere with Google Analytics, therefore if I use something else, the data will look a lot different.

When I started this process, I went in with some assumptions:

  1. Piwik is going to be annoying to set up.
  2. Piwik is going to be giving me numbers that are significantly higher than those coming out of Google Analytics, because I think a substantial number of visitors are being hidden behind adblockers.
  3. Google Analytics is underreporting desktop browsers. Because desktop browsers can have adblock extensions and mobile browsers can't (broadly), the majority traffic GA isn't reporting is from desktop browsers.
  4. Because Opera started blocking GA by default, I should be seeing a lot more Opera traffic in the Piwik data.
  5. Open source software has a reputation for ugliness, so using Piwik is going to be a lot less pleasant than GA.

So now I've got a little bit over a month of comparable data, and I'm ready to report on everything I've learned.

Installing Piwik

Piwik operates basically the same as GA does (where one piece of Javascript is reporting on user activity), but the difference is that Piwik has it reporting to a PHP/MySQL package you have installed on your own server, which means the PHP and database needs to get setup. Being a server-side thing this was Boots' job to take care of, but he just now described the process to me as "Easy!" You upload all the files, create a database, and then run an install script. From there, you put in the javascript and you're all set. That's all you need to do.

That's not all you need to do

Immediately after getting Piwik set up, several people were reporting to me that their own browsers were blocking it as well. After some more research, it turns out that several services (including uBlock Origin, AdBlock Plus, and Ghostery) will, as a rule, reject any script with "piwik" in the name, as well as any script in a /piwik/ folder. This was frustrating to learn.

Internet Smart Guy Troy Hunt wrote about this recently in a blog article, but ad blockers are frequently overstepping their stated goals and blocking anything they can get away with, which makes it hard to defend their approach as "just removing the bad stuff". There's garbage that deserves to get blocked, of course, but if you're making the argument "We're blocking Google Analytics because it contributes to Big Data", you can't then also block the alternative which doesn't do that. So, in order to make everything work properly, we renamed some files to circumvent their system. To be clear, I really hate doing that. It exacerbates the problem and contributes to an arms race, but a rule that considers any traffic reporting malicious is a form of fundamentalism I can't really reason with. So, piece of advice: If you ever do set up Piwik, don't use the default settings.

That problem solved, we started collecting comparable data. Let's start by examining site visits. The chart below plots out sessions, uniques, and page views for either system. Click any label to show/hide that line.

See the Pen Daily Traffic by Lemon (@ahoylemon) on CodePen.

All this data is pretty easy to follow: Piwik gets installed, we work out the kinks, and then Piwik's numbers are consistently higher than GA's, though the difference is smaller than I was expecting. With only about a 10% difference between Piwik's numbers and GA's, we can infer that while some browsers are definitely blocking Google Analytics, the number isn't in the majority. Will that number increase over time? I'd guess yes, but I've been wrong before.

As a side note, what you're seeing here is all data I've copied into CodePens to give you a look at what things look like to me. A lot of developers are shy about sharing their own numbers, and I am one of them. However I felt this snapshot was important enough to embrace that, but I will add that all this data does not account for people who listen to episodes without visiting the site, which is over half the podcast's audience.

Okay, now that I've hyperventilated about the size of these numbers for an entire paragraph, I had this theory about desktop browsers being underreported, does that theory prove out?

See the Pen Comparative Device Use by Lemon (@ahoylemon) on CodePen.

The earlier chart showed a soft increase in total users, so it follows that this one would show a comparative increase in desktop users. The sample size is about 9% bigger in Piwik's chart, and that accounts for a 7.26% increase in desktop's slice of the pie. That correlation I can follow pretty well. As for the 1.2% difference in tablet usage? That's probably down to how each system categorizes these devices. GA has three neat categories (desktop/mobile/tablet) while the Piwik device list is less generalized with entries like "phablet", "portable media player", "game console", and "feature phone". I added the "phablet" total to "mobile", but I'm only just now realizing that the phablet number is 1.2%, so maybe that's solves the mystery? I'd argue that a big phone is still a phone though. Also (and more importantly) the word "phablet" is super dumb so it probably shouldn't be written down anywhere.

Regardless, the real takeaway here is that 25 people visited this website on a feature phone and I want to know what kind of experience they had.

So, that's devices explored, now let's see what kind of browsers people were using:

See the Pen Browser usage, Oct 26 - Nov 26 by Lemon (@ahoylemon) on CodePen.

And here we see a mess. It's kind of hard to learn anything from this comparison because everything is different. GA lists 20 different browsers in this time period, Piwik lists 32. GA has it's own category for the browser in a Kindle Fire, but Piwik has never heard of it. Piwik shows considerable traffic from Samsung Browser, which Google (probably) just categorizes as Chrome. Above I had a question on what "mobile" meant in each context, but here everything's been shifted, so it's hard to know what to make of it.

But there still are some takeways: Chrome is really popular on this site. About 18% more popular here than the U.S. average, and most of that number can be subtracted directly from Microsoft browsers, which are used much less than average. Earlier I mentioned a theory that I'd see a big uptick in Opera use, and that's been proven wrong, because even though there's slightly more on the Piwik side, I've learned that basically nobody is using Opera here.

Okay, one last thing to check on, and that's events. I built a hook that reports back when certain events happen on the site, like when somebody buys merch or clicks a social share button, and I redid the script so that it reported equally on both sides. Here's how that data stacks up.

See the Pen [ignore me] Chart In Progress by Lemon (@ahoylemon) on CodePen.

Perhaps the most obvious thing to report here is that if you have a website for a podcast, you should make sure people can play episodes directly from the website. You can assume that people are just listening to the show with whatever podcatcher they use, but the data shows that about 40% of the visits to the site include somebody using the embedded episode player.

As for the last assumption of Piwik being ugly, well, that wouldn't be a fair statement. There's some interfaces they do better: Real Time Visitor Log is much more informative than the GA counterpart and the referrals section works a lot better as well. However the overall design does lack cohesion and the icons specifically are inconsistent and pretty bad looking. In contrast, GA looks like what it is: A popular software platform produced by a company worth half a trillion dollars who employ a lot of really talented designers and developers. GA looks better, Piwik looks mostly good, but has a couple places where it could improve, like what's happening in this image:

So, what have we learned from this?

In this particular experience, I believe the data shows that Piwik is better for the site than Google Analytics is. Now that the experiment is over, I don't feel I have the need to run GA on the site anymore and so I removed it from the site today. The numbers aren't incredibly different, but this looks like the more responsible approach and Boots and I made plans to do the same thing to ballp.it when we get the chance. Other sites may follow.

Recent events have brought up some new fears of surveillance, and people have very good reasons to be wary of Big Data matrices. With this in mind, I feel that in the coming years, people who run websites have a responsibility to think more critically about the privacy concerns of their users. Those users, in turn, need to have a bit more sympathy with the sites they feel are acting in good faith. If you adblock, whitelist sites that you like.

This post is tagged:
tech advice analytics blog