The Common Data Set

Posted onAugust 12, 2009 Edit onAugust 7, 2021 by Christopher Berry

If you’ve been attending the Web Analytics Association Research Committee calls, you’ll know that I’ve been troubled by this question of a ‘common data set’.

As it is right now, data that is common, clean, and relevant to web analytics is rare. To be sure, there are heaps of open source log files (I believe the Wiki Foundation made 5 terabytes available for download awhile back), but in terms of there being some manageable CSV file out there – it’s pretty rare.

Such a dataset is pretty useful from a few perspectives.

For one, it would enable researchers within our community to use a verifiable data source when making assertions about the importance of different metrics. I’m dissatisfied with what I can demonstrate here: only ‘theory and definitions’. I’m certain that several other people are too. A common data set which people could go at and demonstrate their ideas would be invaluable from a community perspective.

I think it would also reduce the bullshit quotient quite a bit too.

For two, it would give directors and managers of analytics talent a great way to evaluate prospective talent using real data. Practitioners can’t share a lot of their work. Almost always it is subject to a number of NDA’s. (And with good reason).

For three, it would offer a verifiable way to test general claims. That’s the root of real community science. Verifying results and allowing other people to confirm those claims.

What’s the problem? Why isn’t there such a data source now?

First, no company would want to publish current, relevant data. It would allow enterprising competition to take advantage of them. Secondly, the question of format comes to mind: most analytics is hosted on third party systems, and free logfile readers that are ‘good’ are vanishingly rare.

There are solution spaces I can think of.

Somebody might want to buy the books and databases of a bankrupt company. Somebody might want to leave the site hosted and make all the (non-personally identifiable) information freely available so the community of web analysts could conduct an autopsy on it. We’d have to make the web analytics vendor logins all available (which would be a stretch, but not insurmountable). Though, truth be told, the thought of it being my bankrupt company – out there all exposed – makes my blood go cold. But you never know. Somebody might be into that.

There’s also the great guys at Quantcast. While they don’t have clickstream data, they do make available an awful lot of data on their website for free. Problem is, of course, that it’s not all nicely ETL’d into a format that is convenient and distributable within the community.

I wouldn’t blog about it if I didn’t think that it’s an opportunity for us all to do some good and advance the practice. And it’s a practice that is worth advancing.

5 thoughts on “The Common Data Set”

Jim Novo says:

August 12, 2009 at 10:52 pm

When we were working on the very first WAA course, we proposed using a common dataset that all the vendors could run against.

Then standard screenshots from each vendor could be displayed and people could get a rough idea of the different analytical approaches using a comparable data base.

DEAD – ON – ARRIVAL
Jose says:

August 14, 2009 at 6:12 pm

Wait a second. This is not what the Web Analytics ChampionShip, in a way, was going to be: A common data set available to the WAA members… at least for a few weeks?

I hope there was not a vendor related issue behind the contest being “postponed”.
Christopher Berry says:

August 18, 2009 at 7:51 pm

@JimNovo I imagine that many vendors would have just as much to hide as many companies. I’m somewhat happy to hear that you tried.

@josedavilla I certainly hope not.
Jacques Warren says:

August 19, 2009 at 2:31 pm

I’m all for it. Been talking about similar stuff by proposing a universal tagging structure (http://www.waomarketing.com/blog/?p=59), but was called a dreamer.

Our WA blogosphere is full of crappy assumptions people happily repeat without questioning. We’re in dire need of *proving* what we say, and the current incapacity to assure reproductibility (?) of results is a big hurdle.

I guess one of the reasons is that we have been boxed in a paradigm defined by technologists, and commercial interests. What would be science if the nature/structure of data depended on the measurement tool brand…
Christopher Berry says:

August 20, 2009 at 12:49 pm

@jacqueswarren I agree with you. How things are now are frankly ass-backwards, and that’s by and large a function of the problems WA was trying to solve in the first place:

A fragmented solution set.

I can’t think of a way around that. Yet.

Comments are closed.