If you’ve been attending the Web Analytics Association Research Committee calls, you’ll know that I’ve been troubled by this question of a ‘common data set’.
As it is right now, data that is common, clean, and relevant to web analytics is rare. To be sure, there are heaps of open source log files (I believe the Wiki Foundation made 5 terabytes available for download awhile back), but in terms of there being some manageable CSV file out there – it’s pretty rare.
Such a dataset is pretty useful from a few perspectives.
For one, it would enable researchers within our community to use a verifiable data source when making assertions about the importance of different metrics. I’m dissatisfied with what I can demonstrate here: only ‘theory and definitions’. I’m certain that several other people are too. A common data set which people could go at and demonstrate their ideas would be invaluable from a community perspective.
I think it would also reduce the bullshit quotient quite a bit too.
For two, it would give directors and managers of analytics talent a great way to evaluate prospective talent using real data. Practitioners can’t share a lot of their work. Almost always it is subject to a number of NDA’s. (And with good reason).
For three, it would offer a verifiable way to test general claims. That’s the root of real community science. Verifying results and allowing other people to confirm those claims.
What’s the problem? Why isn’t there such a data source now?
First, no company would want to publish current, relevant data. It would allow enterprising competition to take advantage of them. Secondly, the question of format comes to mind: most analytics is hosted on third party systems, and free logfile readers that are ‘good’ are vanishingly rare.
There are solution spaces I can think of.
Somebody might want to buy the books and databases of a bankrupt company. Somebody might want to leave the site hosted and make all the (non-personally identifiable) information freely available so the community of web analysts could conduct an autopsy on it. We’d have to make the web analytics vendor logins all available (which would be a stretch, but not insurmountable). Though, truth be told, the thought of it being my bankrupt company – out there all exposed – makes my blood go cold. But you never know. Somebody might be into that.
There’s also the great guys at Quantcast. While they don’t have clickstream data, they do make available an awful lot of data on their website for free. Problem is, of course, that it’s not all nicely ETL’d into a format that is convenient and distributable within the community.
I wouldn’t blog about it if I didn’t think that it’s an opportunity for us all to do some good and advance the practice. And it’s a practice that is worth advancing.