WebTrends is the most common server log file reader in use today. Many tools don’t use log files, but it’s interesting to talk about them.
Back in 1997, one group reported that it took 240 person hours to develop perl scripts to read the files. To put just a half month’s worth of data into a ‘usable format’, it took another 100 person hours. It’s not simple, because there are four types of server logs: access, agent, error and referrer. It’s just rather hard to parse all these types of data.
Access logs tell you the time, IP address, and which files got downloaded (.html, .flv, .jpeg, .pdg, and so). This is where the misleading term “hits” came from. Every file that got downloaded was considered a ‘hit’, and this was in part because it was relatively easy to aggregate that kind of information. People like to follow the path of least resistance, and since the original IT workers were people, this is what they did.
So, back in 1997 (and possibly to this day, if I was dealing with an audience that didn’t know what ‘hits’ meant), I could, if I was unethical, inflate my ‘hits’ count by taking my main logo and cut it up into 4 pixel by 4 pixel pieces…so that loading a single file could result in several hundred ‘hits’ to the server. I recall a period around 1997-1999 where it was believed that breaking up image files into dozens of smaller units would improve download performance (though, I can’t vouch for the actual results, it certainly would increase the number of hits on a site).
Agent logs tell you about the browser, operating system, and computer that somebody is using.
Error logs tell you about the number of 404 and other errors, such as the number of aborted downloads.
Referrer logs tell you how people came to your site.
It took a number of years, but the community started moving towards ‘accesses’ (later to be called pageviews). Sadly, the way that pageviews were conceptualized in some products, pageviews didn’t approximate nice Nielsen-TV-Ratings kind of numbers, so as of late, we’ve started chasing this concept of unique visitors or ‘absolute unique visitors’, though, I really question a lot of these efforts.
The path that visitors used to take through a site used to be called ‘threading’. It’s long since been renamed “pathing” or “click-path analysis”.
Log files can be enormous. Parsing and aggregating them can take a tremendous amount of computer power.
Moreover, people can use multiple computers. I, myself, use three computers a day. Worse, I think this pursuit of ‘eyeball’ measurement is noble, but at this point, impractical. With Nielsen, they measure how many people are watching a given show at a specific period of time. It’s what I’d call a ‘synchronous measure’. With websites, well, in the Internet seems to be really ‘asynchronous’. So, maybe asynchronous measurement works better?
Log files are the original way that web analytics was born. And it just seems like we’re always questing for an old-media measure.