WebTrends is the most common server log file reader in use today. Many tools don’t use log files, but it’s interesting to talk about them.

Back in 1997, one group reported that it took 240 person hours to develop perl scripts to read the files. To put just a half month’s worth of data into a ‘usable format’, it took another 100 person hours. It’s not simple, because there are four types of server logs: access, agent, error and referrer. It’s just rather hard to parse all these types of data.

Access logs tell you the time, IP address, and which files got downloaded (.html, .flv, .jpeg, .pdg, and so). This is where the misleading term “hits” came from. Every file that got downloaded was considered a ‘hit’, and this was in part because it was relatively easy to aggregate that kind of information. People like to follow the path of least resistance, and since the original IT workers were people, this is what they did.

So, back in 1997 (and possibly to this day, if I was dealing with an audience that didn’t know what ‘hits’ meant), I could, if I was unethical, inflate my ‘hits’ count by taking my main logo and cut it up into 4 pixel by 4 pixel pieces…so that loading a single file could result in several hundred ‘hits’ to the server. I recall a period around 1997-1999 where it was believed that breaking up image files into dozens of smaller units would improve download performance (though, I can’t vouch for the actual results, it certainly would increase the number of hits on a site).

Agent logs tell you about the browser, operating system, and computer that somebody is using.

Error logs tell you about the number of 404 and other errors, such as the number of aborted downloads.

Referrer logs tell you how people came to your site.

It took a number of years, but the community started moving towards ‘accesses’ (later to be called pageviews). Sadly, the way that pageviews were conceptualized in some products, pageviews didn’t approximate nice Nielsen-TV-Ratings kind of numbers, so as of late, we’ve started chasing this concept of unique visitors or ‘absolute unique visitors’, though, I really question a lot of these efforts.

The path that visitors used to take through a site used to be called ‘threading’. It’s long since been renamed “pathing” or “click-path analysis”.

Log files can be enormous. Parsing and aggregating them can take a tremendous amount of computer power.

Log file analysis has similar problems as other systems. While a few tag-based systems use cookies to track unique visitors, which can be problematic because a few users – to the tune of 3-5% (and typically of a specific profile) delete their cookies. However, even relying on IP’s can be problematic, as some ISP’s will change a users IP address due to server loading, even midway through a session.

Moreover, people can use multiple computers. I, myself, use three computers a day. Worse, I think this pursuit of ‘eyeball’ measurement is noble, but at this point, impractical. With Nielsen, they measure how many people are watching a given show at a specific period of time. It’s what I’d call a ‘synchronous measure’. With websites, well, in the Internet seems to be really ‘asynchronous’. So, maybe asynchronous measurement works better?

Log files are the original way that web analytics was born. And it just seems like we’re always questing for an old-media measure.

2 thoughts on “On Log Files and Old Media Measurement

  1. Shaina Boone says:

    I’d like to clarify here that you mean server-side generated log files over client-side generated log files right? Client-side generated log files are still being used by enterprise (non-hosted)products, and sometimes in conjunction with server-side. The beauty of CS logs is that they can be customized to your needs by modifying the JavaScript file that generates them so you can have multiple log files to segment data sets.

    Basic CS log file example:

    #Fields: date time c-ip cs-username s-ip s-port cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken cs(User-Agent) cs(Referrer)
    2002-05-24 20:18:01 172.224.24.114 – 206.73.118.24 80 GET /Default.htm – 200 7930 248 31 Mozilla/4.0+(compatible;+MSIE+5.01;+Windows+2000+Server) http://64.224.24.114/

  2. Right, right!

    Though, client-side generated logs still end getting passed to the server, right?

Comments are closed.