In Others Words

How Web Analytics evolved by adopting terms and concepts from other disciplines and making them their own

I like to tell a story how Web Analytics evolved over the last twenty years or so. As a guideline I am going to use a couple of terms and concepts that  Web Analysts have adopted from other disciplines and have made their own. These are specifically:

  • Analytics, the very word itself
  • Validity, or why do we think some data is better than other
  • Attribution, or, the rational user and why he isn’t
  • Big Data, or, not little data just more of it

I will also try to crack a few jokes, which usually does not end well (but all things strive, so  bear with me).

Analytics

Analysis is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic  since before Aristotle (384–322 BC)
(Wikipedia)

So the idea of analysis/analytics clearly  precedes the internet, as do I –  I got my first job in online marketing by the end of the 1990s. Before that I studied education at Berlins Humboldt University, and as students we had a really high opinion on that whole newfangled “Internet” thing – we thought if everybody had a says then the world would be a lot more democratic; in reality it just got a lot louder instead.

Online marketing by the end of the 1990s usually did not have much of a business plan. We were rather full of ourselves. We called ourselves the new economy and quoted Ulrich Beck on how the knowledge society would soon supplant the outfashioned old economy which, you know, actually made things.

This was not how things actually worked out. In the first dot-com bust most of the new economy collapsed, the old economy bought what was left and with a certain amount of schadenfreude they let us run though the hamsterwheel of frugal employment for years to come.

However these were not wasted years. There were first experiments with mobile content and games; of course the contemporary feature phone had a 160*25 pixel grey scale display which was a bit to unappealing for success, but rumor has it that the idea of mobile apps gained some traction recently.  This was also the time of the browser wars, which eventually resulted in web standards being established. CSS, proposed by Håkon Wium Lie in 1994, started to gain traction, leading to a separation of content and presentation. So, a lot of groundwork was laid during that time.

There was not much Web Analytics.

This was, at least in part, because there was a lot less advertising. Google did barely exist, and Search Engine Marketing meant to buy a display banner on the Yahoo or Altavista homepage, and there were times when you had click-through rates in the two figures. You did not need that much analyses, because you’d know when somebody had clicked on your banner, and in a way it says something about the kind of success you can hope to have with advertising today when you need a sophisticated statistics tool to find out of your ads have been successful at all; you cannot tell just by looking.

Ad serving technology became better and cheaper. Users where exposed to a much larger volume of advertising and developed certain defense mechanisms. Analytics emerged as a new and necessary field.

Analytics, Wikipedia tells us, is the process of breaking a complex topic or substance into smaller parts.  What exactly do we break apart when we do Web Analytics ? These days we usually think of abstract concepts – “our data”, “our business goals”, something like that. But there is actually a kind of physical substrate of our data that is broken up into different buckets to aggregate information. Back in the 90s that substrate was the server logfile, which might look like this:

logfiles

For each “hit” to the server the logfile contains the IP address of the user, a user agent string, date and time, request method, the address of the requested resource, the status code and the size of the requested resource in byte.

To make that into a useful statistic you’d throw away all hits to images and other assets and keep just the entries for the content files. Then you’d break the lines up by whitespace into individual fields. This would give answers to questions like “in my given timeframe, what browsers accessed my site”, or “what where my top ten pages in terms of pageview” and similar.  Undoubtly that is useful information.

There are however a few things missing from such a logfile.

The first one is the user, which is a pretty big thing to be missing.  HTTP is a  “stateless” protocol. Individual requests to the server are not connected to each other, and there was no built-in marker to tell which requests belong to the same visitor.

There were several attempts to make up for this. The most successful so far was the use of cookies. A cookie is a small text file that is stored in the browser, and sent to the server of origin every time a request is made. Such a cookie can contain a client id, which is then used to maintain a session comprising individual pageviews. By their founders account this technique was pioneered by Urchin, the company that later was acquired by Google and was remade into Google Analytics.

This worked okay-ish as long as there was a 1:1 ratio of people to computers. If you have multiple people per computer, or multiple computers and other web enabled devices per person then the system breaks down. Likewise if users delete their cookies, or prevent their browser from setting cookies, we cannot recognize recurring users, or users at all (the reports in our analytics software will still show data, but that data does not reflect reality).

Another thing that is missing is the proper session duration. Duration is the delta between two datapoints.  After the last hit for a client id there is no more datapoint to compute a delta from, so we do not know how much time a user spent on the last page of the session, nor do we actually know when the visitor left. As far as we know the visitor might linger on the last page of his visit forever.

In the end this problem was resolved by convention rather than technology. It was assumed that anybody who had not produced another interaction within 30 minutes probably had left the site. This convention became a de-facto industry standard and is still in use today.

So the interesting thing here is that while we use the most scientific term we could find – “Analytics” – there are uncertainties and assumptions baked into our discipline right from the start. As a science we are somewhat less exact then we want to.

Key Learnings

  • Analytics
    Despite our best efforts there is a certain fudge factor built in

(to be continued)

Leave a Reply

Your email address will not be published. Required fields are marked *