Somewhat belated here is the last part of the presentation from the Measurecamp Berlin (I managed to grab one of the coveted tickets for the next Measurecamp in London, which is a good incentive to finally finish this series). This last bit turned to be out somewhat expensive – as it turned out “Big Data” was on the “forbidden word list” that required a donation per uttered instance. So that was an extra 20 for a refugee initiative, for some closing words on big data. So what does the term mean (if anything) ?
Big Data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. The term “big data” refers […] seldom to a particular size of data set.
The term big data has a somewhat unclear provenance. Wikipedia suggests that one of the first people to use the term was John R. Mashey of Silicon Graphics Inc., which is defunct by now but was a pretty big deal in the 80s and 90s and best known for their high end graphics workstations. People who do work with 3D graphics have a natural interest in moving around big amounts of data.
So for Mashey “big data” was to a large part a hardware problem. He was worried about rpm numbers of hard disks, I/O throughput rates, bus sizes etc., what he names summarily “infrastress“. This is not usually the sense in which we use the term “big data” today.
Actually, after all the big promises and subsequent failures of big data and the great sobering up Big Data is often described in terms of “little data, just more of it”. However that is not quite right either.
When I first heard about big data it was in the context of web analytics, or event the internet. I read the term first in “Sterne und Weltraum” (the German edition of Sky and Telescope) in an article about LOFAR, the Low Frequency Array – the Effelsberg section of that radio telescope was awaiting completion at the time. The article described Lofar as a “software telescope”. Where previously you would have to develop a theory, and then aim your dish carefully at selected stars, you would now all of the sky at once, average out your channels and everything that still stood out in the end would be a bonafide phenomenon.
In the end that was not how LOFAR (or anything, really) worked. But this kind of attempted paradigm shift became the driving idea behind “Big Data” – what an article in Wired would embrace as “The End of Theory“. No more hypothesis and theories and models would be necessary. You would simply collect all possible data, and the underlying reality would reveal itself.
Did it work ?
Well, kind of, sort of. Not really for Web Analytics, thank god – that kind of thing could easily put me out of a job. Many tools offer some kind of data-driven automation, but for the most part we feel uneasy to yield full control to intransparent algorithms. There are a number of successful big data applications, but in Web Analytics , for better or worse, we are still building models.
Despite our best efforts there is a certain fudge factor built in
Always test multiple conflicting hypotheses to arrive at valid conclusions
Really cool (but you do not necessarily want to lead with that)
- Big Data
Yes! But for data scientists with big data sets rather than for web analysts