Friday, July 5, 2013

What is Big Data, anyway?

My graduate school advisor, Rod Fort, posed the question in this post's title on Twitter today.  I gave answers and, as he usually does, he made me think more about my answers and their precision.  Technically, what I was trying to get across was that the use of Big Data, in most cases, is terribly imprecise.  I should have been able to explain the use of the term quickly, but it took a while and a number of "well, we've always done that" from Rod.  It is thrown around a lot, and in most cases not in any meaningful way.  I got a similar reaction to my mention of a prospective certificate in Complex Systems while at Michigan (which I did not pursue--mainly because my mathematical background wasn't strong enough and I had time constraints pursuing other things).

So, assuming we want to separate the use of "Big Data" with "Analytics", I think we can amply sum up the term with the following:

Big Data describes the relationship between the ability to collect data, and the ability to do something with it.  Data is BIG at the margin at which one more unit of data would leave us unable to analyze it all with the given technological capability.

This leaves Big Data flexible for the given tool.  The growth to collect and store large amounts of data has outpaced the ability to do anything meaningful with it.  This isn't anything new.  In the same way that dynamic pricing isn't really a new idea, just a new implementation.  In the same way that analytics aren't new, just a clear recognition of the integration between statisticians, programmers, and managers in the use of the term today.  In the same way that Moneyball, the idea, isn't new.  All tend to improve over time just as any field.

When it comes to analysis of Big Data--not the term big data itself--the holy grail is to have the ability to push a button, and have the answer directly to the decision maker, what I called "streamlining" on Twitter.  But this isn't Big Data itself (and it's really a fantasy at its extreme).  Certainly we can get closer to this, but data changes, behavior changes, the world changes.  These will always have to be updated, and in many ways I don't know that Big Data and Analytics as terms are completely separable.  In this case, though, let's be specific:

Analytics is the pursuit of simplistic, streamlined statistical information in a context understandable to the decision maker.

Again, unless we believe the movie Paycheck, this won't be 100% possible.  But the fantasy idea is that the computer and its data will tell us the answer to everything.  I enjoy this quote from the Big Data article linked above:  

"May 2012 danah boyd and Kate Crawford publish “Critical Questions for Big Data” in Information, Communications, and Society...(3) Mythology: the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.”

However, we can do things to ease the use of large amounts of information to make decisions.  This requires the cooperation of statisticians, programmers, and managers.  Managers need to pose the problem in a way that is tractable and understandable.  Statisticians need to know the best methods for the distribution and variability in some given set of information, and be able to communicate this back to the manager.  The programmer needs to be able to collect the data accurately--possibly with the help of many other experts--and deliver the methodology in a way that can happen in real time or as close to it as possible.  In many ways, Dynamic Pricing is an outcome of these things--but not a new idea.  Big Data commentaries and discussions are referring to closing the gap between availability and implementation.



No comments:

Post a Comment