Hype, “Big Data”, and Towards a More Pragmatic Analytics

According to Industry Pundits™, IT departments worldwide will be spending an amount of money equal to the GDP of several midsized countries on Big Data. As someone who has been around the block a few times, I have seen many hype cycles come and go. But there is something truly staggering about the hype around Big Data these days. It is enough to make you question the whole thing – or at least wonder just how deep the inevitable trough of disillusionment is going to be!

Is It Big Data, or Big Hype?

There is, of course, a kernel of truth to the hype. Much of the most interesting work in tech in general – and the startup world in particular – is occurring in what can vaguely be called the “Big Data” space. Amazon, Facebook, and Google, three of the so called “four horsemen of technology” (the fourth being Apple) drive their profitability via Big Data. As these companies lead, it isn’t just other startups following, but corporate IT departments are suddenly are looking at their long running “Business Intelligence” initiatives and wondering why they are not seeing the same kinds of return on investment. They are thinking…  if only we tweaked that “BI” initiative and somehow mix in some “Big Data”, maybe *we* could become the next Amazon.

Sadly, many such initiatives are doomed to failure. Many companies will task IT with coming up with a big data initiative, but won’t really involve the business at all. Many of these cases will result in IT going out and buying a product, installing it on desktops (or perhaps even tablets), and subsequently declaring victory. Of course, the value from this activity is dubious at best, typically resulting in lots of licenses acquired that, ultimately, sit on a shelf and seldom get used.

That might be bad, but there are worse things that can happen. There will be others that will go forth and try to build a comprehensive platform. Because IT often works in isolation, without a specific business problem to work on, the urge is often to try to build a solution that a theoretical business person can use to do analytics in a generic sense. There are, of course, serious problems with this approach:

  • Tendency to spend years “perfecting the universal platform”.
  • Building a platform that, in an attempt to be usable by a general business user, dumb things down and don’t deliver sufficient value.

Without a specific business problem to focus on, not only do you build more platform than you need, but you tend to build a platform that lacks the depth needed to solve the kind of problems you solve with modern analytical tools. Let’s face it – most business users are not experts in how to use Monte-Carlo simulation, neural networks, or other tools that are in the domain of the data scientist.

Want To Do Analytics? Start with a Business Problem!

It seems like rather trite advice, and I am loathe to say never very often, but I can say with certainty that you should never build out a Big Data or Analytics initiative without a specific business problem in mind. Sure, there are lot’s of people who would love to raid your corporate treasury in order to build out the “one platform to rule them all” – or even just sell you a bunch of shelfware. But that doesn’t actually do what you are there to do, which is get results, not buy software.

Our advice is at ThoughtWorks is, unambiguously, start small. Every good analytics problem has an underlying analytics question – something like “given what we know about a customer, how likely are they to leave for a competitor?” or “given a set of transactions, what is the likelihood of fraud?”. We then look at what data sources – some conventional, some far from conventional – are available to help solve the problem. We do discovery work in the data, and proceed to work up a hypothesis about how we can use said data to predict something about the customer, transaction, or other subject of interest. And then we test our hypothesis. And if the test works out, we move on and find a way to operationalize our finding for the benefit of the customer. Or if not, we seek the next hypothesis and try again.

The key to success in any agile process is the feedback loop. The problem with BI and even many incarnations of more modern analytics initiatives is the distinct lack of feedback that occurs when you build a platform before you solve a problem. The reason we call this Agile Analytics is because we use these feedback loops – and the learning associated with them – to guide our efforts.

Data Science versus Data Voodoo

Of course, a feedback loop doesn’t guarantee results. The work, the science, has to be solid as well. The tools of yesterday, doing things like building data cubes to slice and dice data, might be good at telling you what happened. They do little, however, to tell you meaning behind data, or to be useful for predictive use. For this, we bring in data scientists, often people with PhDs in mathematics, physics, or related fields, to develop predictive algorithms. The kind of people who developed things like modern spam filters that predict an answer to the question “is this email likely to be spam” using Bayesian classifiers.

The tools of the data scientist are vast indeed. Neural networks, machine learning, natural language processing, and more are among the techniques such people often use when developing analytics solutions. More importantly, they know when to use these tools, and they know when simpler tools will suffice – and when a pivot from one technique to another is needed if a hypothesis does not work out. It is the data scientist, not the tools themselves, that makes Agile Analytics possible. This is why we cringe when we see products advertised that claim to allow end users to bypass the data scientist and allow users of all proficiencies to, say, apply neural networks to data sets. We would feel the same way about do it yourself surgery kits!

Scaling Based on Results

The pundits are right about the potential around analytics and big data. But they frequently understate the risks of wasting money by engaging on long running programs that invest far too much prior to seeing results. ThoughtWorks took up Agile because we saw companies engage in massive waste by investing money into projects for years before results are realized. Analytics is no different. In fact, given the amount of hype around analytics, it is even more prudent you demand results before you scale up investment.

Our call to action is to start with discovery. We are happy to help with this, but even if we don’t, we believe starting these initiatives using small teams of less than six people. Start by establishing an analytics question, engage in a short discovery phase – typically around 3 weeks – to form and test your hypothesis, and based on your results, go from there. When Google got started, they certainly did not start with a seven figure budget and some enterprise software package promising to “index the internet”. Neither should you, as you engage in your analytics initiative.

Hype, “Big Data”, and Towards a More Pragmatic Analytics

Business Intelligence does not Come From a Product

There is this guy, Bradford Cross, whom I met on my first project at ThoughtWorks.  I remember the day in a profound way, as I was on my first day at a client that, you could say, was something of a well known company in the top tier of accounting firms.  The kind of place one might perceive, without knowing further, was the kind of place that still may have a “business attire” dress code.  Well – here I am on day one, and this person walks in, I think wearing a green belt, unmatched pants, having messy hair – ready and eager to start work on this new project we were working on.

Welcome to ThoughtWorks, and Welcome to Silicon Valley!

Fast forward to today, and this same person, who I remember having a passion for functional programming and math that far exceeded anything I could muster up, is now one of the people behind FlightCaster – a service that uses functional programming (using Clojure and Hadoop) to implement a predictive algorithm that can tell you – with far more accuracy than the airlines – when your flight will arrive!  Being the cynical skeptic that I tend to be, I really did not believe it until I started using the service myself.  While I am only two flights in, the results are very promising.

To sum up what it does, the service uses 10 years worth of flight data, along with a certain amount of statistics wizardry that I almost certainly don’t understand, to correlate various events (plane arrival times, weather, etc.) to instances where delays occur.  More to the point, when I think of what business intelligence should be, what FlightCaster does seems far more interesting to me than what I have seen in most vendor presentations.  This is what leads me to the idea that BI comes from an idea about how correlation might occur, combined with technical folks who understand statistics and know how to harness data to do analytics.  It almost certainly never comes along because you install a tool.

So if that is where BI comes from, what is the most ideal toolset for helping to realize it.  There are no shortage of vendors who will try to sell you something that will lock you into their stack.  Or sell you some sort of whiz-bang UI that convinces you that you can go cheap on people so long as you just have good tools.  A typical anti-pattern is that an organization will buy the tools, then basically use them to do reporting against a data warehouse and post some simplistic results to some dashboard.  Something that, when you get down to it, you could have done without the tools, and probably without the fancy dashboard, at far lower cost using simpler tools.

This is especially true today.  We now have functional languages, equipped with great math and statistics libraries, that are for the most part mainstream (Clojure and F# come to mind).  We have open source tools like Hadoop (no more big expensive software) which make this stuff work in distributed computing environments that make it so you can leverage commodity hardware (no more big expensive iron).  If FlightCaster, a service that is as innovative as anything I have ever seen in the BI space, probably moreso, runs on this kind of stuff, it is probably good enough for what most companies would ever call their own BI efforts.

So what are the barriers to this?  One barrier I have encountered is that there is a bias against BI efforts that require “programming”.  I once gave a webinar about this topic to a group of CIOs, Consulting Executives, and BI product vendors.  One of my points was that the most useful BI efforts would require a technologist.  One guy in the crowed literally booed.  There is literally a subculture in the world of BI that thinks if you have to have someone program something, you have done a bad job.  Why this is the case I can’t be certain, but given that there are corners of the IT world that see programming as something to be avoided, it is not entirely surprising.

Another barrier is that there are more than a few CIOs or “Directors of BI” who are charged with the task of implementing a BI initiative, but have no clue where to start.  They know they need it, and that is as far as it goes.  Opportunistic salesperson comes on the scene, sells the SKU, and suddenly, we have a BI initiative made mostly out of shelfware.  It does not help that the world of “big database” (think IBM, Oracle, Microsoft) sells a database product that is largely a commodity without some proprietary BI extension which is usually added for differentiation purposes.  With a database software sales force already ensconced in the world of the enterprise CIO, it should suprise nobody that there are folks who think you need IBM, Oracle, Microsoft, or some other product to do BI.

I am here to say hogwash.  You don’t need any of that stuff.  The tools are here, they are free, and they have clearly done BI on a scale grander than most BI efforts I have seen inside most corporate IT.  What you need is a decent use case (tell me when my flight will arrive), that is solvable by a combination of data, math, and organizational will.  Do that, and you will have Business Intelligence.

Business Intelligence does not Come From a Product