Hype, “Big Data”, and Towards a More Pragmatic Analytics

According to Industry Pundits™, IT departments worldwide will be spending an amount of money equal to the GDP of several midsized countries on Big Data. As someone who has been around the block a few times, I have seen many hype cycles come and go. But there is something truly staggering about the hype around Big Data these days. It is enough to make you question the whole thing – or at least wonder just how deep the inevitable trough of disillusionment is going to be!

Is It Big Data, or Big Hype?

There is, of course, a kernel of truth to the hype. Much of the most interesting work in tech in general – and the startup world in particular – is occurring in what can vaguely be called the “Big Data” space. Amazon, Facebook, and Google, three of the so called “four horsemen of technology” (the fourth being Apple) drive their profitability via Big Data. As these companies lead, it isn’t just other startups following, but corporate IT departments are suddenly are looking at their long running “Business Intelligence” initiatives and wondering why they are not seeing the same kinds of return on investment. They are thinking…  if only we tweaked that “BI” initiative and somehow mix in some “Big Data”, maybe *we* could become the next Amazon.

Sadly, many such initiatives are doomed to failure. Many companies will task IT with coming up with a big data initiative, but won’t really involve the business at all. Many of these cases will result in IT going out and buying a product, installing it on desktops (or perhaps even tablets), and subsequently declaring victory. Of course, the value from this activity is dubious at best, typically resulting in lots of licenses acquired that, ultimately, sit on a shelf and seldom get used.

That might be bad, but there are worse things that can happen. There will be others that will go forth and try to build a comprehensive platform. Because IT often works in isolation, without a specific business problem to work on, the urge is often to try to build a solution that a theoretical business person can use to do analytics in a generic sense. There are, of course, serious problems with this approach:

  • Tendency to spend years “perfecting the universal platform”.
  • Building a platform that, in an attempt to be usable by a general business user, dumb things down and don’t deliver sufficient value.

Without a specific business problem to focus on, not only do you build more platform than you need, but you tend to build a platform that lacks the depth needed to solve the kind of problems you solve with modern analytical tools. Let’s face it – most business users are not experts in how to use Monte-Carlo simulation, neural networks, or other tools that are in the domain of the data scientist.

Want To Do Analytics? Start with a Business Problem!

It seems like rather trite advice, and I am loathe to say never very often, but I can say with certainty that you should never build out a Big Data or Analytics initiative without a specific business problem in mind. Sure, there are lot’s of people who would love to raid your corporate treasury in order to build out the “one platform to rule them all” – or even just sell you a bunch of shelfware. But that doesn’t actually do what you are there to do, which is get results, not buy software.

Our advice is at ThoughtWorks is, unambiguously, start small. Every good analytics problem has an underlying analytics question – something like “given what we know about a customer, how likely are they to leave for a competitor?” or “given a set of transactions, what is the likelihood of fraud?”. We then look at what data sources – some conventional, some far from conventional – are available to help solve the problem. We do discovery work in the data, and proceed to work up a hypothesis about how we can use said data to predict something about the customer, transaction, or other subject of interest. And then we test our hypothesis. And if the test works out, we move on and find a way to operationalize our finding for the benefit of the customer. Or if not, we seek the next hypothesis and try again.

The key to success in any agile process is the feedback loop. The problem with BI and even many incarnations of more modern analytics initiatives is the distinct lack of feedback that occurs when you build a platform before you solve a problem. The reason we call this Agile Analytics is because we use these feedback loops – and the learning associated with them – to guide our efforts.

Data Science versus Data Voodoo

Of course, a feedback loop doesn’t guarantee results. The work, the science, has to be solid as well. The tools of yesterday, doing things like building data cubes to slice and dice data, might be good at telling you what happened. They do little, however, to tell you meaning behind data, or to be useful for predictive use. For this, we bring in data scientists, often people with PhDs in mathematics, physics, or related fields, to develop predictive algorithms. The kind of people who developed things like modern spam filters that predict an answer to the question “is this email likely to be spam” using Bayesian classifiers.

The tools of the data scientist are vast indeed. Neural networks, machine learning, natural language processing, and more are among the techniques such people often use when developing analytics solutions. More importantly, they know when to use these tools, and they know when simpler tools will suffice – and when a pivot from one technique to another is needed if a hypothesis does not work out. It is the data scientist, not the tools themselves, that makes Agile Analytics possible. This is why we cringe when we see products advertised that claim to allow end users to bypass the data scientist and allow users of all proficiencies to, say, apply neural networks to data sets. We would feel the same way about do it yourself surgery kits!

Scaling Based on Results

The pundits are right about the potential around analytics and big data. But they frequently understate the risks of wasting money by engaging on long running programs that invest far too much prior to seeing results. ThoughtWorks took up Agile because we saw companies engage in massive waste by investing money into projects for years before results are realized. Analytics is no different. In fact, given the amount of hype around analytics, it is even more prudent you demand results before you scale up investment.

Our call to action is to start with discovery. We are happy to help with this, but even if we don’t, we believe starting these initiatives using small teams of less than six people. Start by establishing an analytics question, engage in a short discovery phase – typically around 3 weeks – to form and test your hypothesis, and based on your results, go from there. When Google got started, they certainly did not start with a seven figure budget and some enterprise software package promising to “index the internet”. Neither should you, as you engage in your analytics initiative.

Hype, “Big Data”, and Towards a More Pragmatic Analytics

10 thoughts on “Hype, “Big Data”, and Towards a More Pragmatic Analytics

  1. Very good post and liked reading it. A solution is needed when there is a problem. Trying to put in a solution for a non existing business problem could bring along more unanticipated troubles. Jumping into big data initiation without a business case is an evidence of lack of proper Governance.

  2. Most times when a company calls me and says they need a BIG data solution, they really want some numbers – nay, they want to see colour images of grouped numbers…

    Well done to have described Big Data and Analytics side by side

  3. alg0rhythm says:

    I agree with most of what is said here…especially with the tendency of many organizations to. Hop on to trendy intiatives that ends up with a lot of waste. Ive talked to organizations that were 1 app /1 server minimum. Going too big can be rough and wasteful.

    However…I do think one data standard is a key idea. If information flow stimulates growth..getting rid of data silos is important. Is it not possible to migrate the data of an existing organization onto one platform?
    I am in the opening stages of an experiment to see if all data…. can be stored in the same warehouse with different permissoning structures.

    1. Aaron Erickson says:

      I actually don’t think it’s ever possible to be on the same platform. Too much data being generated at the edge in most organizations, much of which is the most useful/interesting. I am doing this now with a major IaaS provider – where the amount of data being generated, the organizational complexity, and the need for both owned and outside data in order to do proper analytics implores us to take a multi-platform strategy.

      In other words, data generation has reached escape velocity. It is being generated far faster than any org can warehouse it. You have to take a strategy of dealing with data as it is, not as you wish it to be.

      1. alg0rhythm says:

        Thanks for replying. It’s been a while since I managed a corporate data structure and even then they were only segments of the enterprise. Not sure i know what you mean by the efge… and when you say platform you mean like Oracle?

      2. Aaron Erickson says:

        When I say edge, I mean things like logs (structured, unstructured), sensor data coming in from devices around the network, systems in shadow IT, and anything else that tends to escape the control of IT directly. The “dark matter” of corporate data, if you will – which almost always escapes direct control from corp IT.

      3. ok. but wouldn’t it be possible to leave space and create a rubric for either manually adding or automating additions to edge data later. To analyze it in context with other data.. which without specific examples I can’t be sure is necessary you would have to pull it from it’s silo anyway, right?

      4. Aaron Erickson says:

        I suppose such a thing is possible, in theory. However, I fail to see a great reason for doing so. The analysis is usually going to be in some alt structure in most deep analytics scenarios (think Hadoop or other map-reduce platforms).

        Why do we need everything on the same platform anyway? Why even bother trying? We can analyze data in context of other data across multiple platforms reasonably easily. The only people I see wanting to build platforms are people that either sell the platforms (i.e. think EDW vendors) or IT departments who have already invested heavily in such systems and for some reason need to justify said system’s existence.

Leave a comment