According to Industry Pundits™, IT departments worldwide will be spending an amount of money equal to the GDP of several midsized countries on Big Data. As someone who has been around the block a few times, I have seen many hype cycles come and go. But there is something truly staggering about the hype around Big Data these days. It is enough to make you question the whole thing – or at least wonder just how deep the inevitable trough of disillusionment is going to be!
Is It Big Data, or Big Hype?
There is, of course, a kernel of truth to the hype. Much of the most interesting work in tech in general – and the startup world in particular – is occurring in what can vaguely be called the “Big Data” space. Amazon, Facebook, and Google, three of the so called “four horsemen of technology” (the fourth being Apple) drive their profitability via Big Data. As these companies lead, it isn’t just other startups following, but corporate IT departments are suddenly are looking at their long running “Business Intelligence” initiatives and wondering why they are not seeing the same kinds of return on investment. They are thinking… if only we tweaked that “BI” initiative and somehow mix in some “Big Data”, maybe *we* could become the next Amazon.
Sadly, many such initiatives are doomed to failure. Many companies will task IT with coming up with a big data initiative, but won’t really involve the business at all. Many of these cases will result in IT going out and buying a product, installing it on desktops (or perhaps even tablets), and subsequently declaring victory. Of course, the value from this activity is dubious at best, typically resulting in lots of licenses acquired that, ultimately, sit on a shelf and seldom get used.
That might be bad, but there are worse things that can happen. There will be others that will go forth and try to build a comprehensive platform. Because IT often works in isolation, without a specific business problem to work on, the urge is often to try to build a solution that a theoretical business person can use to do analytics in a generic sense. There are, of course, serious problems with this approach:
- Tendency to spend years “perfecting the universal platform”.
- Building a platform that, in an attempt to be usable by a general business user, dumb things down and don’t deliver sufficient value.
Without a specific business problem to focus on, not only do you build more platform than you need, but you tend to build a platform that lacks the depth needed to solve the kind of problems you solve with modern analytical tools. Let’s face it – most business users are not experts in how to use Monte-Carlo simulation, neural networks, or other tools that are in the domain of the data scientist.
Want To Do Analytics? Start with a Business Problem!
It seems like rather trite advice, and I am loathe to say never very often, but I can say with certainty that you should never build out a Big Data or Analytics initiative without a specific business problem in mind. Sure, there are lot’s of people who would love to raid your corporate treasury in order to build out the “one platform to rule them all” – or even just sell you a bunch of shelfware. But that doesn’t actually do what you are there to do, which is get results, not buy software.
Our advice is at ThoughtWorks is, unambiguously, start small. Every good analytics problem has an underlying analytics question – something like “given what we know about a customer, how likely are they to leave for a competitor?” or “given a set of transactions, what is the likelihood of fraud?”. We then look at what data sources – some conventional, some far from conventional – are available to help solve the problem. We do discovery work in the data, and proceed to work up a hypothesis about how we can use said data to predict something about the customer, transaction, or other subject of interest. And then we test our hypothesis. And if the test works out, we move on and find a way to operationalize our finding for the benefit of the customer. Or if not, we seek the next hypothesis and try again.
The key to success in any agile process is the feedback loop. The problem with BI and even many incarnations of more modern analytics initiatives is the distinct lack of feedback that occurs when you build a platform before you solve a problem. The reason we call this Agile Analytics is because we use these feedback loops – and the learning associated with them – to guide our efforts.
Data Science versus Data Voodoo
Of course, a feedback loop doesn’t guarantee results. The work, the science, has to be solid as well. The tools of yesterday, doing things like building data cubes to slice and dice data, might be good at telling you what happened. They do little, however, to tell you meaning behind data, or to be useful for predictive use. For this, we bring in data scientists, often people with PhDs in mathematics, physics, or related fields, to develop predictive algorithms. The kind of people who developed things like modern spam filters that predict an answer to the question “is this email likely to be spam” using Bayesian classifiers.
The tools of the data scientist are vast indeed. Neural networks, machine learning, natural language processing, and more are among the techniques such people often use when developing analytics solutions. More importantly, they know when to use these tools, and they know when simpler tools will suffice – and when a pivot from one technique to another is needed if a hypothesis does not work out. It is the data scientist, not the tools themselves, that makes Agile Analytics possible. This is why we cringe when we see products advertised that claim to allow end users to bypass the data scientist and allow users of all proficiencies to, say, apply neural networks to data sets. We would feel the same way about do it yourself surgery kits!
Scaling Based on Results
The pundits are right about the potential around analytics and big data. But they frequently understate the risks of wasting money by engaging on long running programs that invest far too much prior to seeing results. ThoughtWorks took up Agile because we saw companies engage in massive waste by investing money into projects for years before results are realized. Analytics is no different. In fact, given the amount of hype around analytics, it is even more prudent you demand results before you scale up investment.
Our call to action is to start with discovery. We are happy to help with this, but even if we don’t, we believe starting these initiatives using small teams of less than six people. Start by establishing an analytics question, engage in a short discovery phase – typically around 3 weeks – to form and test your hypothesis, and based on your results, go from there. When Google got started, they certainly did not start with a seven figure budget and some enterprise software package promising to “index the internet”. Neither should you, as you engage in your analytics initiative.