The Messy and Scary World of Big Data

Big Data is frequently a topic that is overheard and discussed in its most generic terms and subsequently never fully explained. Even within our own articles we have avoided explaining this revolutionary sector of the business intelligence field due to its depth and level of complexity. This article will not give you detailed information about big data but hopefully it will give you an understanding of its nature and get you excited about the implications for businesses. With that being said there are dozens of books out there that delve into exactly what big data is and what its broad implications are. Few people realise that everyday some element of their life has been influenced by big data analytics; the broad impact of this field ranges greatly. Some examples of the influence of big data include: improving sea travel since the 1800’s, preventing flu outbreaks, predicting stock market trends, determining who to insure, finding the cheapest flight periods to preventing exploding manhole covers in New York City. The impact of big data is only increasing as our world is quickly undergoing rapid “datafication”.

If any of the above seems interesting I implore you to check out the book “Big Data: A Revolution That Will Transform How We Live, Work and Think” by Viktor Mayer-Schonberger and Kenneth Cukier. This book looks at the morass that is big data and breaks down its nature, implications and challenges through real world cases and simple concepts.

In this article we won’t delve into too much detail but we are going to outline a few of the key concepts of big data and explain to you why the world of big data is a messy and scary place with a level of potential that can only be characterised as revolutionary.

Big data as the name suggests is at its core basically a huge collection of information. Just one of the elements that distinguishes a big data source from a ‘small’ data source is the sheer volume and frequency that data is collected. An example of a big data source from Schonberger and Cukier’s book is when Google launched a project to predict the outbreak of flu within the United States based off fifty million of the most common search queries. Google then applied millions of mathematical models to the data which eventually lead to a list of forty five search terms that could successfully predicted where and when an outbreak of flu would occur to a high degree of reliability. The scary facet of this is not only Google predicting future disease spread from simple searches that might resemble “headache”, “runny nose” or “flu medication” but the fact that this data set is quite small in comparison to those that number in the tens of billions of rows. Where the messy component starts is in what the data actually looks like.

In most business intelligence systems companies are dealing with small, structured data such as point of sales systems or accounting systems. This is data that contains categories/fields such as booking number, item type or id, transaction value, location, date, time etc. Big data in contrast is data that is normally unstructured or commonly less structured for example think about the Google example earlier regarding search queries what does your typical search query look? To think about this lets think about some of the forms a search for flu “headaches” could generate. Not only does the logic that sorts and organises the data have to include the vast range of typos but it also needs to be able to read and categorise complex sentences such as “how to get rid of my cold headache” or any of the other possible variations. Now with that in mind think about the computer processing power and complex set of rules your system needs to have to sort and categorise all that diverse and unorganised data. This complex logic and huge processing power is a major hurdle for big data analytics however the benefits of processing this much data is astounding as it allows for identification of connections and events that wouldn’t normally be possible. Big data allows data analysts to look beyond trends and outputs and look at what is occurring on a macro scale without finding out why it is occurring. If you have enough data that says those cities that search “warm beanies” and “runny nose” end up with a flu outbreak 95% of the time then you don’t need to know why that happens you just need to know that statistically most of the time that is what occurs. This provides companies with clear correlation based actionable intelligence which is exactly what business intelligence and big data is about.

Dozens of companies use the inexact nature of big data to their advantage often finding correlation where no one would have guessed it existed. Amazon might find that people who read “Fear and Loathing In Las Vegas” also end up reading “Lord of the Flies”. Alternatively Netflix may find that statistically those individuals who view a drama such as “House of Cards” will also eventually watch the TV shows “Scandal” or “West Wing”. All of these hypothetical examples are basic and simple uses for the correlations that can be pulled from within big data sources. The more data you have the easier it is to draw accurate conclusions as there is less chance of small deviations in the data casing an incorrect assumption.

What this article aimed to achieve was to explain that big data at its core is all about huge quantities of messy, unorganised and complex data. This data is then analysed not to find out ‘why’ something is occurring but to find out ‘what’ is occurring. It is this ‘what’ that provides companies with the actionable intelligence that can be exploited within the marketplace.

If you enjoyed this article please go and check out Schonberger and Cukier’s book as it is a much clearer and informed examination of big data then I could ever hope to manage.