Real Life Issues With Big Data In The Enterprise – The Issues With Data Consistency (Or Lack Thereof)
Large Enterprises face huge challenges when dealing with their Big Data. In this article I am going to outline some of the common challenges with Big Data I see firms dealing with on a day to day basis. This is a continuation of the discussion that was started in the article titled “The Challenges of Dealing With Big Data”.
In the previous article we discussed how a lot of firms and discussions, in and out of the press, are focused on how to analyze and gain insight from Big Data (whether it be on Twitter or in the traditional Enterprise). Furthermore, I outlined how, in my personal experience, the root of the true problems with big data are often not in how or what tools we use to analyze the data, but more so in how we capture, or fail to capture it in the first place. In essence, our failure to capture the data accurately and consistently often renders analysis of it a meaningless exercise due to the Garbage In = Garbage Out (GIGO) principle. To make this issue more clear, I am going to provide some real world examples of some of the Big Data issues I come across with my clients on a regular basis. Unfortunately, as I started writing this it was getting more than a bit long so I have broken it into three shorter posts of which this is the first one.
The Issues With Data Consistency (Or Lack Thereof)
Consider a large enterprise. My typical clients often have between 500-2000+ different applications running within their data centers. Furthermore, these applications are spread across 1500-100,000 servers. Now imagine that Paul Michaud is a customer of this enterprise. Now in my life time I have lived and worked all over the world and with all the temporary corporate housing I have had, it probably amounts to over 20 addresses in my adult lifetime. Now consider that as a customer of said enterprise for most of that time, they have had to enter my information into some meaningful fraction of those systems in order to handle me as a customer. As a result its likely that some have me as Paul Michaud, some as P. Michaud Some as Paul K Michaud and some even as K Michaud. Believe me it happens…a lot. In addition they probably have many addresses for me in the different systems with some of them reflecting current addresses but some probably have stale addresses. So the question is this, how does the firm analyze their business relationship with Paul Michaud when they can’t even guarantee that all of these different versions of Paul Michaud’s are the same person.
Now consider a corporation as a client. Let’s use IBM as an example. If you are say a large bank you may deal with IBM and its subsidiaries in many capacities. IBM may be a client, a trading counter party, a supplier, a customer, etc. To make it even more interesting you may also deal with some of the subsidiaries directly in their own right and to make it even more interesting you might have dealt with a firm in the past that IBM has now purchased (say COGNOS). The opportunities are virtually boundless here for data error issues. Is IBM in some systems as IBM or International Business Machines or even as Intl. Bus Mach? Does the system even know that IBM bought Cognos and when you try and determine all business you do with IBM does that Cognos business show up in the analysis? What is likely is that some of the systems in your enterprise don’t even have a concept of a Corporate hierarchy in their data structures so it’s completely impossible for them to understand that there is a relationship between Cognos and IBM.
Now some of you are probably thinking this is an exaggeration but let me tell you this. I had a client a few years ago who had to have the same data replicated in about 450 applications. They would try to synchronize this every day in real time, but would also do a major replication each night using Exchange Transform and Load (ETL) processes. Now this seems pretty standard I am sure. What may surprise you though is that this firm had to employ over 200 fulltime staff whose sole job was to fix the data errors that happened every night in that ETL process. These errors result from a couple of sources:
- Human error in how it is entered
- Differences in how each application stores the data in their internal data models (No two systems probably have exactly the same way of recording a customer record for example and the need to translate between them results in either errors of loss of fidelity in the data)
- Errors in the ETL processes itself
The bottom line is for a large enterprise, they often have huge amounts of errors in their data and if we don’t correct this problem at the source, then all the fancy analysis tools in the world can only do so much. At their core, they all have to assume that the data they are being used on is basically good data. Unfortunately, this is often not the case.
Watch for the second part of this post to be published in the next day or so.