Real Life Issues With Big Data In The Enterprise – The Issues With Data Completeness
So completeness can mean a lot of different things. In this case I am going to define a piece of data as being complete if the description of the data contains all of the available information about the item in question and if that description is captured in a manner which represents a true representation of that object in a context neutral manner. In my experience, this is the single biggest cause of the data issues in the enterprise. In fact if this was done well the issues listed in the first article of this series would be much less likely to occur.
So here is an example from all of our recent past, related to the global financial crisis. Consider a bank who gives Paul Michaud a mortgage, along with about 10,000 other people. This bank then bundles them all together and sells them as a Mortgage Backed Security (MBS) . A MBS is basically a bond whose interest and principle get paid off by the people paying their mortgages. The MBS then gets sold to a bunch of other banks who hold it in their portfolios. So the problem here is that the banks that buy the MBS bonds don’t even know that Paul Michaud’s mortgage is even in the pool of 10,000 that is responsible for paying their bonds. Often, neither does the bank that sold the MBS based on the pool in the first place but that’s another issue. Anyhow, the bank that bought the bond wants to asses risk on their portfolio of MBS securities. In order to do this they need to project cash flows from their investments under different risk scenarios. In order to do this well, they would like to be able to model the behavior of the individuals who own the bonds underlying their MBS securities. The problem is they have no idea who owns the mortgages, how much they earn, what their debt load is, etc. At best they have some general summary statistics about the pool of 10,000. Worse yet, even if they were given all of the detailed data, none of their systems would be able to store it correctly anyhow. At the end of the day, the firm will run a valuation and risk assessment using a supercomputing cluster of 10,000+ servers on the data and generate 100’s to 1000’s of reports based on it. Unfortunately, all of this fancy analysis was rendered inaccurate because the data was not complete or accurate at the time of data capture and thus their analysis is at best an approximation from which they may draw incorrect conclusions, as we observed a few years ago.
In fact it is my opinion (and I told this to people in the federal government, and other top advisors to the World’s financial establishment) that this is one of the primary causes of the global financial crisis a few years ago. While everyone was harping on the need to force these banks to disclose more and more information to the government and other regulators, it would have been a GIGO exercise because the banks don’t fully know what they have or the risks they face largely because of the limitations created by incomplete data. While there were definitely lots of issues inside the banks that contributed to the crisis, I genuinely believe the banks try their best to asses value and risk in their portfolios and that they disclose what they believe those values and risks to be to the government and regulators. The issue is it’s a truly hard problem to solve and if they can’t fix the quality of their data then they will always be at risk of their internal analysis being wrong.
So the bottom line here is that you need you systems to be able to capture, store, maintain and retrieve a complete representation of all of your data. At the very least the data model inside your systems should be capable of capturing complete data even if you are unable to populate it with complete data at this time. Who knows a year from now you may be able to fill in the missing values but if the systems wasn’t designed to hold them, you are going to be in trouble.
Remember, at the end of the day, virtually every computer system on the planet, has as its primary responsibility, the need to capture store, maintain, retrieve and process data. So if it doesn’t do that primary data job right, then what’s the point of building it in the first place.
As always you can reach me through Twitter, LinkedIn, by using the contact links in the author box or here through the website.













