google search

Custom Search

Saturday, October 25, 2008

What Data Errors You May Find When Building a Data Warehouse

You may have seen publications that tell you that you may have to spend the majority of your data warehouse development time building the means for both the initial and recurring extraction, transforming, and loading of data. What I have not seen, though, is much in-depth discussion of what exactly are those errors in the dirty data that you will spend your time cleaning up. Forewarned is forearmed. If you know the possibility that certain errors exist, you will be more prone to spot them and to plan your project to attack the errors in a manageable way. Perhaps the material in this paper can help you formulate a checklist of errors you will be checking for. What follows is a list of common errors. Also, if you are a relational database expert, bear with my imprecise use of some terminology. Finally, note that when I refer to a data warehouse, I refer to the database that is directly fed with data from the source systems - not the data marts (or whatever you want to call them) that are fed with cleansed data.
The categories of "errors"
I place "errors" into four categories. Quotations are around the word errors because some errors are not, in the metaphysical sense, erroneous. So, with some awkwardness, let me suggest that errors involve data that are either:


bullet
Incomplete
bullet
Incorrect
bullet
Incomprehensible
bullet
Inconsistent.

Incomplete errors

These consist of:

Missing records
This means a record that should be in a source system is not there. Usually this is caused by a programmer who diddled with a file and did not clean up completely. (I read a white paper about how users have to "fess up" about bad data. Actually, usually system personnel cause MUCH more headaches than users.) Note you may not spot this type of error unless you have another system or old reports to tie to.
Missing fields
These are fields that should be there but are not. There is often a mistaken belief that a source system requires entry of a field.
Records or fields that, by design, are not being recorded
That is, by intelligent or careless design, data you want to store in the data warehouse are not being recorded anywhere. I further divide this situation into three categories. First, there may be dimension table attributes you will want to record but which are not in any system feeding the data warehouse. For example, the marketing user may have a personal classification scheme for products indicating the degree to which items are being promoted. Second, if you are feeding the same type of data in from multiple systems you may find that one of the source systems does not record a field your user wants to store in the data warehouse. Third, there may be "transactions" you need to store in the data warehouse that are not recorded in a explicit manner. For example, updating the source system may not necessarily cause the recording of a transaction. Or, sometimes adjustments to source system data are made downstream from the source system. Off-invoice adjustments made in general ledger systems are a big offender. In this case you may find that the grain of the information to be stored in the warehouse may be lost in the downstream system.

Incorrect errors

You can say that again! That is, the data really are incorrect.

Wrong (but sometimes right) codes
This usually occurs when an old transaction processing system is assigning a code that the transaction processing system users do not care about. Now if the code is not valid, you are going to catch it. The "gotcha" comes when the code is wrong but it is still a valid code. For example, you may have to extract data from an ancient repair parts ordering system that was programmed in 1968 to assign a product code of 100 to all transactions. Now, however, product code 100 stands for something other than repair parts.
Wrong calculations, aggregations
This situation refers to when you decide to or have to load data that have already been calculated or aggregated outside the data warehouse environment. You will have to make a judgment call on whether to check the data. You may find it necessary to bring data into the warehouse environment solely to allow you to check the calculation.
Duplicate records
There usually are two situations to be dealt with. First, there are duplicate records within one system whose data are feeding the warehouse. Second, there is information that is duplicated in multiple systems that feed in the same type of information. For example, maybe you are feeding in data from an order entry system for products and an order entry system for services. Unbeknownst to you, your branch in West Wauwatosa is booking services in both the product and service order entry systems. (The possibility of situation like this may sound crazy until you encounter the quirks in real world systems.) In both cases, note that you may miss the duplicates if you feed already aggregated data into the warehouse.
Wrong information entered into source system
Sometimes a source system contains data that were simply incorrectly entered into the system. For instance, someone may have keypunched 6/9/96 as 9/6/96. Now the obvious action is to correct the source system. However, sometimes, for various reasons, the source system cannot be corrected. Note that if you have many errors in a source system that cannot be corrected, you have a much larger issue in that you do not really have a reliable "system of record".
Incorrect pairing of codes
This is best described by an example. Sometimes there are supposed to be rules that state that if a part number suffix is XXX, then the category code should be either A, B, or C. In more technical terms, there is a non-arithmetic relationship between attributes whose rules have been broken.

Incomprehensibility errors

These are the types of conditions that make source data difficult to read.

Multiple fields within one field
This is the situation where a source system has one field which contains information that the data warehouse will carry in multiple fields. By far the most common occurrence of this problem is when a whole name, e.g., "Joe E. Brown", is kept in one field in the source system and it is necessary to parse this into three fields in the warehouse.
Weird formatting to conserve disk space
This occurs when the programmer of the source system resorted to some out of the ordinary scheme to save disk space. In addition to singular fields being formatted strangely, the programmer may also have instituted a record layout that varies.
Unknown codes
Many times you can figure out what 99% of what codes mean. However, you usually find that there will be a handful of records with unknown codes and usually these records contain huge or minuscule dollar amounts and are several years old.
Spreadsheets and word processing files
Often in order to perform the initial load of a data warehouse it is necessary to extract critical data being held in spreadsheet files and/or "merge list" files. However, often anything goes in these files. They may contain a semblance of a structure with data that are half validated.
Many-to-many relationships and hierarchical files that allow multiple parents
Watch out for this architecture in source systems. It is easy to incorrectly transfer data organized in such manner.

Inconsistency errors

The category of inconsistency errors encompasses the widest range of problems. Obviously similar data from different systems can easily be inconsistent. However, data within one system can be inconsistent across locations, reporting units, and time.

Inconsistent use of different codes
Much of the data warehousing literature gives the example of one system that uses "M" and "F" and another system that uses "1" or "2" to distinguish gender. May I suggest that you wish that this is the toughest data cleaning problem you will face.
Inconsistent meaning of a code
This is usually an issue when the definition of an organizational entity changes over time. For example, say in 1995 you have customers A, B, C, and D. In 1996, customer A buys customer B. In 1997, customer A buys customer C. In 1998, Customer A sells of part of what was A and C to customer D. When you build your warehouse in 1999, based on the type of business analysis you perform, you may face the dilemma of how to identify the sales to customers A, B, C, and D in previous years.
Overlapping codes
This is a situation where one source system records, say, all its sales to Customer A with three customer numbers and another source system records its sales to customer A with two different customer numbers. Now, the obvious solution is to use one customer number here. The problem is that there is usually some good business reason why there are five customer numbers.
Different codes with the same meaning
For example, some records may indicate a color of violet and some may indicate a color of purple. The data warehouse users may want to see these as one color. More annoyingly, sometimes spaces and other extraneous information have been inconsistently embedded in codes.
Inconsistent names and addresses
Strictly speaking this is a case of different codes with the same meaning. My unscientific impression of this type of problem is that decent knowledge of string searching will allow you to relatively easily make name and address information 80% consistent. Going for 90% consistency requires a huge jump in the level of effort, Going for 95% consistency requires another incremental huge jump in effort. As for 100% consistency in a database of substantial size, you may want to decide if sending a person to Mars is easier.
Inconsistent business rules
This, for the most part, is a fancy way of saying that calculated numbers are calculated differently. Normally, you will probably avoid loading calculated numbers into the warehouse but there sometimes is the situation where this must be done. As noted before, you may have to feed data into the warehouse solely to check calculations. - This can also mean that a non-arithmetic relationship between two fields (e.g., if a part number suffix is XXX, then the category code should be either A, B, or C) is non consistently followed.
Inconsistent aggregating
Strictly speaking this is a case of inconsistent business rules. In a nutshell, this refers to when you need to compare multiple sets of aggregated data and the data are aggregated differently in the source systems. I believe the most common instance of this type of problem is where data are aggregated by customer.
Inconsistent grain of the most atomic information
Certain times you need to compare multiple sets of information that are not available at the same grain. For example, customer and product profitability systems compare sales and expenses by product and customer. Often sales are recorded by product and customer but expenses are recorded by account and profit center. The problem occurs when there is not necessarily a relation between the customer or product grain of the sales data and the account - profit center grain of the expense data.
Inconsistent timing
Strictly speaking this is a case of inconsistent grain of the most atomic information. This problem especially comes into play when you buy data. For example, if you work for a pickle company you might want to analyze purchased scanner data for grocery store sales of gherkins. Perhaps you purchase weekly numbers. When someone comes up with the idea to produce a monthly report that incorporates monthly expense data from internal systems, you'll find that you are, well, in a pickle.
Inconsistent use of an attribute
For example, an order entry system may have a field labeled shipping instructions. You may find that this field contains the name of the customer purchasing agent, the e-mail address of the customer, etc. A more difficult situation is when different business policies are used to populate a field. For example, perhaps you have a fact table with ledger account numbers. You may find that entity A uses account '1000' for administrative expenses while entity B uses '1500' for administrative expenses. (This problem gets more interesting if entity A uses '1500' and entity B uses '1000' for something other than administrative expenses.)
Inconsistent date cut-offs
Strictly speaking this is a case of inconsistent use of an attribute. This is when you are merging data from two systems that follow different policies as to dating transactions. As you can imagine, the issue comes up most with dating sales and sales returns.
Inconsistent use of nulls, spaces, empty values, etc.
Now this is not the hardest problem to correct in a warehouse. It is easy, though, to forget about this until it is discovered at the worst possible time.
Lack of referential integrity
It is surprising about how many source systems have been built without this basic check.
Out of synch fact data
Certain summary information may be derived independently from data in different fact tables. For example, a total sales number may be derived from adding up either transactions in a ledger debit/credit fact table or transactions in a sales invoice fact table. Obviously there may be differences because one table is updated later than another table. Often, however, the differences are symptoms of deeper problems.

Some ending thoughts

I hope this paper adds to the understanding of what takes up the majority of time in a data warehouse. Let me offer the following ending thoughts:

Be prepared for a lot of tedious work.
Probably the most important "tools" for solving these problems are a sharp eye and endurance for checking an abundance of detail information.
You may spend much more time checking for errors than cleaning up errors.
Most of these errors do not jump out at you.
The errors of inconsistency are the most difficult to handle.
At least that is my experience.
The complexity of a data warehouse increases geometrically with the number of sources of data fed into it.
Having to reconcile inconsistent systems is the reason. For example, if it takes 100 hours to reconcile data from two source systems, you can expect that it will take on the order of 400, not 200, hours to reconcile data from four source systems.
The complexity of a data warehouse increases geometrically with the span of time of data to be fed into it.
My previous comment applies. Note, however, that reconciling inconsistencies over time may be even harder because the people who know what happened in previous years may not be around to answer your questions.
You will be faced with an economic and political question as to how erroneous the data in your system will be.
Completely fixing some of these problems can be quite expensive. More vexingly, often what constitutes "correct" data is debatable. What you do, more often then not, boils down to a question of money and politics.

0 comments:

 

blogger templates | Make Money Online