Architecture & DesignData Quality for a Modern Data Warehouse

Data Quality for a Modern Data Warehouse

Developer.com content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Building a data warehouse from two or ten or dozens of data sources is common. Less common is the use of data quality services (DQS) for cleansing and matching source data via business rules. DQS as a process has been implemented in a number of technologies over time. Frequently used approaches include SQL CASE statements, Excel document lookups, and hard-coded references that grow and grow, becoming unmanageable and introducing brittleness to a data warehouse.

Cleansing

Data cleansing is a process of correcting, amending, enhancing and standardizing source data as it is transmitted to a target warehouse. Third party companies offer cleansing services on addresses. The result is standardization of route designations such as “Boulevard” to “Blvd” or “Rd.” to “Road” based on business rules.

Internal data cleansing is also valuable. When building a data warehouse from many LOB applications, it is not unusual to see regions or divisions stated differently among the sources. For instance, “Northwest” and “Midwest” might be the proper nomenclature for regions that are shown at “NW” or “MW” in applications.

Additionally, there is often a need to build composite data cleansing processes. The third party address vendors can easily return a string of AddressLine as the concatenation of “Address” + ‘, ‘ + “City” + ‘, ‘ + “2-character State Code” + ‘ ‘ + “5-digit US ZipCode”. Meaningful data cleansing operations on internal data would be useful in providing a string representation of a Division and Region and LOB, perhaps. An operation of this nature would provide “Fuels – Northwest – Oregon” or other master data representation.

When processing source data through a cleansing process, varying levels of certainty can be assigned that allow results to flow to result tables/files for approved or rejected values. Over time, the “voice” of the source data may be heard as approved if certainty is 90%, and rejected if certainty is 70% or less. The deltas between 70 and 90 become reviewable rows. It is common to store counts and metadata and even the suspect data for reporting and analysis. The analysis often leads upstream to anomalies in applications which feed the source data systems.

Matching

Matching is another Data Quality initiative. Matching references a knowledge base of business rules to merge or associate or link related data elements in or among datasets.

A scenario for matching is the convenience store at your favorite fuel station. The end caps and counter displays of chips, drinks, and energy supplements are purchased space, and not random product displays based on staff selections. How should the space be allocated, and at what fee, is a regular analysis for corporate offices.

A business rule might consider proximity to an interstate, number of other c-stores at the intersection, and percentage of inside purchases to fuel purchases. A knowledge base would contain these attributes and include them in a composite domain.

SpaceAllocation:

Proximity_OtherC-Stores_PctInsideSales

 

Rules for finding proper matches would follow. Having many people enter the store, with few other local options, and located at an interstate exit, results in a HighContact rating. Certain items score well for such a rating, and are desirable items for end caps and counter space.

HighContact

Proximity < 1 mile

Other C-stores <= 1

Pct Fuel Sales > 70%

 

MediumContact

Proximity < 1 mile

Other C-stores <=3

Pct Fuel Sales > 40%

 

LowContact

Proximity > 1 mile

Other C-stores <=3

Pct Fuel Sales < 40%

 

Business rules that can be devised by subject matter experts, also known as data stewards, introduces a level of flexibility to a data warehouse that is desirable. Aligning incoming data from application sources with data matching capabilities provides the data analyst with results that already have a business focus applied.

Data Quality Services are a necessity in developing a modern data warehouse. Cleansed and matched data result in reliable, authoritative data from which sound decisions can be made.

About Dave Leininger

Mr. Leininger has been a Data Consultant for 30 years. In that time, he has discussed data issues with managers and executives in hundreds of corporations and consulting companies in 20 countries. Mr. Leininger has shared his insights on data warehouse, data conversion, and knowledge management projects with multi-national banks, government agencies, educational institutions and large manufacturing companies. Reach him at Fusion Alliance at dleininger@FusionAlliance.com.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories