Big Data: Practical Tips for ETL Testing
By Kaushal Amin, Chief Technology Officer for KMS Technology
Big Data represents all kinds of opportunities for all kinds of businesses, but collecting it, cleaning it up, and storing it can be a logistical nightmare. According to the latest IDC research the big data technology and services market is showing a compound annual growth rate of 27% and it will be worth $32.4 billion by 2017.
Data warehouses have a vital role to play in Big Data as companies collect information on their business operations, markets, and client behavior to identify patterns, and aggregate the results to identify business opportunities / operational improvements. As Gartner's Magic Quadrant for Data Warehouse Database Management Systems points out, "A data warehouse is simply a warehouse of data, not a specific class or type of technology."
That's one of the reasons that Data Warehouse testing is growing increasingly important for many organizations as they try to leverage the opportunities hidden in the data.
Data Warehouse "ETL" processing
The main component (brain and heart) of the data warehouse system is data Extraction, Transformation, and Loading (ETL). The challenge is to extract the data, often from a variety of disparate systems, to transform that data, so that it is uniform in terms of format and content, and finally to load the data into a warehouse where it can serve as the basis for future business intelligence needs.
It's very important that the integrity of the data is maintained. It must be stored clearly and concisely without loss, and it must be accessible for analysts. In some industries, such as healthcare, there will be strict governance rules and potentially severe consequences for error. If data warehouses are to deliver value, they require careful ETL testing to ensure that processes are working as intended.
Unique challenges of ETL testing
Testing the data warehouse and verifying that ETL processes are working correctly is very different to traditional application testing. Here are some of the main barriers to overcome:
- There is no user interface – In traditional application testing, testers can validate functionality via the input / output of data through the user interface. The user interface is expected to behave a certain way based on data input and actions taken. In data warehouse testing instead of a user interface that a tester can generally grasp quite quickly, you have data and its relationships. The key testing skills needed here are the ability to look at data, validate data processing rules, and analyze final data output. Knowledge of database query languages, such as SQL, is essential for testers to accurately do this where traditional manual testing skills are not enough.
- The volume of data is huge – We could be looking at millions of transactions per day. To verify that the extraction, transformation, and loading of that data working as designed, in real-time as code is updated.
- Data is from multiple sources – This is not a single system, it could be a long list of different systems that are feeding daily transactional data in the warehouse. Some of the data may even come from other systems utilized in cloud computing or hosted by 3rd party. The systems and their data won't necessarily agree on the format and content with each other. How do you collate that data and make sure everything gets processed consistently and connected with each other?
- Incomplete data and exceptions– The information collected by various source systems may not be complete and in many cases full of exceptions. You may find that some systems collect more data than others. You may find the data is incomplete or imperfect.
- Rules are not static – The source systems can change over time due to release upgrades or witching to another 3rdparty vendor's solution. You must be able to cope with these changes without having to redesigning data warehouse.
High level approaches for ETL testing
The entire process is about taking data from a source file, cleaning it up in a temporary staging area, and then pushing it into the data warehouse. It's vital to ensure that no data goes missing or gets incorrectly transformed. There are two high level approaches to ETL data validation that can be considered:
- Approach I: Validate the data from the data sources directly in the data warehouse. This approach validates that all the data in the source appears in the data warehouse according to the business rules. This approach does not validate the intermediate staging area and transformation processes in between the source and data warehouse.
- Approach II: Validate the data from the data sources through each step of the extract and transformation process including final load in the data warehouse. Validate the data at each transformation. For example, in stage one you would take the data from the file and verify it's all there; in stage two, you would remove any junk information that you don't want to process; in stage three, you would translate the data to clean it up and make it uniform, and so on.
The second approach, where testers are verifying each step in the process, is more time-consuming but it makes it much easier to track down any problems that occur. On the other hand, the first approach is less time-consuming, but makes it a lot more difficult to track down a bug when one is found.
Key areas of ETL testing
For the ETL process to be verified there's a lot to test. Here are some of the major considerations that should form a part of your planning, ask yourself these questions:
- Was all the expected data loaded from the source file? Bugs that result in dropped records are common. You must do a full inventory count and ensure that the right records made it into the right tables across the source landing area, staging area, and data warehouse. Were some records rejected because of missing data? Find the problems and work out why they occurred.
- Were all the records loaded into the right place? If the transaction order gets out of sync you can end up with all the records being processed, but some data gets over written with the older data or in the wrong place. It's important to verify that each data field from each source loads in the correct field / table and in the correct order.
- Do you have duplicate data? If you don't remove duplicate data then the analysis and aggregate reports could be skewed. Duplicates can happen due to the source system feeding duplicates or due to defects in the ETL processes. In either case, the duplicate check must happen as data moves across landing, staging, and finally into the data warehouse.
- Do the data transformation rules work? You may receive data with different date formats, or using different codes or descriptions for the same thing. The transformation process has to homogenize that data correctly, so that it is stored consistently. Most data warehouses have many such transformation rules which must all be verified with sample data that represents all the various permutations / possibilities.
- Have you lost any part of data? Maybe truncation has caused a problem because the source file was too big or changed the format. It is important to make sure data from source file did not get truncated as it gets processed from landing to staging to the data warehouse.
- Is data being corrected and rejected correctly? Sometimes the data may be incomplete or unclear. It may be necessary to hold this data aside and move on with rest of the data. Such data must be logged and rejected for further clarification before it can be re-processed.
- Is the system generating error logs? You need to know when data is not being processed and why. You also need a system that can gracefully handle something like a system crash or power outage without creating duplicates, losing records, or forcing the user to reprocess from the start (as much as possible).
- Can the system handle the data load? Performance might not be an issue at first, but you have to think ahead. How much data will the system have to handle 2 years from now? What happens when system already has 5+ yeas of data and new data has to be processed? It must be able to cope with the expected load, not just now, but in the future too.
There's a lot to consider for a smooth running ETL process in your data warehouse. Proper testing is a prerequisite to reaping the benefits and insights in the big data that's being collected. This article should give you an idea of the challenges that lie ahead and what needs to be tested.
About the Author
Kaushal Amin is Chief Technology Officer for KMS Technology, (www.kms-technology.com) a software development and IT services firm based in Atlanta, GA and Ho Chi Minh City, Vietnam. He was previously VP of Technology at LexisNexis and a software engineer at Intel and IBM. You may reach him at email@example.com.