Best Practices: ETL Development for Data Warehouse Projects
The tripod of technologies that are used to populate a data warehouse are (E)xtract, (T)ransform, and (L)oad, or ETL. Extract connects to a data source and withdraws data. Transforms might normalize a date format or concatenate first and last name fields. Load is the process of moving data to a destination data model.
Consider a data warehouse development project. Many tasks will need to be completed before a successful launch can be contemplated. Minutiae are important. Measured steps in the extraction of data from source systems, and in the transformation of that data, and in the loading of that data into the warehouse, are the subject of these best practices for ETL development.
The scope of the ETL development in a data warehouse project is an indicator of the complexity of the project. Software systems have not progressed to the point that ETL can simply occur by pointing to a drive, directory, or entire database. Claims that big data projects have no need for defined ETL processes are patently false.
ETL Best Practice #1: Know the Data Requirements
A data warehouse project is implemented to provide a base for analysis. There are a number of reports or visualizations that are defined during an initial requirements gathering phase. At some point, business analysts and data warehouse architects refine the data needs, and data sources are identified. Only then can ETL developers begin to implement a repeatable process.
ETL Best Practice #2: Data Cleaning and Master Data Management
What is the source of the data? Has it been approved by the data governance group? Does the data conform to the organization's master data management (MDM) and represent the authoritative source of truth? In organizations without governance and MDM, data cleansing becomes a noticeable effort in the ETL development.
ETL Best Practice #3: Map the Source [to Landing Zone] [to Staging] to Warehouse
It is not unusual to have dozens or hundreds of disparate data sources. The sources range from text files to direct database connection to machine-generated screen-scraping output. There are datatypes to consider, and security permissions to consider, and naming conventions to implement. The mapping must be managed in much the same way as source code changes are tracked. Some ETL tools have internal features for such a mapping requirement.
ETL Best Practice #4: Know the Dependencies
It is customary to load data in parallel, when possible. Even medium-sized data warehouses will have many gigabytes of data loaded every day. Yet, the data model will have dependencies on loading dimensions. ETL packages or jobs for some data will need to be completely loaded before other packages or jobs can begin.
ETL Best Practice #5: Size it up
Know the volume of expected data and growth rates and the time it will take to load the increasing volume of data. If the ETL processes are expected to run during a three hour window be certain that all processes can complete in that timeframe, now and in the future. Also, consider the archiving of incoming files, if those files cannot be reliably reproduced as point-in-time extracts from their source system, or are provided by outside parties and would not be available on a timely basis if needed. Terabytes of storage is inexpensive, both onsite and off, and a retention policy will need to be built into jobs, or jobs will need to be created to manage archives.
ETL Best Practice #6: Logging
Whether working with dozens or hundreds of feeds, capturing the count of incoming rows and the resulting count of rows to a landing zone or staging database is crucial to ensuring the expected data is being loaded. ETL tools have their own logging mechanisms. Enterprise scheduling systems have yet another set of tables for logging. Each serves a specific logging function, and it is not possible to override one for another, in most environments. A reporting system that draws upon multiple logging tables from related systems is a solution.
ETL Best Practice #7: Scheduling
Scheduling is often undertaken by a group outside of ETL development. Knowing the volume and dependencies will be critical in ensuring the infrastructure is able to perform the ETL processes reliably.
ETL Best Practice #8: Alerting
Alerts are often sent to technical managers, noting that a process has concluded successfully. With many processes, these types of alerts become noise. Alerting only when a fault has occurred is more acceptable. Or, sending an aggregated alert with status of multiple processes in a single message is often enabled. There is less noise, but these kinds of alerts are still not as effective as fault alerts.
ETL Best Practice #9: Restartability
Something unexpected will eventually happen in the midst of an ETL process. When dozens or hundreds of data sources are involved, there must be a way to determine the state of the ETL process at the time of the fault. The aforementioned logging is crucial in determining where in the flow a process stopped. Can the data be rolled back? Can the process be manually started from one or many or any of the ETL jobs?
ETL Best Practice #10: Documentation
Beyond the mapping documents, the non-functional requirements and inventory of jobs will need to be documented as text documents, spreadsheets, and workflows.
Minding these ten best practices for ETL projects will be valuable in creating a functional environment for data integration.
About the Author
Dave Leininger has been a Data Consultant for 30 years. In that time, he has discussed data issues with managers and executives in hundreds of corporations and consulting companies in 20 countries. Mr. Leininger has shared his insights on data warehouse, data conversion, and knowledge management projects with multi-national banks, government agencies, educational institutions and large manufacturing companies. Reach him at Fusion Alliance at dleininger@FusionAlliance.com.