Today I discuss how to get more value out of IBM’s DataStage ETL (Extracting, Transforming, and Loading) Tool. This information can apply to any ETL tool as well, but in this article, I will be referring to DataStage. I will be even more specific: I will be discussing the benefits of incorporating a QA (Quality Assurance) regimen to enhance your ETL process.
The scope of this article does not cover the definition of DataStage or ETL. IBM has a wealth of information, including free Redbooks, on the subject.
DataStage Quality Assurance Processes
ETL Tools move data from Point A to Point B. However, we must ensure that the data meets a quality standard. “An ounce of prevention is worth a pound of cure”, as Data Analyst Benjamin Franklin is quoted as saying.
Cleaning up inconsistent data can be costly and time-consuming. Below are some options to manage it effectively:
Maintaining Data Consistency with DataStage
Data should be consistent and complete, with primary keys defined. If we are unable to correct at the source (not able to change the source data), we must take measures to ensure integrity during the ingestion.
DataStage – Cleanse and/or Remap Data
- Use the ‘Transform’ stage to re-map columns from source to target.
- Example 1: Data types are different between source and target columns.
- Example 2: Transform Null values to Empty Strings.
- Use ‘Lookup’ tables and/or SQL statements to filter out inconsistent data.
- Use the ‘Rejected Records’ method to flag bad/missing data.
DataStage – Prevent Incomplete/Inconsistent Data from being Ingested
- Load incoming data into intermediary ‘Staging’ tables and use the ‘Difference’ stage to compare the data with the source.
- Create DataStage jobs that query the source data and ensure the quality of the data before the ingestion jobs are run. These preliminary jobs can act as a ‘Gatekeeper’ and can give the ingestion jobs the “green light”, or conversely, stop them from transferring bad data.
Volume/Number of Records – Tracking ETL Activity with DataStage
Ensure that the target Database can handle the ingestion of data. It is also a good practise to confirm the context of the data with the specifications, so that we are only moving relevant and contextual data. This decreases job duration, and at the same time, increases our quality of data.
Use DataStage to Record the Metrics of the ETL Process
Track all ETL activity, either in a log, or in a Database table. At the very least, you will want to record the following:
- Pre-ingestion count of records in the target table
- Post-ingestion count of records
- Net inserted records (The actual number of records inserted – Post minus Pre).
An absence of records for a certain table, might indicate that source data was truncated, or archived.
- Duration of job execution. This is very valuable information. Most ETL jobs run during non-peak hours to minimize performance impact to the database.
If a job started during non-peak hours and is still running during peak hours, we need to be alerted. From there, we can investigate.
In addition, duration statistics can determine scheduling times. Also, we can report to team members an estimated time that their target data will be available.
- UNIX has a ‘df’ command that you can execute on the database server to get a summary of used disk space. You could write the output to a log, or to a table, and trigger a notification when the number exceeds a threshold.
QA Related DataStage Processes
- Data cleansing and re-mapping using Transform Stage
- Comparing incoming data against Lookup tables stage (Ex: Country Codes)
- Use Rejected Records pattern and log results
- Load to intermediary tables first before committing to the final target
- Compare key fields in intermediary tables (source to target) using Difference Stage
- Log important ETL activities either to log or QA-Stats Table (duration, records inserted)
- Incorporate Gatekeeper decision jobs that can half the ETL Process if source data quality fails
The examples I have given you just scratch the surface. There are many other Quality Assurance processes that can be implemented. There are many advantages of using DataStage or any other ETL tool to do both the ingestion and at least part of the Quality Assurance tasks.
The best part? It is one-stop shopping, as they say. All the operations can be encapsulated in the same collection of jobs. The connections are already established, and the message delivery method is incorporated as well through either Email or Slack messaging. In addition, ETL tools have built-in scheduling and decision structures.
DataStage does not take the place of a well-thought-out Quality Assurance process. However, it can be the tool used to implement it. It can certainly assist you in ensuring that Data Integrity remains intact in your DataStage projects.
Custom Cloud Applications
Indellient takes a customer-first approach to help you build a modern cloud strategy on Amazon Web Services, Windows Azure and Google Cloud Platform. Our team can help you build, replatform, migrate and integrate applications, so you can benefit from the scalability, agility, and performance available through cloud technologies.