IBM Cloud Pak for Data includes a new version of DataStage
I recently attended an IBM-hosted Webinar on the subject of hybrid cloud integration. The presenters also discussed the release of the next generation of DataStage. DataStage (DS) is the industry standard ETL (Extract, Transform, and Load) tool. It has been rewritten to fit into IBM’s cloud environment, as part of IBM Cloud Pak for Data.
DS now operates as a SaaS (Software as A Service) application. This is very exciting news for companies that utilize ETL, and especially ETL Developers. Let me tell you about some of the features that were discussed.
A Single point to Govern and Automate Data
DataStage is now available in a hybrid cloud environment, running natively on the Red Hat OpenShift platform. It is optimized for Kubernetes. This also means that all the benefits of the IBM Cloud environment are available. The Designer (previously a Windows-only client application) is now web-based. This will be welcome news for Mac users, who had to run DataStage in a Windows Virtualized environment.
30 % Performance Increase
Improved Dynamic Workload Balancing in the containerized environment has resulted in a 30% increase in performance. The new parallel engine was cited as contributing to this gain as well. Another added benefit of running DS containerized is that it allows for the preservation of computer resources. To put this another way, when DS is ‘idling’ the resources can be used for other microservices.
New DataStage Features
Let us go through some of the new features of DS.
- Job Logs. These are simple text files that can be searched and indexed now.
- Leveraging of Watson Knowledge Catalog and IBM Watson Studio
- Leveraging of Data Virtualization
- Automated Failure Resolution/Automation of backup and recovery
- Automated updates, service packs, version upgrades can be implemented with a single click.
- Source Control through GitHub to either publish, or release to production.
- New Licensing model for thin (web) client will result in significant cost savings
- AI assisted tools such as Smart Palette and Suggested Stages have been added to reduce hand-coding.
- DataStage Flow designer is now a thin client.
What about my previously created DataStage Jobs?
Existing DataStage jobs (from the previous client/server version) can be imported into the new version. No need to re-write jobs! All of the previous Stages and Connector types have been replicated as well.
Jobs can be run as either Apache Spark Jobs or Parallel Jobs. In addition, AI-assisted tools such as Smart Palette and Suggested Stages have been introduced to reduce hand-coding.
There are two versions of DS:
- DataStage Enterprise
- DataStage Enterprise Plus
Both versions allow an unlimited number of users.
I started writing this article a few months ago. At that time, I did not have the opportunity to use the new DataStage version. On paper, the features and improvements are impressive though.
The major features, lower cost, integration with IBM’s cloud environment, and containerization, add up to a compelling reason to move to the new version. Now that I have had a chance to develop and execute jobs in Cloud Pak for Data, I can say that it does process jobs faster than the legacy client version.
If you are looking for a cloud-integrated ETL product, I recommend you investigate Cloud Pak for Data.
Here are a few documents from IBM that you might find of interest:
- Performance link (Workload balancing on DataStage – IBM Cloud Pak for Data) Scott Brokaw Yi Yang Ren
- DataStage on IBM Cloud Pak for Data
Collect and Analyze your Data with Indellient
Need support in collecting, organizing, storing, or reading your data? Indellient has teams of experts that can help you no matter how big or small the project. Contact us today to have a no-obligation conversation on your specific needs.