There was a point in time when it was common practice to manually manage and track versions of software development code across developers. It was meticulous, cumbersome, prone to error and frankly, for any developer, an unpleasant experience. Something had to be done.
The speed at which businesses evolve introduces significant challenges with releasing and managing code at an equivalent velocity.
While not perfect, many of these issues are overcome through version control systems. Most VCs enable the ability to rollback to previous changes, merge new features of a product, centralizing committed source code as well as invites the opportunity for continuous deployment.
Git is a popular option in this space due to the combination of its involvement with open source projects, branch-merge competency and distributed code control. Can the same benefits be applied to the data domain? Particularly with extract, transfer and load (ETL) routines, that are typically written in codified routines and undergo rapid changes to adapt to business requirements? The answer is YES!
ETL Version Control
An active analytics pipeline sees ETL development code undergo quite rapid changes throughout its lifecycle. This is analogous to the crown jewels of software development.
Recently, multiple ETL type vendors have supported this fact by releasing source control plugins to external tools or developing inhouse control systems. Tools such as Talend, DataStage and SSIS support this notion to varying degrees.
Recently, we engaged on a client project using the Microsoft Stack and leveraged SSIS with Github for source control. By way of tutorial, the next few sections will provide you with steps on how to get your SSIS environment setup with Github to enable source control on the ETL portion of any data project. These instructions will work for SSDT 2010 or later.
Installing Git Source Control Provider using Visual Studio
Step one: Select Tools | Extension Manager
Step two: Run Visual Studio
Step three: Go to Tools | Extension Manager; search the online gallery for “Git Source Control Provider” and install
Step four: Download Addin. Place it in the Add-ins folder under the Visual Studio user documents folder:
C:\Users\xxxxxxx\Documents\Visual Studio 2010
Step five: Restart Visual Studio
Configuring the Git Extensions
Step one: Download the Git Extension and Open it. It will be installed under C:\Program Files (x86)\GitExtensions unless otherwise specified
Step two: Go to Tools -> Settings and configure the checklist items as needed
Step three: Enter your name and email tied to the Github repository that will be leveraged for source control of the data project
Once configured you can clone the Repository
Step one: In Git Extension, click on Clone repository. Fill in the fields as shown below:
A Github window will open asking a user to log in (enter GitHub information to your repository).
Step two: After logging in, click the “Clone” button.
After the clone is successful, the project can be opened by Visual Studio by navigating to the locally cloned directory and double-clicking on the Visual Studio solution file.
Everything will now be set up to start source controlling your ETL into Git. Going forward you can use this menu drop down in Visual Studio to access different commands with git through the GUI:
We will refer to these in the next steps. There is also the option of issuing Git CMD instructions to perform the same actions, but we won’t be covering those in this tutorial. As for the Github environment, it is configured as having a Master branch, as our development branch and demo-stats is a feature branch which is then merged back into Master for commits.
Example of pulling from Master (Master = dev environment in this case)
Step One: Open local copy of solution file called *.sln
Step Two: Once open in SSDT go to GitExt > Pull
Step Three: Set Remote branch to the GitHub server version to merge changes down into the local copy
Example of merging development into Master (local development environment in this case)
Step One: Check out local branch: For example, Master
Step Two: Merge branch: dev/demo-stats (local) to Master
One Notable Drawback and Solution
Microsoft is still working out the kinks in merge conflicts with SSIS. There is one file in particular, related to the project metadata that requires manual merge resolving from time to time. This drawback applies to the merging of branches that was covered above.
If two developers are working on developing packages in SSDT in the same ETL project, merge conflicts will occur. They will reach a merge conflict for an overlapping file that is always present with SSIS development. The file affected is: *.dtproj . Therefore, a merge conflict on *.dtproj will have to be resolved manually.
If you’re merging code back you’ll need to ensure the conflict includes the package name of any additional package that has been created and merged into Master.
An example is a *.dtproj file package list that needs to include your package name.
Here is the example project metadata file with the quick solution:
For every package you’ve added to this list, you will also need to include it as an entry in the package metadata. Example below:
Setting up a proper source control in SSIS is well worth the time from a development code manageability perspective. Even with the multi-developer version controlling issues presented above, it provides a very quick and convenient way to roll back recent changes in ETL process. A benefit that cannot be understated when UAT or Production data issues occur that were unforeseen from a previous ETL promotion.
What’s been your experience (or lack thereof) with using version control on SSIS or other ETL/data projects? Send me an email or connect with me on LinkedIn to continue the conversation!
Visit Data Optyks to explore business intelligence, advanced analytics, and data science initiatives. You can always reach out to us directly with questions at firstname.lastname@example.org.