2019 – The Year of the Data Lake Revamp

Data Lakes are touted as a key component of digital transformation. But with Gartner indicating that conservative failure rates are around 60% of big data projects; we as data experts are challenged to explain how our projects will be successful. Considering this, I would like to share the common pitfalls encountered on Data Lake projects and recommendations to alter the course of your current project if you’re failing to see ROI.

Pristine  Data Lake rather than a  Data Swamp

Has your speed to insights improved since establishing your Data Lake? Have there been notable impacts on business decisions that led to a quantifiable return on investment? Are you answering and solving the targeted business problems?

few important, high-level questions to ask when determining the effectiveness of your initiative. If your Data Lake is not delivering on these, it might be time to determine whether your project is headed towards being a glorified swamp rather than a Lake.

To help pinpoint where potential improvements can be made, consider examining the following areas of your project.

Keep a living, breathing Catalog of your data Assets

A modern data catalog or metadata management system should be a non-negotiable component of your platform’s semantic layer. If it’s not currently factored into your project, I highly suggest putting this on your priority list for the following reasons:

  • Without a catalog, data management becomes unfeasible and attempts at working towards a Data-as-a-Service model will prove futile. Search, discover and subscribe capabilities are necessary to deliver on data exploration fronts for both data science-driven initiatives as well as ad hoc querying.
  • The monotonous cataloging tasks are becoming streamlined via machine learning abilities thereby breaking barriers down to data acquisition within the Data Lake. It also enforces Data Lake governance. Leveraging automation and machine learning is important to allow your solution to scale at speed with data.
  • A Data Catalog is a solid step in the right direction towards spreading tribal knowledge and building on an organization’s internal intelligence. Achieve compounding returns from a Data Lake investment by keeping account of how insights were discovered and the impact those findings had on business decisions.

Ensure the discovery environment is operative and effective

Swamps are stale. Is your Data “Lake” also?

If your Data Lake’s adoption rates have been relatively low or people in the organization just aren’t discovering new insights, consider the platform may have characteristics that are hindering new discoveries. Below are some common but less talked about pitfalls based on our experience:

  • Data scientists are not armed with a development environment that encourages rapid exploration and testing of models. There is no streamlined approach to data science workflows. Models are built and tested locally instead of through cloud and containerization.
  • Views of data are not fully covering the type of user persona’s that exist within the Data Lake. Is the data in the lake being served up adequately for all the end user types that are expecting self-service exploration? Business users may not be interested in data at the raw level but rather want access to views encompassing gold or curated data.
  • Discovery tools for each persona are either not fulfilling their purpose or non-existent altogether. While data engineers and scientists might be content with accessing the Data Lake through a command line, business users and data analysts will require their own set of tools to explore all the Lake has to offer. Some of the more modern catalogs provide easy-to-use interfaces that enable discovery for higher-level users.
  • There is no clear path to retrieving and storing data. Moreover, when a user discovered intriguing data objects in the catalog there is no clear path on how to obtain access. Your platform should have a data governance strategy that dictates the route individuals can take in order to augment their analysis with further data. This includes data locked up in sensitive zones. No clear path means delaying time to insights.

The Balancing Act of Governance

Is your data governance accurately balanced?

Data Lakes have a fine balancing act between governance restrictiveness and data sensitivity. A blanketing data governance policy that doesn’t properly defines rules for each zone within a Data Lake can lead to problems. For example, if data governance rules are too tight in the working/discovery zones for data scientists, then it can lead to fewer insights because exploration is hindered. Whereas gold, curated or sensitive data zones within the Data Lake should have tighter governance policies. Consider reviewing your governance policies so that the restrictiveness matches both the use case, sensitivity and applicable data policies (i.e. GDPR, HIPPA etc.).

As data in the Data Lake is explored and insight is found through some flavor of data processing or data science modelling, it becomes much more important to govern the data. Again, this includes cataloging how data was processed or modelled to derive insight and, additionally, its lineage with the intention of passing on that information to others to use or provide additive value.

Conclusion

Regardless if you call your data project a Lake or Reservoir, a Hadoop cluster or an array of solutions, one thing is clear: if it’s not moving your business quickly towards customer, product or operational insights, it probably needs some level of revamping for 2019.

With any luck the information presented in this post will help you identify some key items you may have overlooked in your project that will help steer you in the right direction. Of course, if you have any questions/comments or your thinking about starting a Data Lake project reach out to us directly.

Visit Data Optyks to explore business intelligence, advanced analytics, and data science initiatives.


Indellient is a Canadian Software Development Company that specializes in Data AnalyticsDevOps Services, and Business Process Management

About The Author

Hello, I’m Evan Pearce. I’m a Solutions Manager at Indellient. I closely work with our clients to develop solutions by bringing together business acumen, strong technical aptitude and novel methodologies. I help our clients connect the dots when it comes to integrating different pieces of a solution together. I love discovering new ways in which technology can solve challenging problems. Some of my favorite tools are SiSense, Datastage, SSIS, Redshift, Hive, S3, PureData, Periscope, PowerBI and Presto.