I recently headed to Chicago to attend the Data Architecture Summit held by Dataversity. The event covered a broad range of popular tracks including AI, Big Data, Data Engineering & Science, Metadata Management, and Data Modelling through general sessions and in-depth tutorials. There was also a great sense of community, leading to multiple chances to share ideas and past experiences with colleagues on challenges data architects face in today’s rapidly evolving landscape of solutions and tools.
I want to focus on metadata management, its importance and explore an option that is applicable to data projects.
One of the pervasive conversations throughout the conference is the critical role metadata management tools play within the analytics ecosystem. The rapid onboarding of Data Lakes along with siloed data has heightened questions revolving around how data can be used and where it can be found.
Data catalogs are an integral component to delivering at velocity by providing ways to break down tribal knowledge into accessible information across an organization. This means more time spent on data analysis and less time spent on finding and re-(re-re-re)-integrating. It’s one of many valuable tools that promotes DataOps practices and self-service analytics.
There now exists neatly packaged metadata management solutions that encompass some of the more modern and exciting capabilities. Mainstream offerings, like Alation and Watson Knowledge Catalog, includes integrated social collaboration, tagging automation through machine learning recommendations and keyword extraction to accelerate categorizing unstructured content. And while most of the features are utopic for metadata management, a platform that includes these, plus end-to-end Data Lineage, can provide tremendous value but comes at quite a premium in today’s market.
A Cost-Effective and Flexible Approach for AWS
When I heard about a simple, yet effective custom AWS Data Catalog solution, I thought it was worth sharing. It’s a prime example of a very search-oriented type of Data Catalog. Although the architecture is very simple, it can be easily extended on iterations. Here it is:
Essentially, the core components are:
- Web Front-End (Node.JS combined with a presentation layer is suitable)
In the above figure, files which will later be consumed for downstream data ingestion are placed into an inbound or processed S3 Bucket. S3 bucket rules are applied that trigger events to Lambda to invoke functions that collect metadata on S3 Files.
This metadata along with any data profiling, also collected through a Lambda process, is then written to NoSQL. Other important pieces of metadata could be picked up on the file such as where it is located (to provide a copy to the end user if required), content makeup, data types on various fields and so on and so forth. NoSQL database is favorable for this type of job because the metadata width can be adjusted according to the number of attributes that were worth recording for each file.
Elasticsearch used in this case, is a search engine on top of NoSQL, allowing users to query metadata, files or both together. Typically end users wouldn’t interact with Elasticsearch directly. Rather, a query in the form of search parameters from a web front end can be sent to Elasticsearch to parse and execute through to metadata or text within a Data Lake, Warehouse or NoSQL documents.
It can empower users to answer questions such as:
- Which type of files in our Data Lake contain “Brand X” data?
- Which source contains “Product X” order volumes?
- How often is this data loaded?
- How often is “Attribute X” coming in as Blank?
- Which dashboards report on this data?
If working in a Data warehouse environment, you could further improve this architecture by linking the transformed file back to the original, providing two locations – the raw and transformed. Also, providing an interactive front end that has the capability of writing back to the NoSQL DB would further enhance by creating a closed loop tagging system. There are a lot of other options to explore.
Why Start Building a Data Catalog Solution?
This solution is aimed at small to mid-sized data projects not looking to spend a small fortune to get started with pursuing an entry-level Catalog. So, if you want to have some search capabilities on your data, along with a scalable metadata repository that can support tagging and profiling, you’ll find this approach appealing.
We’ve already had discussions internally around how interesting this metadata solution is and how it might fit into previous data projects. If you are looking at metadata management, have thoughts around it or looking to navigate the solutions in the data space – feel free to reach out