With the influx of data available to a company, it is easy to assume that this translates into knowledgeable information executives can use to gain a competitive edge. And although this is true in principal, it is not that simple. Getting to a state where information is usable and accessible for your company requires having a well thought out strategy.
Imagine the following frustrations:
- You are presented with two distinctly different views on the health of the company from different reports. Are any of the views accurate?
- The company collects the information you need, but by the time you can get access to it in any meaningful way, your need has passed.
- Asking for a change in your analytics solution, such as updating a dashboard or adding some additional data, is a complex and frustrating process
- Your reports are good at showing surface-level information, leading to a lot of questions that we can’t pursue further without costly IT involvement
With all the money put into creating a data warehouse, executives expect to get accurate results in a timely manner and not have to experience any of the frustrations above.
Let’s look at some of the common root causes and ways to mitigate issues.
Improving Analytics Data: Dealing with Bad Data
We’ve all heard the saying, garbage in, garbage out and that holds steadfast here as well. Inaccurate data exported to reports or dashboards can be a result of bad data that resides in the data warehouse. There are a number of reasons that a data warehouse could contain bad data.
- It’s possible that data is just not available, it may be delayed in being input into the data warehouse, or could be missing altogether.
- Spelling variations could exist in the database that could impact filters. It could be words like E-Commerce vs eCommerce or extra spaces at the end of words etc.
- Duplicate data being inserted either by accident, coming from multiple sources or incorrect joins on tables. These values can misconstrue or inflate numbers
- Data that is not maintained over time
- Data entered in incorrect fields
- Corrupted data files being ingested
- File manipulation between the data source and ingestion process. (Especially if the process of sending the file to the data warehouse is manual)
- Manual data manipulation in the data warehouse
How to Reduce Bad Data
So how do you reduce the amount of bad data in the data warehouse? Start by setting up processes to streamline the ingestion of the data and make sure that data goes through validation checks. All of this ties into the creation of a data governance framework.
When creating a Data Governance Framework it’s important to be thorough. Your framework may include:
- Creating policies and standards
- Monitoring and controlling who has access to the data
- How that data is used
- Securing the data
Having a proper framework in place prevents bad data from happening rather than solving it on a case-by-case basis. If you are using an ETL too for Data Ingestion, some data validation/QA can be executed then. Proper governance will save your business time and money and build healthy (and consistent) data habits across all your data teams.
Creating Accurate Reports – Calculations
Now that you have good clean data, let’s talk about how to get accurate reports. It’s important to have a good data science team as the output is as good as the data analyst working and interpreting the data. It’s easy to make mistakes like the following:
- Using an incorrect formula
- Incorrect order of operations being performed
- Incorrect calculations created due to a misunderstanding of data
- Data that is not normalized causing incorrect calculations
This is probably the most complicated yet the most important aspect of the entire process. Having a solid analytics team that understands the data, understands the business needs, and is able to create reports and dashboards that provide valuable actionable information is imperative.
Tools for more Accurate Reports
There are a number of tools that can help the data analysts understand the data they are working with, whether this is an internal employee or an external consulting company, having the following will be helpful:
Data dictionaries can provide valuable information on the content of the data warehouse, the format in which the data is stored, and the relationships between the elements. This is helpful for IT teams and for users that are creating the reports/dashboards. Understanding the structure of the data and the database is crucial.
The business glossary would be helpful in providing the end-users with information on what data is available in the data warehouse from a business standpoint. A business glossary provides a common vocabulary which is helpful if data is coming from multiple sources.
More is not always better, just because you have the data, doesn’t mean it all needs to be displayed and shared. It’s very possible to overwhelm a user with too much information. Only include actively maintained, good quality data in data warehouses to avoid ‘contaminating’ good results.If you are using an ETL too for Data Ingestion, some data validation/QA can be executed then. It helps to work with a data engineer and a business analyst to define what kind of information should be displayed and how it will best help move the company forward.
That being said though, having more data means you can do more analytics, have more options, historical knowledge. You just need to make sure your infrastructure is set up to handle the amount of data, usage, and reports.