sia.hackernoon.com

I started my career as a first-generation analyst focusing on writing SQL scripts, learning R, and publishing dashboards. As things progressed, I graduated into Data Science and Data Engineering where my focus shifted to managing the life-cycle of ML models and data pipelines. 2022 is my 16th year in the data industry and I am still learning new ways to be productive and impactful. Today, I am now the head of a data science & data engineering function in one of the unicorns and I would like to share my findings and where I am heading next.

When I look at the big picture, I realized that the problems most companies face are quite similar. Their vision towards being data-driven has turned into a BHAG — pronounced “bee hag” (Big Hairy Audacious Goal).

We data folks like patterns, so here are my findings:

During 5 out of 10 review meets, I have witnessed people question the reliability of the data/report/dashboard. Additionally, HODs will also try to convince others that their data is the most accurate or reliable :)
A lot of times, HOD comes and says that the data is not updated. The data team is already working to fix the report/data table.
A new product got launched the week before, however, we are yet to figure out the performance. The data team is working on a query change and will soon update the CXO team.
Everyone has built expertise around writing complicated ML (machine-learned) models, however very few talk about or deploy inference monitoring. There is a high probability of model drift or performance drift in the coming weeks/months if not monitored or observed efficiently.
Very few companies deploy solutions or models to detect performance anomalies.

The list is long, I am sure you can relate or add more to this.

In a nutshell, I found that data reliability is a BIG challenge and there is a need for a solution that is easy to use, understand, and deploy, and also not heavy on investment.

I am Jatin Solanki and I am on a mission to build and develop a solution to make your data reliable.

What is needed to make your data more reliable?

Complexities around data infrastructure are surging as companies gear to get a competitive edge and out-of-the-box offerings.

Every company goes through a data maturity matrix. In order to reach a level where you deploy AI models or self-service models, you need to invest in a robust foundation.

In my opinion, the foundation begins with a reliable data source or defining source of truth. Your data models won’t be impactful if it’s ingested with bad data. You know it’s garbage in garbage out.

On a high level, here are a few checks you can implement to ensure data reliability:

Volume: It ensures all the row/events are captured or ingested.
Freshness: Recency of the data. If your data gets updated every xx mins, this test will ensure its updated and raises an incident if not.
Schema Change: If there is a schema change or a new feature that was launched, your data team needs to be aware to update the scripts.
Distribution: All the events are in an acceptable range. e.g if a critical shouldn’t contain null values, then this test ensures to raise an alert for any null or missing values.
Lineage: This is a must-have module, however, we always underplay these ones. Lineage provides a handy info to the data team of the upstream and downstream.
Reconciliation: I would say recon or finding deltas between two given datasets. This could be used to understand the difference between stagingand production OR between source and destination . This could be effective in running some financial recon too, like payment gateway to the sales table.

What next? How do we implement this?

The most common question people face with:

Build versus Buy

I am a big fan of open-source tech, however, in some critical modules, I prefer buying an out-of-the-box solution because it’s scalable and already tested in the market. Developing in-house might cost you around US2k per month and it includes a few hours of engineer’s time along with cloud cost.

If you are inclined toward buying an out-of-the-box solution, here are a few factors that should be part of your checklist.

Should be able to connect to popular sources which require minimal config.
Extract information automatically without the need for additional code.
No-code or CLI (I leave it to you)
Lineage and Catalog module.
Data Reconciliation along with scheduling feature.
Anomaly detection
Of course, Of course, all the tests we discussed earlier along with alerts should be in a position to tell where to debug.

A robust platform provides easy access to all the incidents and also evaluates the data health.

It should be in a position to automatically detect my critical data assets and apply hygiene checks.

The only platform to group alerts instead of pushing 100+ alerts.

At last, the solution should help you reduce data quality incidents and make your data more reliable.

So, do I need a data observability platform?

If your answer to any of the below questions or scenarios is “Yes”, then you should procure or deploy a data observability solution right away.

Dashboard not getting updated on a regular basis?
Don’t know which report is accurate?
Business stakeholders are the first to learn about data incidents.
Questions during a meeting on the performance stats.
Have at least 2 members in the data team.
Deployed a business intelligence tool.

As software developers have leveraged on DataDog, Dynatrace, etc kind of solutions to ensure web/app uptime, data leaders should invest in data observability solutions to ensure data reliability.

Also published here.

Data Observability: The First Step Towards Being Data-Driven

What is needed to make your data more reliable?

What next? How do we implement this?

So, do I need a data observability platform?