I started my career as a first-generation analyst focusing on writing SQL scripts, learning R, and publishing dashboards. As things progressed, I graduated into Data Science and Data Engineering where my focus shifted to managing the life-cycle of ML models and data pipelines. 2022 is my 16th year in the data industry and I am still learning new ways to be productive and impactful. Today, I am now the head of a data science & data engineering function in one of the unicorns and I would like to share my findings and where I am heading next.

When I look at the big picture, I realized that the problems most companies face are quite similar. Their vision towards being data-driven has turned into a BHAG — pronounced “bee hag” (Big Hairy Audacious Goal).

We data folks like patterns, so here are my findings:

  1. During 5 out of 10 review meets, I have witnessed people question the reliability of the data/report/dashboard. Additionally, HODs will also try to convince others that their data is the most accurate or reliable :)

  2. A lot of times, HOD comes and says that the data is not updated. The data team is already working to fix the report/data table.

  3. A new product got launched the week before, however, we are yet to figure out the performance. The data team is working on a query change and will soon update the CXO team.

  4. Everyone has built expertise around writing complicated ML (machine-learned) models, however very few talk about or deploy inference monitoring. There is a high probability of model drift or performance drift in the coming weeks/months if not monitored or observed efficiently.

  5. Very few companies deploy solutions or models to detect performance anomalies.

The list is long, I am sure you can relate or add more to this.

In a nutshell, I found that data reliability is a BIG challenge and there is a need for a solution that is easy to use, understand, and deploy, and also not heavy on investment.

I am Jatin Solanki and I am on a mission to build and develop a solution to make your data reliable.

What is needed to make your data more reliable?

Complexities around data infrastructure are surging as companies gear to get a competitive edge and out-of-the-box offerings.

Every company goes through a data maturity matrix. In order to reach a level where you deploy AI models or self-service models, you need to invest in a robust foundation.

In my opinion, the foundation begins with a reliable data source or defining source of truth. Your data models won’t be impactful if it’s ingested with bad data. You know it’s garbage in garbage out.

On a high level, here are a few checks you can implement to ensure data reliability:

What next? How do we implement this?

The most common question people face with:

Build versus Buy

I am a big fan of open-source tech, however, in some critical modules, I prefer buying an out-of-the-box solution because it’s scalable and already tested in the market. Developing in-house might cost you around US2k per month and it includes a few hours of engineer’s time along with cloud cost.

If you are inclined toward buying an out-of-the-box solution, here are a few factors that should be part of your checklist.

A robust platform provides easy access to all the incidents and also evaluates the data health.

It should be in a position to automatically detect my critical data assets and apply hygiene checks.

The only platform to group alerts instead of pushing 100+ alerts.

At last, the solution should help you reduce data quality incidents and make your data more reliable.

So, do I need a data observability platform?

If your answer to any of the below questions or scenarios is “Yes”, then you should procure or deploy a data observability solution right away.

  1. Dashboard not getting updated on a regular basis?
  2. Don’t know which report is accurate?
  3. Business stakeholders are the first to learn about data incidents.
  4. Questions during a meeting on the performance stats.
  5. Have at least 2 members in the data team.
  6. Deployed a business intelligence tool.

As software developers have leveraged on DataDog, Dynatrace, etc kind of solutions to ensure web/app uptime, data leaders should invest in data observability solutions to ensure data reliability.


Also published here.