There’s more need for open data infrastructure for AI than ever. In this article, we would love to share our learnings from the past, what has changed, what is broken, and why we decided to work on CocoIndex (https://github.com/cocoindex-io/cocoindex) - a next-generation data pipeline built for AI-native workloads — designed from the ground up to handle unstructured, multimodal, and dynamic data and an open system, at scale.

Data for Humans → to data for AI

Traditionally, people build data frameworks heavily in this space to prepare data for humans. Over the years, we’ve seen massive progress in analytics-focused data infrastructure. Platforms like Spark and Flink fundamentally changed how the world processes and transforms data, at scale.

But with the rise of AI, entirely new needs — and new capabilities — have emerged. A new generation of data transformations is now required to support AI-native workloads.

So, what has changed?

New capabilities require new infrastructure.

The new patterns require capabilities to

On top of all this, we need to think about:

Why patching existing data pipelines is not enough.

It’s not something that can be fixed with small patches to existing data pipelines. Many traditional frameworks fall short in several key ways:

With so many limitations, developers start to handle AI-native data “natively”, with hand-written Python, and wrap it in orchestrations. Begin from a demo, then start to worry about scale, tolerating backend failures, picking up from where it left off when pipelines break, rate limiting and backpressure, building tons of manual integrations when data freshness is needed, making sure stale input data is purged from output, and all of the issues are hard to handle when it runs at scale, and things start to break.

There are so many things to “fix”, and patching existing systems doesn’t work anymore. A new way of thinking about it, from the ground up, is needed.

So what are some of the design choices for CocoIndex?

Declarative Data Pipeline

Users declare "what", we take care of "how". Once the data flow is declared, you have a production-ready pipeline - infrastructure setup (like creating target tables with the right schema, and schema upgrades), data processing and refresh, fault tolerance, rate limiting, and backpressure, batching requests to backends for efficiency, etc.

Persistent Computing Model

Most traditional data pipelines treat data processing as transient things – they terminate as long as all data has been processed. If anything changes (data or code), you process it again from scratch. We treat pipelines as live things with memory of existing states and only perform necessary reprocessing on data or code changes, making the output data continuously reflect the latest input data and code.

This programming model is essential for AI-native data pipelines. It unlocks out-of-the-box incremental processing - the output data is continuously updated as the input data and code change with minimal computation, and provides the ability to trace the lineage of the data for explainable AI.

Clear Data Ownership With Fine-Granular Lineage

The output of pipelines is data derived from source data via certain transformation logic, so each row of data created by the pipeline can be traced back to the specific rows or files from the data source, plus certain pieces of logic. This is essential for refreshing the output on data or logic updates, and also makes the output data explainable.

Strong Type Safety

The schema of data created by each processing step is determined at pipeline declaration time with validation – before running on specific items of data. This catches issues earlier and enables automatic inference of the output data schema for automatic target infrastructure setup.

Open Ecosystem

An open system that allows developers to dock their choice of ecosystem as building blocks. AI agents should be tailored to specific domains, and there will be different technology choices, from source to storage to domain-specific processing. It has to be an open data stack that is easily customizable and brings in the user's own building blocks.

With the rapid growth of the ecosystem - new sources, targets, data formats, transformation building blocks, etc., the system shouldn’t be bound to any specific one. Instead of waiting for specific connectors to be built, anyone should be able to use it to create their own data pipelines—assembling flexible building blocks that can work directly with internal APIs or external systems.

It needs to stay open.

We believe this AI-native data stack must be open.

The space is moving too fast — closed systems can’t keep up. Open infrastructure enables:

It should be something everyone can contribute to, learn from, and build upon.

Build with the ecosystem.

CocoIndex fits seamlessly into the broader data ecosystem, working well with orchestration frameworks, agentic systems, and analytical pipelines. As an open, AI-native data infrastructure, it aims to drive the rise of the next generation of applied AI.

The Road Ahead

We are just getting started. AI is notoriously bad at writing data infrastructure code. The abstractions and data feedback loop, and programming model from CocoIndex are designed deliberately, thought from the ground up, and are tailored for the AI co-pilot.

This unlocks the full potential of CocoIndex on the path of a self-driving data pipeline, with data auditable and controllable along the way.

🚀 To the future of building!

Support us and join the journey.

Thank you, everyone, for your support and contributions to CocoIndex. Thank you so much for your suggestions, feedback, stars, and for sharing the love for CocoIndex.

We are especially grateful for our beloved community and users. Your passion, continuous feedback, and collaboration as we got started have been invaluable, helping us iterate and improve the project every step of the way.

Looking forward to building the future of data for AI together!

⭐ Star CocoIndex on GitHub here to help us grow!