CocoIndex https://github.com/cocoindex-io/cocoindex is a ETL framework that helps you to turn your data ready for AI in realtime.

The essential part for supporting robust and efficient update is because of incremental update. In CocoIndex, users declare the transformation, and don't need to worry about the work to keep index and source in sync. In this blog we would like to share how we handled the incremental update.

If you like our work, it would mean a lot if you could support us ❤️ with a github star! https://github.com/cocoindex-io/cocoindex

CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes. That makes it suitable for ETL/RAG or any transformation tasks stays low latency between source and index updates, and also minimizes the computation cost.

What is Incremental Updates?

Figuring out what exactly need to be updated, and only update that without having to recompute everything throughout.

How does it work?

You don't really need to do anything special, just focus on define the transformation need.

CocoIndex automatically tracks the lineage of your data and maintains a cache of computation results. When you update your source data, CocoIndex will:

  1. Identify which parts of the data have changed

  2. Only recompute transformations for the changed data

  3. Reuse cached results for unchanged data

  4. Update the index with minimal changes

And CocoIndex will handle the incremental updates for you.

CocoIndex provide two modes with pipeline with simple configuration:

Who needs Incremental Updates?

Many people may think incremental updates is only beneficial for large scale data, thinking carefully, it really depends on the cost and requirement for data freshness.

Google processed large scale data, and google has huge resources for it. Your data scale is much less than Google, but your resource provision is also much less than Google.

Real condition for incremental update needs is:

Overall, say T is your most acceptable staleness, if you don't want to recompute the whole thing repeatedly every T, then you need incremental more or less.

What exactly is incremental updates, with examples

Well, we could take a look at a few examples to understand how it works.

Example 1: Update a document

Consider this scenario:

So we need to keep 3 rows, remove 2 previously existing rows, and add 2 new rows. These need to happen behind the scene:

CocoIndex takes care of this.

Example 2: Delete a document

Contining with the same example. If we delete the document later, we need to delete all 7 rows derived from the document. Again, this needs to be based on the lineage tracking maintained by CocoIndex.

Example 3: Change of the transformation flow

The transformation flow may also be changed, for example, the chunking logic is upgraded, or a parameter passed to the chunker is adjusted. This may result in the following scenario:

This falls into a similar situation as document update (example 1), and CocoIndex will take care of it. The approach is similar, while this involves some additional considerations:

Example 4: Multiple inputs involved: Merge / Lookup / Clustering

All examples above are simple cases: each single input row (e.g. a document) is involved independently during each specific transformation.

CocoIndex is a highly customizable framework, not only limited to simple chunking and embedding. It allows users for more complex advanced transformations, such as:

The common theme is that during transformation, multiple input rows (coming from single or multiple sources) need to be involved at the same time. Once a single input row is updated or deleted, CocoIndex will need to fetch other related rows from the the same or other sources. Here which other rows are needed is based on which are involved in the transformations. CocoIndex keeps track of such relationships, and will fetch related rows and trigger necessary reprocessings incrementally.

Change Data Capture (CDC)

1. When source supports push change

Some source connectors support push change. For example, Google Drive supports drive-level changelog and send change notifications to your public URL, which is applicable for team drive and personal drive (only by OAuth, service account not supported). When a file is created, updated, or deleted, CocoIndex could compute based on the diff.

2. Metadata-based, last modified only

Some source connectors don't support push change, but provide metadata and file system operations that list most recent changed files. For example, Google Drive with service account.

CocoIndex could monitor the change based on last modified vs last poll time, periodic trigger to check modified. However this cannot capture full change, for example a file has been deleted.

3. Metadata-based, Fullscan

Some source connectors have limited capabilities with listing files, but provide metadata that list all files. For example, with local files, we'd need to traverse all files in all directories and subdirectories recursively to get the full list.

When the number of files is large, it's expensive to traverse all files.

Cache

In CocoIndex, every piece of the lego block in the pipeline can be cached. Custom functions can take a paramter cache. When True, the executor will cache the result of the function for reuse during reprocessing. We recommend to set this to True for any function that is computationally intensive.

Output will be reused if all these unchanged: spec (if exists), input data, behavior of the function. For this purpose, a behavior_version needs to be provided, and should increase on behavior changes.

For example, this enables cache for a standalone function, see full code example here:

@cocoindex.op.executor_class(gpu=True, cache=True, behavior_version=1)
class PdfToMarkdownExecutor:
    """Executor for PdfToMarkdown."""
      ...