For a long time, siloed data systems such as databases and data warehouses were sufficient. These systems provided convenient abstractions for various data management tasks, including:

However, as needs evolved, it became necessary to utilize multiple systems to process the data, leading to costly and time-consuming data duplication and copying. This also introduced challenges in troubleshooting and maintaining the pipelines required for these data movements. This is where the concept of a data lakehouse architecture becomes valuable. It leverages the existing open storage layer of a data lake and allows for the modular introduction of table, catalog, and query execution layers in a decoupled, modular manner.

In a typical lakehouse, we:

Crucially, the table format is key to enabling these functionalities. In this article, we will explore what Apache Iceberg is and provide resources for further learning.

How Apache Iceberg Works

Apache Iceberg consists of four key components:

Ultimately, these layers of metadata enable tools to efficiently scan the table and exclude unnecessary data files from the scan plan.

Additional Resources

Apache Iceberg Features

Data Lakehouse table formats such as Apache Iceberg, Apache Hudi, and Delta Lake provide essential features for enabling ACID transactions on data lakes, along with Schema Evolution and Time Travel, which allows querying historical versions of tables. However, Apache Iceberg offers a variety of unique features:

Additional Resources

The Apache Iceberg Ecosystem

One of the most significant advantages of Apache Iceberg is its open and extensive vendor and developer ecosystem.

Additional Resources

Getting Hands-on

Tutorials that can be done from your laptop without cloud services

Tutorials that require cloud services