Merkle Trees 101 (Part 1): Structure, Proofs, and Real-World Uses

This may seem like a random topic, but I’m currently taking a cryptography class, and my project idea is related to homomorphic encryption. While doing some preliminary research into the topic, I stumbled across Merkle trees and distributed filesystems. So, naturally, I decided to make it the topic of my post in hopes of learning more about them and imparting some knowledge.

So, without further ado…enjoy.

Before we can dive into Merkle Trees, we first have to talk about plain old trees.

What is a Tree?

A tree is a data structure in computer science that represents a hierarchical tree structure composed of nodes. Each node can be connected to many children, but can only have one parent node; this is not the case for the root node. The root node has no parents; it’s the top-most node in the tree. Nodes with no children are called leaf nodes.

Below is a figure that visually explains this:

Root Node: the root node in the above tree is Parent. The parent node has no…parent.
Children Nodes: James, Thomas, Nick, and all of the Grandchildren are children nodes.
Leaf: the leaf nodes of a tree are the children nodes with no children of their own. So in this case, all of the Grandchildren nodes are leaf nodes as they do not have any children of their own.

The constraints put on the nodes is actually quite useful. This means that there are no loops in relations, meaning a node cannot be it’s own ancestor. Secondly, this means that we can consider each child node the root node of its own subtree. That might sound confusing, but let’s use some visuals:

Let’s show the entire tree with leaf nodes and subtrees marked:

With that in mind, how can we visit/access each node? This is called “walking the tree” or tree traversal.

Tree Traversal

We won’t be looking at tree traversal in-depth for this post; however, just at a high level, I’ll briefly describe the two most popular types.

Unlike linear data structures, which are traversed linearly (looping over the data structure itself and computing at each element), trees can be traversed in multiple ways. Two of the most common traversal methods are breadth-first and depth-first.

Depth-First: go as deep into the tree as possible before moving to the next sibling.
Pre-Order: Visit root, then left subtree, then right subtree.
In-Order: Visit left subtree, root, then right subtree.
Post-Order: Visit left subtree, right subtree, then root.
Breadth-First: explore all nodes at the current depth before moving on, also known as level-order

We now have enough background info to tackle this week's post…

What is a Merkle Tree?

A Merkle tree is a binary tree (just a tree where a node can have up to two child nodes), obviously, in which every leaf node contains the cryptographic hash of a block of data, and every non-leaf node contains the cryptographic hash of the hashes of its children (sounds confusing, I’ll show a visual in a second).

Now, what this specific tree allows us to do is securely verify that arbitrary content exists within the tree.

Let’s get a visualization of a Merkle Tree:

Inclusivity

So, how can we verify that a piece of data is included in the Merkle tree? By providing a Merkle Proof, aka an inclusion proof.

In an inclusion proof, all we need is the leaf node’s (the data block being verified) sibling and all other intermediate hashes needed to calculate the Merkle root hash. I’ll explain visually below:

Let's say we want to check if data block L1 is in the Merkle Tree. The way we would do that is:

Calculate hash(L1)
Use the Merkle Tree’s Hash 0-1 and Hash 1
This means that we will calculate Hash 0 using the Merkle tree’s Hash 0-1 and our computed Hash 0-0

This method of proof is actually very advantageous when compared to linear data structures because it scales logarithmically. This means that we need to compute log(N) hashes to prove the inclusion of a data element, where N is the number of leaf nodes. As data grows, this logarithmic scaling proves to be much more manageable.

So in our example above, we have four leaf nodes, so we only need log(4) hashes to verify any data block. Let’s see how that scales:

8 leaf nodes → log(8) → 3 hashes needed
16 leaf nodes → log(16) → 4 hashes needed
128 leaf nodes → log(128) → 7 hashes needed
1,000,000 leaf nodes → log(1000000) → 20 hashes needed

As the leaf nodes grow, the number of hashes to verify any individual data block grows at a sustainable pace.

Uses

So, where do we see Merkle Trees today? Basically everywhere. The most common and prevalent usage of Merkle Trees is within the Blockchain and Cryptocurrency space. It’s a fundamental part of Blockchain technology; they’re used to verify and organize transactions efficiently within data blocks.

Blockchain Technology:
Bitcoin and Ethereum use Merkle Trees to efficiently verify transactions.
Simplified Payment Verification (SPV) allows lightweight clients to verify transactions without downloading the entire blockchain.
Distributed File Systems:
IPFS (InterPlanetary File System) uses Merkle Trees to address and distribute content, ensuring data integrity.
Version Control Systems:
Git uses Merkle Trees to manage file versions and ensure data integrity across repositories.
Peer-to-Peer Networks:
BitTorrent utilizes Merkle Trees to verify pieces of files downloaded from multiple peers.
Trusted Platform Modules (TPMs):
Use Merkle Trees to securely store cryptographic keys and verify system integrity.

So basically, Merkle Trees are everywhere, from the file system on your computer to your version control system (Git, Mercurial) to distributed databases like Amazon DynamoDB and Google’s BigTable.

Conclusion

Merkle Trees play a crucial role in ensuring data integrity and efficient verification in distributed systems. Their hierarchical hashing mechanism allows for secure and scalable data verification, making them indispensable in modern technology.

For my next post, we’ll be implementing Merkle Trees.

If you found this post helpful or have questions, please leave a comment below. Don't forget to subscribe to stay updated on future posts exploring fascinating topics in computer science and cryptography.

Thanks for tuning in,

Ali.