Beating JavaScript Performance Limits With Rust and N-API: Building a Faster Image Diff Tool

The Moment JavaScript Hit Its Performance Ceiling

Visual regression testing is one of those problems that sounds simple until you actually scale it. Take a screenshot before your change, take one after, and diff them. If they don't match, something broke. Every CI pipeline doing visual testing runs thousands of times per day.

The JavaScript ecosystem has solid tools for this. pixelmatch is the standard — small, fast enough for most cases, zero dependencies. It's a well-written library. But “fast enough” starts to crack when your screenshots are 2K/4K resolution, when you're running batch comparisons across hundreds of pages, and when every millisecond in CI costs a penny.

I wanted to build something more efficient.

Designing the BlazeDiff Algorithm

My first idea was to divide the image into blocks and skip unchanged regions entirely. Most UI screenshots between runs are 95%+ identical. Why scan 18 million pixels for a button color change?

At first glance, this sounds impossible. You can't know whether a block changed without touching its pixels.

But most image diff algorithms already use a cheap byte-level check before running expensive perceptual comparisons. The insight was to split those stages into two separate passes. In the cold pass, touch every pixel but only do the cheap check, marking blocks as changed or unchanged. In the hot pass, run the expensive work only on flagged blocks.

You still scan every pixel once. But the expensive work only runs on the 1-5% of blocks that actually differ.

The two-pass architecture:

Cold pass: Scan every block using a cheap byte-level comparison. If any pixel in a block differs, mark it as changed and move on. If a block is fully identical, draw the grayscale output immediately and forget about it.
Hot pass: Iterate only over the changed blocks. Run the full perceptual pipeline: YIQ color delta computation, anti-aliasing neighborhood detection, diff/AA color output rendering. This is the expensive work, and it only touches a small fraction of the image that actually changed.

The block size scales with image dimensions:

export function calculateOptimalBlockSize(
	width: number,
	height: number,
): number {
	const area = width * height;

	const scale = Math.sqrt(area) / 100;
	const rawSize = 16 * Math.sqrt(scale);

	// More efficient power-of-2 rounding using bit operations
	const log2Val = Math.log(rawSize) * LOG2_E;
	return 1 << Math.round(log2Val); // Bit shift instead of Math.pow(2, x)
}

Small images get 8×8 blocks. A 4K screenshot gets 128×128 blocks. The formula ensures blocks are always powers of two (cache-line friendly) and that the granularity scales with image area.

The cold pass is where most of the speedup comes from. For a 5600×3200 image with a small UI change, you might have 2,000 blocks total, but only 15 contain differences. The hot pass, which performs the expensive calculations, covers less than 1% of the image. The remaining 99% gets a fast grayscale fill and moves on. When nothing changed, the cold pass rejects every block immediately.

Iterating on the JavaScript Implementation

The first version of BlazeDiff was pure TypeScript. The block-based design gave significant speedups over pixelmatch's linear scan, but JavaScript itself became the next bottleneck.

I went through several optimization passes:

Typed arrays everywhere. The image data is Uint8Array/Uint8ClampedArray, but the YIQ color comparison needs to work on 32-bit RGBA pixels. Casting to Uint32Array via DataView lets you read a full pixel in one operation instead of four-byte reads.
Minimizing allocations. Pre-allocating the changedBlocks array, avoiding intermediate buffers, reusing output buffers across comparisons.
Buffer identity check. Before doing any work, check if the two image buffers are literally the same reference. This catches the “nothing changed” case in constant time:
```
Buffer.compare(
  Buffer.from(image1.data.buffer),
  Buffer.from(image2.data.buffer)
);
```
Loop structure. Keeping the hot inner loop as tight as possible with no function calls, no conditionals that the branch predictor can't handle, and no unnecessary property lookups.

The JavaScript implementation ended up 1.5 times faster than pixelmatch on average and up to 88% faster on identical images. The numbers from benchmarks on an M1 Max:

Benchmark	pixelmatch	BlazeDiff	Improvement
4K image (different)	302.29ms	211.92ms	29.9%
4K image (identical)	19.18ms	2.39ms	87.5%
Full page (different)	331.94ms	92.77ms	72.1%
Full page (identical)	63.18ms	7.68ms	87.8%

The block-based optimization was clearly working. But 211ms for a single 4K diff is still slow when you're running hundreds of them. And at this point, the algorithm itself was optimized. JavaScript runtime overhead remains the bottleneck.

Discovering the Ceiling: Comparing Against odiff

While benchmarking, I discovered odiff — an image diff tool written in Zig with OCaml bindings. It was consistently faster than my JavaScript implementation.

This wasn't surprising. Native code has fundamental advantages:

Direct memory access. No garbage collector pauses. No JIT compilation warmup. The code runs exactly as compiled.
SIMD instructions. JavaScript has no direct access to SIMD intrinsics like SSE4.1, AVX2, or NEON. WebAssembly exposes SIMD, but it still adds overhead compared to native code for workloads like this.
Lower runtime overhead. No V8 runtime, no event loop, no property lookup chains.

The reality is clear: once the algorithm is optimized, the language and runtime become the limiting factor. You can't out-optimize JavaScript's fundamental overhead.

Porting the Algorithm to Rust

The decision to port to Rust came down to a few factors:

Predictable performance. No GC, no JIT warmup, no hidden overhead. What you write is what runs.
SIMD support. First-class intrinsics for SSE4.1, AVX2/FMA, AVX-512, and NEON.
Memory safety. Unsafe blocks are explicit and contained. The rest of the code gets lifetime and borrow checking for free.
Ecosystem. napi-rs for Node.js bindings, rayon for parallelism, solid cross-compilation story.

The Rust port follows the same two-pass architecture. Here's the core of the cold pass (the part that identifies which blocks have changed):

// Cold pass: identify changed blocks
for by in 0..blocks_y {
    for bx in 0..blocks_x {
        let start_x = bx * block_size;
        let start_y = by * block_size;
        let end_x = (start_x + block_size).min(width);
        let end_y = (start_y + block_size).min(height);

        let has_diff = block_has_perceptual_diff(
            a32, b32, width, start_x, start_y, end_x, end_y, max_delta,
        );

        if has_diff {
            changed_blocks.push((start_x, start_y, end_x, end_y));
        } else if let Some(ref mut out) = output {
            if !options.diff_mask {
                fill_block_gray_optimized(
                    image1, out, options.alpha, start_x, start_y, end_x, end_y,
                );
            }
        }
    }
}

The SIMD acceleration is where things get interesting. The YIQ color delta computation (conversion of RGB differences into perceptual differences) is the hottest inner loop. Here's the core of the NEON implementation for ARM (processing 4 pixels simultaneously):

// Load four pixels at once and extracts RGB channels using vector instructions before computing the YIQ perceptual delta.
#[cfg(target_arch = "aarch64")]
unsafe fn yiq_delta_4_neon(pixels_a: *const u32, pixels_b: *const u32) {
    let pa = vld1q_u32(pixels_a);   // Load 4 RGBA pixels
    let pb = vld1q_u32(pixels_b);

    // Extract channels
    let r_a = vandq_u32(pa, mask_ff);
    let g_a = vandq_u32(vshrq_n_u32(pa, 8), mask_ff);
    let b_a = vandq_u32(vshrq_n_u32(pa, 16), mask_ff);

    // Convert to float for YIQ transform
    let r_a_f = vcvtq_f32_u32(r_a);
    // Alpha blending with FMA
    let br_a = vfmaq_f32(v255, vsubq_f32(r_a_f, v255), alpha_norm_a);

    // YIQ: y²×0.5053 + i²×0.299 + q²×0.1957
}

On x86_64, the same logic runs as SSE4.1 (4 pixels) or AVX2+FMA (8 pixels), selected at runtime.

Feature detection happens once per diff, and the fastest available SIMD path is used for the entire operation. No runtime dispatching per-pixel.

The initial Rust port matched odiff-level performance. Same algorithm, different language, and suddenly the runtime overhead was gone.

Optimization Passes: Making Rust Significantly Faster

Matching odiff was the baseline. Several rounds of optimization pushed BlazeDiff well past it.

SIMD tuning. The cold pass block comparison got its own SIMD path. Instead of computing the full YIQ delta for every pixel in a block, use SIMD byte comparison first. If any bytes differ, fall through to the perceptual check. This makes the cold pass nearly free for identical blocks.
Cache-aware block traversal. Blocks are processed in row order to maximize L1/L2 cache hits. The image data is stored as contiguous RGBA bytes, so scanning blocks left-to-right, top-to-bottom follows the memory layout.
Minimizing memory copies. The image is interpreted as &[u32] via a zero-copy cast. No allocation, no copy:

Parallel I/O. Image loading (PNG decoding via vendored libspng, JPEG decoding via libjpeg-turbo) runs in parallel using rayon:

fn load_images(path1: P1, path2: P2) -> Result<(Image, Image), DiffError> {
    if fmt1 == fmt2 {
        return match fmt1 {
            ImageFormat::Png => load_pngs(&path1, &path2),
            ImageFormat::Jpeg => load_jpegs(&path1, &path2),
        };
    }

    let results: Vec<_> = [
        (path1.as_ref().to_path_buf(), fmt1),
        (path2.as_ref().to_path_buf(), fmt2),
    ]
    .par_iter()
    .map(|(path, fmt)| match fmt {
        ImageFormat::Png => load_png(path),
        ImageFormat::Jpeg => load_jpeg(path),
    })
    .collect();

    Ok((iter.next().unwrap()?, iter.next().unwrap()?))
}

Release profile. Fat LTO, single codegen unit, abort on panic, symbol stripping:

[profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1
panic = "abort"
strip = true

The results against odiff (including full image I/O: decode, diff, encode):

Benchmark	odiff	BlazeDiff	Improvement
4K/1 (5600×3200)	1190.92ms	293.86ms	75.3%
4K/2	1530.21ms	363.50ms	76.2%
4K/3	1835.47ms	389.67ms	78.8%
Full page/1	1035.20ms	472.99ms	54.3%
Full page/2	598.79ms	263.90ms	55.9%

3-4× faster on 4K images. The combination of block-based early exit, SIMD vectorization, parallel I/O, and aggressive compiler optimization adds up.

The Integration Problem: Getting Rust into JavaScript

Having a fast Rust binary is great. But the target users are JavaScript developers writing visual tests with Playwright, Cypress, or Vitest. They need npm install, not cargo install.

The first approach was the obvious one:

Node.js → spawn binary → run diff → return result

Wrap the Rust binary in a Node.js package, spawn it as a child process, and parse the JSON output:

try {
  await execFileAsync(binaryPath, args);
  return { match: true };
} catch (err) {
  // Parse exit codes: 0=identical, 1=pixel-diff, 2=error
}

This works. It's simple, portable, and the binary handles all the heavy lifting. But in a CI pipeline diffing 500 screenshots, you're spawning 500 processes. Process creation isn't free. Fork, exec, memory mapping, shared library loading. The overhead per spawn is small, but it adds up. Benchmarks showed it was eating 30-40% of the total time on small images where the actual diff takes <5ms.

Why Not a Persistent Diff Server?

odiff solves this problem by running a persistent server process. You start it once, send diff requests over a socket, and it handles them without the spawn overhead.

I considered this approach and rejected it. The trade-offs didn't fit the use case:

Complexity. A server means lifecycle management: start it before tests, stop it after, handle crashes, and manage ports.
Statefulness. A long-running process accumulates state. Memory leaks become an issue (I actually ran into this later with mimalloc).
CI integration. CI environments are ephemeral. Adding a server daemon complicates Docker images, GitHub Actions workflows, and debugging.
API design. I wanted each diff operation to be atomic and self-contained. compare(a, b) should just work, without requiring the user to manage server instances.

The design goal was clear: native performance with the simplicity of a function call.

Learning from Rust-Based JavaScript Tooling

The answer came from studying how other Rust-powered JavaScript tools handle this problem. Biome (the linter/formatter) uses a particularly clean architecture:

JavaScript API → N-API binding (.node file) → Rust core library

The key insight is N-API — Node.js's stable ABI for native addons. Instead of spawning a child process, the Rust code compiles to a .node shared library that loads directly into the Node.js process. Function calls from JavaScript to Rust become direct function pointer calls. No serialization, no IPC, no process overhead.

The packaging strategy is equally important. Platform-specific binaries are published as separate npm packages with os and cpu fields:

{
    "name": "@blazediff/bin-darwin-arm64",
    "os": ["darwin"],
    "cpu": ["arm64"],
    "files": ["blazediff", "blazediff.node"]
}

The main package lists these as optionalDependencies:

{
    "name": "@blazediff/bin",
    "optionalDependencies": {
        "@blazediff/bin-darwin-arm64": "3.5.0",
        "@blazediff/bin-darwin-x64": "3.5.0",
        "@blazediff/bin-linux-arm64": "3.5.0",
        "@blazediff/bin-linux-x64": "3.5.0",
        "@blazediff/bin-win32-arm64": "3.5.0",
        "@blazediff/bin-win32-x64": "3.5.0"
    }
}

npm/pnpm/bun only installs the package matching the current platform. No postinstall scripts. No compilation. Just a prebuilt binary.

The Final Architecture: Rust + N-API

BlazeDiff's final architecture uses a tiered approach. The N-API binding is the fast path. The spawned binary is the fallback.

JavaScript API
    ↓
Try N-API binding (in-process, ~0 overhead)
    ↓ (if unavailable)
Fall back to execFile (spawn binary)
    ↓
Rust diff engine
    ↓
SIMD + parallel I/O

The N-API binding is defined with napi-rs:

#[napi]
pub fn compare(
    base_path: String,
    compare_path: String,
    diff_output: Option<String>,
    options: Option<NapiDiffOptions>,
) -> Result<NapiDiffResult> {
    // Load images in parallel using rayon
    let (img1, img2) = load_images(&base_path, &compare_path)?;

    // Run the diff
    let result = diff(&img1, &img2, output_image.as_mut(), &diff_options)?;

    Ok(NapiDiffResult {
        match_result: result.identical,
        reason: if result.identical { None } else { Some("pixel-diff".into()) },
        diff_count: Some(result.diff_count),
        diff_percentage: Some(result.diff_percentage),
    })
}

On the JavaScript side, the binding loads lazily and caches the result:

let nativeBinding: NativeBinding | null = null;
let nativeBindingAttempted = false;

function tryLoadNativeBinding(): NativeBinding | null {
    if (nativeBindingAttempted) return nativeBinding;
    nativeBindingAttempted = true;

    const key = `${os.platform()}-${os.arch()}`;
    const platformInfo = PLATFORM_PACKAGES[key];
    if (!platformInfo) return null;

    try {
        const require = createRequire(import.meta.url);
        const binding = require(platformInfo.packageName) as NativeBinding;
        if (typeof binding?.compare === "function") {
            nativeBinding = binding;
            return binding;
        }
    } catch {
        // Native binding not available, will use execFile fallback
    }

    return null;
}

The Cargo configuration compiles the same crate as both a standalone binary and a cdylib (shared library for N-API):

[lib]
name = "blazediff"
crate-type = ["lib", "cdylib"]

[features]
default = []
napi = ["dep:napi", "dep:napi-derive"]

One codebase, two distribution targets: a CLI binary and a Node.js native module. Same SIMD optimizations, same block-based algorithm, same everything, just a different entry point.

Lessons Learned

1. Algorithms before languages.

The block-based design produced 72-88% speedups in pure JavaScript, before any Rust was written. Choosing the right algorithm matters more than choosing the right language. If I'd gone straight to Rust with a pixel-by-pixel approach, I'd have a fast pixel scanner instead of a fast image differ.

2. JavaScript has a hard ceiling for numerical work.

Once the algorithm was optimized, JavaScript's runtime overhead — GC pauses, JIT warmup, lack of SIMD, property lookup chains — became the dominant cost. For tight numerical loops over large buffers, no amount of micro-optimization can overcome the V8 overhead. This isn't a criticism of JavaScript. It's a recognition that different tools serve different purposes.

3. Process architecture matters as much as algorithm design.

The jump from spawn(binary) to N-API was a larger improvement than some algorithmic changes. Architecture decisions — how you invoke code, how you transfer data, how you manage process lifecycles — compound with every call. When you're diffing 500 images, the difference between a function call and a process spawn is the difference between seconds and minutes.

4. N-API is an underrated bridge.

N-API provides a stable ABI across Node.js versions, clean platform-specific packaging via optionalDependencies, and near-zero overhead for Rust-to-JS function calls. It's the same approach used by Biome, SWC, and other high-performance JavaScript tools. If you're building a performance-critical JavaScript library and JavaScript itself is the bottleneck, N-API + Rust is a proven path.

5. Benchmark everything, trust nothing.

Every optimization was validated with benchmarks running on consistent hardware. The WASM route was explored and abandoned (I actually tried both Rust WASM and AssemblyScript — both were slower than native for this workload). The block size formula was tuned empirically across image sizes from 100×100 to 5600×3200. Performance intuition is often wrong. Measure.

JavaScript remains one of the best ecosystems for developer tooling: the package manager story, the testing framework integration, and the sheer breadth of the community. But sometimes the fastest path forward is to let native code do the heavy lifting while JavaScript remains the interface developers love.

BlazeDiff ships as a single npm install. Under the hood, it's 2,000 lines of SIMD-optimized Rust processing pixels through a two-pass block algorithm. From the developer's perspective, it's compare(a, b).

That's the goal: invisible performance. The user shouldn't have to know or care that Rust exists.