TL;DR: Caching bit shifts looks smart but makes code up to 6× slower. Modern CPUs and compilers make direct computation faster than memory lookups - measure, don’t assume.
The Interview Question
You're given an array and asked to generate its powerset - all possible subsets. For [1, 2, 3]
, that's:
[], [1], [2], [1,2], [3], [1,3], [2,3], [1,2,3]
A handy solution will use recursion. There are solid code snippets that will demonstrate it, so I won’t dive into it.
There’s another elegant solution that uses binary representation: for an array of length n
, count from 0
to 2^n - 1
, where each bit pattern represents which elements to include:
000 → []
001 → [1]
010 → [2]
011 → [1,2]
100 → [3]
101 → [1,3]
110 → [2,3]
111 → [1,2,3]
The code is beautiful:
function generatePowerset(arr) {
const result = [];
const n = arr.length;
for (let mask = 0; mask < (1 << n); mask++) {
const subset = [];
for (let i = 0; i < n; i++) {
if (mask & (1 << i)) { // Check if bit i is set
subset.push(arr[i]);
}
}
result.push(subset);
}
return result;
}
The "Optimization" That Wasn't
While implementing this in JavaScript, I noticed (1 << i)
being computed millions of times. My programmer instinct screamed: "Cache it!"
// "Optimized" version
const BIT_MASKS = new Array(32);
for (let i = 0; i < 32; i++) {
BIT_MASKS[i] = 1 << i; // Compute once
}
// Then use:
if (mask & BIT_MASKS[i]) { // Reuse millions of times
subset.push(arr[i]);
}
This seems like textbook optimization: compute once, use many. But when I benchmarked it, something shocking happened.
The "optimized" version was 36% slower.
This sent me down a rabbit hole across languages, compilers, and CPU architecture.
The Investigation
I created three test implementations:
- Variable Shift: Compute
(1 << i)
every time (variable shift amount) - Local Precalculation: Create array per function call
- Global Precalculation: Create array once at startup
Each approach counts set bits in 32-bit integers - a perfect isolated test of bit operations.
Testing in JavaScript (V8 JIT)
Let's see how these three strategies compare in JavaScript with V8's JIT compiler:
// Dataset: 100,000 random 32-bit integers
// 50 iterations for statistical validity
Results:
No Precalc: 8.999ms (baseline)
Global Precalc: 12.250ms (+36% slower)
Local Precalc: 15.303ms (+70% slower)
Direct computation won decisively. But was this JavaScript-specific?
Testing in C (Unoptimized: -O0)
Now testing in C without compiler optimization to see the raw hardware behavior:
// Dataset: 1,000,000 numbers
// Pure C without compiler optimization
Results:
No Precalc: 126.762ms (baseline)
Global Precalc: 175.930ms (+39% slower)
Local Precalc: 217.756ms (+72% slower)
Nearly identical percentages! This suggested a fundamental hardware truth, not a language quirk.
Testing in C (Optimized: -O2)
Here's where it got interesting. With aggressive optimization, the compiler can recognize that the global array is constant and eliminate the lookup entirely at compile time:
// Dataset: 1,000,000 numbers (3.8 MB)
Results:
Global Precalc: 1.925ms (FASTEST!)
Local Precalc: 2.006ms (+4%)
No Precalc: 2.231ms (+16%)
The compiler reversed the results! By proving the global array is constant, GCC turned memory lookups into optimized code that beats runtime computation.
But when I increased the dataset to 10 million numbers (38 MB) - exceeding the CPU cache:
// Dataset: 10,000,000 numbers (38 MB)
Results:
No Precalc: 16.617ms (FASTEST)
Global Precalc: 17.575ms (+6%)
Local Precalc: 17.927ms (+8%)
Direct computation won again. The compiler's optimization hit a wall when data exceeded cache capacity.
The Fourth Approach: Running Mask
One more optimization seemed promising - instead of computing (1 << i)
with a variable shift, maintain a running mask and shift by a constant:
// Instead of variable shift:
for (i = 0; i < 32; i++) {
if (num & (1 << i)) { count++; } // Variable shift amount
}
// Use constant shift:
uint32_t mask = 1;
for (i = 0; i < 32; i++) {
if (num & mask) { count++; }
mask <<= 1; // Constant shift by 1
}
This appears even better:
- Shift by constant 1 (simpler than variable shift)
- One operation per iteration
- Cleaner code
Result: 5.85× slower! (97ms vs 17ms)
Here's how all four approaches compare with production-scale data:
// Dataset: 10,000,000 numbers, 1000 iterations
Results:
Variable Shift: 16.614ms (FASTEST)
Global Precalc: 17.622ms (+6%)
Local Precalc: 17.803ms (+7%)
Running Mask: 97.159ms (+485% - catastrophically slow!)
The culprit? Data dependencies destroy CPU parallelism.
Why Running Mask is So Slow
The running mask creates a tight dependency chain that prevents the CPU from executing iterations in parallel:
Variable shift (fast): Each iteration is independent
for (i = 0; i < 32; i++) {
if (num & (1 << i)) { ... } // No dependency between iterations
}
- CPU can execute multiple iterations in parallel
- Instruction-level parallelism works
- Out-of-order execution optimizes throughput
Running mask (slow): Each iteration waits for the previous
uint32_t mask = 1;
for (i = 0; i < 32; i++) {
if (num & mask) { ... }
mask <<= 1; // Next iteration MUST WAIT for this!
}
- Serialized execution (no parallelism possible)
- Pipeline stalls on read-after-write hazard
- CPU sits idle waiting for mask to update
The Answer: Cache Architecture
The mystery unraveled when I examined my test machine's cache:
Apple Silicon M-series Cache:
- L1 Data: 128 KB
- L2: 16 MB (per P-core cluster)
- L3: None (uses System Level Cache instead)
Dataset Size vs Cache:
Dataset |
Size |
Fits in L2? |
Winner (-O2) |
---|---|---|---|
1M |
3.8 MB |
✓ YES (24%) |
Global Precalc |
10M |
38 MB |
✗ NO (238%) |
No Precalc |
Why This Happens: The Performance Model
When Data Fits in Cache (≤ 4 MB)
Memory access latency:
- L2 cache: ~4 nanoseconds
- Bit shift: ~0.5 nanoseconds
Array lookup is 8× slower, but still fast enough that aggressive compiler optimization can make it competitive or even better:
// What GCC -O2 does:
// 1. Recognizes GLOBAL_BIT_MASKS is constant
// 2. Keeps it in L1 cache or registers
// 3. Eliminates bounds checking
// 4. Optimizes memory access patterns
// 5. Reorders instructions for pipelining
//
// Result: Array lookup becomes nearly free
When Data Exceeds Cache (> 16 MB)
Memory access latency:
- Main RAM: ~100 nanoseconds
- Bit shift: ~0.5 nanoseconds
Array lookup is now 200× slower! The compiler's optimization can't overcome the physics of memory latency.
// Even with optimization:
// - Array still must be fetched from RAM
// - Data constantly evicted from cache
// - Cache thrashing occurs
// - Memory latency dominates
//
// Result: Direct computation wins
The Fundamental Truth
At the CPU level:
Bit Shift: Single ALU instruction (0.3-0.5 ns)
↓
[CPU Register]
↓
Result
Array Lookup: Memory fetch + dereference
↓
[Check L1 Cache?] → Miss
↓
[Check L2 Cache?] → Miss
↓
[Fetch from RAM] → 100 ns!
↓
Result
Bit shifts execute in the CPU's Arithmetic Logic Unit (ALU) with no memory access. Even the fastest memory is slower than no memory access at all.
Why Compiler Optimization Changes Things
Modern compilers like GCC can perform aggressive optimizations:
For Global Arrays (-O2):
// Source code:
mask & GLOBAL_BIT_MASKS[i]
// Compiler sees:
// - GLOBAL_BIT_MASKS is constant
// - Values are powers of 2
// - Can precompute at compile time
// - Can keep in registers
// - Can eliminate runtime lookups
// Result: Sometimes faster than dynamic shifts
For Bit Shifts:
// Source code:
mask & (1 << i)
// Compiler sees:
// - Must compute at runtime (i is variable)
// - Simple instruction, but can't eliminate it
// - Can pipeline with other ops
// - Very fast, but not zero cost
// Result: Consistently fast, not optimized away
The paradox: When data fits in cache, the compiler can optimize the "slow" approach to be faster than the "fast" approach!
JavaScript JIT Limitations
V8's Just-In-Time compiler is powerful but more conservative than static compilers:
Why V8 can't optimize as aggressively:
- Must handle dynamic types
- Must maintain language semantics
- Compiles at runtime (less time for optimization)
- Less global program knowledge
- Can't prove global arrays are truly constant
- Arrays aren't really arrays
JavaScript Arrays Are Not Native Arrays
A critical difference often overlooked:
// In JavaScript, this looks like a C array:
const BIT_MASKS = new Array(32);
// But it's actually a complex data structure more like Java's ArrayList:
// - Dynamic resizing
// - Sparse storage (holes are allowed)
// - Can hold mixed types
// - Properties can be added/deleted
// - Bounds checking at runtime
// - Potential hash table backing for sparse arrays
In C:
uint32_t BIT_MASKS[32]; // Contiguous memory block
// Fixed size, single type
// Direct pointer arithmetic
// No runtime overhead
In JavaScript:
const BIT_MASKS = new Array(32);
// Actually a managed object with:
// - Type checks on access
// - Bounds validation
// - Potential deoptimization if used wrong
// - Much more overhead than raw memory access
Even when V8 optimizes JavaScript arrays to use contiguous storage (packed arrays), they still have more overhead than native C arrays. The JIT must guard against type changes, resizing, and other dynamic behaviors that don't exist in C.
Result: In JavaScript, direct bit shifts always win because:
- The JIT can't apply the same aggressive optimizations as GCC
- Array access has inherent overhead that C doesn't have
- Even "optimized" JavaScript arrays aren't as fast as C arrays
Practical Recommendations
For JavaScript/TypeScript Developers
✅ Always use direct bit shifts:
mask & (1 << i) // Use this
❌ Don't precalculate:
mask & BIT_MASKS[i] // Avoid this
Why: V8 can't optimize array lookups as aggressively as static compilers. Direct computation is 30-70% faster.
For C/C++ Developers
The answer is more nuanced:
Production builds (-O2/-O3) with small data:
- Variable shift
(1 << i)
is best - Compiler optimizes it effectively
- ~5-10% faster than precalculation
Production builds with large data:
- Variable shift wins decisively
- Memory latency dominates at scale
- 6% faster than precalc, 485% faster than running mask
Debug builds (-O0):
- Always use variable shift
(1 << i)
- 40-70% faster than precalculation
- Critical for faster debug-build runs
Never use running mask:
mask <<= 1
approach is 5.8× slower- Data dependencies prevent parallelization
- Appears simpler but kills CPU pipeline
Unknown dataset size:
- Default to variable shift
(1 << i)
- Safest and fastest choice across all scenarios
The General Rule
Only "cache" computations if:
- The computation is expensive (>10ns)
- Your working set fits in cache
- You're using aggressive optimization
- You've measured and confirmed benefit
Don't cache if:
- Computation is trivial (bit ops, arithmetic)
- Working set exceeds cache
- "Optimization" adds complexity
- You haven't profiled
Never create data dependencies if:
- You can compute values independently
- "Simpler" operations serialize execution
- You're trading parallelism for "efficiency"
The Deeper Lesson
This investigation reveals why performance optimization is so challenging:
1. Intuition Fails
"Compute once, use many" seems obviously better. But:
- Memory is not infinitely fast
- CPU computation can be cheaper than memory access
- Scale changes everything
2. Context Matters
The "right" answer depends on:
- Language/runtime (JavaScript vs C)
- Compiler optimization level (-O0 vs -O2)
- Dataset size (cache fit)
- Hardware architecture (cache sizes)
3. Measure, Don't Assume
My intuition said "cache it." The benchmark said "don't."
In the 1990s, when CPU cycles were precious and memory was relatively fast, caching bit masks might have been faster. In 2025, with multi-GHz CPUs and memory that's relatively slower, the opposite is true.
Hardware evolution flips performance assumptions.
Conclusion
The "optimizations" of bit mask manipulation reveal surprising truths:
Precalculating bit masks makes code:
- 36-70% slower in JavaScript
- 40-70% slower in unoptimized C
- 6-7% slower in optimized C at scale
Using running mask makes code:
- 485% slower (5.85×) even with "simpler" constant shifts
- Destroys CPU parallelism through data dependencies
- Shows that "simpler" operations can be catastrophically slow
The fundamental reasons:
- CPU bit shift instructions are faster than memory access
- Independent operations enable parallelism
- Data dependencies serialize execution and stall pipelines
- Modern CPUs optimize for instruction-level parallelism
Modern compilers can sometimes overcome memory access costs with aggressive optimization when data fits in cache, but:
- At production scale, hardware realities prevail
- Data dependencies cannot be optimized away
- CPU parallelism trumps "simple" operations
The Meta-Lesson
This investigation isn't really about bit shifts. It's about:
- Questioning assumptions
- Measuring instead of guessing
- Understanding your hardware
- Recognizing that "optimization" can make things slower
- Appreciating CPU instruction-level parallelism
The next time you're tempted to "optimize" something, ask:
- Is the original operation really expensive?
- Have I measured the actual cost?
- Am I trading fast computation for slow memory access?
- Does my "optimization" exceed cache capacity?
- Am I creating data dependencies that prevent parallelism?
- Does my "simpler" approach actually serialize execution?
Sometimes the best optimization is no optimization at all. And sometimes the "obviously better" approach is 6× slower.
In short: Compute, don’t cache. Measure, don’t assume.The CPU is a master of parallel math - let it work, and keep memory out of the way.
Reproduce the Results
All source code and benchmarks are available in this repository.
Hardware: Apple Silicon M-series, macOS 24.6.0Software: Node.js v22.14.0 (V8), GCC/Clang with C99