Imagine a world where scientific experiments yielded different results every time you ran them, even with identical setups. Or a financial model that produced varying predictions for the exact same input data. In most fields, such inconsistencies would be unacceptable. Yet, in the burgeoning world of large language models (LLMs), we’ve often accepted a curious paradox: the same prompt, run multiple times, can produce entirely different outputs. This isn’t just a quirk of “probabilistic” AI; it’s a fundamental challenge to reproducibility, reliability, and ultimately, trust in our most advanced AI systems.

The team at Thinking Machines Lab recently published a groundbreaking blog post, “Defeating Nondeterminism in LLM Inference,” which strips back the layers of abstraction to reveal the true culprit behind this frustrating inconsistency. Their work not only demystifies why LLMs behave this way but also offers a tangible path to achieving true, bitwise-identical reproducibility in LLM inference. For anyone building, deploying, or simply enthusing about AI, these insights are indispensable.

Unmasking the Real Cause of LLM Nondeterminism: Beyond the Obvious

Many of us might intuitively attribute LLM output variations to the “sampling” process or perhaps the inherent parallelism of modern GPUs. Indeed, when you ask ChatGPT the same question repeatedly, you’ll observe different results, and lowering the temperature parameter to 0 (which theoretically makes sampling deterministic by always picking the highest probability token — known as greedy sampling) doesn’t fully resolve the issue. Even running inference on your own hardware with open-source libraries like vLLM or SGLang still results in nondeterministic outputs.

A common hypothesis, often called the “concurrency + floating point” hypothesis, suggests that the non-associativity of floating-point arithmetic on GPUs, combined with the nondeterministic ordering of concurrent operations, leads to these varying results. Floating-point numbers, used extensively in GPU calculations, exhibit non-associativity: (a + b) + c does not always equal a + (b + c) due to finite precision and rounding errors. This can indeed lead to subtle numerical differences depending on the order of operations.

However, Thinking Lab’s research reveals that this hypothesis, while containing a grain of truth about floating-point behavior, doesn’t tell the full story regarding LLM inference. They demonstrate that running the same matrix multiplication on a GPU repeatedly will consistently provide bitwise identical results. This indicates that the core issue isn’t simply floating-point math or concurrency in isolation.

The Thinking Lab identifies two crucial points:

  1. Floating-point non-associativity is the underlying cause of numerical differences. The way floating-point numbers handle different scales during addition can lead to information loss and wildly different results based on the order of summation.
  2. The actual source of LLM inference nondeterminism is more subtle: it’s the lack of batch invariance in the kernels (the low-level computational units) used in the LLM’s forward pass, combined with the nondeterministic variation in batch sizes due to server load.

Let’s break down “batch invariance”: It’s the property that the computation results for a specific element in a batch should not be affected by other elements in the batch or the total batch size. Thinking Lab empirically demonstrates that many matrix multiplication implementations are not batch-invariant, meaning a single element’s result can change if it’s computed as part of a larger batch versus on its own. From a user’s perspective, the server’s load (and thus the batch size) is effectively nondeterministic, directly impacting their individual request’s output if the kernels lack batch invariance. This issue, they note, isn’t exclusive to GPUs; it affects LLM inference endpoints served from CPUs or TPUs as well.

To achieve batch invariance, Thinking Lab outlines specific strategies for critical LLM operations:

Thinking Lab’s practical demonstration using their batch-invariant-ops library integrated with vLLM showed striking results. When prompting Qwen/Qwen3–235B-A22B-Instruct-2507 for 1000 completions at temperature=0, the default setup yielded 80 unique completions. With their batch-invariant kernels enabled, all 1000 completions were identical. This is the mathematical determinism we expect from a temperature=0 setting, finally achieved.

Why This Matters: From Science to Scalability

The implications of achieving deterministic LLM inference are profound for the entire AI and tech community:

A Future of Predictable AI: My Perspective

This research by Thinking Lab represents a pivotal moment in LLM engineering. It moves us away from treating LLMs as inherently “probabilistic black boxes” and towards a more rigorous, principled engineering discipline. For too long, we’ve accepted numerical variations as an unavoidable consequence of parallel computing. This work forcefully rejects that “defeatism”.

I predict that the pursuit of bitwise reproducibility will become a standard requirement in enterprise-grade AI deployments, particularly in heavily regulated industries. While the Thinking Lab notes some performance overhead (e.g., their unoptimized deterministic vLLM was 55 seconds compared to 26 seconds for default, which improved to 42 seconds with an optimized attention kernel), this is a temporary trade-off. As these batch-invariant kernels are further optimized and integrated into core frameworks, the performance gap will undoubtedly narrow, making deterministic inference a practical reality for a wider range of applications.

This capability unlocks new frontiers for controlled experimentation, A/B testing, and robust validation of LLMs. It empowers developers and researchers to reason about their models with greater confidence, transforming LLMs from fascinating but sometimes erratic entities into predictable, controllable, and ultimately more trustworthy tools.

Practical Takeaways for AI Professionals and Enthusiasts

So, what can you do with this groundbreaking insight?

The Road Ahead: Towards Predictable Intelligence

The quest for predictable AI is not just an academic exercise; it’s a fundamental step towards building truly reliable, trustworthy, and impactful intelligent systems. The work from Thinking Lab provides a clear blueprint for achieving this. It reminds us that even in the most complex corners of machine learning, careful engineering and a refusal to accept “good enough” can lead to breakthroughs that redefine what’s possible.