sia.hackernoon.com

Imagine a world where scientific experiments yielded different results every time you ran them, even with identical setups. Or a financial model that produced varying predictions for the exact same input data. In most fields, such inconsistencies would be unacceptable. Yet, in the burgeoning world of large language models (LLMs), we’ve often accepted a curious paradox: the same prompt, run multiple times, can produce entirely different outputs. This isn’t just a quirk of “probabilistic” AI; it’s a fundamental challenge to reproducibility, reliability, and ultimately, trust in our most advanced AI systems.

The team at Thinking Machines Lab recently published a groundbreaking blog post, “Defeating Nondeterminism in LLM Inference,” which strips back the layers of abstraction to reveal the true culprit behind this frustrating inconsistency. Their work not only demystifies why LLMs behave this way but also offers a tangible path to achieving true, bitwise-identical reproducibility in LLM inference. For anyone building, deploying, or simply enthusing about AI, these insights are indispensable.

Unmasking the Real Cause of LLM Nondeterminism: Beyond the Obvious

Many of us might intuitively attribute LLM output variations to the “sampling” process or perhaps the inherent parallelism of modern GPUs. Indeed, when you ask ChatGPT the same question repeatedly, you’ll observe different results, and lowering the temperature parameter to 0 (which theoretically makes sampling deterministic by always picking the highest probability token — known as greedy sampling) doesn’t fully resolve the issue. Even running inference on your own hardware with open-source libraries like vLLM or SGLang still results in nondeterministic outputs.

A common hypothesis, often called the “concurrency + floating point” hypothesis, suggests that the non-associativity of floating-point arithmetic on GPUs, combined with the nondeterministic ordering of concurrent operations, leads to these varying results. Floating-point numbers, used extensively in GPU calculations, exhibit non-associativity: (a + b) + c does not always equal a + (b + c) due to finite precision and rounding errors. This can indeed lead to subtle numerical differences depending on the order of operations.

However, Thinking Lab’s research reveals that this hypothesis, while containing a grain of truth about floating-point behavior, doesn’t tell the full story regarding LLM inference. They demonstrate that running the same matrix multiplication on a GPU repeatedly will consistently provide bitwise identical results. This indicates that the core issue isn’t simply floating-point math or concurrency in isolation.

The Thinking Lab identifies two crucial points:

Floating-point non-associativity is the underlying cause of numerical differences. The way floating-point numbers handle different scales during addition can lead to information loss and wildly different results based on the order of summation.
The actual source of LLM inference nondeterminism is more subtle: it’s the lack of batch invariance in the kernels (the low-level computational units) used in the LLM’s forward pass, combined with the nondeterministic variation in batch sizes due to server load.

Let’s break down “batch invariance”: It’s the property that the computation results for a specific element in a batch should not be affected by other elements in the batch or the total batch size. Thinking Lab empirically demonstrates that many matrix multiplication implementations are not batch-invariant, meaning a single element’s result can change if it’s computed as part of a larger batch versus on its own. From a user’s perspective, the server’s load (and thus the batch size) is effectively nondeterministic, directly impacting their individual request’s output if the kernels lack batch invariance. This issue, they note, isn’t exclusive to GPUs; it affects LLM inference endpoints served from CPUs or TPUs as well.

To achieve batch invariance, Thinking Lab outlines specific strategies for critical LLM operations:

RMSNorm: By ensuring that the reduction order for each element is fixed regardless of batch size, typically through data-parallel strategies that assign one batch element per core, maintaining sufficient parallelism.
Matrix Multiplication: Standard data-parallel approaches (chunking the output into tiles) are preferred. Avoiding “Split-K Matmul” (splitting along the reduction dimension) or switching tensor core instructions based on batch size helps maintain invariance, even if it means a slight performance trade-off (around 20% compared to cuBLAS in their unoptimized example).
Attention: This is the most complex. It requires handling reductions over feature and sequence dimensions, plus various inference optimizations like KV caching and chunked prefill. The key is ensuring the reduction order for a given token is independent of how many other tokens from its sequence are simultaneously processed, which means carefully managing the KV cache before the attention kernel. For decode stages, where query length is small, a “fixed split-size” strategy for Split-KV (FlashDecoding) is proposed, rather than fixing the number of splits, to ensure consistent reduction order.

Thinking Lab’s practical demonstration using their batch-invariant-ops library integrated with vLLM showed striking results. When prompting Qwen/Qwen3–235B-A22B-Instruct-2507 for 1000 completions at temperature=0, the default setup yielded 80 unique completions. With their batch-invariant kernels enabled, all 1000 completions were identical. This is the mathematical determinism we expect from a temperature=0 setting, finally achieved.

Why This Matters: From Science to Scalability

The implications of achieving deterministic LLM inference are profound for the entire AI and tech community:

Scientific Reproducibility and Trust: At its core, this work addresses a fundamental barrier to scientific progress in AI. If researchers cannot reliably reproduce results, validating new models, comparing techniques, and building a cumulative body of knowledge becomes incredibly challenging. For the wider public and regulatory bodies, determinism is a cornerstone of trust, ensuring that an AI system behaves consistently under identical conditions.
Reliability for Critical Applications: As LLMs permeate critical sectors like healthcare, finance, and autonomous systems, predictable and consistent outputs are non-negotiable. Imagine an AI giving varying diagnostic probabilities for the same patient data, or a legal AI producing different interpretations of the same contract. Batch invariance makes LLMs more reliable and suitable for high-stakes environments.
Simplified Debugging and Development: The ability to consistently reproduce bugs is a developer’s superpower. Nondeterminism makes debugging LLMs a “whack-a-mole” game, where an issue might appear once and disappear without a clear cause. Deterministic inference promises to streamline the development cycle, making complex LLM systems easier to build, test, and maintain.
Advancing Reinforcement Learning (RL): One of the most significant impacts is on RL. Thinking Lab highlights that nondeterministic inference implicitly turns “on-policy RL” into “off-policy RL” because the numerics between training and inference diverge. By achieving bitwise identical results between the sampler (inference) and the trainer, their method enables “true on-policy RL,” preventing training instabilities like reward collapse and maintaining a flat KL-divergence of 0 between policies. This is a game-changer for training more robust and effective AI agents.
Responsible AI and Auditability: Determinism contributes to building more ethical, auditable, and accountable AI systems. If an LLM’s output can be pinned to specific inputs and computational paths, it becomes easier to trace, explain, and mitigate biases or unintended behaviors. This is crucial for meeting growing demands for responsible AI development and deployment.

A Future of Predictable AI: My Perspective

This research by Thinking Lab represents a pivotal moment in LLM engineering. It moves us away from treating LLMs as inherently “probabilistic black boxes” and towards a more rigorous, principled engineering discipline. For too long, we’ve accepted numerical variations as an unavoidable consequence of parallel computing. This work forcefully rejects that “defeatism”.

I predict that the pursuit of bitwise reproducibility will become a standard requirement in enterprise-grade AI deployments, particularly in heavily regulated industries. While the Thinking Lab notes some performance overhead (e.g., their unoptimized deterministic vLLM was 55 seconds compared to 26 seconds for default, which improved to 42 seconds with an optimized attention kernel), this is a temporary trade-off. As these batch-invariant kernels are further optimized and integrated into core frameworks, the performance gap will undoubtedly narrow, making deterministic inference a practical reality for a wider range of applications.

This capability unlocks new frontiers for controlled experimentation, A/B testing, and robust validation of LLMs. It empowers developers and researchers to reason about their models with greater confidence, transforming LLMs from fascinating but sometimes erratic entities into predictable, controllable, and ultimately more trustworthy tools.

Practical Takeaways for AI Professionals and Enthusiasts

So, what can you do with this groundbreaking insight?

Adopt a Mindset of Precision: Challenge the assumption that nondeterminism is an inherent, unresolvable property of LLMs. As Thinking Lab proves, with careful engineering, it can be overcome.
Deepen Your Technical Understanding: For AI engineers, a deeper dive into floating-point arithmetic, kernel implementation, and parallel computing architectures is becoming increasingly vital. Understanding concepts like batch invariance and deterministic reduction strategies will be the key skill sets.
Prioritize Reproducibility in Your Pipelines: Actively seek out and advocate for frameworks and libraries that prioritize deterministic inference. Incorporate reproducibility checks into your testing and evaluation protocols. If your current LLM deployments suffer from inconsistent outputs, investigate whether batch invariance is the missing piece.
Explore “True On-Policy RL”: If you’re working with Reinforcement Learning, this research offers a clear path to more stable and effective policy training. Investigate how these batch-invariant techniques can be applied to your RL environments.
Stay Informed on Advancements: Keep a close watch on projects like Thinking Lab’s batch-invariant-ops and other efforts to integrate deterministic kernels into popular inference engines like vLLM. The landscape is evolving rapidly.

The Road Ahead: Towards Predictable Intelligence

The quest for predictable AI is not just an academic exercise; it’s a fundamental step towards building truly reliable, trustworthy, and impactful intelligent systems. The work from Thinking Lab provides a clear blueprint for achieving this. It reminds us that even in the most complex corners of machine learning, careful engineering and a refusal to accept “good enough” can lead to breakthroughs that redefine what’s possible.

The Unseen Variable: Why Your LLM Gives Different Answers (and How We Can Fix It)

Unmasking the Real Cause of LLM Nondeterminism: Beyond the Obvious

Why This Matters: From Science to Scalability

A Future of Predictable AI: My Perspective

Practical Takeaways for AI Professionals and Enthusiasts

The Road Ahead: Towards Predictable Intelligence