Are our most advanced AI systems secretly bluffing? This isn’t a rhetorical question, but a critical challenge underpinning the trustworthiness and future adoption of Large Language Models (LLMs). Imagine asking a widely-used chatbot for the PhD dissertation title of a prominent researcher, Adam Kalai. You might expect a single, accurate answer. Instead, it confidently provides three different, entirely incorrect titles. Or perhaps his birthday, only to receive three distinct, equally false dates.

These instances, where an AI model confidently generates an answer that isn’t true, are what we call hallucinations. They are a fundamental, stubbornly persistent challenge for all LLMs, even the most capable iterations like GPT-5, though its rates are significantly lower, especially in reasoning tasks. As a tech leader deeply invested in the responsible evolution of AI, this phenomenon isn’t just a technical glitch; it’s a pivotal hurdle we must overcome to unlock AI’s full potential for reliability and trust.

Our recent research at OpenAI delves into the heart of this paradox, revealing that hallucinations aren’t a mysterious defect, but a logical outcome of current AI training and evaluation paradigms. It’s a dual problem: rooted in the statistical nature of how these models learn, and exacerbated by the incentives baked into how we measure their performance.

The Genesis of Errors: When Learning Leads to Guessing

To truly understand hallucinations, we must first look at the pretraining phase, where base models learn the distribution of language from massive text corpora. This process relies on next-word prediction, a self-supervised task where the model learns patterns by predicting what word comes next. Unlike traditional machine learning, there are no explicit “true/false” labels on every statement; the model approximates the overall language distribution.

Here’s where the statistical traps emerge:

The key takeaway from pretraining is that certain types of errors are not just possible, but statistically probable, given the inherent limitations of pattern learning on vast, diverse, and often noisy datasets. It demystifies hallucinations, showing they are not a “glitch” but a natural statistical outcome.

The Perverse Incentives: How Evaluations Encourage “Bluffing”

While pretraining sets the stage for potential errors, it’s the post-training evaluation process that transforms these potential errors into confident falsehoods. We’ve essentially been “teaching to the test” in a way that prioritizes superficial accuracy over genuine understanding and honesty about uncertainty.

Think of it like a multiple-choice exam: if you don’t know the answer, a wild guess might get you lucky. Leaving it blank guarantees zero points. The same logic applies to LLMs:

This “epidemic” of penalizing uncertainty means that even as LLMs become more advanced, they are still incentivized to hallucinate, providing confident but wrong answers rather than acknowledging their limits.

The Path Forward: Cultivating “Intelligent Humility” in AI

The good news is that this problem is not insurmountable. To truly foster trustworthy AI, we need a paradigm shift towards what I call “Intelligent Humility”. This means we must move beyond simply trying to reduce hallucinations and instead fundamentally redesign how we evaluate and design AI to reward calibrated uncertainty and meaningful abstention.

Here’s how we can achieve this:

  1. Redesign Evaluation Scoreboards: The most straightforward fix is to penalize confident errors more severely than acknowledging uncertainty, and award partial credit for appropriate expressions of uncertainty. This isn’t about introducing a few niche hallucination tests; it’s about reworking the primary evaluation metrics that currently dominate leaderboards. If the main scoreboards continue to reward lucky guesses, models will continue to learn to guess.
  2. Integrate Explicit Confidence Targets: We should embed clear confidence targets and penalty schemes directly into evaluation instructions. For example, a prompt could state: “Answer only if you are >t confident, since mistakes are penalized t/(1-t) points, while correct answers receive 1 point, and ‘I don’t know’ receives 0 points”. This makes the incentives transparent and encourages models to only answer when they meet a specified confidence threshold, fostering “behavioral calibration”.
  3. Elevate Abstention as a Virtue: Just as humility is a core value at OpenAI, the ability for an LLM to say “I don’t know” or to ask for clarification should be rewarded, not penalized. A model that knows its limits is often more useful and safer than one that bluffs its way to a statistically higher (but less reliable) accuracy score.

This isn’t just a technical adjustment; it’s a strategic and ethical imperative for the AI industry. By prioritizing Intelligent Humility, we can steer the field toward AI systems that are not only powerful but also reliable, transparent, and genuinely trustworthy; essential qualities for their integration into critical applications and for fostering public confidence.

The future of AI isn’t just about reaching higher accuracy scores; it’s about building systems that understand the nuance of knowledge, the value of honesty, and the importance of knowing when to hold back. It’s about graduating our LLMs from the “test-taking” mode of superficial performance to the real-world standard of accountable, intelligently humble assistance.