Large language models don’t just get things wrong - they present mistakes as facts. Even with new releases, the problem hasn’t gone away. Vectara’s 2025 LLM Hallucination Leaderboard puts GPT-5’s grounded error rate at 1.4% - lower than GPT-4’s 1.8%, and only 0.09% better than GPT-4o’s 1.49%.

A small improvement, but the problem remains.

The public has already seen how bad this can get. Mid-2024, Google’s AI Overviews told people to eat rocks for minerals - Google later acknowledged the issue. Early 2023, Google’s Bard demo misstated a James Webb Space Telescope fact. Add the “glue-on-pizza” tip, and the 2023 Avianca case where two lawyers were sanctioned after citing six made-up cases from ChatGPT.

These might look like funny headlines, but it’s different when people actually rely on these tools. Small mistakes are just annoying, but in areas like health, law, or therapy, they can be dangerous.

What causes hallucinations in LLMs

LLMs, including ChatGPT, are trained to predict the next word in a sequence, not to verify facts. They have no built-in database of guaranteed truths; instead, they generate text by synthesising patterns from training data. When they don’t know, they guess the next words that seem most likely - and that guess can be wrong.

Training data came from giant web scrapes - blogs, forums, wikis. Today a lot of the web is AI-written, so models start learning from their own outputs. Mistakes get repeated and amplified.

No more free data

By mid-2023, user-generated content (UGC) platforms started locking down access. Reddit limited its free API; subreddits went dark. Twitter/X ended free API access. LinkedIn cracked down on bulk scraping. Stack Overflow said it would charge for training access to its Q&A. Quora moved more content into its Poe app. Meta tightened rate limits and legal warnings on Facebook and Instagram.

That ended the era of free data. Big AI companies moved to paid licensing, and public models were left with older, messy web data - making it more likely they would train on their own AI-written text.

Paying for access

OpenAI first signed a deal with the Associated Press in 2023, followed by multi-year agreements with Axel Springer and News Corp. By 2025, more than twenty publishers - including The Guardian and The Washington Post - had joined in. Some deals give AI models access to archives, others cover links and attribution inside products. Google also signed with AP in early 2025, while Microsoft connected Copilot to Thomson Reuters’ Westlaw for legal look-ups (for users, not for training).

The AI training-data market itself is valued at about $3.2B in 2024 and is expected to grow to $16.3B by 2034.

Where the clean data lives

Licensed and cleaned data is forming sector-specific reservoirs:

Plenty reservoirs stay shut. The New York Times sued OpenAI and Microsoft in December 2023, making clear it would not license its archives. The Financial Times signed a deal with OpenAI in April 2024. Elsevier and Wiley maintain closed scientific archives. Bloomberg has kept its financial data proprietary. Clean data exists – but behind contracts.

Paid, specialised data is next

We’re likely heading for a split: the open web is fine for simple tasks like quick lookups, drafting text, or answering everyday questions; serious research, analysis, and AI builds move to clean reservoirs of data - vetted, filtered, verified - often behind subscriptions. Big companies will push this, since bad data slows them. Expect more spending on data cleaning, labelling, and firewalls that separate reliable data from the mess.

That setup needs role-based access built in - HR sees HR, finance sees finance, legal sees legal. Role-based access means the model only pulls from what the person is cleared to view. This keeps private data out of answers and reduces the risk of the model pulling “facts” from the wrong pool.
Most chatbots don’t do this today. If that gap remains, the teams building role-aware search and locked-down knowledge bases will earn trust - and the contracts.

What to do with only public AI access

Prompt engineering is often the first line of defence against made-up answers - it’s inexpensive and immediate. If the prompt is unclear, the answer will be unclear. Industry practitioners stress the same point: without enough context, the output is likely to be poor, and the model is more prone to hallucinate. Clear rules and clean sources keep answers on track.

Best practices include:

The bottom line

By 2025, the split is clear:

Both will continue. The difference is that one prioritises speed, the other accountability. Knowing which track you’re using matters.

Glue in pizza sauce makes a funny headline when it comes from a consumer search. In a hospital chart or a courtroom filing, it’s catastrophic. That’s why curated reservoirs and guardrails are becoming the foundations of serious AI.